We’ve all been there. You write a prompt, test it a few times, and it works great. You ship it. Then, a week later, you find out it’s been quietly failing on some weird edge case or spitting out chatty, unusable fluff instead of the clean data you needed.
That "works most of the time" approach is fine for a weekend project, but when you're building something real—something that other code or paying customers depend on—"most of the time" is just another word for "broken."
The good news is that we don't have to just guess and check. The research community has been formalizing prompting into a real engineering discipline. Think of it less like whispering a wish into a magic box and more like designing a specific tool for a specific job.
Today, I want to walk you through five of my favorite techniques that go way beyond the basics. These aren't about fine-tuning or spinning up new infrastructure. They work entirely at the prompt level, and they are designed to solve very specific problems with how LLMs think and respond.
Let’s get into it.
First, a Quick Setup
Just so we're all on the same page, we'll be using OpenAI's gpt-4o-mini for these examples. Here’s a tiny bit of Python to set up a clean way to chat with the API. It’s just a simple wrapper so we can focus on changing the prompts and nothing else. The section and divider functions are just for making the output pretty in our terminal.
import json
from openai import OpenAI
import os
from getpass import getpass
# A safer way to load your key at runtime
if 'OPENAI_API_KEY' not in os.environ:
os.environ['OPENAI_API_KEY'] = getpass('Enter OpenAI API Key: ')
client = OpenAI()
MODEL = "gpt-4o-mini"
def chat(system: str, user: str, **kwargs) -> str:
"""Minimal wrapper around the chat completions endpoint."""
response = client.chat.completions.create(
model=MODEL,
messages=[
{"role": "system", "content": system},
{"role": "user", "content": user},
],
**kwargs,
)
return response.choices[0].message.content
def section(title: str) -> None:
print("\n" + "=" * 60)
print(f" {title}")
print("=" * 60)
def divider(label: str) -> None:
print(f"\n── {label} {'─' * (54 - len(label))}")
Alright, let's get to the fun stuff.
1. Give Your LLM a Job Title: The Power of Role-Prompting
Think about it this way: if you have a legal question, you don't ask a random person on the street. You ask a lawyer. LLMs are trained on a massive mix of data from every domain imaginable—marketing, security, engineering, history, you name it. When you just ask a generic question, the model tries to pull from all that knowledge at once, which usually results in a correct but bland, generic answer.
Role-prompting is how we fix this. By giving the model a specific persona in the system prompt (e.g., "You are a senior application security researcher"), you're essentially telling it which hat to wear. This filters its knowledge, forcing it to use the language, priorities, and mental models of that specific role.
Let’s see it in action.
section("TECHNIQUE 1 -- Role-Specific Prompting")
QUESTION = "Our web app stores session tokens in localStorage. Is this a problem?"
baseline_1 = chat(
system="You are a helpful assistant.",
user=QUESTION,
)
role_specific = chat(
system=(
"You are a senior application security researcher specializing in "
"web authentication vulnerabilities. You think in terms of attack "
"surface, threat models, and OWASP guidelines."
),
user=QUESTION,
)
divider("Baseline")
print(baseline_1)
divider("Role-specific (security researcher)")
print(role_specific)
Notice the difference? Both answers correctly identify the Cross-Site Scripting (XSS) risk. But the framing is completely different. The baseline response talks about localStorage as a choice with tradeoffs. The "security researcher" response immediately frames it as an attack surface. It doesn't just say "XSS is bad"; it explains what an attacker does once they achieve XSS.
That’s the magic. We didn't give it any new facts. We just changed its perspective, and in doing so, we got a much more valuable, context-aware answer.
2. Tell It What Not to Do: Cleaning Up with Negative Prompting
LLMs are trained to be helpful and conversational. This means they love to add friendly intros ("Great question!"), hedge their bets ("Well, it depends..."), and wrap everything up in a neat little summary. That’s great for a chatbot, but for technical documentation or an API response? It’s just noise.
Negative prompting is my favorite way to cut through the fluff. Instead of only describing what you want, you also explicitly list what you don't want. This aggressively narrows the model's output space, forcing it to be more direct and concise.
Here, we'll ask it to explain a database index.
section("TECHNIQUE 2 -- Negative Prompting")
TOPIC = "Explain what a database index is and when you'd use one."
baseline_2 = chat(
system="You are a helpful assistant.",
user=TOPIC,
)
negative = chat(
system=(
"You are a senior backend engineer writing internal documentation.\n"
"Rules:\n"
"- Do NOT use marketing language or filler phrases like 'great question' or 'certainly'.\n"
"- Do NOT include caveats like 'it depends' without immediately resolving them.\n"
"- Do NOT use analogies unless they are necessary. If you use one, keep it to one sentence.\n"
"- Do NOT pad the response -- if you've made the point, stop.\n"
),
user=TOPIC,
)
divider("Baseline")
print(baseline_2)
divider("With negative prompting")
print(negative)
The difference is night and day. The baseline response is a mini-essay with analogies and headers. The negatively prompted version gives you the same core information but in a format that's ready to be dropped straight into your team's Confluence page. Nothing important was lost, but all the unnecessary conversational padding is gone.
3. Speak the Language of Code: Forcing Structured JSON Output
This one is absolutely critical anytime an LLM's output is meant for another machine, not a human. Free-form text is a nightmare to parse reliably. The structure changes, key details are buried in paragraphs, and a tiny change in wording can break your code.
By defining a JSON schema right in the prompt, you turn the output structure into a non-negotiable constraint. This is more than just formatting; it forces the model to organize its reasoning into distinct, predictable fields.
Let's try parsing a product review.
section("TECHNIQUE 3 -- JSON Prompting")
REVIEW = """
Honestly mixed feelings about this laptop. The display is stunning --
easily the best I've seen at this price range -- and the keyboard is
surprisingly comfortable for long sessions. Battery life, on the other hand,
barely gets me through a 6-hour workday, which is disappointing. Fan noise
under load is also pretty aggressive. For light work it's great, but I
wouldn't recommend it for anyone who needs to run heavy software.
"""
SCHEMA = """
{
"overall_sentiment": "positive | negative | mixed",
"rating": <integer 1-5>,
"pros": ["<string>", ...],
"cons": ["<string>", ...],
"recommended_for": "<string describing ideal user>",
"not_recommended_for": "<string describing user who should avoid>"
}
"""
baseline_3 = chat(
system="You are a helpful assistant.",
user=f"Summarize this product review:\n\n{REVIEW}",
)
json_output = chat(
system=(
"You are a product review parser. Extract structured information from reviews.\n"
"You MUST return only a valid JSON object. No preamble, no explanation, no markdown fences.\n"
f"The JSON must match this schema exactly:\n{SCHEMA}"
),
user=f"Parse this review:\n\n{REVIEW}",
)
divider("Baseline (free-form)")
print(baseline_3)
divider("JSON prompting (raw output)")
print(json_output)
divider("Parsed & usable in code")
parsed = json.loads(json_output)
print(f"Sentiment : {parsed['overall_sentiment']}")
print(f"Rating : {parsed['rating']}/5")
print(f"Pros : {', '.join(parsed['pros'])}")
print(f"Cons : {', '.join(parsed['cons'])}")
print(f"Recommended for : {parsed['recommended_for']}")
print(f"Avoid if : {parsed['not_recommended_for']}")
See that? The baseline is a readable paragraph, but you'd need some fragile regex to pull out the pros and cons. The JSON version, however, is immediately usable. You can load it directly into a variable and start working with the data. Information that was just implied in the text is now explicit, structured, and ready for your database.
4. Don't Just Think, Follow the Checklist: A Smarter Chain-of-Thought
You've probably heard of Chain-of-Thought (CoT) prompting, where you ask the model to "think step by step." It's a great start, but it has a weakness: the model decides what steps to take, which can lead it down irrelevant paths or cause it to miss critical details.
Attentive Reasoning Queries (ARQ) are like a supercharged version of CoT. Instead of a vague instruction to "think," you give the model a mandatory checklist of questions it must answer in order. This puts you, the developer, in control of its reasoning process. It's the difference between telling a pilot to "fly safely" and handing them a pre-flight checklist.
Let's use it to review a piece of code.
section("TECHNIQUE 4 -- Attentive Reasoning Queries (ARQ)")
CODE_TO_REVIEW = """
def get_user(user_id):
query = f"SELECT * FROM users WHERE id = {user_id}"
result = db.execute(query)
return result[0] if result else None
"""
ARQ_QUESTIONS = """
Before giving your final review, answer each of the following questions in order:
Q1 [Security]: Does this code have any injection vulnerabilities? If yes, describe the exact attack vector.
Q2 [Error handling]: What happens if db.execute() throws an exception? Is that acceptable?
Q3 [Performance]: Does this query retrieve more data than necessary? What is the cost at scale?
Q4 [Correctness]: Are there edge cases in the return logic that could cause a silent bug downstream?
Q5 [Fix]: Write a corrected version of the function that addresses all issues found above.
"""
baseline_cot = chat(
system="You are a senior software engineer. Think step by step.",
user=f"Review this Python function:\n\n{CODE_TO_REVIEW}",
)
arq_result = chat(
system="You are a senior software engineer conducting a security-aware code review.",
user=f"Review this Python function:\n\n{CODE_TO_REVIEW}\n\n{ARQ_QUESTIONS}",
)
divider("Baseline (free CoT)")
print(baseline_cot)
divider("ARQ (structured reasoning checklist)")
print(arq_result)
The standard CoT response is pretty good—it catches the main SQL injection flaw. But the ARQ version is so much better. It's disciplined. It systematically covers security, error handling, performance, and correctness because we forced it to. Each question acts as a checkpoint, making the final review more complete, structured, and easy to audit.
5. Embrace Uncertainty: Getting Multiple Hypotheses, Not One Confident Answer
This one tackles a huge, fundamental problem with LLMs: they are trained to sound confident even when they are completely guessing. Alignment training encourages them to give a single, decisive answer, which means they often hide their own uncertainty from you.
Verbalized sampling flips this on its head. You explicitly ask the model to generate multiple possible interpretations or hypotheses, complete with confidence scores and supporting evidence. Instead of one black-box answer, you get a ranked list of possibilities.
Let's try classifying a customer support ticket.
section("TECHNIQUE 5 -- Verbalized Sampling")
SUPPORT_TICKET = """
Hi, I set up my account last week but I can't log in anymore.
I tried resetting my password but the email never arrives. I also
tried a different browser. Nothing works.
"""
baseline_5 = chat(
system="You are a support ticket classifier. Classify the issue.",
user=f"Ticket:\n{SUPPORT_TICKET}",
)
verbalized = chat(
system=(
"You are a support ticket classifier.\n"
"For each ticket, generate 3 distinct hypotheses about the root cause. "
"For each hypothesis:\n"
" - State the category (Authentication, Email Delivery, Account State, Browser/Client, Other)\n"
" - Describe the specific failure mode\n"
" - Assign a confidence score from 0.0 to 1.0\n"
" - State what additional information would confirm or rule it out\n\n"
"Order hypotheses by confidence (highest first). "
"Then provide a recommended first action for the support agent."
),
user=f"Ticket:\n{SUPPORT_TICKET}",
)
divider("Baseline (single answer)")
print(baseline_5)
divider("Verbalized sampling (multiple hypotheses + confidence)")
print(verbalized)
The baseline gives us a single label: "Password Reset Issue." Okay, but what if that's wrong? The verbalized version is a powerful diagnostic tool. It offers multiple plausible causes, ranks them by likelihood, and tells the support agent exactly what to ask to confirm or deny each one. It turns a simple, often wrong, classification into an actionable decision-making aid.
Moving from Art to Engineering
The big takeaway here is that prompting doesn't have to be a dark art. By applying these kinds of systematic techniques, we can move from hoping for a good result to engineering a reliable one. Each of these tools gives you more control, more predictability, and ultimately, a more robust system.
So next time you find yourself tweaking a prompt over and over, take a step back and ask: Am I just wishing for a better answer, or am I building a better process? The answer to that question can make all the difference.




