Prompt Engineering for Reducing AI Hallucinations in Code

Table of Contents

The Market Story: Why Code Generation Hallucination Still Breaks Production
Prompt Engineering Patterns That Cut Hallucination Rates
Failure Modes: Real Examples of Hallucinated APIs and Imports
Benchmark: Which Prompting Strategies Win on SWE-bench?
Constraint specification: Clearly stating what can and cannot be used in the generated code.
Test-driven prompting: Including tests within the prompt so the model knows the criteria for a correct solution.
Chain-of-thought and self-verification: Asking the model to reason through each step or check its own output for errors.

Below, each pattern is explored with concrete examples and explanations for why it works.

1. Few-Shot Examples: Show the Model What Good Looks Like

A few carefully crafted code examples can steer the LLM away from fabricating APIs and towards real, working code. This is the single most effective “priming” technique for code correctness.

# Prompt: Write a Python function to compute the nth Fibonacci number.

# Examples:
# Input: 5
# Output: 5

# Input: 7
# Output: 13

def fibonacci(n):
    # Model fills in the correct logic based on example patterns
    if n <= 1:
        return n
    return fibonacci(n-1) + fibonacci(n-2)

Why it matters: By seeding the prompt with real code and expected outputs, you anchor the model in valid syntax and discourage creative "gap-filling."

What is a few-shot example? In prompt engineering, "few-shot" refers to giving the model a few demonstrations of the desired task before asking it to perform the task itself. These examples show the model the format, logic, and expectations, reducing the chance it will invent new, unsupported code.

For instance, if you only provide the model with "Write a function to calculate Fibonacci numbers," it might invent new approaches or even reference non-existent libraries. By explicitly showing valid inputs and outputs, you focus its generation on what is correct and expected.

2. Constraint Specification: Box in the Output

Tell the model what APIs, data structures, or libraries are allowed—and what to avoid. Constraints can be as simple as "use only standard library" or as strict as "no external imports, only use sorted()."

# Prompt:
# Implement a Python function that sorts a list of integers.
# Constraints:
# - Only use built-in list.sort() or sorted()
# - Do not import any external libraries

def sort_numbers(nums):
    nums.sort()
    return nums

Why it matters: Constraints cut the "surface area" for hallucination, blocking the model from inventing unsupported modules or calls.

What does constraint specification mean in prompting? It means specifying rules and boundaries for the model's output. By clearly stating what is allowed, you prevent the model from defaulting to potentially incorrect or imaginary solutions.

For example, if you ask for a sorting algorithm but don’t mention library restrictions, the model may import a non-existent module or use an unfamiliar pattern. Providing constraints like “do not import any external libraries” ensures it sticks to reliable, documented methods such as list.sort() or sorted().

3. Test-Driven Prompting: Embed Tests in the Prompt

Give the model the tests it must pass. This approach, sometimes called "test-driven prompting," raises output correctness by making the model check its own logic against concrete inputs/outputs.

# Prompt:
# Write a function add(a, b) that returns the sum of two numbers.
# The following tests must pass:
# assert add(2, 3) == 5
# assert add(-1, 1) == 0

def add(a, b):
    return a + b

Why it matters: By reflecting on tests, the model is less likely to hallucinate logic or use non-existent functions.

What is test-driven prompting? Test-driven prompting means including test cases directly in your prompt. This gives the model not only a goal, but also a method for checking correctness. It’s similar to test-driven development (TDD) in software engineering, where you write tests before the code.

By embedding tests, you help the model focus on producing code that is both syntactically correct and logically valid for the scenarios you care about. For instance, asking the model to “write a function such that assert add(2, 3) == 5 passes” forces it to check its logic, not just its syntax.

Test-driven prompting: Embedding test cases in your prompt raises output reliability.

4. Chain-of-Thought and Self-Verification: Make the Model Think Out Loud

Explicitly ask the model to reason step-by-step before producing code, or to review its answer for errors and confidence level.

# Prompt:
# Let's write a function to reverse a singly linked list.
# Step 1: How do we traverse the list?
# Step 2: How do we reverse the pointers?
# Step 3: How do we handle an empty list?
# Now, generate the code step-by-step, explaining each step.

# [Model output includes reasoning steps, then the code.]

Why it matters: When the model must explain its logic or self-critique, hallucinated leaps are more likely to be caught and corrected.

What is chain-of-thought prompting? Chain-of-thought (CoT) prompting encourages the model to articulate its reasoning process step by step, rather than jumping straight to an answer. Self-verification takes this further by asking the model to review and critique its own answer.

For example, if the task is to reverse a linked list, asking for a step-by-step approach helps ensure the model doesn't skip over details or make unwarranted assumptions. By verbalizing each part of the process, the model is more likely to catch potential mistakes and avoid fabricating unsupported logic.

Together, these patterns offer a practical toolkit for reducing hallucinations. The next section looks at what happens when these strategies are not applied, illustrating common failure modes in code generation.

Failure Modes: Real Examples of Hallucinated APIs and Imports

Even with strong prompts, code LLMs can produce:

Nonexistent imports: import mystical_api
Fake endpoints: response = requests.get('https://api.fake.com/user/42')
Imagined utility functions: user = get_user_data_by_id(id) (not in any public package)
Unsupported parameters: json.dump(data, file, indent=4, sort_keys=True, encoding='utf-8') (encoding is not a valid parameter here)

What are these failure modes? Failure modes are typical mistakes or incorrect outputs that LLMs produce when generating code, such as referencing APIs or modules that do not exist, or using functions in unsupported ways.

Here's a real prompt and typical hallucinated output:

# Prompt:
# Write a Python function to fetch user info from an API endpoint.
# Use only the `requests` library.

# Hallucinated output:
import user_api  # <--- FAKE MODULE

def get_user(user_id):
    response = user_api.fetch(f"https://api.example.com/user/{user_id}")  # <--- FAKE FUNCTION
    return response.json()

In production, these errors can:

Trigger runtime exceptions (crash the script)
Slip past review if code "looks plausible"
Retrieval-augmented context: Provide up-to-date docs, codebases, or API references directly in the prompt (see Infoworld, 2023). Retrieval-augmented generation (RAG) helps ground the model in real, current information.
Model temperature tuning: For factual code, use a lower temperature (0.3–0.5) to reduce creative drift. The "temperature" parameter controls the randomness of model outputs—lower values result in more predictable, fact-based responses.
Iterative prompting: Force the model to critique and improve its own output—"Review your answer and flag any claim you are less than 90% confident in."

Key Takeaways:

Combining these patterns yields the lowest hallucination rates; layering verification further locks down reliability.

Hallucinations can't be eliminated, but they can be reduced to single digits with the right prompt and production safeguards.

Thomas A. Anderson