Wrote a small tool to compare how different prompts perform across GPT and Claude, some results were surprising

Enhancing Prompt Evaluation: Developing a Tool to Compare AI Model Performance Across Multiple Prompts

In the rapidly evolving landscape of artificial intelligence, effectively evaluating how different prompts influence model outputs remains a significant challenge. Manually rephrasing prompts and assessing results can be time-consuming and inconsistent. To address this, I developed an automated solution—a Python-based tool designed to systematically compare how various prompt formulations perform across multiple AI models.

Purpose and Functionality

The core idea was to create a flexible, efficient way to test different prompts against several AI models, analyze the outputs objectively, and visualize the results conveniently. The tool allows users to define a set of prompt variations and specify which models to evaluate, streamlining the process and reducing manual effort.

Implementation Details

The tool uses a YAML configuration file where users specify the task, input data, models, prompts, scoring criteria, and other parameters. Here’s an illustrative example of how such a configuration might look:

yaml task: code_review input: | def get_user_data(user_id): conn = sqlite3.connect("users.db") cursor = conn.cursor() query = f"SELECT * FROM users WHERE id = {user_id}" cursor.execute(query) result = cursor.fetchone() return result models: - openai/gpt-5-mini - anthropic/claude-sonnet-4 prompts: - "Review this code and list any bugs or security issues:" - "What's wrong with this code?" - "Improve this code and explain your changes:" scoring: criteria: [correctness, thoroughness, clarity] judge_models: [openai/gpt-5-mini, anthropic/claude-sonnet-4] exclude_self_judge: true

Scoring Mechanism

The evaluation combines rule-based checks and AI-generated scores. An automatic judge, implemented via a secondary model, rates outputs on predefined criteria using a 1-10 scale. Additionally, rule-based checks evaluate aspects such as output length, structural coherence, repetition, and formatting adherence. These scores are then aggregated into a composite score that facilitates straightforward comparison.

Observed Insights

During testing with a code review scenario involving three distinct prompts across two models, some notable patterns emerged:

The vague prompt “What’s wrong with this code?” consistently received lower scores than more specific prompts.
Specific prompts like “Review this code and list any bugs or security issues” prompted models to perform a more thorough analysis—identifying SQL injection vulnerabilities, missing connection closures, and redundant queries.
Both models identified SQL injection risks regardless of prompt, but the more detailed prompts elicited deeper examinations.

Interestingly, when models were allowed to evaluate their own outputs, they tended to assign themselves inflated scores. Disabling self-judgment improved objectivity—a useful feature built into the tool.

Additional Features

Optional AI scoring via command-line flag (--no-ai-scoring) allows for faster, rule-based evaluations.
Model sets can be overridden dynamically with a command-line argument (--models).
Exporting results as JSON enables further analysis or visualization.
Compatibility with any OpenAI-compatible API makes the tool versatile. In my setup, I use ZenMux, an aggregator platform that offers access to over 100 models with a single API key, simplifying large-scale testing.

Repository and Future Work

The project is open-source and hosted on GitHub: superzane477/prompt-tuner. The current focus involves extending evaluation to translation prompts, investigating whether the observed trend—that specific prompts outperform casual, vague ones—applies across different tasks.

Conclusion

Automating prompt comparison not only accelerates experimentation but also provides clearer insights into prompt design strategies. By leveraging such tools, AI practitioners can better understand how prompt phrasing influences model behavior and output quality, ultimately leading to more effective AI applications.

Interested in exploring or customizing this tool? Check out the repository and contribute to the ongoing development of smarter prompt evaluation methods.

Holidays in Europe

Wrote a small tool to compare how different prompts perform across GPT and Claude, some results were surprising

Leave a Reply Cancel reply