I accidentally benchmarked multiple AI models while trying to fix a comic speech bubble

When a Small Fix Turns into an Unexpected Benchmark: Exploring AI Model Performance in Creative Tasks

In the world of AI-assisted creative work, even seemingly simple tasks can reveal surprising insights into the capabilities and limitations of various models. Recently, during a routine edit of an AI-generated comic, I inadvertently conducted an informal benchmark of several prominent AI models—highlighting fascinating failure modes and stubborn behaviors that shed light on current AI strengths and gaps.

The Task: A Minor Adjustment with Unexpected Complexity

The specific task was straightforward: rotate the tail of a speech bubble in panel 3 so it points toward the speaking character on the left. This is a common editing step, and intuitively, should be trivial for most AI models trained on visual and contextual understanding.

The Models Tested

I experimented with the following AI models:

Grok
Qwen
Gemini Pro
GPT-5.4
GPT-5.3
GPT-5.2

These models are representative of different architectures and training paradigms, making them ideal for a comparative look at performance on creative editing tasks.

The Surprising Outcome

Despite the simplicity of the instruction, every model stubbornly insisted on rendering the speech bubble tail pointing to the RIGHT, opposite of the desired direction. It was a clear case of the models defaulting to a habitual or “safe” solution, ignoring the specific prompt.

Notable Failure Modes

The experiment unveiled several interesting failure patterns:

Duplication of the tail: The Qwen model duplicated the tail instead of repositioning it.
Incorrect panel editing: Grok applied the change to the wrong panel, indicating a misunderstanding of context.
Prolonged deliberation: GPT-5.4 extended version “thought” for nearly 30 minutes and still did not produce the correct adjustment.
Inability to change direction: Gemini Pro simply refused, persistently pointing the tail to the RIGHT regardless of clear instruction.

These behaviors highlight that many AI models, despite their sophistication, can struggle with nuanced visual prompts, especially when spatial reasoning or directional adjustments are involved.

The Final Solution

Faced with these challenges, I opted for a manual fix. Opening the comic in Photoshop, I resized the panel and removed the tail altogether, leaving the speech bubble without a tail in panel 3. This simple edit sidestepped AI limitations and preserved the visual clarity of the comic.

Reflection: What This Tells Us About AI Creativity and Understanding

This experience underscores a broader point: modern AI models, even those with impressive language and image generation capabilities, often falter on specific, detail-oriented tasks. Tasks requiring precise spatial orientation, contextual understanding, or nuanced interpretation can remain beyond their reach—at least with current training and architectures.

While AI continues to improve rapidly, this experiment is a reminder that manual interventions remain a valuable part of creative workflows, especially when accuracy and nuance are critical.

Conclusion

In the realm of AI-assisted artistic editing, even the simplest adjustments can reveal complex limitations. As AI developers and users, recognizing these failure modes helps us better understand the tools’ capabilities, guiding us to more effective workflows and prompting ongoing improvements. For now, sometimes a quick manual fix is the best solution—a small but important lesson in the evolving dance between human creativity and artificial intelligence.

Holidays in Europe