I got tired of “it works on my machine” being the entire QA process for my voice agent. So I built Decibench.

Addressing the Limitations of Traditional QA in Voice Agent Development: The Launch of Decibench

The rapid advancement of voice AI technology has led to a highly competitive landscape, where companies race to deploy increasingly sophisticated voice agents. With infrastructure options like Vapi, Retell, LiveKit, and raw WebRTC making real-time communication more accessible than ever, developers face a new challenge: ensuring consistent quality and performance across iterations.

The Problem with Conventional Testing Approaches

Despite the progress, many development teams rely heavily on manual testing to validate their voice agents. Common responses to how they verify changes include:
– “We call it manually.”
– “We have a dedicated tester.”
– “We noticed issues in production.”

While these methods can catch issues post-deployment, they often result in late discovery of regressions, causing inconvenience, increased costs, and potential user dissatisfaction. This “it works on my machine” mentality fails to provide a systematic, reliable way to monitor ongoing performance.

The Invisible Pitfalls in Voice Agent Testing

The complexity of voice interaction models presents unique challenges:
– A minor prompt adjustment might unexpectedly disrupt intent recognition.
– Optimizations aimed at reducing latency could inadvertently produce more terse or unnatural responses.
– Traditional testing frameworks lack mechanisms tailored specifically for voice AI, leaving gaps in detecting regressions early.

Without dedicated testing frameworks, teams risk deploying subtle bugs that degrade user experience over time.

Introducing Decibench: A Solution for Voice AI Benchmarking

In response to these challenges, I developed Decibench—a comprehensive, open-source benchmarking framework tailored specifically for voice AI agents. Its goal is to enable teams to move beyond manual, ad-hoc testing and establish a continuous, automated evaluation process.

Key Features of Decibench

Open-Source & Community-Driven: Licensed under Apache-2.0, ensuring no vendor lock-in or licensing costs.
Flexible and Extensible: Designed to integrate seamlessly with existing development workflows.
Scenario-Based Evaluation: Allows definition of realistic user call scenarios to assess agent performance.
Regression Detection: Automatically identifies regressions in intent recognition, response quality, and latency.
Easy Integration: Supports importing call data, defining diverse test scenarios, and running evaluations with minimal setup.

Current Status and Future Roadmap

Decibench v0.1.0 is now available, providing a functional foundation to help teams catch regressions before they impact users. While it’s still early in its development—some rough edges remain—the core workflow is operational: import calls, define scenarios, evaluate, and detect regressions.

Looking ahead, the v1 release promises enhancements such as richer scenario definition, more detailed reporting, and integrations with popular CI/CD pipelines. The goal is to make Decibench an indispensable tool for teams committed to maintaining high-quality voice agents.

Join the Conversation

I believe in building tools alongside the community. If you’re developing voice AI agents and have insights into what constitutes effective testing, I’d love to hear from you. What are your biggest pain points? How do you currently validate your models? Your feedback will shape the future of Decibench.

Get Involved

Decibench is available on GitHub: https://github.com/unforkopensource-org/decibench. I encourage developers, researchers, and product teams to explore, contribute, and collaborate.

Conclusion

Automated, scenario-based testing is essential for sustaining quality in voice AI development. With Decibench, I aim to empower teams to shift from unreliable manual validation to a robust, repeatable process—ultimately leading to better user experiences and more reliable voice agents.

Interested in discussions or collaborations? Reach out and let’s shape the future of voice AI testing together.

Holidays in Europe

I got tired of “it works on my machine” being the entire QA process for my voice agent. So I built Decibench.

Leave a Reply Cancel reply