Alibaba has introduced its latest breakthrough in AI transcription with the launch of Qwen3-ASR-Flash, a speech recognition model built on the Qwen3-Omni framework and trained on tens of millions of hours of voice data. The company says the model is designed for high accuracy even in noisy conditions or with complex language patterns, setting a new benchmark for transcription technology.
Performance tests conducted in August 2025 show the model outshines rivals. For standard Chinese, Qwen3-ASR-Flash achieved an error rate of 3.97 percent, well ahead of Gemini-2.5-Pro at 8.98 percent and GPT4o-Transcribe at 15.72 percent. It also performed strongly with Chinese accents, scoring 3.48 percent, while in English it recorded 3.81 percent, again surpassing Gemini’s 7.63 percent and GPT4o’s 8.45 percent.
Perhaps its most surprising achievement comes in music transcription. The model posted an error rate of 4.51 percent when recognizing song lyrics, a task notoriously difficult for speech models. In full-song tests, it delivered a 9.96 percent error rate compared with Gemini’s 32.79 percent and GPT4o’s 58.59 percent, a leap that suggests new creative applications for AI transcription.
Beyond accuracy, Qwen3-ASR-Flash offers innovative features such as flexible contextual biasing. Users can supply background text in any format—whether keyword lists, long documents, or mixed notes—and the model adapts without requiring complex preprocessing. This allows it to refine accuracy based on relevant context, while maintaining strong baseline performance even if the input text is irrelevant.
The model supports 11 languages and multiple dialects, aiming for global usability. Its Chinese coverage spans Mandarin and dialects like Cantonese, Sichuanese, Minnan, and Wu. For English, it accommodates British, American, and regional accents. Other supported languages include French, German, Spanish, Italian, Portuguese, Russian, Japanese, Korean, and Arabic. It can also automatically detect which language is being spoken and filter out background noise or silence for cleaner output.
With its combination of accuracy, flexibility, and wide language coverage, Qwen3-ASR-Flash positions Alibaba as a strong contender in the competitive AI transcription market, signaling its ambition to set a new global standard.