Alibaba unveils open-source Qwen3-Omni, a true omni-modal AI

editorial_staff

October 21, 2025

Alibaba has launched Qwen3-Omni, an open-source large language model it calls the first end-to-end “omni-modal” system that natively takes text, images, audio, and video as inputs and produces text and speech as outputs. Unlike closed models such as OpenAI’s GPT-4o and Google’s Gemini 2.5 Pro, Qwen3-Omni is free to download, modify, and deploy under the Apache 2.0 license, making it suitable for commercial use. Google’s open Gemma 3n also accepts video and audio but only outputs text, while Qwen3-Omni adds natural speech.

Qwen3-Omni arrives in three versions. Instruct pairs a “Thinker” for reasoning with a “Talker” for speech, enabling real-time voice replies from audio, video, or text inputs. Thinking focuses on deep reasoning and long answers with text-only output. Captioner is tuned for accurate audio captioning. The model is available now on Hugging Face and GitHub and through Alibaba’s API as a faster Flash option.

A Thinker–Talker design keeps responses quick and lets safety or retrieval tools review content before speech is generated. Alibaba cites first-packet latencies of about 234 milliseconds for audio and 547 milliseconds for video. It supports 119 languages in text, 19 for speech input, and 10 for speech output. Context windows reach 65,536 tokens in Thinking mode. New users get a 90-day, one-million-token free quota. API pricing is usage-based: text input $0.00025 per 1,000 tokens, image/video input $0.00046, text output $0.00096 (or $0.00178 when inputs include images or audio), and audio output $0.00876 per 1,000 tokens, with text free when audio is used.

Training combined large-scale pre-training and post-training. A 0.6-billion-parameter Audio Transformer was trained on 20 million hours of audio. The LLM learned on roughly two trillion tokens spanning text, audio, images, and video, then was refined with supervised tuning, distillation, and other steps to reduce errors and improve speech quality.

Across 36 public tests, Alibaba reports state-of-the-art results on 22 and leadership among open models on 32, with strong scores in math and logic (AIME25 65.0), vision and video (MLVU 75.2), and low word-error rates on speech tasks that outpace GPT-4o. Alibaba highlights uses from multilingual transcription and translation to OCR, music tagging, video understanding, and live troubleshooting via phone or webcam. With open licensing, real-time performance, and flexible tools, Qwen3-Omni aims to speed enterprise adoption of multimodal AI.

Latest Reads

Google rejects claims that Gmail uses your emails to train Gemini AI

November 24, 202507:33 AM

Google Launches Gemini 3 as AI Race with OpenAI Heats Up

November 19, 202505:08 AM

StarRocks 4.0 Brings AI-Ready Speed and Unified Governance to Modern Enterprise Analytics

November 18, 202504:08 AM