Cohere has introduced its latest AI model, Aya Vision, which is designed for advanced visual and text-based tasks, such as image captioning, answering photo-related questions, translating text, and generating summaries in 23 languages.
The company aims to bridge the gap in AI performance across different languages, particularly in multimodal applications that involve both text and images. Aya Vision is available in two versions: Aya Vision 32B and Aya Vision 8B. The more powerful Aya Vision 32B surpasses larger models like Meta’s Llama-3.2 90B Vision in certain benchmarks, while Aya Vision 8B outperforms models up to ten times its size.
Cohere is making Aya Vision freely available via WhatsApp and AI development platform Hugging Face under a Creative Commons 4.0 license, with restrictions on commercial use. The model was trained on diverse English datasets that were translated and enhanced with AI-generated annotations to improve data interpretation.
This approach aligns with the growing industry trend of using synthetic data, as real-world datasets become scarcer. Major AI players, including OpenAI, have adopted similar methods, with Gartner estimating that synthetic data comprised 60% of AI training data last year.
Aya Vision represents a major advancement in multimodal AI, offering high efficiency in visual understanding tasks. While AI-generated annotations have raised concerns about potential biases and inaccuracies, Cohere is positioning Aya Vision as a tool to push the boundaries of research accessibility. The company’s decision to release it under an open license further supports its vision of democratizing AI development and making technical breakthroughs more widely available.