Sponsored by Looka AI – Exclusive lifetime deal

AI2’s Molmo AI Models Outperform GPT-4 and Claude on Key Benchmarks

October 16, 2024

Original Article by

Today, the Allen Institute for AI (AI2) unveiled Molmo, an open-source, cutting-edge multimodal AI model. On several benchmarks, it surpasses top competitors like OpenAI’s GPT-4o, Anthropic’s Claude 3.5 Sonnet, and Google’s Gemini 1.5, offering advanced image analysis capabilities comparable to proprietary models.

AI2’s Molmo (Multimodal Open Language Model) is not a conventional chatbot like ChatGPT or Gemini, it’s a powerful visual understanding engine. The interesting thing about it is that it lacks an API or website functionality for user queries, and it’s not designed for enterprise integration.

Instead, Molmo excels in image comprehension, analyzing visuals, and providing insightful answers to user queries. You can experience its abilities at a public demo.

Why Molmo is outperforming GPT-4 and Claude?

Molmo comes in four versions: Molmo-72B (based on Alibaba Cloud’s Qwen2-72B), Molmo-7B-D (demo model from Qwen2-7B), Molmo-7B-O (from AI2’s OLMo-7B), and MolmoE-1B (a mixture-of-experts model). It excels at visual tasks like identifying vegan menu options or explaining a coffee maker, but what truly sets Molmo apart is its efficiency in performing these tasks.

These models outperform many proprietary alternatives across third-party benchmarks and are available under Apache 2.0 licenses for research and commercialization. Molmo-72B leads with top scores on 11 key benchmarks, ranking second in user preference behind GPT-4o.

Vaibhav Srivastav from Hugging Face praised Molmo on X, calling it a strong alternative to closed systems.

Ted Xiao from Google DeepMind highlighted Molmo’s use of pointing data as a game-changer for robotics, improving visual grounding and interaction with physical environments—an area where most multimodal models fall short.

These high-performing, open-source models offer researchers and developers access to cutting-edge technology. Also, Molmo breaks the “bigger is better” AI trend by using less data, but of much higher quality. Instead of billions of images, AI2 trained Molmo on just 600,000 carefully annotated ones.

This smaller, smarter model performs on par with GPT-4o and Claude 3.5 Sonnet, despite being about a tenth their size. Molmo’s standout feature is its ability to “point” at relevant parts of images, offering precise, zero-shot answers like counting objects or navigating web interfaces.

Additionally, it’s Completely free, open-source available in its Hugging Face space, and small enough to run locally, Molmo empowers developers to create AI-driven experiences without relying on big tech.

Molmo is making its marks with its astonishing model as AI2 announced on X that Molmo uses “1000x less data” than its proprietary competitors, thanks to innovative training techniques detailed in a recent technical report.

As far as the benchmark is concerned, Molmo is also taking the lead showing Molmo-72B with top scores, achieving 96.3 on DocVQA and 85.5 on TextVQA, outperforming Gemini 1.5 Pro and Claude 3.5 Sonnet. It also surpasses GPT-4o on AI2D and excels in visual grounding on RealWorldQA, showcasing its strength in robotics and complex multimodal tasks.

AI2 CEO Ali Farhadi:

“There are a dozen different benchmarks that people evaluate on. I don’t like this game, scientifically… but I had to show people a number. Our biggest model is a small model, 72B, it’s outperforming GPTs and Claudes and Geminis on those benchmarks. Again, take it with a grain of salt; does this mean that this is better than them or not? I don’t know. But at least to us, it means that this is playing the same game.”

This release highlights AI2’s dedication to open research by providing high-performing models with open weights and data for the community and businesses seeking customizable solutions. It follows the recent launch of OLMoE, a cost-effective “mixture of experts” model.

Molmo’s Architecture

Molmo’s architecture is all about efficiency and performance. It utilizes OpenAI’s ViT-L/14 336px CLIP model as its vision encoder, transforming multi-scale, multi-crop images into vision tokens. These tokens are seamlessly integrated into the language model through a multi-layer perceptron (MLP) for dimensionality reduction.

At its core, Molmo features a decoder-only Transformer, drawing from the OLMo, Qwen2, and Mistral series, each tailored for varying capacities and openness.

Its training process is a two-step powerhouse:

Multimodal Pre-training: Models learn to generate captions using high-quality, detailed image descriptions from human annotators, forming a robust dataset called PixMo.
Supervised Fine-Tuning: Here, models refine their skills on a mix of standard benchmarks and innovative datasets, preparing them for real-world tasks like document reading and visual reasoning.

What sets Molmo apart? Unlike many models today, it skips reinforcement learning from human feedback (RLHF), instead relying on a meticulously crafted training pipeline that optimally tunes all model parameters. The result? A highly capable AI ready to tackle complex challenges!

In a rapidly evolving AI landscape, Molmo stands out as a strong alternative, challenging the dominance of big players. Its robust capabilities and open-source nature empower developers to innovate without heavy costs, signaling that true progress often lies beyond exclusivity. The future of AI may well depend on how we embrace this potential.