Meta’s experimental AI model, Llama 4 Maverick, has fallen behind its competitors in the LM Arena chat benchmark, which measures AI performance based on human-rated conversations. Earlier this week, Meta faced criticism for using an unreleased version of Maverick to achieve high scores on LM Arena. This led to the LM Arena team revising its policies and recalculating scores based on the unmodified version of Maverick, which has not performed well in comparison to rival models like OpenAI’s GPT-4o, Anthropic’s Claude 3.5 Sonnet, and Google’s Gemini 1.5 Pro.
As of April 11, 2025, Maverick’s unmodified version, “Llama-4-Maverick-17B-128E-Instruct,” ranked below these established models. This version of Maverick had been specifically optimized for conversational interactions, which appeared to give it an advantage in earlier tests but did not translate well when assessed by LM Arena’s human raters. Despite Maverick’s high performance in these initial tests, the modified version, designed for broader applications, struggled in the standard evaluation.
Meta acknowledged the situation, explaining that the “Llama-4-Maverick-03-26-Experimental” version had been optimized to perform well in specific tasks, such as benchmarking with LM Arena. The company emphasized that it regularly experiments with different variants of its models to gather feedback and improve performance. Meta has now released the open-source version of Llama 4 and looks forward to seeing how developers customize it for diverse use cases.
Despite the lower ranking, Meta remains committed to refining Llama 4 and exploring how it can be adapted to meet different demands in AI development.