Microsoft promises to fix AI hallucinations with its new service “Correction”. This tool flags potentially erroneous text—like misattributed quotes from quarterly earnings calls—and fact-checks it against reliable sources, such as uploaded transcripts. While concerns remain, the tool shows potential for improvement.
Correction is a part of Microsoft’s Azure AI Content Safety API (in preview for now) and can be integrated with other text-generating AI models like Meta’s Llama and OpenAI’s GPT-4o.
Microsoft:
“Correction is powered by a new process of utilizing small language models and large language models to align outputs with grounding documents. We hope this new feature supports builders and users of generative AI in fields such as medicine, where application developers determine the accuracy of responses to be of significant importance.”
Although Microsoft has covered a really big concern with “Correction”, it still expert shows concern saying “It’s not addressing the root cause of hallucination”.
Os Keyes, a PhD candidate at the University of Washington studying the ethical impact of emerging technologies, argues that removing hallucinations from generative AI is like trying to remove hydrogen from water—it’s an intrinsic part of its nature, making complete elimination unlikely.
Text-generating models often hallucinate because they don’t actually “know” anything; they analyze patterns in words and predict what comes next based on their training data. This results in inaccuracies, as shown by a study finding that ChatGPT gets medical questions wrong about half the time.
To combat this, Microsoft has created a two-part solution. The first model identifies potentially incorrect or fabricated text, while the second model corrects these hallucinations using specified “grounding documents” for greater accuracy.
Microsoft Spokesperson:
“Correction can significantly enhance the reliability and trustworthiness of AI-generated content by helping application developers reduce user dissatisfaction and potential reputational risks. It is important to note that groundedness detection does not solve for ‘accuracy,’ but helps to align generative AI outputs with grounding documents.”
That was a nicer one but Keyes wasn’t satisfied, he said:
“It might reduce some problems,” they said, “But it’s also going to generate new ones. After all, Correction’s hallucination detection library is also presumably capable of hallucinating.”
In response to a detailed question about the Correction model, a spokesperson referenced a paper discussing the model’s pre-production architectures. However, the paper lacked critical details, such as the datasets that will be used for training the model.
In addition to Keyes, Mike Cook, an AI researcher at Queen Mary University, argued that even if the models work as Microsoft claims, it could foster blind trust regarding AI models among users, leading to potentially harmful consequences.
Spokesperson:
“Microsoft, like OpenAI and Google, have created this issue where models are being relied upon in scenarios where they are frequently wrong. What Microsoft is doing now is repeating the mistake at a higher level. Let’s say this takes us from 90% safety to 99% safety — the issue was never really in that 9%. It’s always going to be in the 1% of mistakes we’re not yet detecting.”
Cook also pointed out a cynical business angle to Microsoft’s Correction feature, which is free on its own but limits “groundedness detection” to 5,000 text records per month.
After that, it costs 38 cents per 1,000 records. As Microsoft invests nearly $19 billion in AI this Q2, the pressure is on to demonstrate value to customers and shareholders.
Despite this, there are still concerns about its long-term AI strategy, as highlighted by a recent stock downgrade from a Wall Street analyst.
The Information reveals that many early adopters have discontinued deployments of Microsoft 365 Copilot due to performance and cost concerns.
For instance, one client found that the AI created fictitious meeting attendees and misrepresented discussion topics during Microsoft Teams calls. Accuracy and the risk of hallucinations have become major concerns for businesses piloting AI tools according to a KPMG poll.
“If this were a normal product lifecycle, generative AI would still be in academic R&D, and being worked on to improve it and understand its strengths and weaknesses. Instead, we’ve deployed it into a dozen industries. Microsoft and others have loaded everyone onto their exciting new rocket ship, and are deciding to build the landing gear and the parachutes while on the way to their destination.”