The technical reality: LLMs aren’t as brilliant as they seem !

Beyond the marketing hype about the spectacular advances of AI, the technical reality of large language models (LLMs) has significant limitations that are rarely mentioned in corporate presentations.

Hallucinations: the persistent Achilles’ heel

One of the most serious problems is that LLMs can confidently “hallucinate” information. These systems invent data, quotes, or facts that sound perfectly credible but are completely false. A Vectara study found that the most accurate models, GPT-4 and GPT-4 Turbo, hallucinate about 3% of the time when summarizing texts, while other models achieve error rates as high as 27%.

In customer service, this has real and costly consequences. In February 2024, Air Canada was ordered by a Canadian court to pay compensation to a customer after its chatbot fabricated a bereavement fee policy that didn’t exist. The bot confidently claimed that customers could request retroactive discounts up to 90 days after ticket issuance, which is completely false according to the company’s actual policy. Other notable cases include DPD, a European logistics company, which had to disable part of its chatbot after it started insulting customers and describing the company as “the worst delivery service in the world.” Virgin Money was also forced to apologize after its chatbot reprimanded a user for using the word “virgin.” And Cursor, an American tech startup, had to limit the damage when its chatbot informed customers of a radical change to its usage policy that was entirely fictitious.

The paradox of advanced reasoning models

Paradoxically, more advanced reasoning models, which use “chain of thought” approaches to break down complex problems into smaller pieces, appear to hallucinate more often than ordinary LLMs, according to Vectara’s analysis. OpenAI acknowledged in a performance report for its latest reasoning models that o1 hallucinated 16% of the time when synthesizing public information about people, while its newer models o3 and o4-mini hallucinated 33% and 48% of the time, respectively.

Basic mathematics and logical reasoning

Ironically, while companies market these systems as “superintelligences,” LLMs struggle noticeably with tasks any elementary school student could solve. Basic mathematical reasoning remains a weak point, which is problematic when customers ask questions about discounts, warranty dates, or cost calculations.

How can we manage this risk and have complete confidence in our AI-powered tools?

We have identified precautions to take and methods to follow in order to make the most of AI capabilities (for customer services as in all areas that handle critical information) and we will share these elements in the last of the 5 articles we are publishing on this subject.

Stay tuned!