Search

Artificial Intelligence's Struggle with Historical Knowledge, Study Reveals

Artificial Intelligence (AI) has shown prowess in areas such as coding and podcast creation, but a recent study has uncovered its limitations when it comes to tackling complex historical questions.

Researchers have developed a novel benchmark, Hist-LLM, to evaluate the performance of three leading large language models (LLMs) on historical queries: OpenAI’s GPT-4, Meta’s Llama, and Google’s Gemini. This benchmark assesses the accuracy of their responses using the Seshat Global History Databank, an extensive repository of historical information named after the Egyptian deity of wisdom.

The findings, unveiled at the prestigious AI event NeurIPS, were underwhelming, with researchers from the Complexity Science Hub (CSH), an Austrian research institute, reporting that even the top-performing LLM, GPT-4 Turbo, scored only around 46% accuracy—barely above the level of random chance.

“Despite their capabilities, LLMs lack the in-depth comprehension necessary for advanced historical studies,” commented Maria del Rio-Chanona, a co-author of the study and an associate professor at University College London’s computer science department. “They can handle basic historical facts, but when it comes to more intricate, doctoral-level historical analysis, they fall short.”

The researchers provided TechCrunch with examples of historical questions that the LLMs mishandled. For instance, GPT-4 Turbo incorrectly affirmed the presence of scale armor in ancient Egypt during a specific epoch, when in fact, the technology emerged 1,500 years later.

Why do LLMs falter on detailed historical inquiries when they can adeptly answer complex coding questions? Del Rio-Chanona suggested to TechCrunch that LLMs often rely on prominent historical data, struggling to access less common historical facts.

For instance, when asked about the existence of a professional standing army in ancient Egypt during a particular period, the LLM incorrectly affirmed it, likely due to the abundance of information on standing armies in other ancient civilizations like Persia.


“Imagine being told A and B repeatedly, and C only once; when asked about C, you might default to what you remember about A and B and extrapolate from there,” explained del Rio-Chanona.

The study also pointed out that certain models, such as those from OpenAI and Llama, performed poorly on questions related to regions like sub-Saharan Africa, indicating possible biases in their training datasets.

The study’s leader, Peter Turchin, a faculty member at CSH, emphasized that LLMs are not yet ready to replace humans in certain fields.

Nevertheless, the researchers remain optimistic about the potential of LLMs to assist historians. They are refining their benchmark by incorporating data from underrepresented regions and introducing more sophisticated questions.

“The study’s results, while highlighting areas for improvement in LLMs, also indicate their potential to contribute to historical research,” the paper concludes.