AI could be better at certain tasks, such as coding or generating podcasts. But passing high-level history exams can be difficult, a new paper has found.
A team of researchers created a new benchmark to test three top large-scale language models (LLMs): OpenAI’s GPT-4, Meta’s Llama, and Google’s Gemini on historical problems. The benchmark Hist-LLM tests the correctness of your answers according to the Seshat Global History Databank, a vast database of historical knowledge named after the ancient Egyptian goddess of wisdom.
Researchers at the Austria-based research institute Complexity Science Hub (CSH) say results presented last month at the high-profile AI conference NeurIPS were disappointing. The best performing LLM was GPT-4 Turbo, but its accuracy was only about 46%, which was not much better than a random guess.
“The main takeaway from this study is that while LLMs are good, they still lack the depth of understanding required for advanced history; “When it comes to more nuanced doctoral-level historical research, we are not up to the task yet,” said Maria del Río-Chanona, an associate and one of the paper’s co-authors. Professor of Computer Science at University College London.
The researchers shared with TechCrunch a sample of historical questions that LLMs got wrong. For example, GPT-4 Turbo was asked if scale armor existed in a certain period of ancient Egypt. LLM said yes, but the technology didn’t appear in Egypt until 1,500 years later.
Why are LLMs good at answering very complex questions about things like coding, but not good at answering technical historical questions? Del Rio-Chanona told TechCrunch that LLMs He said this is likely because people tend to make inferences from very salient historical data, making it difficult to search for more obscure historical knowledge.
For example, researchers asked GPT-4 whether ancient Egypt had a professional standing army during a specific historical period. The correct answer is no, but the LLM incorrectly answered yes. This is probably because there is a lot of public information about other ancient empires like Persia having standing armies.
“If you’re told A and B 100 times, C once, and then asked a question about C, you might just remember A and B and try to guess from there,” Del Rio says. Chanona says.
The researchers also identified other trends, such as OpenAI and Llama models performing poorly in certain regions such as sub-Saharan Africa, suggesting potential bias in the training data.
Peter Turchin, a CSH faculty member who led the study, said, “These results show that LLMs cannot yet replace humans in certain areas.”
But researchers still hope that LLM can help historians in the future. They are working to improve the benchmark by including more data from underrepresented regions and adding more complex questions.
“Overall, our results highlight areas where LLMs need improvement, while also highlighting the potential for these models to be useful for historical research,” the paper says.