Conclusion
Several larger conclusions emerge from this test case. The two models that drew from curated databases of experimental literature, NotebookLM and our custom-built tool, outperformed the LLMs trained on unfiltered internet data. In particular, models relying on open web sources tended to mix established theories with highly speculative ones.
The evaluated LLMs (accessed in December 2024) also showed weaknesses in temporal and contextual understanding. For example, they often failed to recognize when a proposed hypothesis was later disproved. They also frequently omitted relevant papers when they didn’t explicitly include the exact language used in the initial query.
Our results broadly highlight the need for LLMs to better understand tables and images, as scientific papers heavily use these formats. While two of the models consistently referenced images, they often relied more on image captions rather than on visual analysis. Enhancing visual reasoning capability, including interpreting images, plots and scale bars, is a major direction for future improvement.

