Testing LLMs on superconductivity research questions

Conclusion

Several larger conclusions emerge from this test case. The two models that drew from curated databases of experimental literature, NotebookLM and our custom-built tool, outperformed the LLMs trained on unfiltered internet data. In particular, models relying on open web sources tended to mix established theories with highly speculative ones.

The evaluated LLMs (accessed in December 2024) also showed weaknesses in temporal and contextual understanding. For example, they often failed to recognize when a proposed hypothesis was later disproved. They also frequently omitted relevant papers when they didn’t explicitly include the exact language used in the initial query.

Our results broadly highlight the need for LLMs to better understand tables and images, as scientific papers heavily use these formats. While two of the models consistently referenced images, they often relied more on image captions rather than on visual analysis. Enhancing visual reasoning capability, including interpreting images, plots and scale bars, is a major direction for future improvement.

What's Hot

Nous Research Proposes Lighthouse Attention: A Training-Only Selection-Based Hierarchical Attention That Delivers 1.4–1.7× Pretraining Speedup at Long Context

OnePlus removed from Best Buy stores, replaced by Nothing

Motorola Razr Fold camera review: Zoom zoom zoom

Nous Research Proposes Lighthouse Attention: A Training-Only Selection-Based Hierarchical Attention That Delivers 1.4–1.7× Pretraining Speedup at Long Context

Meet LiteLLM Agent Platform: A Kubernetes-Based, Self-Hosted Infrastructure Layer for Isolated Agent Sandboxes and Persistent Session Management in Production

A Coding Implementation to Master GPU Computing with CuPy, Custom CUDA Kernels, Streams, Sparse Matrices, and Profiling

Scientists Discover Strange New Crystal Formed by Nuclear Blast

Old Oil and Gas Wells Could Find Second Life Producing Clean Energy

6 Steps to Crack GenAI Case Study Interviews

Nous Research Proposes Lighthouse Attention: A Training-Only Selection-Based Hierarchical Attention That Delivers 1.4–1.7× Pretraining Speedup at Long Context

OnePlus removed from Best Buy stores, replaced by Nothing

Motorola Razr Fold camera review: Zoom zoom zoom

Nous Research Proposes Lighthouse Attention: A Training-Only Selection-Based Hierarchical Attention That Delivers 1.4–1.7× Pretraining Speedup at Long Context

OnePlus removed from Best Buy stores, replaced by Nothing

Motorola Razr Fold camera review: Zoom zoom zoom

Usefull link

categories

What's Hot

Testing LLMs on superconductivity research questions

Conclusion

Related Posts

Usefull link

categories