New paper questions “LLM library learning” gains - accepted for publication at EACL
The paper “Is This LLM Library Learning? Evaluation Must Account For Compute and Behaviour” by Ian Berlot-Attwell, Tobias Sesterhenn, Frank Rudzicz and Xujie Si has been accepted for publication at the Conference of the European Chapter of the Association for Computational Linguistics (EACL). The work critically examines recent in-context “library learning” systems that claim improved performance by learning and reusing tools/lemmas without fine-tuning. Across three published systems (including an in-depth analysis of LEGO-Prover), the authors argue that many reported gains largely disappear once computational cost is properly controlled, and they find little evidence that learned libraries are actually being reused in the intended way. The paper concludes with recommendations for stronger evaluation standards, including compute-matched baselines and behavioural analysis.
Abstract:
The in-context learning (ICL) coding, reasoning, and tool-using ability of LLMs has spurred interest in library learning (i.e., the creation and exploitation of reusable and composable functions, tools, or lemmas). Such systems often promise improved task performance and computational efficiency by caching reasoning (i.e., storing generated tools) - all without finetuning. However, we find strong reasons to be skeptical. Specifically, we identify a serious evaluation flaw present in a large number of ICL library learning works: these works do not correct for the difference in computational cost between baseline and library learning systems. Studying three separately published ICL library learning systems, we find that all of them fail to consistently outperform the simple baseline of prompting the model - improvements in task accuracy often vanish or reverse once computational cost is accounted for. Furthermore, we perform an in-depth examination of one such system, LEGO-Prover, which purports to learn reusable lemmas for mathematical reasoning. We find no evidence of the direct reuse of learned lemmas, and find evidence against the soft reuse of learned lemmas (i.e., reuse by modifying relevant examples).