Publication / AI4Math

Learning an Effective Premise Retrieval Model for Efficient Mathematical Formalization

Published in 2nd AI for Math Workshop @ ICML 2025. Hornoable Mention Award.

Published 2025-07-18Updated 2025-07-18Premise Selection, Lean, Information RetrievalarXiv

Abstract #

Premise selection is a crucial yet challenging step in mathematical formalization, especially for users with limited experience. Due to the lack of available formalization projects, existing approaches that leverage language models often suffer from data scarcity. In this work, we introduce an innovative method for training a premise retriever to support the formalization of mathematics. Our approach employs a BERT model to embed proof states and premises into a shared latent space. The retrieval model is trained within a contrastive learning framework and incorporates a domain-specific tokenizer along with a fine-grained similarity computation method. Experimental results show that our model is highly competitive compared to existing baselines, achieving strong performance while requiring fewer computational resources. Performance is further enhanced through the integration of a re-ranking module. To streamline the formalization process, we release a search engine that enables users to query Mathlib theorems directly using proof states, significantly improving accessibility and efficiency. Codes are available at here.

Links #

Model & Code:

Research Paper:

arXiv Preprint

Citation #

@inproceedings{
  tao2025learning,
  title={Learning an Effective Premise Retrieval Model for Efficient Mathematical Formalization},
  author={Yicheng Tao and Haotian Liu and Shanwen Wang and Hongteng Xu},
  booktitle={2nd AI for Math Workshop @ ICML 2025},
  year={2025},
  url={https://arxiv.org/abs/2501.13959},
}