AI does not (yet) beat humans in new mathematical problems: the results of the “First Proof” test

Artificial intelligence was beaten by humans in solving 10 complicated mathematical problems within the “First Proof” project. The project was born from this question: can AI replace mathematicians? To try to answer, a group of researchers from several European and US universities created the “First Proof” project. The goal of this project is to evaluate the ability of AI to solve complex mathematical problems, the demonstration of which has never been published before.

The First Proof team, which had already organized a first less “official” test in February, subjected 10 unpublished problems of extremely advanced mathematics to four different AI models. The answers were then evaluated by industry experts, just like a normal review of scientific research. The results, published on the First Proof website on June 10, show that, as of now, AI is still far from the level of the best human researchers. The best performing system, created by the University of Zurich, managed to demonstrate only 6 questions out of 10. The worst, that of Princeton University, demonstrated only 2.

How the test was structured and what its purpose is

Evaluating the actual mathematical capabilities of AI is not easy. One of the main problems is that models are trained on huge amounts of texts, scientific articles and material available online. If a question is too similar to something already present in the training data, the risk is that the system will be able to solve it simply by re-proposing a solution already present in the data on which it was trained, appearing more capable than it is.

To really test its capabilities, the “First Proof” team asked researchers from all over the world to send problems, questions and theorems that they had solved during their research, but which they had not yet published either in scientific articles or online. Among all the proposals, ten were selected, belonging to different branches of mathematics, from geometry to algebra to probability.

AI systems have had to address these problems completely autonomously, without human assistance. The solutions were then examined by a group of about thirty mathematicians, following a process similar to that used in reviewing scientific articles.

Four different AIs were tested and ETH Zurich won

Four AI models took part in the challenge. The only large company directly present was OpenAI with ChatGPT 5.5 Pro. The other three systems were developed by research groups from the University of California in Los Angeles (UCLA), Princeton University in New Jersey and the Federal Institute of Technology (ETH) in Zurich.

All three universities have developed so-called “harnesses”, i.e. AI systems in which a model (for example ChatGPT) produces a solution and other models (for example Gemini and Claude) monitor it, criticize it and improve it through a series of subsequent steps. The winning system from ETH Zurich worked exactly like this: ChatGPT generated a possible proof, which was then verified and improved with the contribution of Gemini and Claude.

As we were saying, the ETH Zurich harness was the best, correctly solving and demonstrating 6 out of 10 problems. The UCLA team ranked second, followed by ChatGPT 5.5 Pro. Last was the Princeton system, which, with a harness based mainly on Gemini 3.1 Pro, managed to demonstrate only two questions.

All this, however, was not without costs. The systems created by the three universities, precisely because of their continuous verification mechanisms, were incredibly expensive. The ETH model went as far as consuming $950 trying to solve a single problem, but failing. For comparison, ChatGPT 5.5 Pro used alone required just $144 to tackle the entire set of ten problems.

One of the 10 “First Proof” test problems. This geometry question has not been solved by any AI and the proof attempts have cost a total of more than 1100 dollars.

Because AI still can’t replace mathematicians

After the competition, the First Proof team tried to understand why some of the proposed problems remained unsolved. The main conclusion is that these questions required very different ideas or strategies from those present in the existing mathematical literature. According to Johannes Schmitt, a member of the ETH Zurich team, AI often lacked “a critical and unexpected idea”, that creative step necessary to complete the reasoning and arrive at the proof.

Another limitation that emerged concerns the way in which the models construct the proofs. AIs tended to develop with great precision the more procedural and mechanical parts, those considered most boring by humans, but, in the more conceptually complex passages they tended to be much less rigorous. In some cases they took for granted results that would require demonstration, without providing any justification. In others they cited articles that did not actually contain the mentioned result.

This does not mean that artificial intelligence is useless for mathematical research. On the contrary, the test shows that it can already be a useful tool for verifying steps and receiving support in demonstrations. The First Proof team will begin preparing a new edition of the test in August, scheduled for October 2026. According to the organizers, the next versions of the test could help to better understand in which contexts artificial intelligence can become truly useful to mathematicians, from verifying proofs to finding new strategies to tackle problems that are still open.