Relevant if you build with AI tools, APIs, or coding agents. Relevant als je bouwt met AI-tools, API's of coding agents.
MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering
Title: MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering Title: MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering
Quick editorial signal Snelle redactionele duiding
- Track this as a OpenAI update, not just a standalone headline. Bekijk dit als OpenAI-update, niet alleen als losse headline.
- Useful for builders who need to understand API, coding, or workflow changes. Nuttig voor bouwers die API-, code- of workflowwijzigingen willen begrijpen.
- Likely worth revisiting after people have used the release in practice. Waarschijnlijk de moeite waard om opnieuw te bekijken zodra mensen het in praktijk gebruiken.
MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering | OpenAI
Skip to main content
[](https://openai.com/)
* Research
* Products
* Business
* Developers
* Company
* Foundation(opens in a new window)
Log inTry ChatGPT(opens in a new window)
Try ChatGPT(opens in a new window)Login
OpenAI
October 10, 2024
Publication
MLE-bench
Evaluating Machine Learning Agents on Machine Learning Engineering
Read paper(opens in a new window)
We introduce MLE-bench, a benchmark for measuring how well AI agents perform at machine learning engineering. To this end, we curate 75 ML engineering-related competitions from Kaggle, creating a diverse set of challenging tasks that test real-world ML engineering skills such as training models, preparing datasets, and running experiments. We establish human baselines for each competition using Kaggle's publicly available leaderboards. We use open-source agent scaffolds to evaluate several frontier language models on our benchmark, finding that the best-performing setup — OpenAI's o1‑preview with AIDE scaffolding — achieves at least the level of a Kaggle bronze medal in 16.9% of competitions. In addition to our main results, we investigate various forms of resource-scaling for AI agents and the impact of contamination from pre-training. We open-source our benchmark code(opens in a new window) to facilitate future research in understanding the ML engineering capabilities of AI agents.
October 10, 2024
Publication
MLE-bench
Evaluating Machine Learning Agents on Machine Learning Engineering
Authors
Chan Jun Shern, Neil Chowdhury, Oliver Jaffe, James Aung, Dane Sherburn, Evan Mays, Giulio Starace, Kevin Liu, Leon Maksin, Tejal Patwardhan, Lilian Weng, Aleksander Madry
* o1
* Software & Engineering
* Learning Paradigms
* Reasonings & Policy
Authors
Chan Jun Shern, Neil Chowdhury, Oliver Jaffe, James Aung, Dane Sherburn, Evan Mays, Giulio Starace, Kevin Liu, Leon Maksin, Tejal Patwardhan, Lilian Weng, Aleksander Madry
Help shape what we cover next Help bepalen wat we hierna volgen
Anonymous feedback, no frontend account needed. Anonieme feedback, zonder front-end account.
More from OpenAI Meer van OpenAI
All updates Alle updatesOpenAI available at FedRAMP Moderate OpenAI available at FedRAMP Moderate
Expanding secure AI for government. Expanding secure AI for government.
Choco automates food distribution with AI agents Choco automates food distribution with AI agents
Using OpenAI APIs, Choco processes millions of orders, reducing manual work and enabling always-on operations across global food supply chains. Using OpenAI APIs, Choco processes millions of orders, reducing manual work and enabling always-on operations across global food supply chains.
An open-source spec for Codex orchestration: Symphony. An open-source spec for Codex orchestration: Symphony.
Title: An open-source spec for Codex orchestration: Symphony. Title: An open-source spec for Codex orchestration: Symphony.
The next phase of the Microsoft OpenAI partnership The next phase of the Microsoft OpenAI partnership
Amended agreement provides long-term clarity. Amended agreement provides long-term clarity.