The old tests are getting too easy. Tejal Patwardhan leads OpenAI’s frontier evals team, which is finding new ways to measure and forecast progress as models become more capable. She and host Andrew Mayne discuss why evals matter for research, how benchmarks can break or get gamed, and what models need to be judged on next.
Chapters
00:00:24 Growing up at OpenAI 00:03:10 Why reasoning changed everything 00:06:28 What made o1 surprising 00:11:20 Why old benchmarks stopped working 00:14:45 What makes a good benchmark 00:17:35 Why evals are getting harder 00:22:09 Measuring voice and vision models 00:24:48 Testing models on real science 00:33:23 How OpenAI tracks frontier progress 00:40:47 What AI means for work