When will AI take my job?
Benchmarks used to evaluate AI systems tend to favor questions which are easy to provide to an AI, and easy to evaluate or grade.
This favors benchmarks which are akin to a written exams similar to those you would expect to find in a classroom, or university exam rooms. Note that while the structure of such tests can be simple (“ask question, evaluate if the answer is correct or not”), the questions themselves can be all but. Below are two sample questions from the benchmark “Humanity’s last exam”.

Over the past few years, many AI benchmarks have become saturated—AI systems now achieve near-perfect scores, making these benchmarks ineffective for measuring further progress. This raises an important question: if AI performs so well on benchmarks, why hasn’t it had a larger impact on the labor market?
For example, AI systems can pass medical licensing exams and bar exams for lawyers, yet we don’t see doctors and lawyers being replaced by AI en masse in the real world. This gap between benchmark performance and real-world impact suggests that these tests may not capture what actually matters for job performance.
One explanation is that most jobs involve more than completing isolated tasks—the very thing benchmarks typically measure. Real work requires coordinating multiple activities, adapting to unexpected situations, and integrating various skills over time.
Some newer benchmarks attempt to address this limitation, and I’d like to discuss two of them.
METR - Measuring AI Ability to Complete Long Tasks

METR (Model Evaluation & Threat Research) is a non-profit AI research group. Their most impactful paper, “Measuring AI Ability to Complete Long Tasks”, proposes a new way to evaluate AI models: measuring how complex a task they can reliably complete.
METR defines “complexity” by how long it would take a human expert to complete the same task—what they call the “time horizon.” A task is considered reliably completed if the AI succeeds 50% or 80% of the time. Their research shows that this time horizon has been increasing exponentially as newer, more capable AI models are released. The current best model (GPT-5) has a time horizon of 2 hours and 17 minutes.
If this trend continues at the rate observed since GPT-2 in 2019, METR projects that AI models will be able to complete full-day tasks (8 hours) within about 14 months. Models capable of handling tasks that take a human a full work week (40 hours) could arrive roughly 20 months after that.
OpenAI - GDPval

Another method of estimating how large part of the economically valuable work that models can replace is proposed in the paper “GDPval: Evaluating AI Model Performance on Real-World Economically Valuable Tasks”, which is based upon research done by a team working for OpenAI.
The researchers behind this study recruited human experts from 44 professions across the top 9 sectors of the American economy. Each expert defined an economically valuable task from their profession and provided their own solution. This created a dataset of 1,320 tasks.
The researchers then prompted several AI systems to solve each task. Another set of human experts evaluated all solutions—both human and AI-generated—and selected the best one for each task, without knowing which was which.
While human solutions were generally preferred, the margin was surprisingly small. AI solutions were either preferred or rated equally good in 47.6% of tasks. Notably, a model from Anthropic (one of OpenAI’s main competitors) performed best overall.
The average human completion time for these tasks was 7 hours—significantly longer than METR’s observed time horizon of 2 hours and 17 minutes for the best model. The difference is that GPDval provided AI systems with detailed reference materials and context, demonstrating the value of proper prompt engineering. With the right context, AI can tackle much longer tasks than METR’s benchmark suggests.

The performance of frontier models from a specific company (OpenAI) is also measured over time - noting that the flagship model from last summer (GPT-4o) scores about 20% worse compared to the flagship model from this summer (o3 high, before the release of GPT-5). A naive extrapolation would put whichever is the best model from OpenAI next summer on par with human experts.

Finally, the researcher observes that some profession tend to be much harder for AI to create acceptable solutions compared to others - human industrial engineers prefer the human solutions in 83% of tasks compared to AI, whilst software developers actually favor the AI solution in 76% of the tasks.
So, when will AI take my job?
The 2024 Nobel Prize winner Geoffrey Hinton famously predicted in 2016 that we should stop train radiologist, since AI would be obviously better at their job already the next decade. 9 years later the profession is more in demand than ever, despite increasing AI capabilities.
A job is obviously more than just a collection of tasks to solve, as evident by Hinton’s incorrect prediction. Andrej Karpathy recently stated that he believed that we are a decade from being able to fully replace software developers with AI agents1.
That said, AI models are constantly improving over time (in a very predictable manner according to some metrics, like the METR evaluation). And in parallel the industry are making progress in understanding how best to support AI agents with scaffolding, such as well curated context, in order to perform ever more complex useful tasks.
The next few years will be interesting indeed.