AI Fails at 96% of Jobs (New Study)

By ColdFusion

Summary

Topics Covered

AI Fails 96% on Real Jobs
AI Excels in Narrow Creative Tasks
Benchmarks Hide Real-World Failures
LLMs Mimic, Don't Understand World

Full Transcript

In the absence of AI and robotics, we're actually totally screwed.

>> We are working to build tools that one day could help us make new discoveries and address some of humanity's biggest challenges like climate change and curing cancer.

>> Hi, welcome to another episode of Cold Fusion. Here's a question. How can AI be

Fusion. Here's a question. How can AI be disrupting the job market but also be losing billions of dollars at the same time? Well, this video will answer that.

time? Well, this video will answer that.

The truth is, while AI helps make some jobs easier, when compared to a human, it performs worse a whopping 96.25% of the time, which basically means give

an AI 10 tasks and it will perform at least nine of them worse than when compared to a human. That's at least according to a new study. It's such an interesting finding and begs the

question, why has no one systematically compared how well AI does versus a human who's done exactly the same job? All

previous benchmarks have been simulated human work, not real generalized work.

The results from the team of researchers who did the study makes one think maybe the true value of consumer AI isn't hundreds of billions of dollars, but orders of magnitudes less. I'm not

saying that all AI sucks. This study is just a general reminder that AI is a time-saving tool and not a replacement.

Just maybe the economy is valuing it too highly when it comes to near-term capabilities.

In this episode, we'll take a look at the study in detail and discuss what it all means.

>> You are watching Temple Fusion TV.

>> So, the synopsis of the study was straightforward enough. Give paid jobs

straightforward enough. Give paid jobs already completed by real people to AI models and then see how well the results compare. Once the AI completes the

compare. Once the AI completes the tasks, humans evaluate the results. The

researchers called this method the remote labor index or RLI. It's so

simple. Most of us use a computer to do modern work, right? So why not just directly compare how well AIs compete on a professional computer-based job? The

jobs to be completed were real ones from the freelancer site Upwork, a site where you pay remote workers to complete any given task. The jobs were varied from

given task. The jobs were varied from video creation, computer Aed design, graphic design, game development, audio work, architecture, and more. Both

humans and AI were given the same brief and any attached files that were necessary for the job. For example, an Excel spreadsheet of data or instructional images.

The AI models were tested on 240 jobs, each paying $630 on average. So, how did they perform? The performance was

they perform? The performance was abysmal. The best AI was Claude Opus 4.5

abysmal. The best AI was Claude Opus 4.5 with a 3.75% success rate when it came to producing work of an acceptable quality. You heard that right, a 96.25%

quality. You heard that right, a 96.25% failure rate was the best performer.

Interestingly, Gemini was the loser with a 1.25% success rate. Now, Claude Opus 4.6 might score 5% better, but that's still a 91% failure rate. When these

scores get to 35% or 40%, then we can talk. So, a couple of things to note.

talk. So, a couple of things to note.

The original paper used AI models that were 6 months or so old, but their website has up-to-date results, which are the scores that I'm referring to in this episode. I'll leave a link for the

this episode. I'll leave a link for the website below.

So, where exactly did the AI systems fail? Well, first we need to define

fail? Well, first we need to define exactly what failure means. Failure

counts as not performing a task at or better than a human level. This is

specifically in the context of a freelancing environment, an environment where people actually pay money directly for the work. With that in mind, the paper lists four main failure points for AI systems.

Number one, sometimes the AI would produce quote corrupt or empty files or deliver work in incorrect or unusable formats. Number two, AI quote frequently

formats. Number two, AI quote frequently submitted incomplete work characterized by missing components, truncated videos, or absent source assets. For example, a

video of 8 seconds when an 8-minute video was required. Number three,

another one was quality issues. Quote,

"Even when agents produce a complete deliverable, the quality of work is frequently poor and does not meet professional standards." End quote. And

professional standards." End quote. And

finally, number four, inconsistencies with AI generated work. This includes a house's appearance changing across different 3D views or digital floor plans that don't match the supplied

sketches. It's all very interesting. So

sketches. It's all very interesting. So

for years now we've been told that AI is going to replace humans everywhere. But

the truth is we are nowhere near that point. At least not yet anyway.

point. At least not yet anyway.

So then where did the AI succeed?

Success would mean that the AI does the same work at the same quality or better quality than human output. They note

that AI was proficient in creative ideas like audio and image related work along with writing, data retrieval or web scraping. And that kind of checks out.

scraping. And that kind of checks out.

The success of Open Claw attest to the latitude and AI images and audio are already good enough to fool a lot of people. Advertisement and logo creation

people. Advertisement and logo creation was another successful area. It's also

no surprise that AI was good at report writing and generating simple code for an interactive data visualization.

Competent video generation is coming very shortly. Just take a look at Seed

very shortly. Just take a look at Seed Dance 2.0.

Heat. Heat.

So, the main takeaway is AI is pretty good at some things, but horrendous for general work.

But what else do we learn? This paper

exposes a lot, much of it negative, but it does show that the RLI format is a very useful measure of AI performance in the real world. Reason being,

current-day benchmarks aren't reflective of real world performance. As the paper puts it, quote, while AI systems have saturated many existing benchmarks, we find that the state-of-the-art AI agents

perform near the floor on RLI. End

quote. I found the study to be very robust, by the way. So, I'll leave a link to it below.

According to this study, AI may impact jobs with lots of language requirements, audio, simple advertising, or data retrieval, but human oversight is still needed. A PWC report found that the

needed. A PWC report found that the majority of CEOs see no financial returns from AI. Upper management and CEOs just command workers to use AI and expect it to all work. For AI to work

within a corporation, there needs to be a planned and skilled implementation of the technology with the knowledge of its shortcomings. And that doesn't happen a

shortcomings. And that doesn't happen a lot of the time. Gartner predicts that by next year, half of the companies that fired workers for AI are going to hire them back. Also, 9 months ago, Microsoft

them back. Also, 9 months ago, Microsoft proudly proclaimed that 30% of their code was written by AI. And since then, we've seen some of the worst software issues at the company in its history.

Now, it's obvious that AI is disruptive and some jobs will be lost to the technology. For example, diffusion

technology. For example, diffusion models are proficient in the visual arts as you saw earlier. But as for LLMs and the general workforce, this study indicates that job losses could be a lot

less. The AI space does move fast. So, I

less. The AI space does move fast. So, I

could be wrong, but that's how things are looking today in early 2026. To sum

up the job prognosis in one line, if you're a software engineer, set up a business that fixes vibecoded apps and you'll make a lot of money. I think the thing is artificial intelligence really is going to transform the world like in

ways we can't even imagine. But it's not going to do it now. Not with this technology. My favorite example of this

technology. My favorite example of this is one trains them on the whole internet. So they get access to a lot of

internet. So they get access to a lot of written rules of chess and lots of games of chess and they still make illegal moves. They never really abstract the

moves. They never really abstract the model of how chess works. That's just so damning. you would not be able to learn

damning. you would not be able to learn chess after seeing a million games, reading the rules in Wikipedia and chess.com. Just making it bigger is not

chess.com. Just making it bigger is not going to solve these problem. We need to do foundational research. That's what I was saying for the last 5 years. What is

intelligence? The problem is is to understand your world and um reinforcement learning is about understanding your world. Whereas large

language models are about mimicking people, doing what people say you should do. They're not about figuring out what

do. They're not about figuring out what to do. just to mimic the the what people

to do. just to mimic the the what people say is not really to build a model of the world at all. I don't think >> so. I'm not saying that AI will never

>> so. I'm not saying that AI will never work or it's not genuinely useful already. There will be some narrow AI

already. There will be some narrow AI products that work really well. I'm just

warning that there's a significant financial risk in the current AI space.

The investment ethos and the rollout of AI everywhere might be misallocating hundreds of billions of dollars. Even in

the medical field, Reut has just reported that the FDA has received 100 reports of AI malfunctions, botched surgeries, and misidentified body parts.

In a few cases, a lawsuit alleges that the AI misinformed the surgeons on the locations of their instruments, causing one to mistakenly puncture the base of a patient's skull, and causing strokes from the damage to a major artery in two

others. We don't need to put AI in every

others. We don't need to put AI in every field. It's just not ready yet. Again,

field. It's just not ready yet. Again,

in some fields like coding, high maths, and writing, AI is pretty good. and can

make jobs a lot easier, but we can't pretend like it's going to replace everyone perfectly right now. Now, I was going to stop the video here, but just a couple of personal thoughts. Back in

2016 when I started covering AI, it was fun and fascinating to see how these things worked. But ever since the big

things worked. But ever since the big money started coming in, the hype has just gone off the charts. CNBC just

reported that companies like Anthropic, Google, and Microsoft have paid individual content creators $400,000 to half a million dollars each to promote

their AI models. Now, brand deals are fine, but if the current generation of AI was as revolutionary as being advertised, they wouldn't need to spend so much money to convince us. It's a

jarring disconnect. One last thing.

>> We're fooled into thinking those machines are intelligent because they can manipulate language. And we're used to the fact that people who can manipulate language very well are

implicitly smart. But we're being

implicitly smart. But we're being fooled. Um now they they're useful.

fooled. Um now they they're useful.

There's no question. They're great tools like you know computers uh have been for the last decade five decades. But let me make an interesting historical point and

this is maybe due to my age. Uh there's

been generation after generation of AI scientists since the 1950s claiming that the technique that they just discovered was

going to be the ticket for human level intelligence. you you see declarations

intelligence. you you see declarations of Marvin Minsky, Newan Simon, um you know, Frank Rosenblad who invented the perceptron, the first learning machine

in 1950 saying like within 10 years we'll have machines that are as smart as humans. They were all wrong. This

humans. They were all wrong. This

generation with L&M is also wrong. I've

seen three of those generation in my lifetime. Okay. Um so, you know, it's it

lifetime. Okay. Um so, you know, it's it it's just another example of being fooled. That's Yan Lee Kun, the creator

fooled. That's Yan Lee Kun, the creator of convolutional neural networks. He's

been outspoken in saying that the current AI architecture is reaching its peak. He thinks that throwing more data

peak. He thinks that throwing more data and power at the problem isn't going to solve it. And I think that's what the

solve it. And I think that's what the early data is showing us. It's called

the scaling problem, and it's a large part of my upcoming video about how open AI is in big trouble. When it's

complete, I'll leave a link for that episode below, so be sure to check it out after this. Anyway, that's about it from me. You've been watching Cold

from me. You've been watching Cold Fusion. Let me know your thoughts. I'm

Fusion. Let me know your thoughts. I'm

sure the comment section will be very very full of very good discussion.

Anyway, that's it. My name's Doggo and I'll see you again soon for the next episode. Cheers guys. Have a good one.

episode. Cheers guys. Have a good one.

Cold Fusion. It's new thinking.

Loading...

Loading video analysis...