LongCut logo

2025年,人形机器人走到哪一步了?|机器人系列|具身智能年度盘点

By 硅谷101

Summary

## Key takeaways - **Unitree Shatters Price Floor**: Unitree Robotics released the $5,900 R1 humanoid robot, shattering the industry's previous cost expectation of $20,000 to $30,000. [00:59], [01:04] - **Figure AI Valuation Explodes**: Figure AI's valuation skyrocketed from $2.6 billion in 2024 to $39 billion, backed by investors like Microsoft, OpenAI, Nvidia, Bezos, Intel, and Samsung. [01:14], [01:22] - **Tesla Halts Optimus Production**: Tesla's promise to produce 5,000 Optimus units resulted in only about 1,000 assembled before halting for redesign, making Musk's claim that 80% of Tesla's value will come from Optimus awkward. [01:30], [01:46] - **VLA Enables Adaptive Robots**: Vision-Language-Action models unify seeing the scene, understanding tasks, and outputting actions, allowing robots like Dyna's to improvise folding wrinkled towels unlike traditional scripted robots. [02:56], [03:37] - **Dyna Folds 700 Towels Daily**: Dyna's robots fold 700 towels in 24 hours with 99.4% success rate, achieving real productivity in hotels and laundries. [07:41], [07:50] - **Home Robots Face Zero Error Tolerance**: Housework demands zero tolerance for errors in unstructured home environments with varying lighting and item placements, unlike controllable factory losses if a part breaks. [09:13], [09:32]

Topics Covered

  • VLA Enables Improvising Robots
  • Three Forces Ignite 2025 Boom
  • Home Robots Face Fatal Hurdles
  • System 1+2 Mimics Brain Efficiency
  • Embodied Intelligence at GPT-2

Full Transcript

Is embodied intelligence the biggest "bubble" of 2025?

This year, robots seem to have suddenly progressed from simply walking to quickly learning to do housework , run, play basketball, soccer, table tennis , dance, kung fu, and more.

The progress is astonishing.

I've heard there's a lot of competition in China too, and Unitree Robotics is about to go public.

So, are humanoid robots really coming to my house soon?

Sorry, we might have to pour some cold water on this for you. Let

's calm down.

Embodied intelligence—look at this term : the first part is the robot's "hardware," and the second part is its "brain."

Each of these directions has its own difficulties , and combining them while demanding accuracy, stability, and commercialization makes it even more challenging.

In this video, we interviewed some of Silicon Valley's most prominent robotics companies, visited several cutting-edge laboratories , and talked with industry leaders about whether the robotics industry is just a bubble of capital speculation or a genuine technological breakthrough.

In 2025, several astonishing things happened in the humanoid robot industry . At the beginning of the year, Unitree Robotics suddenly

. At the beginning of the year, Unitree Robotics suddenly released the $5,900 R1 humanoid robot.

Just a year ago, the industry generally believed that the cost floor for humanoid robots was $20,000 to $30,000 . Unitree's move

. Unitree's move essentially shattered the entire industry's price expectations. Then

, Figure... The valuation of AI skyrocketed from $2.6 billion in 2024 to $39 billion , a 15-fold increase.

The list of investors reads like the Oscars of the tech world: Microsoft OpenAI Nvidia Bezos Intel, Samsung, and others.

The capital market is betting wildly, However, at the same time, Tesla 's ambitious promise to produce 5,000 Optimus units actually only resulted in the assembly of about 1,000 before being put on hold and facing a redesign.

Musk's boast that "80% of Tesla's value will come from the Optimus" seems somewhat awkward in the face of reality.

This contrast is quite perplexing.

Where exactly has embodied intelligence developed?

This video will explore this from the perspectives of algorithms, hardware, data, and capital... We'll delve into

and capital... We'll delve into the key areas, including the major players' strategies, and provide a detailed analysis.

The conclusion is that by 2025, embodied intelligence is transitioning from a "pioneering debut" of demos to a more pragmatic phase of "prudent exploration and advancement."

The industry is recognizing reality and moving beyond grandiose promises to verify the boundaries of its capabilities in real-world scenarios.

Demo videos remain impressive , but implementation is becoming more cautious.

Technological breakthroughs are exciting , but every step towards commercialization involves cost calculations.

The good news is that this shift is precisely the sign that robots are becoming more practical.

Next, we'll break down embodied intelligence in detail.

Before discussing the current state of the industry, let's clarify what embodied intelligence is.

If ChatGPT is "talking" AI, then embodied intelligence is "action-oriented" AI.

Its core is VLA ( Vehicle Automation). The Vision-Language-Action

(VLAA) model unifies these three elements into a single neural network: Vision ( seeing the current scene ) , Language ( understanding the task objective and common sense ), and Action ( outputting specific control commands ) . Simply put, it has three capabilities:

. Simply put, it has three capabilities: understanding the environment, comprehending commands, and performing actions.

How does this differ from traditional robots ? To illustrate, traditional industrial robots

? To illustrate, traditional industrial robots are like actors who can only memorize fixed lines; you program them, and they execute the script step-by-step.

However, embodied intelligent robots are more like actors who improvise ; they can understand environmental changes and make autonomous decisions.

For example, if you ask a robot to fold a towel, a traditional robot needs to place the towel in the exact same position every time.

But an embodied intelligent robot can recognize that the towel is wrinkled or crooked , adjust its movement trajectory, and still fold it correctly.

Dyna Robotics , a hot Silicon Valley embodied intelligence company, was founded just a year ago and has already raised $120 million in its Series A funding round, valuing the company at $600 million.

Investors include Nvidia.

The "towel folding" task was precisely the demo that first made Dyna famous.

In simple terms, VLA uses the large-scale model domain VLM as its backbone, but in the final output, it transforms this result into actions usable in the robotics domain.

Intuitively, an action can be translated into commands, such as moving an arm to a specific coordinate point.

The most common criticism of VLA is why we need the L (Language ). (Language)

Because many traditional robot algorithms were purely vision-based , but if you think about it carefully , your brain actually generates something similar to language to tell you what to do first and second in a long-term task.

The role of L (Visual Algorithm) is that for some very complex tasks, it can learn a lot of logic from the large language model it has been trained on. For example, if you want to drink water , you need to find a cup or bottle.

This is something that the large language model can directly give you.

The main purpose of using VLA (Visual Algorithm) is to better combine language and vision.

Otherwise, if you only have vision, the tasks you can do are probably all short-term.

You can't do any long-term tasks that require reasoning.

So this is the main reason why we focus so much on introducing language.

This is a qualitative leap.

Robots are no longer just robotic arms that execute fixed programs , but can understand and plan through a combination of vision, language, and actions . Embodied intelligence, an adaptive intelligent agent

. Embodied intelligence, an adaptive intelligent agent , is not a new concept.

Why did it suddenly explode in 2025?

There are three main factors.

First, large-scale models themselves have matured.

Whether it's OpenAI or other companies, the recent improvements in large-scale models are more incremental rather than the leaps from GPT-3.5 to GPT-4.

Against this backdrop, the overall capabilities of large-scale models are stabilizing and are sufficient to serve as a reliable foundation for embodied intelligence systems. The ChatGPT capability layer proves that large language models can understand complex instructions and perform planning and reasoning.

This capability can be transferred to robots.

If you say, "Make me breakfast," it can plan a multi-step sequence like "first get the eggs, then crack the eggs, then fry them."

Second, computing power prices have plummeted.

Overall computing power levels are continuously improving.

As chip manufacturers continuously release new generations of chips with stronger performance, the unit cost of equivalent computing power shows a long-term downward trend.

Often, the cost of obtaining the same computing power drops to half of what it was every few years.

In 2023, renting an NVIDIA H100 GPU was still exorbitantly expensive.

The price war in cloud computing power is intensifying, significantly reducing the cost of training large models.

What was once a game only leading companies could afford is now accessible to startups.

Thirdly, the hardware supply chain is mature; the overall maturity of robot hardware components is relatively high.

Especially driven by the humanoid robot boom of the past year, a large amount of capital and engineering resources have been invested in the research and development of core components, including motors, reducers, and other key components.

This has led to continuous technological maturation and cost reduction.

Yushu has directly lowered its price to $5,900.

Previously, the industry generally believed that the $20,000-$30,000 range was sufficient for large-scale production.

The sharp drop in the cost curve makes commercialization no longer a pipe dream.

These three forces combined have propelled embodied intelligence from... The laboratory is on the verge of commercialization , but this is not blind optimism , but a rational judgment based on the maturity of the technology.

So where are the current capabilities of embodied intelligence, and what can it do?

Let's first talk about what it can do.

There are already practical applications in industrial and commercial scenarios.

Folding towels and clothes sounds simple , but Dyna's robots can fold 700 towels in 24 hours with a success rate of 99.4%. This

is already a real productivity achievement in hotels and laundries . Moreover, their basic models

. Moreover, their basic models contain various scenario data, such as chopping vegetables and fruits, preparing food, breakfast cleaning, and logistics sorting.

In BMW Group's factories, Figure's robots are doing simple assembly and material handling.

Agility Robotics' Digit is moving boxes in warehousing and logistics scenarios.

1X will also deliver up to 10,000 units to the Swedish giant EQT . Neo humanoid robots are primarily used in industrial scenarios such as manufacturing, warehousing, and logistics.

Not to mention, Amazon has deployed 1 million dedicated robots, almost exceeding its 1.56 million human employees.

These aren't demos; they are real, running commercial projects.

This is "rational progress" —not seeking omnipotence, but practicality.

So what tasks are currently impossible for leading companies to tackle?

For example, medium-difficulty tasks like making breakfast.

This is a "long-term task" requiring planning multiple steps : getting ingredients, chopping vegetables, plating, turning on the heat, and stir-frying.

Each step requires precise execution and control of force— avoiding crushing eggs or cutting your hand while chopping vegetables.

Dyna's latest demo has already demonstrated its ability to handle this long-term task of making breakfast , and Figure has also shown two robots working collaboratively. One

robot can deliver tools while another operates the robot . This is very useful in home scenarios

. This is very useful in home scenarios , but its stability still needs refinement.

The most difficult aspect is housework.

Why?

Because every home environment is different.

Changes in lighting, the placement of items, and the movement of family members all present challenges in an "unstructured environment."

In contrast, factories are "structured environments" with fixed lighting, fixed item positions, and standardized processes.

However, the home is a completely different story . Moreover, housework has a fatal requirement

. Moreover, housework has a fatal requirement : zero tolerance for error. In a factory, if a robot breaks a part, the loss is controllable.

error. In a factory, if a robot breaks a part, the loss is controllable.

But at home, if a bowl is broken or someone is injured, it's an accident.

For example, when your robot is performing a task , there might be a small crease on your tablecloth, your cup might be unstable , or a transparent object might reflect light and interfere with your camera.

These tiny physical changes are actually manageable for humans. While robots can adapt instantly based on intuition and rich experience , they rely heavily on data-driven AI models. However, facing these new challenges

AI models. However, facing these new challenges , they may not be able to truly perceive them.

Therefore, the technological threshold for robots entering homes is much higher than for entering factories.

But this doesn't mean it's unattainable.

We believe the initial focus will be on markets like those we are currently exploring, such as commercial services and scenarios where robots work alongside humans to complete tasks.

However, we believe the home market isn't that far off.

A complete, universal AGI isn't necessary; you might only need a few tasks to enter the home environment.

First, let the robot do chores at home, and then gradually develop more capabilities through model iteration . Of course, we'll reduce the hardware cost

. Of course, we'll reduce the hardware cost to a level affordable for ordinary families. Within that scope , we might prioritize , for example, selling the clothes-folding function to families first , and then gradually expanding to other functions.

So this timeline shouldn't be too far off, probably around 1-2 years.

This is "rational progress"—it doesn't mean waiting until robots become the all-powerful butlers in science fiction movies before launching them to the market , but rather starting with a clearly defined function that users truly need and iterating gradually.

Next, let's talk about the technological breakthroughs in 2025.

Why has this industry suddenly exploded this year?

Although there are many challenges , there are indeed several noteworthy technological breakthroughs in 2025.

Industry insiders frankly told us that each breakthrough is not revolutionary , but they are all real progress.

The first progress is architecture; many companies are starting to adopt the so-called "System..." The "System 1 + System 2" architecture

"System..." The "System 1 + System 2" architecture consists of two parts : System 1, or "fast thinking, " responsible for reflexive actions like grasping and moving, with a small number of parameters (perhaps only 80 million); and System 2, or "slow thinking," responsible for complex planning tasks like "making breakfast," which involves a large number of parameters (perhaps 7 billion).

This division of labor is similar to the human brain.

Reaching out to catch a ball is an instinctive reaction , but planning a meal requires careful thought.

Figure AI's Helix model is a prime example of this architecture.

Within two weeks of parting ways with OpenAI, they rapidly launched this self-developed model .

The success of this architecture demonstrates the scaling of robot foundational models and large language models. Law might not be the case ; bigger isn't always better.

It's about finding the right parameter allocation strategy.

Next, let's talk about data breakthroughs.

Why is robot data so expensive?

The reason is simple : humans only have 24 hours a day.

Collecting real-world data is too slow and expensive.

Nvidia's solution is to generate synthetic data using simulators.

They've demonstrated generating 780,000 operational trajectories in 11 hours, equivalent to 6,500 hours or nine consecutive months of human demonstration data.

Although synthetic data differs from real data , it at least solves the urgent "data shortage."

However, there's a key technical trade-off here . We've talked to

. We've talked to many people working on large language models before , and they've mentioned data in the language field.

They've now discovered that even with a lot of low-quality data —like a bunch of text with an ad inserted in the middle followed by more text — it's still possible to train a relatively good model.

Because once a model has enough data, it can automatically filter out ads.

However, for robots , we believe the Scaling Law is more about requiring high-quality data . If the data is too complex

. If the data is too complex and includes a lot of data, the robot model might not know where to focus its attention, so the final result isn't as good.

Another breakthrough in 2025 is cross-robot generalization capability (Physical). Intelligence's π0 model

(Physical). Intelligence's π0 model and the open-source OpenVLA model can control multiple different robots.

The same model or strategy can work effectively on robots with different shapes and hardware configurations without needing to be retrained for each robot.

This is called cross-robot generalization ability , which is very important.

Previously, each robot required separate model training, which was costly.

Now, a single model can be adapted to multiple robots, and data can be shared, which can significantly reduce costs.

However, the technical difficulties are also obvious.

Different robots have huge differences in motion space, arm length, and number of joints.

So how can a single model control them all?

This ability to work in completely unfamiliar environments is not 100% perfect , but it is already a substantial improvement.

The final breakthrough is multi-robot collaboration.

Figure shows the use of a single neural network to coordinate the cooperation of two robots.

Innovatively, a single neural network is used to control the entire upper body's 35 degrees of freedom while simultaneously controlling two robots to cooperate.

It sounds simple, but it is actually very difficult.

The two robots need to cooperate with each other , and the timing, force, and position need to be precisely synchronized.

This will be very useful in future factory scenarios , but it is still in the early verification stage.

So none of these technological breakthroughs are disruptive , but each one is making solid progress. This is precisely the characteristic of 2025: no longer pursuing flashy demos , but steadily advancing in a direction that is verifiable, quantifiable, and reproducible.

I mentioned at the beginning that I wanted to pour cold water on this because there are still many core problems to be solved in the field of embodied intelligence.

Technological breakthroughs are one aspect , but the industry still faces several major hurdles.

Clearly recognizing these challenges is precisely the prerequisite for "rational progress," and it's also what has brought embodied intelligence to the eve of a major explosion.

First, there's the data difficulty.

ChatGPT training used trillions of tokens , essentially feeding it the entire text of the internet , but robot operation data is extremely scarce.

Google spent 17 months training its RT-2 model , collecting 130,000 data points in a real kitchen, yet its generalization ability is still limited.

Why is robot data so difficult to collect?

Because it requires real robots operating in real environments.

Every piece of data costs money and time , and errors can damage equipment.

This is unlike text data, which can be obtained by web crawlers.

Therefore, most robot basic models still rely on a small amount of real data plus a large amount of simulated synthetic data to reinforce learning or self-supervised methods.

We then interviewed Physical, a star startup in the field of robot brains in Silicon Valley.

An intelligence researcher made a bold prediction: If we assume a person's life is 100 years, roughly speaking, that's about 1 million hours.

I think that, to my knowledge or within the scope of publicly available information, no one has a dataset of 1 million hours.

I speculate that we might only begin our exploration when we receive data equivalent to 1 million hours of physical experience in a lifetime.

If data is the "oil" for robots , then that well hasn't been drilled yet.

The second problem is the gap between the virtual and real worlds.

Training robots in the virtual world is cheap, and tens of thousands of simulators can run simultaneously.

However, the virtual world is never the real world.

Just like being good at racing games doesn't mean you can actually drive an F1 car.

Real-world friction, softness, and light variations are too complex ; simulations can only reproduce some of the real physical properties.

The remaining issues are the root cause of the robot's "culture shock" from simulator to real-world . Nvidia's Genesis and Isaac simulators are working hard to narrow this gap , but completely eliminating it will take time.

The third unsolved problem is called Embodiment. A

human hand has 27 joints and can sense pressure, temperature, and texture, while a robot's dexterous hand typically has only 15-22 joints.

The sensors are also not as precise.

Even if a robot perfectly mimics human movement patterns, the results will differ.

A human can gently pick up an egg, but a robot might crush it with a little force.

First, the human hand and the robot hand need to be very similar to each other if you want their abilities to transfer well.

This is why many people are now working on highly dexterous hands that closely approximate human freedom— this is inherently very difficult.

Second, even if you get close, it's not exactly the same.

Therefore , there will still be a gap between robot data and human data, what we call the embodiment gap.

The gap is widely recognized as a difficult problem to solve in both academia and industry , resulting in low efficiency in data transfer.

Imagine collecting a lot of data; if only 30% or 50% is usable , the total amount needs to be multiplied by the probability factor.

This is a limitation , meaning Tesla faces significant technical challenges in training Optimus's strategy using massive amounts of human video data from YouTube.

This is why Tesla halted production after 1000 units to redesign it.

The ideal is beautiful , but reality is harsh.

The fourth challenge is reliability.

If GPT answers incorrectly, users might just laugh it off; if the robot makes a mistake, it could damage things or injure people —that's the qualitative difference.

Embodied intelligence must achieve extremely high reliability to truly enter factories and homes; this standard is much more stringent than large language models.

The fifth challenge is the cost dilemma —a chicken-and-egg problem.

Currently, humanoid robots need to be priced around $20,000 to be attractive enough for logistics and other scenarios.

However, price reduction requires mass production, which in turn requires large orders, and large orders require sufficiently low prices . This creates a vicious cycle;

. This creates a vicious cycle; someone needs to break the deadlock , but will it trigger a price war?

Driving cost reduction across the entire industry requires observing and recognizing these challenges . This is not pessimism but rationality,

. This is not pessimism but rationality, because startups are realistically acknowledging these bottlenecks.

Embodied intelligence is on the verge of explosive growth.

Finally, let's talk about the main players in embodied intelligence and the paths they have chosen.

Faced with these challenges, different companies have chosen different paths.

One group includes Tesla and Figure.

Their strategy is to integrate hardware and software to create a data closed loop.

Tesla leverages its accumulated FSD autonomous driving technology to transfer visual perception and path planning capabilities to Optimus.

It can also accumulate data from factory production lines.

Former engineering director Milan Kovac put it bluntly: "We're just going from robots on wheels to robots with legs."

However, reality is more complex than expected.

The target of 5,000 units was only one-fifth completed, and a redesign had to be halted.

This shows that even a giant like Tesla has to bow to the embodiment gap.

Figure, after "parting ways" with OpenAI, independently developed the Helix model and took control of its own technology roadmap , demonstrating that they do have technical capabilities.

The 15-fold increase in valuation also proves the capital market's recognition of this path.

However, they have only deployed a few dozen units commercially.

The demo is impressive, but scaling up is still in progress.

The second group includes Physical Intelligence and Skild, which we just mentioned. Unlike many robotics startups that simultaneously invest in hardware, these companies prioritize model-driven, cross-platform adaptability.

Physical Intelligence's π0 model is not tied to specific hardware and can adapt to multiple robots.

Their logic is to first strengthen the model's capabilities , allowing for subsequent selection of the optimal hardware solution . Another company is Skild AI ,

. Another company is Skild AI , a software company focused on building fundamental robot models.

Skild AI's core direction is also to create a universal fundamental model independent of specific robot forms , adaptable and customized for different robot platforms and application scenarios.

In July of this year, Skild AI released its universal robot system, Skild Brain , and publicly demonstrated its ability to perform tasks such as picking up tableware and climbing stairs.

Recently, SoftBank and Nvidia are planning to invest $1 billion in it , raising its valuation to $14 billion.

The third category is platforms that focus on ecosystems. Nvidia provides simulators and computing power infrastructure, launching GR00T. N1 is open source,

launching GR00T. N1 is open source, but you have to use the entire NVIDIA ecosystem to use it.

Google, on the other hand, continuously invests in academic research.

The RT series models have influenced the entire academic community.

They provide the "water, electricity, and gas" for the entire industry.

Whoever can set industry standards controls the ecosystem.

All three paths are progressing, and no one has gained an absolute advantage.

Everyone is trying, iterating, and adjusting.

Returning to the initial question , is embodied intelligence a bubble or the future?

The answer is that by 2025, embodied intelligence is shifting from "pioneering debut" to "rational progress."

Technically, the combination of large models and robots has been tested, but it is far from mature. Core problems such as data, generalization, and reliability have not yet been solved . If we use the "GPT moment" as an analogy,

. If we use the "GPT moment" as an analogy, Robot CTO Wang Hao believes that we are currently at the GPT-2 level.

I think we are at the GPT-2 stage now. In fact,

we basically know that scaling is the only reliable path.

The key at this stage is to frantically accumulate data, scale up the model , and build the infrastructure for real-world applications.

Therefore, I predict that within one to two years, we can fully reach the level of GPT-3.

Note that it's GPT-3 , not GPT-4.

This is a straightforward assessment.

Researchers have seen the improvements brought by this scaling , making the path and goals clearer and more focused.

Commercially, pilot projects are beginning in industrial settings , with implementation cases in warehousing, manufacturing, and service industries.

However, large-scale commercialization may still require two to three years.

Our own goal is to at least achieve large-scale deployment in commercial scenarios next year . For home use, we will consider the opportunity later.

. For home use, we will consider the opportunity later.

So this timeline shouldn't be too far off , probably around one to two years.

In terms of investment, there are both bubbles and opportunities.

Some companies have seen their valuations soar, some have suspended production , and some have gone bankrupt after burning through their cash. (

Open-source robotics company K-Scale) Labs failed to secure funding and went bankrupt; Figure AI, on the other hand, is flush with cash.

The simultaneous existence of these two extremes indicates a market differentiation.

While the long-term trend of embodied intelligence is stable , short-term fluctuations are drastic.

What will be the first "killer application" for embodied intelligence?

It could be household chores , warehousing and logistics, or restaurant cleaning services . Regardless of the scenario

. Regardless of the scenario , heavyweight players are already making moves.

Embodied intelligence is no longer a question of " whether it will happen," but rather "when it will happen."

In 2025, we stand at the starting point of this revolution.

The industry is no longer just showcasing cool demos , but is beginning to practically validate technology , refine products, and find application scenarios.

Tesla's production halt was not a failure, but a redesign to find a more reliable path . Figure AI's soaring valuation

. Figure AI's soaring valuation is not just capital speculation , but the tangible results they have delivered, such as Helix.

Dyna's entry point from folding towels is not a sign of limited vision , but rather an accumulation of data and a flywheel to cultivate model learning capabilities . The fact that π0, part of the intelligence platform, is open source

. The fact that π0, part of the intelligence platform, is open source doesn't mean it's not open enough; rather, it's about finding a balance between commercial interests and technology sharing.

This steady improvement on the existing foundation is precisely a sign of the industry maturing.

By 2025, the embodied intelligence industry has evolved from "pie in the sky" to rolling up its sleeves and getting to work . And this pie

. And this pie is gradually and steadily becoming a reality.

Thank you for watching this episode of "Silicon Valley 101."

I'm co-founder Chen Qian, and I also thank you all for your support and companionship in 2025.

I hope we can create even more exciting content in 2026.

See you in the next video!

Bye!

Loading...

Loading video analysis...