Physical Intelligence, a two-year-old San Francisco-based robotics startup that has rapidly emerged as a prominent AI company in the Bay Area, recently unveiled new research. This groundbreaking study demonstrates that its latest model can guide robots to execute tasks for which they received no explicit training – a capability that even the company’s own researchers admit caught them by surprise.
The new model, designated π0.7, signifies what the company characterizes as an initial yet crucial stride toward realizing the long-coveted objective of a general-purpose robot brain. Such a brain would enable a robot to be directed toward an unfamiliar task, receive plain language coaching, and successfully complete it. Should these findings withstand rigorous examination, they suggest that robotic AI might be nearing a transformative inflection point, akin to the one witnessed in large language models, where capabilities begin to compound in ways that surpass predictions based on the underlying data.
At its core, the paper’s central assertion is compositional generalization—the capacity to synthesize skills acquired in diverse contexts to resolve problems the model has never encountered before. Historically, the conventional approach to robot training has relied on rote memorization: collecting data for a specific task, training a specialized model on that data, and then repeating this process for every new task. Physical Intelligence asserts that π0.7 fundamentally diverges from this established pattern.
Sergey Levine, a co-founder of Physical Intelligence and a UC Berkeley professor specializing in AI for robotics, explains, “Once it crosses that threshold where it goes from only doing exactly the stuff that you collect the data for to actually remixing things in new ways, the capabilities are going up more than linearly with the amount of data.” He adds, “That much more favorable scaling property is something we’ve seen in other domains, like language and vision.”
One of the most remarkable demonstrations presented in the paper involved an air fryer, an appliance the model had essentially never encountered during its training. Upon investigation, the research team discovered only two relevant instances within the entire training dataset: one where a different robot merely pushed an air fryer closed, and another from an open-source dataset where a separate robot placed a plastic bottle inside an air fryer based on instructions. The model inexplicably synthesized these disparate fragments, along with broader web-based pretraining data, into a functional comprehension of the appliance’s operation.
“It’s very hard to track down where the knowledge is coming from, or where it will succeed or fail,” remarks Ashwin Balakrishna, a research scientist at Physical Intelligence and a Stanford computer science PhD student. Nevertheless, without any prior coaching, the model managed a credible attempt at using the appliance to cook a sweet potato. When provided with step-by-step verbal instructions—mimicking how one might guide a new employee through a task—it performed successfully.
This capacity for coaching holds significant implications, as it suggests robots could be deployed in novel environments and enhanced in real-time, eliminating the need for extensive additional data collection or model retraining.
Despite these advancements, the researchers are forthright about the model’s current limitations and maintain a cautious perspective. In at least one instance, they candidly attribute a failure mode to their own team.
“Sometimes the failure mode is not on the robot or on the model,” Balakrishna states. “It’s on us. Not being good at prompt engineering.” He recounts an early air fryer experiment that initially yielded a mere 5% success rate. However, after approximately thirty minutes dedicated to refining how the task was articulated to the model, the success rate dramatically soared to 95%.
Furthermore, the model is not yet proficient at autonomously executing complex multi-step tasks from a single high-level command. “You can’t tell it, ‘Hey, go make me some toast’,” Levine clarifies. “But if you walk it through — ‘for the toaster, open this part, push that button, do this’ — then it actually tends to work pretty well.”
The team also acknowledged the absence of standardized benchmarks in robotics, which complicates external validation of their claims. Instead, Physical Intelligence benchmarked π0.7 against its own preceding specialist models—systems purpose-built and trained for individual tasks—and discovered that the generalist model matched their performance across a diverse array of intricate operations, including coffee preparation, laundry folding, and box assembly.
Perhaps the most striking aspect of this research, assuming the researchers’ accounts are accurate, is not any singular demonstration, but rather the profound degree to which the results astonished them—individuals whose professional responsibility involves an intimate understanding of training data and, consequently, a precise expectation of what a model should and should not be capable of achieving.
Balakrishna shares his previous experience: “My experience has always been that when I deeply know what’s in the data, I can kind of just guess what the model will be able to do. I’m rarely surprised. But the last few months have been the first time where I’m genuinely surprised. I just bought a gear set randomly and asked the robot, ‘Hey, can you rotate this gear?’ And it just worked.”
Levine drew a parallel to the moment researchers first witnessed GPT-2 spontaneously generate a story about unicorns in the Andes. “Where the heck did it learn about unicorns in Peru?” he muses. “That’s such a weird combination. And I think that seeing that in robotics is really special.”
Inevitably, critics may highlight an inherent asymmetry: large language models benefited from the vastness of the entire internet for learning, whereas robots do not, and no amount of clever prompting can fully bridge that gap. However, when asked where he anticipates skepticism, Levine points to an entirely different area.
“The criticism that can always be leveled at any robotic generalization demo is that the tasks are kind of boring,” he states. “The robot is not doing a backflip.” Levine challenges this perception, asserting that the crucial distinction lies between an impressive robot demonstration and a robotic system that genuinely generalizes. Generalization, he contends, will invariably appear less dramatic than a meticulously choreographed stunt but offers considerably greater utility.
The research paper itself employs cautious and measured language throughout, characterizing π0.7 as exhibiting “early signs” of generalization and “initial demonstrations” of new capabilities. These are presented as research outcomes, not a deployed product, and Physical Intelligence has consistently maintained restraint regarding commercialization timelines from its inception.
When directly questioned about when a system based on these findings might be ready for real-world deployment, Levine refrained from speculation. “I think there’s good reason to be optimistic, and certainly it’s progressing faster than I expected a couple of years ago,” he notes. “But it’s very hard for me to answer that question.”
To date, Physical Intelligence has successfully raised over $1 billion and was most recently valued at $5.6 billion. A significant factor contributing to the substantial investor enthusiasm surrounding the company is co-founder Lachy Groom, who spent years as one of Silicon Valley’s most respected angel investors, backing notable companies such as Figma, Notion, and Ramp, before deciding that Physical Intelligence was the venture he had been seeking. This distinguished pedigree has been instrumental in attracting serious institutional investment, even as the startup has steadfastly declined to provide investors with a commercialization timeline.
The company is reportedly in discussions for a new funding round that could nearly double its valuation to $11 billion. The team declined to comment on these reports.
The Editorial Staff at AIChief is a team of professional content writers with extensive experience in AI and marketing. Founded in 2025, AIChief has quickly grown into the largest free AI resource hub in the industry.