Jun 25, 2026

World Models Enter the Lab: AI Is Proposing Hypotheses and Designing Experiments

World models are letting robots learn from human videos and AI agents are screening millions of crystals for superconductors. Hypothesis generation and experimental validation are splitting apart, and the structure of research is changing.

On June 17, OpenAI published a study in which GPT-5.4 found a way to improve the Chan-Lam coupling reaction, raising yields from 16.6% to 25.2% with 88% of substrates showing better performance¹. This reaction has stumped medicinal chemists for decades. The carbon-nitrogen bond formation between sulfonamides and boronic acids is a persistent bottleneck in synthesis. The AI did not merely analyze existing data. It proposed a new hypothesis: add TEMPO as a mild oxidant to the reaction mixture. Then Molecule.one’s automated lab ran 10,080 reactions to test it¹.

That same week, a separate AI agent screened 2.4 million crystals in 28 hours, identified 68,000 superconductor candidates, and guided human researchers to synthesize four entirely new superconducting materials². A robot foundation model released its seventh version and displayed emergent capabilities. It figured out things no one had explicitly taught it³.

The common thread: AI is moving from understanding the physical world to doing experiments inside it.

World Models: From Video Generation to Physical Understanding

The idea of world models is not new, but its meaning has shifted over the past two years. Early world models were mostly about video generation. Feed a scene, predict the next frame. Today’s world models do far more: they grasp forces, spatial relationships, and causal reasoning, then output actions.

Physical Intelligence’s π0 marks this transition. The company open-sourced π0’s weights and code in February 2025⁴. By April 2026, π0.7 showed “emergent capabilities” in scenarios that never appeared in its training data³. The model simply learned new manipulation strategies on its own. Its backers include Jeff Bezos, Sequoia Capital, and OpenAI⁵.

Figure AI takes a different path. Rather than building a model in isolation, it controls the full stack of hardware and data. In September 2025, Figure launched Project Go-Big: a partnership with Brookfield to collect human behavior data across 100,000 residential units, then train the Helix model on that data⁶. The result: robots trained on 100% human video and zero robot demonstrations can navigate real homes and respond to commands like “walk to the refrigerator”⁶. This marks the first end-to-end transfer from human video to robot behavior in a humanoid platform.

Table 1: Key World Model Advances ³⁴⁶⁷

Company	Product	Core Capability	Date
Physical Intelligence	π0.7	Steerable robot foundation model with emergent capabilities	2026.04
Figure AI	Helix + Go-Big	Zero-shot human-video-to-robot-behavior transfer	2025.09
ACE Robotics	Kairos 3.0	Open-source 4B world model, 72x faster than NVIDIA Cosmos 2.5	2026.03
World Labs	Marble	Text/image-to-3D-world generation	2025-2026

ACE Robotics, based in China, pursues an open-source strategy. In March 2026 it released Kairos 3.0-4B under Apache 2.0, available on both Hugging Face and ModelScope⁷. The model runs on just 23.5GB of VRAM despite having 4 billion parameters. On an A800 GPU, its inference speed is 72 times that of NVIDIA Cosmos 2.5⁷. It also runs in real time on the NVIDIA Jetson Thor edge platform, outputting 1.5x faster than live video⁷.

Most significant is cross-embodiment transfer. The same Kairos model controls three distinct robots, Agilex PIPER, Unitree G1, and Galaxy General G1, without retraining for each platform⁷.

AI Starts Running Experiments

World models give AI a grasp of physical reality. The more surprising development is that AI has begun proposing experiments inside that reality. Not analyzing existing datasets, but generating new hypotheses, designing protocols, and having humans validate the results.

Superconductors. ElementsClaw is an agent framework that pairs a Large Atomic Model (LAM) with a Large Language Model (LLM)². Its 1-billion-parameter Elements model handles atomic-level numerical computation, while the LLM manages high-level semantic reasoning. On the superconductor discovery task, ElementsClaw screened 2.4 million stable crystals in 28 GPU-hours and found 68,000 high-confidence candidates². Human researchers then synthesized four novel superconducting materials under its guidance: Zr₃ScRe₈ (Tc = 6.5K), HfZrRe₄ (Tc = 5.9K), Zr₄VRe₇ (Tc = 3.5K), and Hf₂₁Re₂₅ (Tc = 2.5K)².

Optics. The Qiushi Discovery Engine is an end-to-end autonomous scientific discovery system⁸. Operating on a real optical platform, it consumed 1.459 million tokens, made 3,242 LLM calls, and issued 1,242 tool calls to autonomously propose and experimentally verify a new physical mechanism: optical bilinear interaction⁸. The mechanism structurally resembles the core operation of Transformer attention, and it may open paths toward high-speed, low-power optical hardware⁸. According to the authors, this is the first time an AI agent system has independently discovered and experimentally verified a previously unknown physical mechanism⁸.

Catalysts. The MASTER system uses hierarchical LLM reasoning to drive catalyst discovery, cutting the required atomic simulations by 90%⁹. It does not search at random. It thinks like a chemist: reason about which directions are worth trying, then validate with simulation.

Table 2: Verified Cases of AI-Driven Scientific Discovery ¹²⁸⁹¹⁰

Domain	System	Discovery	Validation Method	Date
Medicinal chemistry	GPT-5.4 + Molecule.one	TEMPO improves Chan-Lam reaction	10,080 experiments, 14 control groups manually verified	2026.06
Superconductors	ElementsClaw	4 novel superconducting materials	Experimental synthesis + magnetization measurement	2026.04
Optics	Qiushi Engine	Optical bilinear interaction mechanism	Real optical platform experiment	2026.04
Catalysts	MASTER	Efficient catalyst screening	90% reduction in atomic simulations	2026.05
Proteins	ProteinMPNN	Protein sequence design	52.4% sequence recovery (Rosetta 32.9%)	Verified

One Architecture, Two Applications

The technical foundation of world models and scientific discovery is converging. Both use transformers. Both learn physical laws. The only real difference is training data.

The Kairos paper puts it plainly: world models are shifting from “passive video generators” to “infrastructure for physical AI,” which means understanding space, predicting the future, and outputting actions¹¹. The same description applies to scientific discovery. AI needs to understand molecular structure, predict reaction outcomes, and propose experimental protocols.

OpenAI appears on Physical Intelligence’s investor list⁵. This is no coincidence. The same company building language models is also backing robot world models. The logic is that understanding the physical world and understanding language will eventually converge on a single architecture.

AlphaFold offered an early preview of this convergence. In 2020 it solved the protein structure prediction problem. In 2024 it won a Nobel Prize. By 2025 it had already spawned AI drug discovery companies such as Isomorphic Labs¹⁰. In its five-year retrospective, Google DeepMind described AlphaFold as a “template for AI-accelerated science”¹⁰.

Hypothesis Generation and Experimental Validation Are Splitting Apart

These cases reveal a shift already underway. What AI does well and what humans do well are separating into distinct roles.

What AI does well: Search high-dimensional spaces. Two point four million crystals. Ten thousand eighty reaction conditions. One point four five nine million tokens of experimental reasoning. It can connect knowledge scattered across different papers, such as TEMPO’s role in copper-catalyzed oxidation and the sulfonamide yield problem, then run massive parallel trials.

What humans still own: Designing experiments, physically executing them, judging what the results mean, and deciding whether a question is worth asking in the first place.

The DiscoverPhysics benchmark makes this distinction concrete. The strongest AI agents passed only 50% of 22 “non-standard physics” worlds¹². In these worlds the physical laws had been deliberately altered. They were not standard Newtonian mechanics. The agents had to discover new laws from experimental data. AI could find answers, but it did not necessarily understand why those answers were right. Predictive accuracy is not the same as understanding.

GPT-5.4 found TEMPO, but the reaction mechanism and industrial applicability were confirmed by Molecule.one chemists who manually ran 14 control experiments¹. The Qiushi engine discovered optical bilinear interaction, yet whether the mechanism truly holds still awaits replication by other labs⁸.

The role shift: Scientists are becoming reviewers of AI-generated proposals and executors of validation, rather than the people who run every experiment themselves. A single chemist can review more hypotheses in a day than they could personally test in a lifetime. Human judgment is being amplified, not replaced.

The Structure of Research Is Changing

Traditional science is bottlenecked by human time. A principal investigator mentors a handful of graduate students and runs a few dozen experiments per year. That bottleneck is shifting.

Compute becomes infrastructure. ElementsClaw screened 2.4 million crystals in 28 GPU-hours, a volume equivalent to decades of traditional database accumulation². Labs now need GPU clusters the way they once needed NMR machines.

Data becomes the moat. The competition between Physical Intelligence and Figure is not about model architecture. It is about data. Figure secured 100,000 residential units through Brookfield for data collection⁶. Physical Intelligence open-sourced its model but keeps its data private. Exclusive data is the real barrier to entry.

The front end accelerates; the back end does not. Hypothesis screening has compressed from months to days. But experimental validation, clinical trials, and industrial scale-up still move at human speed. AI accelerates thinking, not doing.

Talent is restructuring. The people you need are no longer “chemists who can run experiments.” They are “domain experts who can judge whether AI output is correct.” The core skill is not programming. Code is increasingly written by AI anyway. It is physical intuition: knowing where an AI-proposed hypothesis might fail, knowing whether a result matches expectations, knowing which anomalies are worth chasing and which are noise. That intuition comes from decades of hands-on experience, and no model can replicate it.

When AI can do in 28 hours what used to take traditional databases decades, scientific competition shifts partly from “who has the best scientists” to “who has the best AI, the best data, and the best validation platform.” But the final verification still requires human hands. That step cannot be skipped.

What to Watch

Can other labs replicate OpenAI and Molecule.one’s TEMPO discovery? Is the reaction mechanism fully understood?
Of ElementsClaw’s 68,000 superconductor candidates, how many will survive experimental validation?
How well does cross-embodiment transfer work in real commercial settings?
As AI generates hypotheses faster, will experimental validation become the new bottleneck?

References

Conclusion

AI is crossing from simulation into experimentation. World models give robots the ability to learn from human video. Scientific agents propose hypotheses that human researchers then validate in physical labs. The division of labor is becoming clear. AI searches vast spaces and connects distant ideas. Humans design, execute, and judge the experiments that matter.

The infrastructure of science is adjusting accordingly. Compute clusters replace traditional instruments as the core lab resource. Data access, not model architecture, determines who leads. The people who thrive will be those who can read AI output with the skeptical eye of someone who has spent years at the bench. The experiments themselves still belong to us.

TechsCurrent — OpenAI’s AI Chemist Finds a Lab-Tested Way to Improve Drug Discovery Chemistry https://techscurrent.com/2026/06/openai-ai-chemist-drug-discovery-chan-lam-reaction/ ↩ ↩² ↩³ ↩⁴
arXiv — Agentic Fusion of Large Atomic and Language Models to Accelerate Superconductor Discovery https://arxiv.org/abs/2604.23758 ↩ ↩² ↩³ ↩⁴ ↩⁵ ↩⁶
Physical Intelligence — π0.7: a Steerable Model with Emergent Capabilities https://www.pi.website/blog/pi07 ↩ ↩² ↩³
Physical Intelligence — Open Sourcing π0 https://www.pi.website/blog/openpi ↩ ↩²
Physical Intelligence — About / Investors https://www.pi.website/ ↩ ↩²
Figure AI — Project Go-Big: Internet-Scale Humanoid Pretraining and Direct Human-to-Robot Transfer https://www.figure.ai/news/project-go-big ↩ ↩² ↩³ ↩⁴
ACE Robotics / GitHub — Kairos 3.0: A Native World Model Stack for Physical AI https://github.com/kairos-agi/kairos-sensenova ↩ ↩² ↩³ ↩⁴ ↩⁵
arXiv — End-to-end autonomous scientific discovery on a real optical platform https://arxiv.org/abs/2604.27092 ↩ ↩² ↩³ ↩⁴ ↩⁵ ↩⁶
Nature Communications — Hierarchical Multi-agent Large Language Model Reasoning for Autonomous Heterogeneous Catalyst Discovery https://www.nature.com/articles/s41524-026-02139-1 ↩ ↩²
Google DeepMind Blog — AlphaFold: Five Years of Impact https://deepmind.google/blog/alphafold-five-years-of-impact/ ↩ ↩² ↩³
arXiv — Kairos: A Native World Model Stack for Physical AI https://arxiv.org/html/2606.16533v2 ↩
arXiv — DiscoverPhysics: Benchmarking LLMs for Out-of-the-Box Scientific Thinking https://arxiv.org/html/2605.26087v1 ↩