Archived issue · 05-25-2026

View latest issue

← fruition.net

verified 8w ago

The Frontier · Issue 05-25-2026

Gemini 3.5 lands, Co-Scientist gets real, and agent runtimes harden

Google I/O dominated the week: Gemini 3.5 Flash, Omni, and Antigravity 2.0 reposition Gemini as both consumer AI and the agent runtime, with Flash claiming roughly 4x faster output than other frontier models at under half the cost. Co-Scientist moved from demo to deployed — Calico, Stanford, MIT, Cambridge, and Edinburgh labs are now publishing leads generated with it, including a successful cellular rejuvenation result. On the agent infrastructure side, the stack is rapidly differentiating: LangSmith Engine and SmithDB for trace-driven CI, Codex going to Windows/Dell on-prem and into mobile, and Notion exposing an External Agents API. OpenAI also disclosed an internal model disproving an 80-year-old discrete geometry conjecture — a real, verifiable mathematical contribution rather than a benchmark stunt. Policy was quiet on AI specifically; the supply-chain story (TanStack npm "Mini Shai-Hulud") is the more relevant security item for AI shops shipping JS/TS code through Codex-style agents.

Published: Monday, May 25, 2026
Entries: 11
Cadence: Weekly · Sundays
Curator: Brad Anderson

Wire

arxiv.org New paper on tool-use generalization across model families ·

huggingface.co Trending: open-weights vision-language model passes 70% on MMMU ·

anthropic.com MCP server registry surpasses 1,200 published servers ·

deepmind.google Gemini Robotics paper updates with new manipulation benchmarks ·

figure.ai Figure publishes monthly humanoid uptime telemetry ·

arxiv.org Mech-interp finding: refusal vector universal across families ·

whitehouse.gov New EO draft on federal agency AI procurement circulating ·

eu.europa.eu AI Act guidance v3 published — focus on systemic-risk thresholds ·

arxiv.org New paper on tool-use generalization across model families ·

huggingface.co Trending: open-weights vision-language model passes 70% on MMMU ·

anthropic.com MCP server registry surpasses 1,200 published servers ·

deepmind.google Gemini Robotics paper updates with new manipulation benchmarks ·

figure.ai Figure publishes monthly humanoid uptime telemetry ·

arxiv.org Mech-interp finding: refusal vector universal across families ·

whitehouse.gov New EO draft on federal agency AI procurement circulating ·

eu.europa.eu AI Act guidance v3 published — focus on systemic-risk thresholds ·

01

Frontier Models

releases · benchmarks · weights

1 entry

blog.google 2mo

▲ headline

Gemini 3.5 launches with Flash, Omni, and Antigravity 2.0 agent stack

Google announced Gemini 3.5 Flash alongside Gemini Omni (multimodal generation including video) and Antigravity 2.0. Per Google, Flash benchmarks at 76.2% on Terminal-Bench 2.1, 1656 Elo on GDPval-AA, 83.6% on MCP Atlas, and 84.2% on CharXiv Reasoning, sitting in the top-right quadrant of Artificial Analysis's Intelligence Index. Google reports roughly 4x faster output tokens/sec than other frontier models at under half the cost.

Fruition take

Flash is the new default for agentic loops where latency dominates cost — Google's 4x throughput claim plus the sub-half-cost framing changes the cost-per-task math more than the headline benchmark scores. If you're still routing everything through Pro tier, rerun the numbers this week.

02

Agents & Tooling

protocols · SDKs · runtime

1 entry

news.smol.ai 2mo

Agent runtime layer consolidates: LangSmith Engine, SmithDB, Notion External Agents API

LangChain shipped LangSmith Engine (CI/CD loops for agents) and SmithDB (12–15x faster trace querying), and launched LangChain Labs to feed production traces back into training. Notion opened an External Agents API integrating Claude and Codex. VS Code and GitHub Copilot CLI added multi-agent workflow windows. The week marked a clear shift from "chat UX" to durable, inspectable, long-running agent state.

Fruition take

Trace-driven evals are becoming table stakes — if your agent stack still relies on prompt tweaks plus vibes, you're about to fall behind teams that close the loop on production traces weekly. Pick a trace store and an eval harness now.

03

Robotics & Embodied

humanoids · manipulation · field deployments

0 entries

no entries this week

04

Research

papers · interp · alignment · scaling

4 entries

▲ headline

OpenAI model disproves 80-year-old unit-distance conjecture

An OpenAI model produced a construction that disproves a central conjecture in discrete geometry — the unit distance problem, open since the 1940s. The result is verifiable by human mathematicians and represents a concrete mathematical contribution rather than a benchmark score, though debate continues over how much credit belongs to the model vs. its human collaborators.

Fruition take

Verifiable math is the cleanest signal we have that reasoning models are doing real work — no rubric, no judge model, just a proof that checks out. Worth tracking which classes of problems this generalizes to before assuming it transfers to your domain.

allenai.org 2mo

Ai2 launches AIMIP, an open benchmark for AI weather and climate models

Allen AI introduced AIMIP, an open benchmark and dataset for evaluating AI climate models against conventional physics-based models. Early results show AI models match or beat conventional models on historical climate metrics but still generalize poorly to long-term warming trends and unseen scenarios — a negative finding the team published alongside the positive ones.

Fruition take

The framing here — "good on in-distribution, weak on out-of-distribution" — is the template more domain benchmarks should adopt. If you're evaluating AI for scientific or industrial forecasting, AIMIP's methodology is worth borrowing.

deepmind.google 2mo

DeepMind Co-Scientist moves from preview to deployed across multiple labs

DeepMind detailed Co-Scientist, a multi-agent Gemini-based research partner, with concurrent case studies from Calico (aging), Stanford (liver fibrosis), MIT (ALS and cellular rejuvenation RNA approaches), Cambridge (infectious disease genetics), and Edinburgh (liver disease). One MIT lab reports successfully identifying novel factors that rejuvenate human cells in vitro.

Fruition take

The pattern here — a generalist agent paired with domain experts who validate leads in wet labs — is the realistic shape of "AI for science." If you're building research-assistant products, the case studies are a better spec than any benchmark.

allenai.org 2mo

Artificial Analysis adopts Ai2's IFBench for instruction-following evaluation

Artificial Analysis added Ai2's open IFBench to its public model evaluations. IFBench measures whether models reliably follow complex, multi-part user instructions — a capability most leaderboards underweight despite its outsized impact on real product behavior.

Fruition take

Instruction-following is what users actually notice when an agent "feels dumb," and it's poorly correlated with reasoning benchmarks. Add IFBench (or a domain-specific variant) to your model selection process — the rankings will shuffle.

05

Policy & Governance

enforcement · frameworks · safety

1 entry

OpenAI discloses TanStack npm supply-chain attack response

OpenAI detailed its response to the TanStack "Mini Shai-Hulud" npm supply chain attack, including rotated signing certificates and a mandatory macOS app update by June 12, 2026. The disclosure is unusually specific about which internal systems were touched and how OpenAI is hardening its signing pipeline.

Fruition take

Codex and similar agents are now a primary vector for installing compromised npm packages at scale. Lock down agent egress to package registries, pin dependencies, and treat agent-generated PRs the same way you treat external contributions.

06

Field Deployments

what actually shipped in production

4 entries

Virgin Atlantic ships mobile app rebuild with Codex on fixed deadline

Virgin Atlantic used Codex to rebuild its mobile app under a hard pre-holiday deadline, reporting near-total unit test coverage and zero P1 defects at launch. OpenAI's case study frames the engagement around deadline pressure rather than headcount savings.

Fruition take

The interesting metric is P1 count at launch, not lines of code generated. If you're pitching coding agents internally, deadline-bound projects with measurable defect baselines are the right wedge — not steady-state productivity claims.

OpenAI and Dell bring Codex to on-prem and hybrid enterprise

OpenAI announced a partnership with Dell to deploy Codex in hybrid and on-premise environments, targeting regulated enterprises that can't ship code or data to OpenAI's cloud. Pairs with Codex's recent expansion to Windows sandboxes and remote SSH/programmatic token support.

Fruition take

On-prem Codex is the gating item for financial-services and defense buyers who've been stuck on internal LLM gateways. Expect procurement cycles that were stalled on data-residency to restart in Q3.

deepmind.google 2mo

WeatherNext used operationally by NHC for Hurricane Melissa Jamaica landfall

DeepMind detailed how the U.S. National Hurricane Center used WeatherNext to improve track and intensity forecasts for Hurricane Melissa's historic Jamaica landfall, giving communities additional preparation time. This is one of the clearest operational uses of an AI weather model by a national forecasting authority to date.

Fruition take

Operational adoption by NHC is a real bar — these forecasts go into emergency-management decisions, not dashboards. Worth citing when buyers ask whether AI models are "trusted in production" outside tech.

Databricks adopts GPT-5.5 for enterprise agent workflows after OfficeQA Pro SOTA

Databricks is integrating GPT-5.5 into its enterprise agent stack after the model set a new state of the art on the OfficeQA Pro benchmark, which measures multi-step reasoning over real business documents and spreadsheets.

Fruition take

OfficeQA Pro is closer to what enterprise agents actually do than MMLU or HumanEval — it's worth adding to your eval suite even if you're not using Databricks. The deltas there predict end-user satisfaction more reliably.