• The Neuron
  • Posts
  • 😺 Watch: Elorian wants to fix AI's toddler vision

😺 Watch: Elorian wants to fix AI's toddler vision

Toddlers see better than AI... here's why that's a problem

Click the image above to go straight to YouTube!

P.S: If you saw an earlier version of this email, Grant got a little button happy and it went out before it was ready. Sorry for the double-send!

Welcome, humans.

AI is getting weirdly good at programming. It can write functioning code, read docs, run tests, and can reliably do the whole “please build me a little app” thing on auto mode while you work your day job.

But the road to actually “solving” programming probably runs through something more basic: visual reasoning.

Why? Because, speaking from experience, when I look at an app built by AI, and my AI agent looks at the app it built, we see two different things. It doesn’t see what I see. Because, frankly, it can’t see what I see. And this is the biggest barrier holding back agentic engineering and agent-driven software development.

Of course, a coding agent can eventually be architected to hold more files in memory. It can also get better at maintaining lucidity as it builds and works with architecture across a codebase.

But when it opens the browser, looks at the interface it made, and misses the obviously broken layout you spotted in half a second? That’s a critical flaw in the design of the system. Even worse, it lies to your face about how “it looks totally fixed now!”

Bro. You can’t actually SEE it. You’re gaslighting me right now.

That is the bottleneck Andrew Dai is chasing. Andrew spent years at Google Brain and Google DeepMind, and his paper trail runs direct through the middle of modern AI: pre-training, instruction tuning, sparse mixture-of-experts models, PaLM, and the O.G. Gemini.

Now he is co-founder and CEO of Elorian, a new research lab building models that can reason through images, diagrams, designs, charts, floor plans, engineering parts, and the physical world.

In our latest podcast episode, Andrew explains why today’s AI can identify objects but still struggles with problems kids solve by pointing, counting, folding, tracing, or just looking longer.

Our favorite moments:

  • (0:00) The grade-schooler test: Andrew opens with the claim that an elementary school kid can beat frontier models at certain visual reasoning tasks.

  • (4:17) Why image description fails: A London metro map or tangled USB cord can be obvious to your eyes and nearly impossible to fully explain in text.

  • (7:26) The pattern matching ceiling: Models can recognize flowers and dog breeds, then struggle when many visual relationships have to be tracked at once.

  • (11:02) Visual chain of thought: Andrew explains why models may need a visual scratchpad, the same way you picture a sofa in your living room before buying it.

  • (13:27) Why kids point at things: Counting three balls is instant. Counting twenty means tracking what you already counted, and that training data barely exists online.

  • (17:05) What Gemini revealed: Andrew says language and coding capabilities advanced rapidly, while visual reasoning moved slower.

  • (26:13) When mixture-of-experts gets weird: Andrew explains why routing images through specialized sub-models behaves differently than routing text.

  • (28:20) Benchmarks need expiration dates: Andrew makes the case for fresher evaluations, because old visual reasoning tests leak into training data fast.

  • (39:19) The 300-hour engineering bottleneck: Andrew describes a mechanical engineering team spending 200 to 300 hours modifying one component in design software.

The episode lands because Andrew makes the problem feel obvious. You can see a busted UI, a tangled cord, a crowded table, or a folded sheet of paper. Your AI can often describe it. The missing part is the slow, deliberate visual work your brain does before you even think to call it reasoning.

Why watch this? Because if you use ChatGPT, Claude Code, Codex, Gemini, Cursor, Lovable, or any AI tool that has to build or inspect something visual, this episode explains why the tool can sound right while looking wrong.

Watch and/or listen now: YouTube | Spotify | Apple Podcasts

P.S. Jump to (39:19) for the part that should make every product team perk up: the physical world still has huge pockets of work that software-style automation barely touches.

Keep scrolling for Andrew’s paper trail, what Elorian is actually building, and a few recent episodes you should catch up on.

Real quick: Want to see your AI-adjacent product or service show up right here, below these podcast promos? Click the button below to advertise to our 700K+ readers!

THIS EPISODE WAS BROUGHT TO YOU BY…

Today’s Episode is Sponsored by Dell Technologies and NVIDIA

Plenty of companies can launch an AI pilot. Far fewer know how to make it stick. Explore this resource hub, sponsored by Dell AI Factory with NVIDIA, for strategies, decisions, and real-world lessons on turning AI into something scalable, useful, and worth the investment.

Elorian Is Building AI's Visual Reasoning Layer

Elorian describes its mission as building systems that “natively understand and reason through the visual medium,” with a focus on spatial relationships, physical constraints, design intent, and the structure of the physical world.

The clean version: most AI vision tools still behave like they are translating pictures into words, then reasoning over the words. That works for “what breed is this dog?” It gets shaky when the answer depends on relationships that are hard to write down.

What that could unlock:

  • Design review: An AI that can actually inspect a UI, chart, layout, or floor plan instead of politely hallucinating about it.

  • Engineering: Faster iteration on mechanical parts, batteries, wings, devices, and other physical products where every angle matters.

  • Robotics: Better models for machines that need to understand messy rooms, occluded objects, and physical constraints in real time.

  • Satellite analysis: Turning images into usable insights for weather, disaster response, agriculture, and infrastructure monitoring.

  • Science and medicine: Helping researchers reason through visual evidence instead of flattening everything into text descriptions.

Elorian says it has $55M in funding from Striker Ventures, Menlo Ventures, and Altimeter, with participation from 49 Palms and prominent AI researchers including Jeff Dean.

Here’s why Andrew is a big deal…

Check out just a few of the papers Andrew worked on:

  • Semi-supervised Sequence Learning: one of the earlier papers showing how unsupervised pretraining can make later supervised learning more stable.

  • GLaM: a sparse mixture-of-experts model, meaning it activates specialized parts of the model instead of using the whole network for every token.

  • PaLM: Google’s 540B-parameter language model that helped define the scaling era before Gemini.

  • Scaling Instruction-Finetuned Language Models: the Flan paper that helped show how instruction tuning makes models more useful across tasks.

  • Gemini: the multimodal model family built to work across text, image, audio, and video.

…And much more.

🔑 The bottom line: AI coding agents will keep getting better at code. The next leap is giving them enough visual understanding to inspect the thing they built, notice what is broken, and reason through the fix the way a human would.

👀 Learn AI From Scratch Via Our Live Stream Guide

Click here to go straight to our Live LIbrary!

We’ve been doing a lot of AI for Total Beginners streams this week. Here’s an ordered guide of what order to watch them in to get the most out of them:

  1. Start with The 5-Step Framework to Learn AI in 2026 for the map: projects, prompting, skills, automations, and agents (article version).

  2. Watch The AI Starter Kit: What to Try...and What to Ignore to pick the right first tools and skip the which subscription analysis paralysis.

  3. Watch AI Skills vs Agents vs GPTs: Which One Do I Use? to understand the containers: projects, GPTs / Gems, skills, and agents (article version).

  4. Watch Learn Agents in 2026 With This Total Beginner’s Guide to AI Agents & Automation for context, tools, triggers, approvals, and human-in-the-loop guardrails (article version).

  5. Watch OpenAI Workspace Agents 101: Build, Run, and Scale AI Workflows for the build stage: connections, scheduled runs, templates, and useful agents without a terminal directly inside GPT.

  6. Finish with Building Real-Time AI Voice Agents with LiveKit’s Ben Cherry, the advanced track on phone agents, screen sharing, speech-to-text, text-to-speech, real-time models, debugging, observability, and product infrastructure (article version).

By the end, you’ll know when to use each one.

FROM OUR PARTNER

Are you hitting the limits of siloed AI? Just as humans once transformed society by sharing intent, knowledge, and innovation, AI faces a similar inflection point. To achieve distributed superintelligence, we must move beyond scaling up. We need to scale out, too.

Outshift by Cisco is building the Internet of Cognition: an open infrastructure enabling agents and humans to collaborate in real time.

ICYMI: Three Recent Podcasts

1. Worried the internet is getting taken over by bots? You should be. Watch: The Internet is Facing a Human Verification Crisis

TL;DW: Tiago Sada, Chief Product Officer at Tools for Humanity, explains why World ID may become a trust layer for proving someone is a real, unique human online without exposing their identity everywhere.

Why you should watch: AI agents, deepfakes, ticket bots, fake accounts, and identity fraud all point to the same problem: the web needs better proof that a real person is on the other side.

This is the episode to send anyone who still thinks the bot problem is mostly about annoying comments. The scarier version is agents making decisions, purchases, and social connections on behalf of accounts nobody can verify.

2. Tracking the superintelligence race? Microsoft just entered the chat. Watch: AI Superintelligence Is Closer Than You Think

TL;DW: Mustafa Suleyman, CEO of Microsoft AI, joins Corey at Microsoft Build to talk through Microsoft’s new MAI model family, Humanist Superintelligence, and why the company is building more models in house.

Why you should watch: The useful part is the framing. Mustafa talks about superintelligence in terms of products, control, healthcare, agents, and human goals instead of turning the whole topic into a fog machine.

It is also a clean snapshot of how Microsoft wants to talk about AI now: less abstract AGI race, more “what does this actually do for people?”

3. Want to see agents turn into actual work? Watch: Inside Genspark

TL;DW: Wen Sang, co-founder and COO of Genspark, walks through how the company went from AI search startup to autonomous agent platform, including Workspace 4.0, Claw, a DoorDash demo, and a custom agent built live.

Why you should watch: This is one of the clearest examples of the new “AI employee” pitch: software becomes the infrastructure, and the agent becomes the interface you actually manage.

If visual reasoning is the missing perception layer, Genspark is a useful companion episode for the interface layer: what happens when agents move from answering questions to actually doing the boring steps.

Last thing: And if you haven’t subscribed yet, please do! Click the image below to go to our channel and hit “subscribe” to get notified right when new videos go live.

We have a goal to hit 50K subscribers by the end of the year (if not 100K), and we’re less than 30K away! If you like learning about AI, and already watch some of our videos, do us a favor and click here to subscribe today.

Stay curious,

The Neuron Team

That’s all for today, for more AI treats, check out our website.

What'd you think of this podcast episode?

Pick an answer below, then tell us why with the "additional feedback" option.

Login or Subscribe to participate in polls.

P.P.S: Love the newsletter, but don’t want to receive these podcast announcement emails? Don’t unsubscribe — adjust your preferences to opt out of them here instead.