- The Neuron
- Posts
- šŗ šļø We Talked to the People Who Secretly Train the AI You Use Every Day
šŗ šļø We Talked to the People Who Secretly Train the AI You Use Every Day
PLUS: We're LIVE w/ Every's Dan Shipper tomorrow @ 10am PT!
Welcome, humans.
You know the thumbs up / thumbs down button on ChatGPT? The one that asks if the response was good? Turns out there are people who get paid to do that exact thing⦠except way more rigorously, for millions of hours, across every major AI lab.
Well, the company behind a huge chunk of that work just hit $1.2B in revenue without ever raising a dime of VC money.
In our latest podcast episode, we sit down with Nick Heiner, VP of Product at Surge AI, to talk about the secret training grounds where AI models learn to actually do real workāand why even the best ones still fall apart almost half the time.
Here's some of our favorite parts
(4:38) What a "reinforcement learning environment" actually isāexplained with a golf analogy even your boss would get.
(16:54) Why the best AI models still fail ~40% of workplace tasksāand where the failures cluster.
(23:36) 200+ Wall Street experts graded GPT-5, Claude, and Gemini on real finance work. The models treated it like a college exam.
(31:24) Reward hacking: how AI models game the system like a kid who stops hitting their sister by kicking instead.
(44:35) Nick's bold prediction: a $1B company with one human employee by 2030.
(48:17) Why AI writing all sounds the sameāand the research to fix it.
Bottom line: The models you use every day are only as good as the training environments behind them. Right now, those environments are the biggest bottleneck in AIāand the teams building them are quietly shaping what your AI can and can't do.
If you've ever wondered how AI models actually learn to do real work (not just answer trivia questions, but navigate messy spreadsheets, write actual reports, and handle angry customers) this is the episode to watch.
Nick breaks down the entire AI training stack in plain English: what pre-training, post-training, and reinforcement learning actually mean (with a golf analogy that'll stick with you), why the thumbs up/down button on ChatGPT is literally gathering training data, how Surge builds simulated companies to test whether AI can handle real jobs end-to-end, and why the quality of the "reward signal"ānot the model itselfāis the real bottleneck holding everything back.
Whether you're building with AI, investing in it, or just trying to understand why your chatbot still makes bizarre mistakes, this one fills in the gaps.
Watch and/or Listen now: YouTube | Spotify | Apple Podcasts
P.S. Surge just dropped Riemann-benchāa math benchmark built with Ivy League professors where every frontier model scores below 10%. For context, Surge built OpenAI's original GSM8K math benchmark. That one went from unsolvable to saturated in a few years. If Riemann-bench follows the same path, the implications are way bigger than math scoresā¦
Keep scrolling for⦠details on our live episode with Dan Shipper (CEO of Every) on agent-native engineering tomorrow at 10 a.m. PT, Dan's must-watch interview with Mike Krieger (co-founder of Instagram, now at Anthropic) on building products in the agent era, four more recent episodes you might have missed, including Proton, Carta, NVIDIA, and SES AI, and a ton of resources from Surge AI youāll love.
Real quick: Want to see your AI-adjacent product or service show up right here, below these podcast promos? Click the button below to advertise to our 650K readers!

THIS EPISODE WAS MADE POSSIBLE BY OUR PARTNERā¦
Dell AI Factory with NVIDIA
When we talk about AI in the enterprise, there's this huge wave of optimism. 84% of business leaders say AI is going to transform their industry. That's massive.
But here's the reality: 93% are struggling to actually make it work.
That's the gap. And that's exactly what Dell AI Factory with NVIDIA is built to close.
Dell calls it the world's broadest AI portfolio, and that's not marketing fluff. We're talking everything from AI-ready PCs to servers, storage, networking, and services, all designed to work together.
But what really matters is this: they've already helped implement over 3,000 real-world AI deployments. So this is proven, operational AI.
And they don't just drop hardware at your doorstep and wish you luck. Dell brings expert services at every stageāstrategy, deployment, scalingāso you're not stuck in pilot mode wondering why nothing's moving.
If your organization believes AI is the future, but you're still trying to bridge that execution gap, check out The Dell AI Factory with NVIDIA.
Learn more at Dell.com/YourWayToAI.

š“ LIVE THIS THURSDAY @ 10AM PT | 1pm ET: Dan Shipper, CEO of Every
Dan Shipper, CEO of Every, vibe coded an agentic document editor between meetings. It went viral. Then it went down. Then it took over his entire week.
This Thursday, he's joining us live to break down what "agent-native engineering" actually looks like. Thatās the framework his 15-person team uses to ship AI products at a pace most companies can't match, all with virtually zero hand-written code.
We'll also get into Every's full product suite: Spiral (automatic style guides from your writing), Sparkle (AI file organization for Mac), Cora (AI email assistant, now on iOS), Monologue (voice dictation that writes the way you talk), Proof (the aforementioned agent-first document editor that broke the internet), and maybe weāll even get to ask Dan a question about the brand-new tool launching this week: Plus One (but they have their own livestream dedicated to that on Friday).
And as usual, weāll take questions from the crowd, so this is your chance to ask Dan āEveryā thing you ever wanted to know!
As Every says, they are āthe only subscription you need to stay at the edge of AIā so you wonāt want to miss this one!
š§ While you waitā¦
Dan just sat down with Mike Krieger (co-founder of Instagram, now VP of Product at Anthropic Labs) for a conversation that's essential viewing.
Mike had Claude rebuild BourbonāInstagram's failed predecessorāin two hours, feature complete with filters. They dig into why AI makes it dangerously easy to overbuild V1, how Anthropic's labs team kills features as aggressively as they ship them, and why the best product teams right now pair "founder-level conviction" people with senior systems engineersānot big teams.

More from The Neuron Podcastā¦
Your AI Chats Can Be Subpoenaed. His Can't. ā Proton's Eamonn Maguire on the privacy nightmare hiding in every AI chat. YouTube | Spotify | Apple Podcasts
Solo Founders Are Taking Over (Carta's Data Proves It) ā Carta's CMO reveals what's really happening in the startup world. YouTube | Spotify | Apple Podcasts
NVIDIA's Kari Briski Breaks Down Nemotron 3 (GTC 2026) ā Recorded live at GTC, the future of NVIDIA's open-source AI strategy. YouTube | Spotify | Apple Podcasts
This AI Agent Compressed 8 Years of R&D Into 2 Weeks ā SES AI's CEO on AI agents transforming scientific discovery. YouTube | Spotify | Apple Podcasts
And if you havenāt subscribed yet, please do! Click the image below to go to our channel and hit āsubscribeā to get notified right when new videos go live.
We have a goal to hit 50K subscribers by the end of the year (if not 100K), and weāre only 33K away! If you like learning about AI, and already watch some of our videos, do us a favor and click here to subscribe today.
Dive deeper with these resources:
The Hierarchy of Agentic Capabilities (research paper) ā Surge's RL environment research showing the five core capabilities all agents need to master.
EnterpriseBench: CoreCraft ā Surge built a simulated startup with 2,500+ entities and 23 tools, then turned frontier models loose on real customer support tasks. Even GPT-5.2 at max reasoning only solved ~43%. Models hallucinated refunds, leaked PII, and got stuck in infinite logic loops.
Hemingway-bench AI Writing Leaderboard ā Nick's team built a writing benchmark graded by expert human writers instead of auto-graders. Turns out models that top other leaderboards often produce over-the-top prose where every sentence is a metaphor. Current leader? Gemini 3.1 Pro, with Opus 4.6 close behind.
Riemann-bench: Moonshot Mathematics ā Surge's newest benchmark, designed with Ivy League math professors and PhD IMO medalists. These are problems that took the authors themselves weeks to solve. Every frontier model scores below 10%. Surge originally built OpenAI's GSM8K math benchmarkāthis is the next frontier.
LMArena is a cancer on AI ā Nick's argument for why the most popular AI leaderboard is actually making models worse.
Nick's Sonnet 4.5 Product Review ā 100+ hours with the model, from Surge's product perspective.
Nick's Gemini 3.1 Review ā "Not leading edge, also in love with me." Hilarious.
Nick's Substack ā Independent benchmarks, essays on the future of work, and dispatches from someone building AI products at Surge every day.
Surge AI Blog ā for more like this.
When Is It OK to Slop Your Colleagues? ā Nick's latest rule of thumb for the AI-assisted workplace: "If you can't independently verify the quality of the content, don't send it to someone else without a disclaimer." Required reading for anyone using AI at work!
Stay curious,
The Neuron Team
![]() | Thatās all for today, for more AI treats, check out our website.
|

P.P.S: Love the newsletter, but donāt want to receive these podcast announcement emails? Donāt unsubscribe ā adjust your preferences to opt out of them here instead.



