The Neuron
Posts
😺 AI's data dilemma

😺 AI's data dilemma

PLUS: AI beauty pageant? Seriously???

Noah Edelman
June 10, 2024

Welcome, humans.

The AI builder community is filled with some of the brightest minds we’ve met. These people literally building the future—self-driving cars, computer vision, talking robots. Oh, and also Miss AI beauty pageants. We wish we were kidding.

NPR first covered the "Miss AI" contest, a beauty pageant among AI models, with an article headline that could’ve easily been mistaken for The Onion: “Fake beauty queens charm judges at the Miss AI pageant.”

If the robots do end up becoming the dominant species on Earth, wouldn’t we deserve it just a tiny bit? A teensy tiny bit?

Here’s what you need to know about AI today:

AI companies might be running out of training data for their AI models.
4 ways companies are improving their models beyond public text data.
AI chatbots are getting caught spewing incorrect election info.
There’s a new Chinese competitor to Sora called Kling.

On the podcast: US Government Cranks Up the Heat, FTC vs. Big Tech, Microsoft’s Inflection Deal (Apple Podcasts, Spotify, YouTube).

Will AI models plateau once they exhaust their training data?

There’s one school of thought that AI models like ChatGPT improve exponentially—they keep getting better and better, faster and faster.

There’s another school of thought that says, actually, AI models aren’t getting exponentially better. In fact, they might be plateauing.

For instance, GPT-4o is marginally more intelligent than GPT-4, but GPT-4 is significantly more intelligent than GPT-3.5.

Part of the logic in school #2 is that these models are running out of high-quality training data, which has historically been one way to enhance chatbot IQ. Feed them more data → they get smarter. Wash, rinse, repeat.

However, the global reservoir of public human text data isn’t infinite, and models like ChatGPT might have nearly drained it. A new study predicts that AI companies will “exhaust the available stock of public human text data” between 2026 and 2032.

This could be a “serious bottleneck,” warns one of the paper’s authors.

On the other hand: Models don’t improve just by eating more public human training data. Simply put, you don't advance from GPT-4o to AGI by feeding it a few extra Wikipedia pages.

For instance, even though Llama 3 70B was trained with twice the words of GPT-4, it performs noticeably worse.

So what are the alternative methods researchers are turning to to build the next generation of AI models?

Here are a few:

Private, not public datasets. Think proprietary datasets, like Reddit’s data archives, which Google is already paying $60M/year for.
Using non-text data. Think video data. Think YouTube video data. Think a million hours of YouTube video data (cough cough OpenAI).
1. Food for thought: GPT-4o may only slightly edge out GPT-4 in IQ, but it has noticeably stronger vision and voice know-how.
Synthetic data—training AI with data generated by AI. There are pros and cons to this, but all the top AI firms are dabbling with synthetic data as we speak, like Zuck.
Making each data point smarter. Basically squeezing out more intelligence from the same amount of data.

FROM OUR PARTNERS

Jurny’s AI multi-agents are disrupting hospitality, and you can invest!

Six months back, we gave you a sneak peek at Jurny, an AI startup revolutionizing hospitality.

They help property managers like Airbnb to Booking.com automate everything from reservations to pricing, a $1 trillion inefficiency.

Like…everything. Their new AI agents can provide informed, precise questions anytime a property manager or guest has a question or issue.

After 5x customer growth and processing $35M+ in bookings, Jurny is giving The Neuron readers an opportunity to invest.

And if you’re bullish, invest as little as $499 in Jurny on StartEngine. The round is closing soon!

Around the Horn.

One study discovered that AI chatbots give incorrect election answers 27% of the time.
Anthropic outlined its approach to mitigating election-related risks.
Adobe clarified that it does not train its AI models on users’ work.

Treats To Try.

*Brilliant offers bite-sized AI lessons so you stay competitive at work. Join 10M people around the world and start your 30-day free trial today.
Harvey is an AI platform for lawyers (it’s preferred by lawyers 97% of the time over GPT-4). They’re targeting a $2B valuation.
Interviews provides real-time suggestions during job interviews.
NotebookLM is an AI notetaking tool gaining popularity for extracting information from lengthy PDFs.
Ultravox is a speech-to-speech AI model that recognizes non-textual speech elements.

*This is sponsored content. Advertise in The Neuron here.