The Neuron
Posts
😺 Benchmarking

😺 Benchmarking

PLUS: 1 wild ChatGPT hack to control response length

Noah Edelman
April 24, 2024

Welcome, humans.

People are already raving about our new podcast! From YouTube: “This is great, listening to you explain things at length like the different tests was helpful and filled a gap in knowledge. Definitely a great compliment to the The Neuron newsletter!”

There’s still time to enter our $500 Visa gift card giveaway by subscribing to the pod or sharing it on social media!

Here’s what you need to know about AI today:

Benchmarking, LMSYS, and “vibe” are three strategies to compare AI chatbots.
Microsoft released the best new small AI model: Phi 3.
Perplexity might be raising $250M+ at a valuation >$2.5B!
We share a hack to control the length of ChatGPT responses.

Three ways to compare all these AI models.

The sheer velocity of new AI model releases lately is dizzying.

Two weeks ago, it was Gemini Pro 1.5 and Grok 1.5. Last week it was Llama 3. And this week, it’s Phi 3 (try it HuggingFace here).

Phi 3 is Microsoft’s latest gem and the top small model on the market. In layman’s terms, smaller AI models are designed to be faster, cheaper, and more efficient than beefier models like GPT-4 (still the champ btw). As a result, small models excel at simpler tasks, like summarizing text.

Microsoft

With all these new models flooding the market, folks are scrambling to figure out which are the crème de la crème.

One approach is called benchmarking. Here is Pete from yesterday’s pod breaking it down:

Benchmarks “are a bunch of standardized tests that researchers make their AI models take. They're basically the SAT for AI. There's one test called the MMLU where Anthropic's Claude scored 79.0, Google's Gemini scored 81.9, and Meta's Llama 3 scored 82.0. Another one called HumanEval where Claude scored 73.0, Gemini scored 71.9, and Llama 3 scored 81.7."

Still, there's debate over whether benchmarks truly reflect a chatbot's usefulness for the Average Joe or if they're just metrics AI firms tune their chatbots to so they can boast things like "[x model] is now the top-performing chatbot according to benchmarks, yada yada."

Another gauge, the LMSYS Chatbot Leaderboard, actually has hundreds of thousands of people test each model and then vote on which is best.

ChatGPT-4 Turbo is #1, followed by Claude 3 Opus and Gemini 1.5 Pro.

Our solution? Test all the chatbots and see which is the most useful for your daily work tasks. It’s more about the "vibe" than anything else!

FROM OUR PARTNERS

Have an AI idea but don’t know how to build it?

Hire world-class AI experts from Harvard, Stanford and MIT.

If you know AI should be a part of your business but don’t know how to implement it, you need AE Studio.

Their team is stacked with elite software creators who can turn any AI concept into reality for your business—think automated reports, NLP, custom chatbots, and beyond.

The secret to their success is treating your project as if it were their own startup.

Tell AE Studio about your business challenge today. They’ll listen to you, take the time to “get” it, and do the work you need to get done.

How to make an AI actually respond in a specific # of words.

Here’s an all-too-familiar problem many of us run into with chatbots:

You ask ChatGPT to write a 100-word email → it coughs up 72 words.
Then you say, “Make it longer” → it overshoots to 128 words.
Finally, you say, “Not that long, go shorter” → it drops to 58 words.

Frustrating, right?

Thankfully, someone on Reddit experimented with how ChatGPT’s response length varies based on the adjective you use to describe your desired word count.

Check this out:

We’re not sure who would ever use the word voluminous (sounds like something Voldemort would say, right?), but keep this chart of adjectives in mind next time you need to fine-tune the # of words an AI churns out.

Another idea: One Redditor noted that another way to control an AI’s length is to tell it to simply ignore its context limit and answer over multiple messages. When it cuts off, just prompt it to "continue" to complete the idea.

Around the Horn.

watch this for a laugh!

BREAKING: ChatGPT officially has memory across chats, meaning it’ll remember what you said in previous convos.
Pete is hosting a Maven workshop on all things AI and content creation tomorrow at 4PM EST. Sign up here!
SAP’s CFO told Bloomberg that AI is driving record growth for its cloud revenues.
Adobe announced its latest text-to-image model, Firefly 3, which is available in beta in Photoshop.
OpenAI published new research on how to build stronger models that resist being tricked into unsafe actions.

Wednesday Wirings.

Perplexity knows about our podcast!

*webAI empowers businesses to build and launch powerful AI applications fast and securely—minimal data requirements, local training, and options for no-code OR full-code IDE. Get early access to webAI here.
Perplexity, an AI search engine, bagged another $62.7M and launched Perplexity Enterprise Pro for businesses. It might raise another $250M at a $2.5-3B valuation soon.
Upstage, a South Korean AI startup serving enterprises, snagged another $72M.
Neubird, which uses AI to help IT operations teams monitor cloud systems, detect issues, and quickly find solutions, raised $22M.
AI Squared, helping businesses deploy AI into their operations, secured $13.8M.
Langdock, an internal AI assistant that lets you connect multiple LLMs with your data, raised $3M.