GPT-3 is great at standardized tests

Large language models like GPT-3 are giving chatbots an uncanny ability to give human-like responses to our probing questions. But how smart are they, really? A new study from psychologists at the University of California-Los Angeles out this week in the journal nature human behavior found that the language model GPT-3 has better reasoning skills than an average college student—an arguably low bar.

The study found that GPT-3 performed better than a group of 40 UCLA undergraduates when it came to answering a series of questions that you would see on standardized exams like the SAT, which requires using solutions from familiar problems to solve a new problem.

“The questions ask users to select pairs of words that share the same type of relationships. (For example, in the problem: ‘Love’ is to ‘hate’ as ‘rich’ is to which word? The solution would be ‘poor,’)” according to a press release. Another set of analogies were prompts derived from a passage in a short story, and the questions were related to information within that story. The press release points out: “That process, known as analogical reasoning, has long been thought to be a uniquely human ability.”

In fact, GPT-3 scores were better than the average SAT score for college applicants. GPT-3 also did just as well as the human subjects when it came to logical reasoning, tested through a set of problems called Raven’s Progressive Matrices.

It’s no surprise that GPT-3 excels at the SATs. Previous studies have tested the model’s logical aptitude by asking it to take a series of standardized exams such as AP tests, the LSATs, and even the MCATs—and it passed with flying colors. The latest version of the language model, GPT-4, which has the added ability to process images, is even better. Last year, Google researchers found that they can improve the logical reasoning of such language models through chain-of-thought prompting, where it breaks down a complex problem into smaller steps.

Even though AI today is fundamentally challenging computer scientists to rethink rudimentary benchmarks for machine intelligence like the Turing test, the models are far from perfect.

For example, a study published this week by a team from UC Riverside found that language models from Google and OpenAI delivered imperfect medical information in response to patient queries. Further studies from scientists at Stanford and Berkeley earlier this year found that ChatGPT, when prompted to generate code or solve math problems, was getting more sloppy with its answers, for reasons unknown. Among regular folks, while ChatGPT is fun and popular, it’s not very practical for everyday use.

And, it still performs dismally at visual puzzles and understanding the physics and spaces of the real world. To this end, Google is trying to combine multimodal language models with robots to solve the problem.

It’s hard to tell whether these models are thinking like we are—whether their cognitive processes are similar to our own. That being said, an AI that’s good at test-taking is not generally intelligent the way a person is. It’s hard to tell where their limits lie, and what their potentials could be. That requires for them to be opened up, and have their software and training data exposed—a fundamental criticism experts have around how closely OpenAI guards its LLM research.

Win the Holidays with PopSci's Gift Guides

25 enchanting images from the Wildlife Photographer of the Year People’s Choice awards 25 enchanting images from the Wildlife Photographer of the Year People’s Choice awards

Are weight-loss drugs contributing to a fall in the obesity rate? Are weight-loss drugs contributing to a fall in the obesity rate?

Is ChatGPT groundbreaking? These experts say no. Is ChatGPT groundbreaking? These experts say no.

Microsoft lays off entire AI ethics team while going all out on ChatGPT Microsoft lays off entire AI ethics team while going all out on ChatGPT

This AI is no doctor, but its medical diagnoses are pretty spot on This AI is no doctor, but its medical diagnoses are pretty spot on

Just because an AI can hold a conversation does not make it smart Just because an AI can hold a conversation does not make it smart

Meta just released a tool that helps computers ‘see’ objects in images Meta just released a tool that helps computers ‘see’ objects in images

Why an AI image of Pope Francis in a fly jacket stirred up the internet Why an AI image of Pope Francis in a fly jacket stirred up the internet

The good and the bad of Lensa’s AI portraits The good and the bad of Lensa’s AI portraits

Meta attempts a new, more ‘inclusive’ AI training dataset Meta attempts a new, more ‘inclusive’ AI training dataset

A simple guide to the expansive world of artificial intelligence A simple guide to the expansive world of artificial intelligence

The FTC has its eye on AI scammers The FTC has its eye on AI scammers

Meta’s new AI can use deceit to conquer a board game world Meta’s new AI can use deceit to conquer a board game world

Shutterstock and OpenAI have come up with one possible solution to the ownership problem in AI art Shutterstock and OpenAI have come up with one possible solution to the ownership problem in AI art

Researchers made a robot that knows how to laugh with you Researchers made a robot that knows how to laugh with you

Only 4 people have been able to solve this 1934 mystery puzzle. Can AI do better? Only 4 people have been able to solve this 1934 mystery puzzle. Can AI do better?

Google’s new robot butler was trained on social media and Wikipedia articles Google’s new robot butler was trained on social media and Wikipedia articles

Google is testing a new robot that can program itself Google is testing a new robot that can program itself

Share

Win the Holidays with PopSci's Gift Guides