OpenAI’s GPT-4 exhibits “human-level performance” on professional benchmarks

Ars Technica

On Tuesday, OpenAI announced GPT-4, a large multimodal model that can accept text and image inputs while simultaneously returning text output that “exhibits human-level performance across a variety of professional and academic benchmarks,” according to OpenAI. Also on Tuesday, Microsoft announced that Bing Chat has been running on GPT-4 all along.

If it performs as claimed, GPT-4 may represent the opening of a new era in artificial intelligence. β€œIt passes a simulated bar exam with a score around the top 10% of test takers,” OpenAI writes in its announcement. “GPT-3.5’s score, on the other hand, was around the bottom 10%.”

OpenAI plans to release GPT-4’s text capability via ChatGPT and its commercial API, but with a waiting list initially. GPT-4 is currently available to ChatGPT Plus subscribers. The company is also testing GPT-4’s image input capability with a single partner, Be My Eyes, an upcoming smartphone app that can recognize and describe a scene.

Along with the introductory website, OpenAI has also released a technical paper detailing the capabilities of GPT-4 and a system model chart detailing its limitations.

A screenshot of the introduction of GPT-4 for ChatGPT Plus customers starting March 14, 2023.
Enlarge / A screenshot of the introduction of GPT-4 for ChatGPT Plus customers starting March 14, 2023.

Benj Edwards / Ars Technica

GPT stands for “generative pre-trained transformer” and GPT-4 is part of a set of foundational language models dating back to the original GPT in 2018. After the original release, OpenAI announced GPT-2 in 2019 and GPT-3 in 2020. A further refinement called GPT-3.5 came in 2022. In November, OpenAI released ChatGPT, which at the time was a refined conversation model based on GPT-3.5.

AI models in the GPT series are trained to predict the next token (a fragment of a word) in a series of tokens using a large amount of text, mostly taken from the internet. During training, the neural network builds a statistical model that represents relationships between words and concepts. Over time, OpenAI has increased the size and complexity of each GPT model, resulting in overall better performance, model-over-model, compared to how a human would complete text in the same scenario, although this varies per task.

In terms of tasks, GPT-4’s performance is remarkable. Like its predecessors, it can follow complex natural language instructions and generate technical or creative works, but it can do so with more depth: it supports the generation and processing up to 32,768 tokens (approximately 25,000 words of text), which allow much longer to create content or analyze documents than previous models.

When analyzing the capabilities of GPT-4, OpenAI had the model take tests such as the Uniform Bar Exam, the Law School Admission Test (LSAT), the Graduation report exam (GRE) Quantitative and Different AP Subjects. It scored on a human level on many of the tasks. That means that if GPT-4 were a person who is judged solely on taking tests, he could go to law school – and probably many universities as well.

As for its multimodal capabilities (which are still limited to a research preview), GPT-4 can analyze and understand the content of multiple images, such as understanding a multi-image joke or extracting information from a diagram. Microsoft and Google have both been experimenting with similar multimodal capabilities lately. Specifically, Microsoft thinks a multimodal approach will be needed to achieve what AI researchers call “artificial general intelligence,” or AI that performs common tasks at a human level.

Riley Goodside, staff prompt engineer at Scale AI, referenced “AGI” in a tweet as he explored GPT-4’s multimodal capabilities, and OpenAI contributor Andre Karpathy expressed his surprise that GPT-4 could solve a test he proposed in 2012 about an AI vision model that understands why an image is funny.

OpenAI has stated that its goal is to develop AGI that can replace humans in any intellectual task, although GPT-4 is not there yet. Shortly after the GPT-4 announcement, OpenAI CEO Sam Altman said tweeted“IIt’s still flawed, still limited, and it still looks more impressive on first use than after you’ve spent more time with it.”

And it’s true: GPT-4 is far from perfect. It still reflects biases in its training dataset, hallucinates (invents plausible-sounding falsehoods), and can potentially generate misinformation or harmful advice.

Leave a Comment