OpenAI’s deep research can complete 26% of ‘Humanity’s Last Exam’: What is it and what does it mean?- Dilli Dehat se


Artificial intelligence critics argue that the technology may soon outsmart humans, which may lead to a ‘Terminator ’-style situation for humanity. For some, AI is already on its way to turning this future into a reality.

OpenAI's DeepResearch may soon become more intelligent than humans. Find out how(Reuters)
OpenAI’s DeepResearch may soon become more intelligent than humans. Find out how(Reuters)

The deep research AI model, launched by ChatGPT-maker OpenAI earlier this month, has shown an over two-fold increase in performance over the next-best AI model in one of the world’s toughest exams for large language models (LLMs) – Humanity’s Last Exam.

What is the exam about?

Humanity’s Last Exam is a recently released exam for AI models, also called large language models, like ChatGPT, Grok-2 and deep research. It is used to judge the performance of the AI model against a preset list of characteristics.

According to the people behind the exam, it was created as AI models are already scoring 90% accuracy on existing tests. This means that in a way, the scale to measure their performance is falling short. Thus, a larger scale was created in the form of Humanity’s Last Exam.

The exam consists of 2,700 challenging questions, most of which are released to the public, over a hundred subjects.

Did OpenAI prove its dominance?

The Sam Altman-led company’s AI models performed with varied accuracy in the exam. The company’s model which performed with the least accuracy was 4o, which displayed an accuracy of 3.1% with a calibration error of 92.3%.

OpenAI’s o1 model scored 8.8% accuracy and 92.8% calibration error while o3-mini (medium) and o3-mini (high) mediums scored 11.1% and 14% accuracy levels with 91.5% and 92.8% calibration error rates respectively.

OpenAI’s newest model, deep research, scored a staggering 26.6% accuracy in Humanity’s Last Exam. This is over two-fold more accuracy than the next-best performer, which is OpenAI’s o3-mini (high) model.

How did other models fare?

According to the exam’s website, which was last updated on February 11, Elon Musk’s ambitious AI model Grok-2 scored a meagre 3.9% accuracy with a 90.8% calibration error. Another competitor, Anthropic’s Claude 3.5 Sonnet model scored 4.8% accuracy and 88.5% calibration error.

Google’s Gemini Thinking scored 7.2% accuracy and 90.6% calibration error. Chinese firm DeepSeek’s R1 model, which caused a global stock rout for technology firms after it was launched last month, scored higher than all other competitors but couldn’t outperform even the o3-mini (medium) model of OpenAI. It scored 8.6% accuracy with 81.4% calibration error.

What does deep research’s score mean?

The performance showcased by OpenAI’s deep research shows that the model can answer a wide range of analytical, subjective and objective questions with more accuracy than any of its competitors. It also means that the model is more capable of delivering well-rounded answers than other AI models.

This is likely because the model was released primarily to aid people in researching any topic of their choice without the hassle. According to its creators, deep research can conduct multi-step research on the internet for complex tasks in tens of minutes. The same task would otherwise take humans many hours.



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *