Ticker

6/recent/ticker-posts

ChatGPT Passes the Turing Test… and That’s Not Saying Much

ChatGPT Passes the Turing Test… and That’s Not Saying Much

GPT-4.5 and LLaMa-3.1, the major language models from OpenAI and Meta, recently successfully passed an updated version of the famous Turing Test, which measures a model's ability to pass itself off as human in a text conversation... but beware of misinterpretations.

The original version of the test, devised by the illustrious Alan Turing in 1950, is based on interactions between a human interrogator who simultaneously talks with two "witnesses"—a machine and another human. These two witnesses have the same goal: to convince their interlocutor that they are the real human. If the assessor fails to consistently identify the machine (in more than 50% of cases), then the latter can be considered to have passed the test.

In this work conducted by the University of San Diego, in the United States, the researchers opted for a different version of the original test. As is often the case with these modern variants, the researchers provided both models with text queries (or prompts) telling them to adopt a “personality” as human as possible, including using fairly familiar language and incorporating what the researchers call “socio-emotional cues” intended to confuse the issue.

In their study, published on the preprint server ArXiv, the authors concluded that both models passed the test with flying colors. LLaMa 3.1 was judged to be more convincing than its human opponent in 56% of cases, while GPT-4.5 even achieved an impressive score of 73%.

The Turing Test tests humans more than machines

On social media, many Internet users and content creators were quick to claim that this was a major turning point in the history of this technology, and a clear sign that we are entering the era of so-called “general” artificial intelligence. Yet this is a highly sensationalist interpretation, quite disconnected from the true implications of this study.

What's important is that even though modern versions of the Turing Test are much more methodologically sound than the original, the goal was never to compare the intellectual abilities of an AI model and a human. It remains an imitation game whose sole purpose is to test the ability of these tools to pass themselves off as humans—an absolutely crucial distinction in this context. In practice, it's more about testing human gullibility than the model's 'intellectual' abilities.

This point becomes particularly evident when we remove these famous 'personalization prompts' from the equation. Without them, GPT-4.5's score dropped to 36%, for example. This proves once again that its success in the first experiment is not a sign of intelligence per se. These results simply show that, once properly configured, modern LLMs are extremely competent when it comes to extracting linguistic markers of human identity, and distilling them effectively into a conversation.

It is also worth remembering that this is not the first time that a major language model has managed to fool real-life interlocutors in this way, far from it. The first documented example (ELIZA, a rudimentary chatbot designed by MIT engineers), already managed to fool a few people... as early as 1965!

It is also interesting to note that this same ELIZA still obtained a score of 23% in this new study, even though its 'reasoning' abilities are light years ahead of those of modern LLMs. According to the authors, this is explained by the fact that the dialogues generated by this prehistoric chatbot did not correspond to the idea that today's humans have of an AI model. In other words, this shows once again that the Turing Test remains primarily a way of evaluating humans, rather than a true AI benchmark.

Testing the “intelligence” of AI models: a real technical challenge

This brings us to the other implication of this work. In their paper, the study's authors emphasize that intelligence is a "complex and protean" phenomenon that no unified test, and certainly not Turing's, is currently capable of rigorously quantifying.

To determine whether an LLM will one day reach the stage of general artificial intelligence, with reasoning abilities superior to those of humans, it will therefore be necessary to develop new types of tests... and probably exclude our species from the equation. Indeed, there is little chance that we will still be able to objectively judge the situation if we are one day confronted with such superhuman AI.

It will therefore be very interesting to follow the projects of researchers working on AI benchmarks. In the current climate, where many experts believe that artificial general intelligence could emerge within a few years, they will have to be extra ingenious in finding ways to evaluate different models while excluding human bias from the equation, and the process of achieving this will undoubtedly be quite fascinating.

The text of the study is available here.

Post a Comment

0 Comments