LoginGet Started

GPT-4.5 Fooled 73% of People: What the Turing Test Result Actually Means

Thu Nghiem

Thu

AI SEO Specialist, Full Stack Developer

GPT-4.5 Turing test

GPT-4.5 “fooled 73% of people” and suddenly the internet is doing that thing where a single number turns into a prophecy.

You’ve probably seen the headline version: first model to pass a real Turing test, humans can’t tell anymore, it’s basically human now. And yeah, it’s a spicy stat. It’s also very easy to misunderstand.

So let’s slow it down and get specific. What was actually tested. What “fooled” means in this context. Why a simple persona prompt changed the outcome. And what the result is useful for, in real life, like product UX, trust and safety, customer support, and how creators should think about using AI in workflows.

I’ll link to the reporting as we go, but the goal here is the same either way. Less hype. More clarity.

The headline claim, in plain English

The claim floating around is basically this:

In a Turing style chat test, GPT-4.5 was judged to be human 73% of the time.

That figure comes from an experiment where human participants chatted (briefly) and then had to decide which conversation partner was human versus AI. GPT-4.5, when prompted with a certain persona, was picked as the human most of the time.

Coverage tends to make it sound like “people can’t tell anymore, period.” But the correct interpretation is narrower:

  • In this setup, with these instructions, with these participants, in short text chats, a model using a particular strategy got a high rate of being selected as human.

Still meaningful. Just not the sci fi leap some posts imply.

If you want the mainstream writeup, LiveScience has a readable summary here: coverage of the GPT-4.5 Turing test result.

What “fooled people” actually means (and what it does not)

“Fooled 73% of participants” sounds like 73% of people were helplessly deceived by raw intelligence.

That’s not really what’s happening.

In these tests, “fooled” usually means something like:

  • Participants chatted for a limited time.
  • They made a forced choice judgment.
  • The model’s goal was not to be helpful, but to be plausibly human.

And that last bullet matters a lot. A normal chatbot is optimized to answer questions, be polite, be safe, be structured. A Turing style chatbot is optimized to create a vibe.

So the result is not “the model is human level.” It’s closer to:

  • The model can generate human sounding conversational cues well enough to win a perception game, under constraints.

Which is… honestly what modern LLMs are already best at. Language.

Also, remember what the original Turing Test was meant to probe: whether a machine could imitate human conversational behavior well enough that you can’t reliably distinguish it. It was never a clean measurement of intelligence, knowledge, reasoning, morality, consciousness, or anything else people like to smuggle into the word “human.”

A model can be great at passing as human and still fail at basic planning, consistency over time, or factual reliability. Those are different axes.

What the study design likely looked like (the important moving parts)

Most of the reporting describes a fairly standard “imitation game” setup:

  • A participant interacts via text with two entities, one human and one AI, or sometimes they interact with one entity and guess whether it’s human.
  • The conversation is time boxed or message limited.
  • After the chat, the participant makes a judgment.

The details matter because small design choices swing results hard.

Here are the levers that usually change the outcome in these tests:

1. Time limits and message limits

Short chats favor surface level cues. A long, meandering conversation gives you more chances to detect patterns, repetition, over helpfulness, weird hedging, or the model’s tendency to “answer like a policy document.”

So if the test is short, it’s basically a sprint of first impressions.

2. Who the participants are

If participants are:

  • not especially AI literate,
  • taking the test casually,
  • or not incentivized to detect deception,

then yes, they’ll miss more signals.

Also, some people over index on the wrong tells. Like “typos = human.” Which leads nicely to the persona thing.

3. The instruction given to the model

This is the big one. If you prompt a model to be maximally helpful and correct, it can sound like a bot. If you prompt it to be a bit messy, uncertain, even slightly wrong, it can read as more human.

The model isn’t “becoming human.” It’s selecting a different writing strategy.

4. What counts as “passing”

A single number like 73% hides the baseline.

For example:

  • How often were real humans judged to be human?
  • How often were other models judged to be human?
  • Were participants guessing near random for some conditions?

In many setups, humans also get judged as bots at a surprising rate. Modern online writing has its own weird uniformity. People write like templates. Customer support agents paste macros. Everyone is tired. So a “human sounding” benchmark is not what it was in 1950.

The persona prompt: why “pretending to be dumber” can work

One of the more interesting angles in coverage is that GPT-4.5 did best when it used a persona prompt that made it… less polished.

This is the part social media doesn’t want to hear, because it ruins the magic. But it’s arguably the real story.

The Decoder’s reporting frames it clearly: GPT-4.5 fooled 73 percent of people by pretending to be dumber.

Why would that help?

Because a lot of people don’t associate “human” with perfect grammar, perfectly structured answers, and instant competence. In casual chat, humans:

  • answer a bit indirectly
  • misunderstand once in a while
  • ask clarifying questions late
  • drop a thought mid sentence
  • hedge in ways that sound emotional rather than formal
  • have small inconsistencies

A default LLM assistant is too clean. Too eager. Too comprehensive. Too fast to be right.

So when you instruct the model to be a little uncertain, to have minor mistakes, to be less “assistant-y,” you’re not increasing its intelligence. You’re improving its camouflage. It’s performance, not progress.

This is also why “AI detection by vibe” is failing. If what you detect is polish, and the model can simply lower the polish, then your detector is basically measuring formatting preferences.

Passing a narrow conversational test is not general intelligence

This is where a skeptical explainer has to be blunt.

A model can win a Turing style chat game and still:

  • hallucinate citations
  • fail at multi step reasoning
  • contradict itself across turns
  • give confident wrong answers
  • struggle with long horizon planning
  • miss physical commonsense in edge cases
  • be manipulable via prompting

General intelligence is not “people thought it was a guy in a chat window.”

What this result does show is something more practical, and maybe more unsettling:

  • Human trust is easy to trigger with the right conversational cues.
  • People use tone as a proxy for authenticity.
  • And persona prompting is now a first class capability, not a cute trick.

That matters.

Why this matters anyway: trust, UX, and the new “default human”

Even if you treat the 73% as a narrow lab result, it still points at real product implications.

Because most users do not evaluate AI like researchers do. They don’t run consistency checks. They don’t probe for failure modes. They respond to:

  • warmth
  • confidence
  • speed
  • social presence
  • and small human-like imperfections

So here’s where the Turing-ish result becomes practical.

Implication 1: Chat interfaces will feel more “alive,” for better or worse

If a model can reliably adopt human conversational texture, then chat products can become more engaging. That sounds good. It also increases the risk of:

  • users oversharing
  • users trusting the output too much
  • users forming stronger parasocial attachments
  • users assuming there is accountability behind the voice

A human-ish interface pushes people toward human-ish expectations. That’s the trap.

If you’re designing chat UX, you have to decide what you want the user to feel:

  • Do you want it to feel like a tool?
  • A teammate?
  • A person?

The more it feels like a person, the higher the burden for disclosure, boundaries, and clear failure handling. Because people will assume intent where there is none.

Implication 2: Social manipulation gets cheaper

A model that can sound human in short chats is basically a multiplier for:

  • scam attempts
  • astroturfing
  • political persuasion
  • fake customer testimonials
  • phishing that adapts in real time
  • “friendly” DMs that slowly extract information

This is not new, but persona prompting pushes it further. You can spin up a tone that matches a target community. Local slang, plausible ignorance, the right level of confidence. Not too perfect. Just… believable.

And the uncomfortable part is that “make it slightly worse” can increase believability. That defeats a lot of naive moderation heuristics that look for high fluency, repeated structure, or typical bot phrasing.

So the safety question shifts from “can it write like a human” to “how do platforms verify identity and intent when language is no longer a differentiator.”

Implication 3: Customer support is going to blur, fast

Customer support is basically a Turing test environment already. Short chats. High volume. Low context. Users want empathy and resolution.

If GPT-4.5 class models can reliably pass as human in that context, then:

  • companies will deploy more AI agents
  • users will suspect they’re talking to AI even when they’re not
  • and trust will degrade unless disclosure is handled well

Here’s the irony: when AI agents become good enough to be indistinguishable, the honest brands will disclose it… and the dishonest ones won’t. So the brands that do the right thing may take the trust hit first.

If you run support, the takeaway is not “replace your team.” It’s:

  • use AI to handle routine flows,
  • keep humans for escalations,
  • and be explicit about what the system is.

Also, log everything. AI that sounds human can still be wrong, and wrong in a very convincing way. Support errors cost money.

Implication 4: Creator workflows will change, because the “human voice” is now programmable

For writers, marketers, and content teams, this is less about deception and more about craft.

If persona prompting can shift perceived humanness, then voice becomes a control knob. You can tune for:

  • casual blog tone
  • technical documentation voice
  • founder voice
  • community manager voice
  • “smart but tired” voice
  • “confident but not salesy” voice

Which is powerful. But it also creates a new problem: voice without substance.

A lot of AI content that “sounds human” still fails on:

  • factual grounding
  • original insight
  • real examples
  • coherent argument
  • honest caveats

So the win is not “my content can pass as human.” The win is “my content is clear, correct, useful, and worth ranking.”

If you’re building a durable content engine, you want systems that push you toward credibility, not just vibes.

(If you’re curious where models are heading for writing workflows generally, Junia has a good overview here: GPT 5.4 for writing.)

A quick reality check: what could inflate a “pass rate” like 73%

Even without accusing anyone of bad science, there are normal reasons a number like 73% might pop.

  • Forced choice guessing: If participants are uncertain, they guess. If the model is “good enough,” guesses tilt toward it.
  • Participants using bad heuristics: People think bots are polite and perfect, so they pick the messy one as human.
  • Short duration: You don’t get enough time to notice deeper inconsistencies.
  • Text only: No voice, latency cues, or richer context.
  • Persona advantage: The model is explicitly coached to win the game.

None of this makes the result meaningless. It just makes it specific.

And that’s the right way to hold it. Specific.

So what should you do with this information?

Here’s the grounded interpretation I’d keep:

  1. LLMs are now extremely good at social mimicry in short conversations.
    Not just fluent. Strategically human-like.
  2. Persona prompting is not fluff. It changes outcomes.
    In safety, in UX, in persuasion, in support, in content.
  3. The Turing test is a test of perception, not a certification of intelligence.
    Passing it does not equal AGI. It does not equal reliability. It does not equal truth.
  4. Trust is now the main battleground.
    If users can’t tell who or what they are talking to, platforms and brands need stronger disclosure and identity systems.
  5. For creators, “human-sounding” is table stakes. Credibility is the moat.
    The internet will get louder. Clear, sourced, well structured writing will matter more, not less.

Practical takeaways by audience

If you build AI products

  • Decide what “human-like” is allowed to mean in your UI.
  • Add explicit disclosure where it matters, especially in sensitive domains.
  • Test user trust calibration, not just satisfaction. People love confident wrong answers.
  • Monitor persona prompts. They can be a safety surface, not just a style feature.

If you run marketing or comms

  • Treat AI voice as editable, but don’t confuse voice with authority.
  • Add a lightweight verification habit: sources, examples, dates, and a human review pass for claims.
  • Write like a person, sure. But also write like someone who can be held accountable.

If you work in security or policy

  • Plan for persuasion at scale, with adaptive dialogue.
  • Push for provenance tools, identity verification, and platform level friction where needed.
  • Update internal training. “Bots sound like bots” is not a valid mental model anymore.

Wrapping it up (the non clickbait version)

GPT-4.5 “fooling 73% of people” is not the moment machines became human.

It’s the moment a lot of people should admit that humans judge humanness with shallow cues, and that those cues are now easy to manufacture.

The real headline is something like:

With the right persona, LLMs can reliably trigger human trust in short text conversations.

That’s useful for better interfaces and smoother support. It’s also a risk multiplier for manipulation. Same capability, different intent.

If you’re publishing about this stuff, or you’re the person in your team who has to translate AI news into something everyone can actually understand, you need a workflow that optimizes for clarity and credibility, not panic.

If that’s you, try Junia AI. It’s built for turning messy topics into clean, search optimized long form content, with tools for structure, brand voice, and publishing, without losing the human readable flow. You can check it out at Junia.ai.

Frequently asked questions
  • The phrase means that in a specific experiment, human participants chatted briefly with GPT-4.5 and had to decide if their partner was human or AI. GPT-4.5, when prompted with a particular persona, was judged as human 73% of the time. This doesn't imply the model is human-level intelligent but that it can generate human-like conversational cues well enough to win a perception game under certain conditions.
  • This claim refers to a narrow setup where, with specific instructions, participants, and short text chats, GPT-4.5 using a particular strategy achieved high rates of being identified as human. It doesn't mean humans can't tell AI from humans in all contexts or that the model possesses general human intelligence or consciousness.
  • The instruction or persona prompt changes the model's writing strategy. When prompted to be maximally helpful and correct, the model can sound like a bot. However, when asked to be slightly messy, uncertain, or even imperfect—essentially 'pretending to be dumber'—it reads as more human because it mimics natural human conversational quirks and imperfections.
  • Key factors include time and message limits (short chats favor surface-level impressions), participant characteristics (AI literacy and motivation affect detection), instructions given to the model (which shape its conversational style), and what counts as "passing" (baseline rates for humans and other models). Small changes in these levers can significantly impact results.
  • No. Passing a Turing test mainly measures how well an AI can imitate human conversational behavior under certain constraints. It does not assess intelligence, knowledge depth, reasoning skills, morality, consciousness, or factual reliability. A model can sound human yet still fail at planning, consistency over time, or providing accurate information.
  • Understanding how GPT-4.5 can generate plausible human-like conversation helps improve product user experience (UX), trust and safety measures, customer support interactions, and guides creators on integrating AI effectively into workflows without hype—focusing on clarity about AI capabilities and limitations.