> We tested detection rates across major AI models. Claude is hardest to detect at 87%, ChatGPT easiest at 92%. Full comparison with test data inside.
- **Published**: 2026-04-02
- **Category**: AI Detection
- **URL**: https://supwriter.com/blog/which-ai-hardest-to-detect

---
# ChatGPT vs Claude vs Gemini: Which AI Is Hardest to Detect?

If you spend any time in forums where students and writers discuss AI detection, you'll notice a recurring question: which AI model is the hardest to catch? The assumption behind the question is understandable -- if one model consistently slips past detectors, maybe that's the one to use.

I decided to stop guessing and actually test it. Over the past six weeks, I generated 50 samples from each of the four major AI models -- ChatGPT (GPT-4o), Claude (3.5 Sonnet), Gemini (1.5 Pro), and DeepSeek (V3) -- and ran every single one through Turnitin, GPTZero, and Originality.ai. That's 600 individual detection scans. Here's what I found, why the results look the way they do, and what it actually means for anyone trying to use AI writing without getting flagged.

## The Testing Setup

Before getting into numbers, methodology matters. Bad testing produces bad conclusions, and there's no shortage of sloppy "I tested one paragraph" posts floating around online.

### How I Structured the Tests

Each model received the same 50 prompts. The prompts covered a range of writing types:

- 15 academic essays (argumentative, analytical, and expository)
- 10 blog posts on various topics
- 10 professional emails and reports
- 10 creative writing samples (short fiction, personal narrative)
- 5 technical explanations

Every prompt was identical across models. Same topic, same word count target (500-800 words), same level of specificity. No system prompts designed to "trick" the models into sounding more human. No instructions like "write naturally" or "vary your sentence length." Just straightforward writing prompts, the kind a real person would actually use.

Each output was run through all three detectors without any editing. Raw AI output, as generated, pasted directly into the detection tool.

### Why These Three Detectors

I chose Turnitin, GPTZero, and Originality.ai because they represent the three most consequential detection systems in practice:

- **Turnitin** is what most universities use, making it the highest-stakes detector for students
- **GPTZero** is the most widely used free detector and a common first check
- **Originality.ai** is the preferred tool for content professionals and is generally considered the most aggressive detector on the market

Using all three gives a more complete picture than any single detector could. As we'll see, the [inconsistency between detectors](/blog/how-does-ai-detection-work) is itself part of the story.

## The Results: Detection Rates by Model

Here are the aggregate detection rates -- the percentage of samples each detector flagged as AI-generated -- broken down by model.

### Overall Detection Rates

| AI Model | Turnitin | GPTZero | Originality.ai | Average Detection Rate |
|---|---|---|---|---|
| ChatGPT (GPT-4o) | 90% | 91% | 95% | **92%** |
| Claude (3.5 Sonnet) | 84% | 86% | 91% | **87%** |
| Gemini (1.5 Pro) | 87% | 88% | 93% | **89%** |
| DeepSeek (V3) | 92% | 94% | 97% | **94%** |

The headline finding: **Claude is the hardest major AI model to detect**, with an average detection rate of 87% across all three detectors. DeepSeek is the easiest to catch at 94%. ChatGPT falls at 92%, and Gemini sits in the middle at 89%.

But every single model was detected the vast majority of the time. The "hardest to detect" model was still caught in nearly 9 out of 10 samples. That's an important reality check for anyone thinking they can dodge detection simply by switching models.

### Detection by Content Type

The results get more interesting when you break them down by content type. Some writing types are harder to detect across all models, and the model-to-model differences are more pronounced in certain categories.

| Content Type | ChatGPT | Claude | Gemini | DeepSeek |
|---|---|---|---|---|
| Academic essays | 95% | 90% | 92% | 97% |
| Blog posts | 92% | 86% | 89% | 94% |
| Professional writing | 90% | 85% | 87% | 93% |
| Creative writing | 86% | 80% | 84% | 90% |
| Technical writing | 93% | 89% | 91% | 96% |

Creative writing is the hardest category for detectors across the board -- which makes sense, since creative prompts invite more stylistic variation, and detectors have less training data for fiction and narrative. Claude's creative writing samples had the lowest detection rate of any model-category combination at 80%, which is meaningfully different from DeepSeek's 90% in the same category.

Academic essays are the easiest to detect for every model. The structured, formal nature of academic writing plays to detectors' strengths. [Turnitin in particular](/blog/can-turnitin-detect-chatgpt-2026) has been trained extensively on academic text and performs best in that context.

## Why Claude Is Harder to Detect

Claude's lower detection rate isn't random. There are specific characteristics of its output that make it less susceptible to current detection methods.

### Higher Perplexity

Claude's text tends to have higher perplexity than ChatGPT or DeepSeek. In plain terms, its word choices are less statistically predictable. Where ChatGPT might default to the most probable next word in a sequence, Claude more frequently selects slightly less expected alternatives. This pushes its statistical profile closer to human writing, which is inherently less predictable.

### More Varied Sentence Structure

One of the key signals detectors look for is uniform sentence length -- AI models tend to produce sentences within a narrow range of word counts. Claude generates noticeably more variation in sentence structure. You'll find short fragments alongside complex multi-clause constructions, which more closely mirrors how humans actually write.

### Fewer Repetitive Patterns

ChatGPT has well-documented verbal tics. Phrases like "It's important to note that," "In today's digital landscape," and "Let's dive in" appear with almost comical frequency. DeepSeek has its own set of repetitive patterns. Claude uses fewer of these distinctive markers, giving detectors less to latch onto.

### Parenthetical and Qualifying Tendencies

Claude has a tendency to insert parenthetical asides, caveats, and qualifications in ways that read as more conversationally human. These interruptions to the main flow of a sentence break up the statistical regularity that detectors flag.

## Why DeepSeek Is Easiest to Detect

At the other end of the spectrum, [DeepSeek](/blog/can-turnitin-detect-deepseek) has the highest detection rate for clear reasons.

### Highly Predictable Token Distribution

DeepSeek's outputs have the lowest perplexity of any model tested. Its word choices are extremely statistically predictable, creating a strong signal for detection tools. When a detector analyzes the probability distribution of tokens in a DeepSeek text, the pattern practically screams "AI-generated."

### Distinctive Structural Patterns

DeepSeek has a pronounced tendency toward certain organizational structures -- numbered lists, parallel constructions, and symmetrical paragraph lengths. These patterns are consistent enough across samples that detectors can identify them reliably.

### Limited Stylistic Range

Compared to Claude or even ChatGPT, DeepSeek's outputs occupy a narrower stylistic range. The voice is more consistent, the vocabulary more standardized, and the tone more uniform. Less variation means more signal for detectors.

## What About ChatGPT and Gemini?

### ChatGPT (GPT-4o)

ChatGPT's 92% detection rate reflects a model that has improved significantly from earlier versions but still carries recognizable patterns. GPT-4o is better than GPT-3.5 at varying its output, but it still favors certain transition phrases, tends toward balanced paragraph structures, and produces text with a characteristic "smoothness" that detectors have learned to identify.

The irony is that ChatGPT's massive popularity works against it. With billions of text samples available for training, detector developers have had more data to learn ChatGPT's patterns than any other model. The more people use it, the better detectors get at catching it.

### Gemini (1.5 Pro)

Gemini falls in the middle at 89%, which tracks with its design philosophy. Google's model produces clean, informative prose that's less distinctive than ChatGPT but more predictable than Claude. It avoids some of ChatGPT's most recognizable verbal habits but doesn't match Claude's perplexity profile.

Gemini tends to be caught by Originality.ai more often than by Turnitin, suggesting that Originality.ai has been more aggressive about training on Gemini-specific patterns.

## The Critical Takeaway: All Models Are Detectable

Here's what the data actually tells us: the difference between the "hardest" and "easiest" model to detect is 7 percentage points (87% vs 94%). That's statistically meaningful but practically insufficient. Whether you're caught 87% of the time or 94% of the time, you're caught.

Switching from ChatGPT to Claude does not solve the detection problem. It improves your odds from roughly 1-in-12 to roughly 1-in-8. Those aren't odds you want to bet your grade or your job on.

### Why Detection Will Keep Improving

It's also worth noting that detection accuracy for all models has been trending upward. A year ago, [Claude's detection rate on Turnitin](/blog/can-turnitin-detect-claude-2026) was closer to 75%. Today it's 84%. Detectors are improving faster than models are becoming harder to detect, because the fundamental statistical properties of AI-generated text are baked into how language models work. You can make the signatures subtler, but you can't eliminate them without fundamentally changing how the models generate text.

## What Actually Works: Humanization

If switching models doesn't meaningfully reduce detection risk, what does?

The answer is post-generation humanization -- taking AI-generated text and transforming it so that its statistical properties match human writing patterns. This is what [SupWriter's AI humanizer](/ai-humanizer) does, and the results are dramatically different from anything you can achieve by model selection alone.

### Humanization Results Across Models

| AI Model | Raw Detection Rate | After SupWriter | Improvement |
|---|---|---|---|
| ChatGPT (GPT-4o) | 92% | 1% | 91 points |
| Claude (3.5 Sonnet) | 87% | 1% | 86 points |
| Gemini (1.5 Pro) | 89% | 1% | 88 points |
| DeepSeek (V3) | 94% | 2% | 92 points |

Regardless of which model generated the original text, SupWriter brings the detection rate down to the 1-2% range -- well within the false positive territory where even human-written text occasionally gets flagged.

The reason humanization works where model switching doesn't is that humanization directly addresses the statistical properties that detectors analyze. It doesn't just swap words or rephrase sentences. It restructures the text so that its perplexity profile, burstiness characteristics, and token distribution match human-written text.

## Practical Recommendations

Based on six weeks of testing, here's my honest advice:

**If you're choosing between AI models for other reasons** -- capability, accuracy, tone -- keep choosing based on those factors. The detection rate differences between models are too small to be a primary decision factor.

**If your goal is undetectable content**, model selection is not the answer. You could use the "hardest to detect" model and still get caught 87% of the time. That's not a strategy; that's gambling.

**If you need content that reliably passes detection**, the path is generation plus humanization. Use whichever AI model produces the best raw content for your needs, then run it through [SupWriter](/ai-humanizer) to transform the statistical profile. This approach works consistently across all models and all major detectors.

**If you're a student**, understand that [Turnitin catches all these models](/blog/can-turnitin-detect-chatgpt-2026) at high rates. Your university is almost certainly using it. Plan accordingly.

**If you're a content professional**, the model differences matter even less in your context, because your audience and clients care about quality, not which model you used. Focus on output quality and use humanization to handle the detection side.

## The Bottom Line

Claude is the hardest major AI model to detect. DeepSeek is the easiest. But the range is 87-94%, which means all of them get caught the overwhelming majority of the time. The model you use is a minor variable. What you do with the output after generation is the variable that actually matters.

No amount of model shopping will get you from a 90% detection rate to a 1% detection rate. That gap can only be closed by tools specifically designed to transform AI writing into text that matches human statistical patterns. That's not a workaround -- it's the only approach that addresses the actual problem detectors are solving.


---

Source: https://supwriter.com/blog/which-ai-hardest-to-detect