What Do AI Detectors Look For? The Patterns That Flag Your Writing
Every piece of text you write has a fingerprint. Not a literal one, but a statistical signature made up of word choices, sentence rhythms, and structural patterns that reveal something about how the text was created.
AI detectors are trained to read that fingerprint. They analyze your writing for specific mathematical properties that differ, on average, between human-written and AI-generated text. Understanding what those properties are gives you a real advantage, whether you are trying to avoid false flags on your original work or evaluating whether a detector's results are trustworthy.
This is not speculation. The detection methods we cover here are based on published research and the documentation of tools like GPTZero, Originality.ai, and SupWriter's AI detector. Let us walk through exactly what AI detectors measure and why certain writing gets caught.
The Two Core Metrics: Perplexity and Burstiness
Almost every modern AI detector, regardless of brand, relies on some variation of two fundamental measurements: perplexity and burstiness. Think of these as the backbone of AI detection.
Perplexity: How Predictable Is Your Writing?
Perplexity measures how surprising each word in a sentence is given the words that came before it. Technically, it is the exponentiated average negative log-likelihood of a sequence of tokens under a language model. In practical terms, it answers the question: "Could a language model have easily predicted this next word?"
Here is an example.
Low perplexity: "The weather today is sunny and warm."
Every word in that sentence is highly predictable. If you gave a language model "The weather today is..." it would very likely suggest "sunny," "cold," "nice," or similar common completions. This sentence has low perplexity.
High perplexity: "The weather today tastes like forgotten algebra."
That sentence is weird. No language model would predict "tastes" after "weather today," and "forgotten algebra" is a genuinely surprising combination. This has high perplexity.
AI-generated text tends to have consistently low perplexity because language models work by predicting the most probable next token. They are literally designed to produce low-perplexity text. They choose words that are statistically likely given the context.
Human writing, on the other hand, tends to have higher and more variable perplexity. We use unexpected metaphors, make unusual word choices, go on tangents, and sometimes write sentences that are deliberately awkward for emphasis or style.
Here is the critical insight: The detector is not just looking at whether your perplexity is low. It is looking at whether your perplexity is consistently low across the entire text. A human might write several predictable sentences and then throw in something unexpected. AI rarely does that.
Burstiness: The Rhythm of Your Sentences
Burstiness measures variation in sentence structure, particularly sentence length. It captures the rhythm of writing.
Read any skilled human writer and you will notice their sentences vary dramatically. A long, complex sentence with multiple clauses and parenthetical asides might be followed by a short punch. Then a medium one. Then another long one. This variation is natural. It reflects how humans think, which is in bursts.
AI text tends to be more uniform. Language models default to a comfortable middle range of sentence length. They rarely write a three-word sentence followed by a fifty-word sentence. Their paragraphs tend to have similar structures: topic sentence, supporting detail, supporting detail, concluding thought. Over and over.
Here is what burstiness looks like in practice:
High burstiness (human-like):
She opened the door. The hallway stretched ahead of her, dim and impossibly long, the kind of corridor you only find in old municipal buildings where the architects seemed to believe that intimidation was a public service. Nothing moved. She stepped inside anyway, because what else was she going to do, and that question, she realized later, was the whole problem.
Sentence lengths: 4, 38, 2, 29. That is burstiness.
Low burstiness (AI-like):
She opened the door and stepped into the dimly lit hallway. The corridor stretched ahead of her with an imposing sense of length. The old municipal building had clearly been designed to create a feeling of authority. She moved forward despite her uncertainty about what lay ahead.
Sentence lengths: 12, 12, 13, 11. Nearly uniform. Detectors notice this.
Why These Metrics Matter Together
Neither perplexity nor burstiness alone is enough to reliably classify text. It is the combination that matters.
| Pattern | Perplexity | Burstiness | Likely Classification |
|---|---|---|---|
| Standard AI output | Low, uniform | Low | AI-generated |
| Creative human writing | High, variable | High | Human-written |
| Technical human writing | Low, somewhat uniform | Moderate | Often falsely flagged |
| Edited AI text | Mixed | Mixed | Uncertain |
This table explains why technical writing and academic papers get falsely flagged so often. If you are writing about a well-defined topic using standard terminology, your perplexity is naturally low. Your sentence structure may also be more uniform because you are following disciplinary conventions. To a detector, that looks like AI.
Token Probability Distribution
Beyond perplexity and burstiness, more sophisticated detectors examine the full distribution of token probabilities across a text.
Every language model assigns a probability to each possible next token (word or sub-word unit). When the model generates text, it typically selects tokens from the high-probability end of that distribution. The result is that AI text has a characteristic probability distribution: most tokens are high-probability choices.
Human writing has a flatter, more varied distribution. We regularly choose words that would not be in a model's top predictions, not because we are trying to be unpredictable but because human communication is driven by personal experience, emotion, context, and stylistic preference rather than statistical optimization.
Detectors like GPTZero and Originality.ai analyze this distribution pattern. They look at what percentage of tokens in a text fall within the top-k most probable tokens according to a reference language model. If 90% of your tokens are top-10 predictions, that is a strong AI signal. If only 60% are, it looks more human.
The Log-Probability Curve
Some advanced detectors plot the log-probability of each token across the length of the text. For AI text, this curve tends to be smooth and consistent. For human text, it tends to be jagged with spikes and dips.
Think of it like a heart rate monitor. A flat, steady line is not a good sign. Healthy variation is what you want. Similarly, text that maintains a steady, predictable probability profile across its entire length raises flags.
Stylometric Analysis
Stylometry is the statistical analysis of writing style. It has been used in literary scholarship for decades to attribute authorship, and now it is being applied to AI detection.
Detectors examine features including:
- Vocabulary richness: The ratio of unique words to total words. AI tends to use a narrower vocabulary band, favoring common words over rare ones.
- Function word distribution: How frequently you use words like "the," "of," "and," "however." These small words carry strong stylometric signals because humans use them in individually distinctive patterns that AI averages out.
- Syntactic patterns: The grammatical structures you prefer. Do you use passive voice? How often do you start sentences with subordinate clauses? AI has default preferences that differ from most human writers.
- Paragraph structure: AI paragraphs tend to follow predictable templates. Human paragraphs are more varied in their internal organization.
The Averaging Problem
Language models are trained on enormous datasets of human text. Their output is essentially a statistical average of all that training data. This means AI writing tends toward the mean on virtually every stylometric dimension.
Real human writers are idiosyncratic. They have pet phrases, unusual preferences, consistent quirks. A detective novel writer might start 40% of their sentences with short declarative statements. An academic might use semicolons three times more than average. These individual deviations from the mean are part of what detectors interpret as "human."
This is also why AI detectors sometimes struggle with corporate communications, legal writing, and government documents. These types of writing are deliberately de-personalized. They suppress individual voice in favor of institutional consistency, which makes them look statistically similar to AI output.
What Detectors Look For in Specific Text Types
Different detectors weigh these signals differently, and some add specialized checks for particular content types.
Academic and Educational Text
Tools like Turnitin focus heavily on:
- Consistency of writing quality throughout a document (sudden jumps in sophistication can indicate AI-assisted sections)
- Comparison against the student's previous submissions
- Patterns specific to academic AI prompts like "Write an essay about..."
- Overuse of hedging language ("It is important to note that..." and "While there are many perspectives...")
Marketing and SEO Content
Detectors tuned for content marketing look for:
- Formulaic structures (listicles with identical paragraph formats)
- Keyword stuffing patterns that AI tools produce when given SEO instructions
- Lack of original data, quotes, or specific examples
- Generic conclusions that could apply to any topic
Creative Writing
Creative writing is harder for detectors because it can legitimately have either high or low perplexity. Detectors look for:
- Consistency of narrative voice (AI often shifts voice subtly across long texts)
- Depth of sensory detail (AI tends toward visual descriptions and underuses other senses)
- Emotional authenticity (AI emotional descriptions often feel catalogued rather than felt)
Why Human Writing Gets Falsely Flagged
Understanding what detectors look for also explains the false positive problem. Certain types of perfectly genuine human writing trigger the same patterns that detectors associate with AI.
Technical and Scientific Writing
If you are writing about a well-established topic using standard terminology, your perplexity is naturally low. Describing how photosynthesis works does not leave much room for surprising word choices. This is why SupWriter's grammar checker is designed to help you maintain accuracy while introducing enough stylistic variation to avoid false flags.
Non-Native English Writing
As documented in the Stanford ESL study, non-native speakers tend to use simpler vocabulary and more predictable sentence structures. They often learned English from textbooks that emphasize correct, standard patterns, which are exactly the patterns AI models also default to. Detectors read this as AI rather than what it is: competent second-language communication.
Formulaic Professional Writing
Emails, reports, and business documents follow templates and conventions. "Please find attached the quarterly report" is low-perplexity text, but it was written by a human following a professional norm.
Heavily Edited Writing
If you edit your work extensively for clarity and conciseness, you may inadvertently smooth out the natural burstiness and high-perplexity moments that signal human authorship. Polished writing can look more AI-like than rough drafts, which is an irony that has real consequences.
How to Write Authentically Without Triggering Detectors
If you are worried about your genuine writing being flagged, these practices help, and they also make your writing better regardless.
Vary your sentence structure deliberately. Mix short and long sentences. Use fragments occasionally. Start sentences in different ways.
Include specific personal details. AI cannot invent genuine personal experiences. References to specific places, conversations, or observations signal human authorship.
Use unusual word choices where appropriate. You do not need to be bizarre, but reaching for a less common synonym occasionally raises your perplexity in a natural way.
Let your voice come through. Opinions, humor, frustration, enthusiasm: these create the kind of stylometric signature that distinguishes individual humans from statistical averages.
Do not over-edit for uniformity. Some roughness and variation is actually a signal of authenticity. Perfect consistency is a flag, not a virtue, in this context.
If you want to check how your writing appears to detectors, you can run it through SupWriter's AI detector for a transparent assessment. And if specific passages are getting flagged, SupWriter's paraphraser can help you rework them while preserving your meaning, adding the kind of natural variation that detectors recognize as human.
The Arms Race: Why Detection Is Getting Harder
Detection methods are not static. As AI models improve, the statistical differences between AI and human text shrink. GPT-3 text was relatively easy to detect because it had strong, consistent patterns. GPT-4 is significantly harder. Future models will be harder still.
This is because detection relies on the assumption that AI text is statistically different from human text. As models get better at mimicking human statistical patterns, that assumption weakens. OpenAI acknowledged this when it shut down its AI classifier in 2023, which only achieved 26% accuracy, essentially admitting that even with direct access to their own model's architecture, reliable detection was not possible.
The response from detection companies has been to develop more sophisticated approaches: ensemble models that combine multiple detection methods, fine-tuning on newer model outputs, and behavioral analysis that looks at editing patterns rather than just final text.
But the fundamental challenge remains. The same neural networks that generate text are increasingly good at generating text that is statistically indistinguishable from human writing. Detection will always be a probabilistic judgment, never a certainty.
FAQ
What is the most important thing AI detectors measure?
Perplexity, which measures how predictable each word is in context, is the single most important metric. AI-generated text tends to have consistently low perplexity because language models are designed to select statistically probable tokens. Human writing has higher and more variable perplexity. However, no single metric is sufficient for reliable detection. Modern detectors combine perplexity with burstiness, token distribution analysis, and stylometric features.
Can I make my writing less likely to be falsely flagged?
Yes. Varying your sentence lengths, including specific personal details and opinions, using occasional unexpected word choices, and allowing some natural imperfection in your writing all help. The key is to avoid the statistical uniformity that detectors associate with AI. Ironically, highly polished and heavily edited human writing is more likely to be flagged than rougher drafts, so do not over-smooth your text.
Do all AI detectors use the same methods?
They use similar foundational approaches, particularly perplexity and burstiness analysis, but they differ in how they weight various signals, what additional features they analyze, and what training data they use. This is why different detectors can give different results on the same text. Running your text through multiple detectors gives you a more reliable picture than trusting any single one.
Why does technical writing get flagged as AI more often?
Technical writing uses specialized but predictable vocabulary, follows established structural conventions, and often avoids the personal voice and stylistic variation that detectors associate with human authorship. The perplexity of a sentence explaining a well-known scientific concept is naturally low because there are limited ways to express it accurately. Detectors read this low perplexity as an AI signal, even though it is simply the nature of precise technical communication.





