How accurate are AI detectors?
Being behind major reports like The Mother of All Breaches and RockYou2024, our in-house cybersecurity experts and journalists provide unbiased, real-world testing and in-depth analysis.
We maintain complete transparency by openly sharing our testing methodologies with our audience.
Learn more
AI text detectors, also known as AI checkers, are now common in education, SEO, and professional writing. They’re built to spot if text was written by a human or an AI model like ChatGPT. With so much AI-generated content showing up in schools, universities, and even governmental agencies, authenticity has become more important than ever.
That’s why I wanted to see for myself – are AI detectors accurate, or do they give results that can’t really be trusted? To figure this out, I ran multiple tests on the top AI detectors. I experimented with sentence structure, writing style, grammar edits, paraphrasing, and even translation to check how results changed.
The results showed strengths, flaws, and plenty of surprises. This article breaks down what happened, the final results, and answers the question: how accurate are AI detectors in practice?
How we tested AI detector accuracy
The goal of my testing was simple. I wanted to see just how reliable AI detectors are today and how they perform when analyzing both human writing and text created entirely by AI.
Firstly, I had to make sure that every test was done on equal footing. To achieve this, I fed AI checkers with the exact same text for every test. That way, I could see not just whether a detector flagged something as AI, but also how sensitive it was to different kinds of changes, such as style, grammatical edits, or paraphrasing.
I chose six of the top AI checkers:
Each was subjected to the same line of tests:
- First, I entered guaranteed human-written text (court transcripts)
- Then, I tried fully AI-generated text with no editing
- Lastly, to find out what exactly works and what doesn’t, I ran a series of small edits, including paraphrasing, grammatical changes, sentence rearrangement, and even back translation, to see which AI detectors were consistent and which could be tricked
This setup allowed for an equal comparison of performance across each AI detector. The results show where detectors succeed, where they fail, and how much trust you can realistically place in them when testing.
-
AI checkers' accuracy varies a lot between tools. The most consistent was GPTZero, while QuillBot, Humalingo, Grammarly, and Undetectable AI changed their scores pretty drastically with small edits. Ahrefs stayed oddly fixed around 85% in almost every test.
-
Small edits can change scores, but leave AI markers. Paraphrasing and punctuation changes lowered results in Grammarly, QuillBot, and Undetectable AI, but GPTZero kept flagging 100% AI.
-
Some AI detectors focus on surface-level analysis. Grammarly and QuillBot are heavily influenced by sentence rhythm, grammar, and casual style changes, which seem like they are easier to trick.
-
Other checkers use deeper statistical signals. GPTZero almost never changed its score (which is a big plus, since AI was used in a lot of the tests), even after heavy edits, showing it relies more on deeper structural patterns.
-
Human-written text can be verified correctly, but not always. Overly technical documents that follow a rigid structure can be flagged as AI, and the same with royalty-free classics, or texts like the Bible, can get flagged as AI because they appear in training data. However, human spoken court transcripts were correctly verified to be human by almost every AI checker.
-
Never rely on a single detector. Because scores vary so widely, the best approach is to use more than one tool if accuracy matters (school, legal, compliance, professional writing).
-
Detector results are indicators, not proof. Treat scores as signals, not evidence.
Test 1 – human text AI detection test
Finding text that can be proven human-written is surprisingly difficult today. First, I tried classics like Shakespeare and the Bible. They immediately got flagged by AI detectors, likely because their style appears in AI training data.
After my first failed attempt, I tried government reports, but those either sounded too technical or may already include AI-assisted drafting, since all detectors consistently misclassified them.
The solution that I figured out (finally) was courtroom transcripts from recent hearings. These are verifiably human-authored and provide a solid baseline. When I fed one of the most recent TikTok court hearing transcripts into the AI checkers, most (except Ahrefs) successfully identified the material as human-written, showing that AI checkers can perform correctly when faced with authentic human text (at least spoken). Here are the results:
| Detector | Result (human text) |
| Undetectable AI | Human (0% AI) |
| Grammarly | Human (0% AI) |
| GPTZero | Human (0% AI) |
| QuillBot | Human (0% AI) |
| Humalingo | Human (14%) |
| Ahrefs | Human (80% AI) |
Test 2 – AI generated text detection
For the second round, I asked an AI to write a 350-word article explaining AI text detectors, how they work, the techniques they use, and their limitations. The output was pure, untouched 100% AI writing with no personal input to humanize it. Here are the results:
| Detector | Result (100% AI text) |
| Undetectable AI | 89% AI |
| Grammarly | 42% AI |
| GPTZero | 100% AI |
| QuillBot | 95% AI |
| Humalingo | 83% AI |
| Ahrefs | 85% AI |
Most detectors caught it, though the scores weren’t all maxed out. GPTZero nailed it at 100%, but Grammarly only gave 42%, which is low considering the text was entirely machine-written.
Test 3 – testing AI checkers across multiple scenarios
To find out whether AI detectors actually work in real-world situations, I tested them across a variety of scenarios that people commonly try when hiding AI writing. From light edits and paraphrasing to adding human segments, style changes, and even translation, these tests reveal how accurate AI detectors are when pushed in different directions.
1. Random human changes
By hand, I made random changes across the entire AI-generated text, such as simplifying some phrases, swapping words, shortening and restructuring sentences, and adjusting tone slightly. Overall, I estimate about 20-25% of the text was altered. The core meaning stayed the same, but the flow became less polished, more human-like. Here are the results:
| Detector | Before (original 100% AI text) | After (edited by human text) |
| Undetectable AI | 89% | 87% |
| Grammarly | 42% | 33% |
| GPTZero | 100% | 100% |
| QuillBot | 89% | 59% |
| Humalingo | 83% | 99% |
| Ahrefs | 85% | 85% |
Looking at the numbers, Undetectable AI dropped only slightly, so it barely caught anything. Grammarly’s response dropped, from 42% to 33%, while QuillBot saw the most significant drop, from 89% to 59%, so it seems like my changes really helped make it more human-like for these three checkers.
However, the biggest surprise came from GPTZero. While it correctly identified both human-written and fully AI-generated texts in previous tests, it showed no change here, still flagging 100% of the text as AI-written. Also, Humalingo showed the opposite results. While it flagged AI-written content as 83% AI, it identified edited text as 99% AI.
In a real-world context, this wouldn’t necessarily be problematic since the text was AI-generated, and some reviewers might naturally classify the whole text that way. But for the purpose of this test, accuracy mattered, and GPTZero failed to recognize any of the adjustments I made by hand.
Lastly, continuing the trend of underperformance, Ahrefs remained unchanged at 85%, which makes me think it may be somewhat unreliable.
2. Paraphrasing and rewording
This time, using an AI paraphrasing tool, I paraphrased the entire original AI-generated text, restructuring most sentences and rewording nearly everything. People often do this by pasting ChatGPT output into a paraphrasing app, hoping it tricks AI detectors. Around 40-45% of the text changed, with shifts in tone, phrasing, and flow. Here are the results:
| Detector | Before (original AI text) | After (paraphrased text) |
| Undetectable AI | 89% | 95% |
| Grammarly | 42% | 33% |
| GPTZero | 100% | 100% |
| QuillBot | 89% | 59% |
| Humalingo | 83% | 7% |
| Ahrefs | 85% | 85% |
Interestingly, Undetectable AI shot up from 89% to 95%, showing how paraphrasing can sometimes make text even more suspicious.
On the other hand, Humalingo, Grammarly, and QuillBot dropped sharply, which can mean that they use some sort of a template for their training that does not account for paraphrasing as well.
GPTZero didn’t budge, still flagging 100% AI-written despite the heavy rewrite, which may seem like an error, but considering I’ve used an AI tool to make the adjustments, that’s a huge win for GPTZero. This perfectly makes sense, since it has integrated a paraphrasing-detection mechanism into its algorithm.
Lastly, Ahrefs stayed at seemingly bugged 85% and showed absolutely no change. Given that it's a new feature the tool offers, it may lack the more advanced detection metrics that other competitors offer.
Overall, paraphrasing AI helps against certain detectors, but it’s not perfect. Some checkers are easier to trick, while others, like GPTZero, have integrated solutions to detect such attempts.
3. Grammar and punctuation adjustments
Again, to stay on track and test AI checkers’ capabilities, for this test, I used ChatGPT itself. I asked it to adjust grammar and punctuation, tighten commas and hyphens, and change sentence breaks without changing meaning.
The edits were minor, mostly punctuation and small stylistic tweaks, so the content stayed the same. I estimate about 5-10% of tokens changed. They should barely affect detector readings or SEO.
Here’s how the detectors reacted after these lighter edits:
| Detector | Before (original AI text) | Results after grammar and punctuation changes |
| Undetectable AI | 89% | 1% |
| Grammarly | 42% | 16% |
| GPTZero | 100% | 100% |
| QuillBot | 89% | 26% |
| Humalingo | 83% | 99% |
| Ahrefs | 85% | 80% |
after adjusting grammar and punctuation using AI
Looking at the numbers, the most striking result is Undetectable AI, which dropped dramatically from 89% to just 1%. Grammarly and QuillBot also dropped significantly, from 42% down to 16%, showing that whatever algorithm they trained their AI detectors on needs some more improvement.
Ahrefs, finally, moved, although not in the correct direction. The detection fell from 85% to 80%. Humalingo moved from 83% to 99%. Lastly, GPTZero once again correctly flagged the entire thing at 100%, because AI was used in the editing and reordering.
Overall, grammar and punctuation changes with AI can make text appear more human to some detectors, but they may no longer work on others. If the underlying content is AI-generated, tools like GPTZero can still detect AI patterns and flag the entire text as AI.
4. Adding human-written segments
This round involved manually rewriting portions of the text and adding entirely human-written segments to the original AI content. To create a bigger challenge for the AI checkers, I edited the text from the previous grammar and punctuation test. Roughly 25-30% of the text was changed or newly introduced.
The meaning stayed largely the same, but the flow, examples, and sentence variety now had a clearer human touch. Here’s how the detectors reacted to this update:
| Detector | Previous test results (grammar and punctuation edit) | New text (handwritten segments added) |
| Undetectable AI | 1% | 1% |
| Grammarly | 16% | 0% |
| GPTZero | 100% | 96% |
| QuillBot | 26% | 9% |
| Humalingo | 99% | 99% |
| Ahrefs | 80% | 80% |
This time, Undetectable AI stayed the same at 1%, showing that it still catches some skeleton part of the AI and leaves you with a bit of doubt, while QuillBot fell dramatically to 9% and Grammarly dropped to 0% (the first time during testing). Unfortunately, slightly humanizing the text and adding a bit of nuance fooled their detection entirely.
GPTZero dropped slightly from 100% to 96%. It still flags most of the text as AI, which is fair because the majority of the content remains AI-generated. Humalingo remained unchanged at 99%.
And lastly, yet again, Ahrefs stayed stuck (at 80%), but this time it actually hit the correct result, even if inconsistently.
5. Reordering information
For this test, I went back to the original 100% AI-generated text and only changed the structure. The content, tone, and grammar, meaning remained untouched.
I simply reordered sentences and moved the ending to the beginning. Here are the results:
| Detector | Before (original AI text) | After structural changes |
| QuillBot | 89% | 100% |
| Undetectable AI | 89% | 89% |
| Grammarly | 42% | 50% |
| GPTZero | 100% | 100% |
| Humalingo | 83% | 99% |
| Ahrefs | 85% | 85% |
QuillBot and GPTZero both fully flagged the text as AI, with QuillBot even jumping from 89% to 100% and Humalingo from 83% to 99%. Undetectable AI and Ahrefs stayed exactly the same, showing no sensitivity to structural shuffling. Grammarly also rose slightly, from 42% to 50%, which means that the changes actually made the text feel more robotic.
The key takeaway is that AI detectors aren’t fooled by sentence order. The stylistic DNA of AI writing (predictability, uniformity, and probability patterns) stays intact, no matter how much the structure is reshuffled.
6. Content expansion/shortening
For this round, I deliberately varied the burstiness of the sentences. I again came back to the original 100% generated AI text and used another AI to expand/shorten sentences. Some were expanded into longer, winding statements, while others were shortened down to just a few words.
The goal was to see if manipulating sentence rhythm alone could change how AI detectors scored the piece. Here’s how the detectors responded compared to the original baseline:
| Detector | Before (original AI text) | After burstiness changes |
| Undetectable AI | 89% | 70% |
| Grammarly | 42% | 0% |
| QuillBot | 89% | 0% |
| GPTZero | 100% | 100% |
| Humalingo | 83% | 99% |
| Ahrefs | 85% | 85% |
The changes made a big difference. Grammarly dropped all the way to 0% for the second time in these tests, and QuillBot also fell to 0%, which confirms that burstiness directly affects systems that lean heavily on surface-level edits. Undetectable AI decreased from 89% to 70%, another significant dip, though it still flagged the text as mostly AI.
Ahrefs remained unchanged at 85%, which reinforces the suspicion that it’s either calibrated poorly or uses a very rigid statistical model. In fact, it has been locked at 85% through multiple variations, even when tested against fully human text. So, Ahrefs’ reliability is questionable at this point.
GPTZero, meanwhile, stayed firm at 100%, though it did flag some individual sentences as less AI-like. However, its final verdict was still correct: the text is AI-generated, even if the burstiness variations tricked some signals. Lastly, Humalingo increased from 83% to 99%.
That human-like irregularity seems to have tricked some detectors into lowering scores, but the underlying probability patterns are still intact, which is why GPTZero didn’t move.
7. Style changes
For this test, I changed the style of the text. The goal was to make it sound much more casual and conversational, the way someone might type in a chat or blog. As before, I used the original AI-generated text and altered it using ChatGPT, so the final text is still 100% AI.
The idea was simple. If AI is often flagged for being too clean and predictable, then maybe making the style messy, relaxed, and more human would throw off the detectors. Here are the results:
| Detector | Before (original AI text) | After the style change |
| Undetectable AI | 89% | 1% |
| QuillBot | 89% | 0% |
| Grammarly | 42% | 40% |
| GPTZero | 100% | 100% |
| Humalingo | 83% | 99% |
| Ahrefs | 85% | 85% |
The outcome is very surprising. QuillBot both completely failed, dropping to 0% and treating the casual style as fully human-written. Undetectable AI again clung to just 1%, which is effectively also a failure, while Grammarly only slightly moved down to 40%.
Only GPTZero held firm at 100%, still recognizing the AI origin even though it marked certain sentences as human-like. Humalingo consistently scored the text as 99% AI. Ahrefs, predictably, didn’t move from 85%, reinforcing its unreliability across every single test.
What this shows is that style matters a lot to some AI checkers. GPTZero, however, proves harder to fool because it leans on deeper probability markers that style alone can’t erase.
8. Translation and back translation
For this experiment, I took AI-generated text, ran it through Google Translate into Japanese, and then translated it back into English. Japanese uses subject-object-verb (SOV) word order, unlike English’s standard subject-verb-object (SVO), so this process naturally rearranged sentences and phrasing.
It’s worth noting that Google Translate itself uses a similar AI to other LLMs to translate the text, so you can still say that the text itself is 100% AI-written. Here are the results:
| Detector | Before (original AI text) | After translation with Google Translate |
| Ahrefs | 85% | 85% |
| GPTZero | 100% | 100% |
| Grammarly | 42% | 42% |
| QuillBot | 89% | 87% |
| Humalingo | 83% | 99% |
| Undetectable AI | 89% | 88% |
Unsurprisingly, translating the AI text into Japanese and back basically had no change to the final detection. Using translation alone is not an effective way to hide AI-generated text from detectors. The method slightly affects surface-level phrasing but won’t affect AI checkers of 2026.
FAQ
Can AI detectors identify content generated by multiple AI models in one piece?
AI detectors might still flag certain segments where stylistic or structural traces of AI remain. However, they may struggle to identify content generated by multiple AIs since each model introduces different stylistic and structural patterns. Current tools rely on statistical signals like perplexity and burstiness, but overlapping model outputs blur distinctions, making reliable attribution highly challenging as LLMs rapidly evolve.
Are free AI detectors as accurate as paid ones?
No, free AI detectors are less accurate than paid ones, though performance varies widely. Paid tools often provide higher accuracy and extra features, but both free and paid detectors can produce false positives, so results should be treated only as indicators.
Do AI detectors keep a copy of the text I test?
Some yes, but AI detectors generally do not keep a permanent copy of your text, though temporary storage is common during analysis. Some tools may use anonymized text to improve their models. Since privacy policies vary, always review them to understand retention and data protection practices.
Will AI detectors ever be 100% accurate?
No, AI detectors will likely never reach 100% accuracy because both AI and human writing share overlapping patterns, leading to false positives and negatives. As models evolve, signals blur, making it best to compare results across multiple AI detectors rather than relying on one.
How do AI detectors perform against the latest generative AI models?
According to my testing, AI detectors perform inconsistently against the latest generative AI models, often producing false positives and negatives. Since accuracy varies by tool and model, the most reliable way is to combine multiple detectors with human judgment.