We may earn affiliate commissions for the recommended products. Learn more.

How accurate are AI detectors?


AI text detectors, also known as AI checkers, are now common in education, SEO, and professional writing. They’re built to spot if text was written by a human or an AI model like ChatGPT. With so much AI-generated content showing up in schools, universities, and even governmental agencies, authenticity has become more important than ever.

That’s why I wanted to see for myself – are AI detectors accurate, or do they give results that can’t really be trusted? To figure this out, I ran multiple tests on the top AI detectors. I experimented with sentence structure, writing style, grammar edits, paraphrasing, and even translation to check how results changed.

The results showed strengths, flaws, and plenty of surprises. This article breaks down what happened, the final results, and answers the question: how accurate are AI detectors in practice?

How we tested AI detector accuracy

The goal of my testing was simple. I wanted to see just how reliable AI detectors are today and how they perform when analyzing both human writing and text created entirely by AI.

Firstly, I had to make sure that every test was done on equal footing. To achieve this, I fed AI checkers with the exact same text for every test. That way, I could see not just whether a detector flagged something as AI, but also how sensitive it was to different kinds of changes, such as style, grammatical edits, or paraphrasing.

I chose six of the top AI checkers:

Each was subjected to the same line of tests:

  1. First, I entered guaranteed human-written text (court transcripts)
  2. Then, I tried fully AI-generated text with no editing
  3. Lastly, to find out what exactly works and what doesn’t, I ran a series of small edits, including paraphrasing, grammatical changes, sentence rearrangement, and even back translation, to see which AI detectors were consistent and which could be tricked

This setup allowed for an equal comparison of performance across each AI detector. The results show where detectors succeed, where they fail, and how much trust you can realistically place in them when testing.

Key takeaways:

Test 1 – human text AI detection test

Finding text that can be proven human-written is surprisingly difficult today. First, I tried classics like Shakespeare and the Bible. They immediately got flagged by AI detectors, likely because their style appears in AI training data.

After my first failed attempt, I tried government reports, but those either sounded too technical or may already include AI-assisted drafting, since all detectors consistently misclassified them.

The solution that I figured out (finally) was courtroom transcripts from recent hearings. These are verifiably human-authored and provide a solid baseline. When I fed one of the most recent TikTok court hearing transcripts into the AI checkers, most (except Ahrefs) successfully identified the material as human-written, showing that AI checkers can perform correctly when faced with authentic human text (at least spoken). Here are the results:

DetectorResult (human text)
Undetectable AIHuman (0% AI)
GrammarlyHuman (0% AI)
GPTZeroHuman (0% AI)
QuillBotHuman (0% AI)
HumalingoHuman (14%)
AhrefsHuman (80% AI)
Grammarly fully human-written text test results
Grammarly fully human-written text test results
Undetectable AI fully human-written text test results
Undetectable AI fully human-written text test results
GPTZero fully human-written text test results
GPTZero fully human-written text test results
QuillBot fully human-written text test result
QuillBot fully human-written text test results
humalingo court transcript analysis
Humalingo fully human-written text test results
Ahrefs fully human-written text test results
Ahrefs fully human-written text test results

Test 2 – AI generated text detection

For the second round, I asked an AI to write a 350-word article explaining AI text detectors, how they work, the techniques they use, and their limitations. The output was pure, untouched 100% AI writing with no personal input to humanize it. Here are the results:

DetectorResult (100% AI text)
Undetectable AI89% AI
Grammarly42% AI
GPTZero100% AI
QuillBot95% AI
Humalingo83% AI
Ahrefs85% AI
Grammarly 100% AI-written text test results
Grammarly 100% AI-written text test results
Undetectable AI 100% AI-written text test result
Undetectable AI 100% AI-written text test results
GPTZero 100% AI-written text test results
GPTZero 100% AI-written text test results
QuillBot 100% AI-written text test results
QuillBot 100% AI-written text test results
Humalingo ai generated text
Humalingo 100% AI-written text test results
Ahrefs 100% AI-written text test results
Ahrefs 100% AI-written text test results

Most detectors caught it, though the scores weren’t all maxed out. GPTZero nailed it at 100%, but Grammarly only gave 42%, which is low considering the text was entirely machine-written.

Test 3 – testing AI checkers across multiple scenarios

To find out whether AI detectors actually work in real-world situations, I tested them across a variety of scenarios that people commonly try when hiding AI writing. From light edits and paraphrasing to adding human segments, style changes, and even translation, these tests reveal how accurate AI detectors are when pushed in different directions.

1. Random human changes

By hand, I made random changes across the entire AI-generated text, such as simplifying some phrases, swapping words, shortening and restructuring sentences, and adjusting tone slightly. Overall, I estimate about 20-25% of the text was altered. The core meaning stayed the same, but the flow became less polished, more human-like. Here are the results:

DetectorBefore (original 100% AI text)After (edited by human text)
Undetectable AI89%87%
Grammarly42%33%
GPTZero100%100%
QuillBot89%59%
Humalingo83%99%
Ahrefs85%85%
Grammarly test results after random text changes by a human
Grammarly test results after random text changes by a human
Undetectable AI test results after random text changes by a human
Undetectable AI test results after random text changes by a human
GPTZero test results after random text changes by a human
GPTZero test results after random text changes by a human
QuillBot test results after random text changes by a human
QuillBot test results after random text changes by a human
humalingo mixed text
Humalingo test results after random text changes by a human
Ahrefs test results after random text changes by a human
Ahrefs test results after random text changes by a human

Looking at the numbers, Undetectable AI dropped only slightly, so it barely caught anything. Grammarly’s response dropped, from 42% to 33%, while QuillBot saw the most significant drop, from 89% to 59%, so it seems like my changes really helped make it more human-like for these three checkers.

However, the biggest surprise came from GPTZero. While it correctly identified both human-written and fully AI-generated texts in previous tests, it showed no change here, still flagging 100% of the text as AI-written. Also, Humalingo showed the opposite results. While it flagged AI-written content as 83% AI, it identified edited text as 99% AI.

In a real-world context, this wouldn’t necessarily be problematic since the text was AI-generated, and some reviewers might naturally classify the whole text that way. But for the purpose of this test, accuracy mattered, and GPTZero failed to recognize any of the adjustments I made by hand.

Lastly, continuing the trend of underperformance, Ahrefs remained unchanged at 85%, which makes me think it may be somewhat unreliable.

2. Paraphrasing and rewording

This time, using an AI paraphrasing tool, I paraphrased the entire original AI-generated text, restructuring most sentences and rewording nearly everything. People often do this by pasting ChatGPT output into a paraphrasing app, hoping it tricks AI detectors. Around 40-45% of the text changed, with shifts in tone, phrasing, and flow. Here are the results:

DetectorBefore (original AI text)After (paraphrased text)
Undetectable AI89%95%
Grammarly42%33%
GPTZero100%100%
QuillBot89%59%
Humalingo83%7%
Ahrefs85%85%
Grammarly test results after paraphrasing the text using AI
Grammarly test results after paraphrasing the text using AI
Undetectable AI test results after paraphrasing the text using AI
Undetectable AI test results after paraphrasing the text using AI
GPTZero test results after paraphrasing the text using AI
GPTZero test results after paraphrasing the text using AI
QuillBot test results after paraphrasing the text using AI
QuillBot test results after paraphrasing the text using AI
humalingo paraphrazed text
Humalingo test results after paraphrasing the text using AI
Ahrefs test results after paraphrasing the text using AI
Ahrefs test results after paraphrasing the text using AI

Interestingly, Undetectable AI shot up from 89% to 95%, showing how paraphrasing can sometimes make text even more suspicious.

On the other hand, Humalingo, Grammarly, and QuillBot dropped sharply, which can mean that they use some sort of a template for their training that does not account for paraphrasing as well.

GPTZero didn’t budge, still flagging 100% AI-written despite the heavy rewrite, which may seem like an error, but considering I’ve used an AI tool to make the adjustments, that’s a huge win for GPTZero. This perfectly makes sense, since it has integrated a paraphrasing-detection mechanism into its algorithm.

Lastly, Ahrefs stayed at seemingly bugged 85% and showed absolutely no change. Given that it's a new feature the tool offers, it may lack the more advanced detection metrics that other competitors offer.

Overall, paraphrasing AI helps against certain detectors, but it’s not perfect. Some checkers are easier to trick, while others, like GPTZero, have integrated solutions to detect such attempts.

3. Grammar and punctuation adjustments

Again, to stay on track and test AI checkers’ capabilities, for this test, I used ChatGPT itself. I asked it to adjust grammar and punctuation, tighten commas and hyphens, and change sentence breaks without changing meaning.

The edits were minor, mostly punctuation and small stylistic tweaks, so the content stayed the same. I estimate about 5-10% of tokens changed. They should barely affect detector readings or SEO.

Here’s how the detectors reacted after these lighter edits:

DetectorBefore (original AI text)Results after grammar and punctuation changes
Undetectable AI89%1%
Grammarly42%16%
GPTZero100%100%
QuillBot89%26%
Humalingo83%99%
Ahrefs85%80%
Grammarly test result after adjusting grammar and punctuation using AI
Grammarly test result after adjusting grammar and punctuation using AI
Undetectable AI test result after adjusting grammar and punctuation using AI
Undetectable AI test result after adjusting grammar and punctuation using AI

after adjusting grammar and punctuation using AI

GPTZero test results after adjusting grammar and punctuation using AI
GPTZero test results after adjusting grammar and punctuation using AI
QuillBot test results after adjusting grammar and punctuation using AI
QuillBot test results after adjusting grammar and punctuation using AI
humalingo grammar
Humalingo test results after adjusting grammar and punctuation using AI
Ahrefs test results after adjusting grammar and punctuation using AI
Ahrefs test results after adjusting grammar and punctuation using AI

Looking at the numbers, the most striking result is Undetectable AI, which dropped dramatically from 89% to just 1%. Grammarly and QuillBot also dropped significantly, from 42% down to 16%, showing that whatever algorithm they trained their AI detectors on needs some more improvement.

Ahrefs, finally, moved, although not in the correct direction. The detection fell from 85% to 80%. Humalingo moved from 83% to 99%. Lastly, GPTZero once again correctly flagged the entire thing at 100%, because AI was used in the editing and reordering.

Overall, grammar and punctuation changes with AI can make text appear more human to some detectors, but they may no longer work on others. If the underlying content is AI-generated, tools like GPTZero can still detect AI patterns and flag the entire text as AI.

4. Adding human-written segments

This round involved manually rewriting portions of the text and adding entirely human-written segments to the original AI content. To create a bigger challenge for the AI checkers, I edited the text from the previous grammar and punctuation test. Roughly 25-30% of the text was changed or newly introduced.

The meaning stayed largely the same, but the flow, examples, and sentence variety now had a clearer human touch. Here’s how the detectors reacted to this update:

DetectorPrevious test results (grammar and punctuation edit)New text (handwritten segments added)
Undetectable AI1%1%
Grammarly16%0%
GPTZero100%96%
QuillBot26%9%
Humalingo99%99%
Ahrefs80%80%
Grammarly test results after adding human-written segments
Grammarly test results after adding human-written segments
Undetectable AI test results after adding human-written segments
Undetectable AI test results after adding human-written segments
GPTZero test results after adding human-written segments
GPTZero test results after adding human-written segments
QuillBot test results after adding human-written segment
QuillBot test results after adding human-written segments
humalingo rewritten segments
Humalingo test results after adding human-written segments
Ahrefs test results after adding human-written segments
Ahrefs test results after adding human-written segments

This time, Undetectable AI stayed the same at 1%, showing that it still catches some skeleton part of the AI and leaves you with a bit of doubt, while QuillBot fell dramatically to 9% and Grammarly dropped to 0% (the first time during testing). Unfortunately, slightly humanizing the text and adding a bit of nuance fooled their detection entirely.

GPTZero dropped slightly from 100% to 96%. It still flags most of the text as AI, which is fair because the majority of the content remains AI-generated. Humalingo remained unchanged at 99%.

And lastly, yet again, Ahrefs stayed stuck (at 80%), but this time it actually hit the correct result, even if inconsistently.

5. Reordering information

For this test, I went back to the original 100% AI-generated text and only changed the structure. The content, tone, and grammar, meaning remained untouched.

I simply reordered sentences and moved the ending to the beginning. Here are the results:

DetectorBefore (original AI text)After structural changes
QuillBot89%100%
Undetectable AI89%89%
Grammarly42%50%
GPTZero100%100%
Humalingo83%99%
Ahrefs85%85%
Undetectable AI test results after reordering information
Undetectable AI test results after reordering information
Grammarly test results after reordering information
Grammarly test results after reordering information
GPTZero test results after reordering information
GPTZero test results after reordering information
QuillBot test results after reordering information
QuillBot test results after reordering information
humalingo reordered
Humalingo test results after reordering information
Ahrefs test results after reordering information
Ahrefs test results after reordering information

QuillBot and GPTZero both fully flagged the text as AI, with QuillBot even jumping from 89% to 100% and Humalingo from 83% to 99%. Undetectable AI and Ahrefs stayed exactly the same, showing no sensitivity to structural shuffling. Grammarly also rose slightly, from 42% to 50%, which means that the changes actually made the text feel more robotic.

The key takeaway is that AI detectors aren’t fooled by sentence order. The stylistic DNA of AI writing (predictability, uniformity, and probability patterns) stays intact, no matter how much the structure is reshuffled.

6. Content expansion/shortening

For this round, I deliberately varied the burstiness of the sentences. I again came back to the original 100% generated AI text and used another AI to expand/shorten sentences. Some were expanded into longer, winding statements, while others were shortened down to just a few words.

The goal was to see if manipulating sentence rhythm alone could change how AI detectors scored the piece. Here’s how the detectors responded compared to the original baseline:

DetectorBefore (original AI text)After burstiness changes
Undetectable AI89%70%
Grammarly42%0%
QuillBot89%0%
GPTZero100%100%
Humalingo83%99%
Ahrefs85%85%
Undetectable AI test results after content expansion/shortening
Undetectable AI test results after content expansion/shortening
Grammarly test results after content expansion/shortening
Grammarly test results after content expansion/shortening
GPTZero test results after content expansion/shortening
GPTZero test results after content expansion/shortening
QuillBot test results after content expansion/shortening
QuillBot test results after content expansion/shortening
humalingo shortened
Humalingo test results after content expansion/shortening
Ahrefs test results after content expansion/shortening
Ahrefs test results after content expansion/shortening

The changes made a big difference. Grammarly dropped all the way to 0% for the second time in these tests, and QuillBot also fell to 0%, which confirms that burstiness directly affects systems that lean heavily on surface-level edits. Undetectable AI decreased from 89% to 70%, another significant dip, though it still flagged the text as mostly AI.

Ahrefs remained unchanged at 85%, which reinforces the suspicion that it’s either calibrated poorly or uses a very rigid statistical model. In fact, it has been locked at 85% through multiple variations, even when tested against fully human text. So, Ahrefs’ reliability is questionable at this point.

GPTZero, meanwhile, stayed firm at 100%, though it did flag some individual sentences as less AI-like. However, its final verdict was still correct: the text is AI-generated, even if the burstiness variations tricked some signals. Lastly, Humalingo increased from 83% to 99%.

That human-like irregularity seems to have tricked some detectors into lowering scores, but the underlying probability patterns are still intact, which is why GPTZero didn’t move.

7. Style changes

For this test, I changed the style of the text. The goal was to make it sound much more casual and conversational, the way someone might type in a chat or blog. As before, I used the original AI-generated text and altered it using ChatGPT, so the final text is still 100% AI.

The idea was simple. If AI is often flagged for being too clean and predictable, then maybe making the style messy, relaxed, and more human would throw off the detectors. Here are the results:

DetectorBefore (original AI text)After the style change
Undetectable AI89%1%
QuillBot89%0%
Grammarly42%40%
GPTZero100%100%
Humalingo83%99%
Ahrefs85%85%
Undetectable AI test results after changing the style to casual using AI
Undetectable AI test results after changing the style to casual using AI
Grammarly test results after changing the style to casual using AI
Grammarly test results after changing the style to casual using AI
GPTZero test results after changing the style to casual using AI
GPTZero test results after changing the style to casual using AI
QuillBot test results after changing the style to casual using AI
QuillBot test results after changing the style to casual using AI
humalingo change of tone
Humalingo test results after changing the style to casual using AI
Ahrefs test results after changing the style to casual using AI
Ahrefs test results after changing the style to casual using AI

The outcome is very surprising. QuillBot both completely failed, dropping to 0% and treating the casual style as fully human-written. Undetectable AI again clung to just 1%, which is effectively also a failure, while Grammarly only slightly moved down to 40%.

Only GPTZero held firm at 100%, still recognizing the AI origin even though it marked certain sentences as human-like. Humalingo consistently scored the text as 99% AI. Ahrefs, predictably, didn’t move from 85%, reinforcing its unreliability across every single test.

What this shows is that style matters a lot to some AI checkers. GPTZero, however, proves harder to fool because it leans on deeper probability markers that style alone can’t erase.

8. Translation and back translation

For this experiment, I took AI-generated text, ran it through Google Translate into Japanese, and then translated it back into English. Japanese uses subject-object-verb (SOV) word order, unlike English’s standard subject-verb-object (SVO), so this process naturally rearranged sentences and phrasing.

It’s worth noting that Google Translate itself uses a similar AI to other LLMs to translate the text, so you can still say that the text itself is 100% AI-written. Here are the results:

DetectorBefore (original AI text)After translation with Google Translate
Ahrefs85%85%
GPTZero100%100%
Grammarly42%42%
QuillBot89%87%
Humalingo83%99%
Undetectable AI89%88%
Undetectable AI test results after translation and back translation processes using Google Translate
Undetectable AI test results after translation and back translation processes using Google Translate
Grammarly test results after translation and back translation processes using Google Translate
Grammarly test results after translation and back translation processes using Google Translate
GPTZero test results after translation and back translation processes using Google Translate
GPTZero test results after translation and back translation processes using Google Translate
QuillBot test results after translation and back translation processes using Google Translate
QuillBot test results after translation and back translation processes using Google Translate
humalingo back translation
Humalingo test results after translation and back translation processes using Google Translate
Ahrefs test results after translation and back translation processes using Google Translate
Ahrefs test results after translation and back translation processes using Google Translate

Unsurprisingly, translating the AI text into Japanese and back basically had no change to the final detection. Using translation alone is not an effective way to hide AI-generated text from detectors. The method slightly affects surface-level phrasing but won’t affect AI checkers of 2026.

FAQ