Can Undetectable AI Bypass ZeroGPT? I Tested 100 Rewrites, and the Answer Is More Complicated Than the Hype

A high detector score can look impressive on a dashboard. But students do not submit dashboards—they submit sentences. In this test, Undetectable AI often pushed ZeroGPT toward a more human-looking result. At the same time, many rewrites became longer, clumsier, and sometimes stranger. The big takeaway is simple: detector evasion and good writing are not the same thing.

How This Test Was Set Up

I worked with a dataset of 100 original passages. Each one was rewritten with Undetectable AI and then checked in ZeroGPT. ZeroGPT gives an AI score, but for clarity I converted that into a human score. That means:

100% = ZeroGPT treated the passage as fully human.
0% = ZeroGPT treated the passage as fully AI.
Higher scores are better if your goal is to make AI text look human to ZeroGPT.

This is an important limit to keep in mind: the study measures how ZeroGPT reacts, not whether the rewrite is actually strong, accurate, or worth turning in. That difference matters a lot for students.

Also Read: [STUDY] Can Undetectable AI Bypass GPTZero? A 100-Sample Reality Check

What Stands Out Right Away

Average human score: 79.2% across all 100 rewrites.
Median human score: 86%. The median is the middle value after sorting all scores, so it helps show the typical result without being skewed by a few extremes.
High-end results: 43 out of 100 samples landed in the 91% to 100% range.
Weak results still existed: 11 samples scored 50% human or lower.
Length inflation: the rewrites were about 23% longer on average, which suggests the tool often adds words to sound more human.

The First Pattern: Undetectable AI Often Helps, but It Is Not a Guaranteed Pass

The overall score distribution leans positive. Most of the dataset sits above the halfway mark, and a large chunk lands very high. That tells us Undetectable AI is often effective against ZeroGPT, at least in a one-detector test like this one.

Histogram showing the distribution of ZeroGPT human scores across 100 Undetectable AI rewrites

The overall spread is strong, but not clean. A lot of samples scored high, while a noticeable minority still scored poorly.

The bucket view makes that even clearer. Nearly half the dataset scored above 90% human, which is a very strong result. But the bottom of the chart matters too: some rewrites still looked obviously AI-like to ZeroGPT.

Bar chart showing how many samples fell into each human score bucket

43 samples reached 91% to 100% human, but 11 samples remained at 50% human or below.

If you think in terms of practical thresholds, the story becomes easier to read:

90% of samples cleared 50% human.
60% cleared 75% human.
44% cleared 90% human.

That is good enough to call the tool useful, but not good enough to call it reliable in every case.

Also Read: [STUDY] Can Undetectable AI Bypass Originality AI? A 100-Sample Reality Check

Bar chart showing the percentage of samples that scored above common human-score thresholds

The tool performs well often, but a perfect one-click escape from detection is not what the data shows.

The Second Pattern: The More Aggressively It Rewrites, the Better It Usually Scores

I also compared each original passage with its rewritten version to see how much wording stayed the same. Put simply, I measured word overlap. Lower overlap means a more aggressive rewrite. Higher overlap means the tool stayed closer to the original phrasing.

That relationship mattered a lot. The most aggressive quarter of rewrites averaged about 90.3% human, while the quarter that stayed closest to the original averaged only about 59.8% human.

Bar chart comparing average human score by rewrite aggressiveness

The rewrites that changed the wording most heavily usually earned the best ZeroGPT scores.

In simple terms, ZeroGPT was often harder to fool with light editing, but easier to fool when the text was heavily reshaped. That sounds like a win for the humanizer—until you actually read some of the rewritten passages.

Here Is the Catch: Better Scores Often Came with Worse Writing

This was the most interesting part of the CSV. A close read of the rewrites showed that strong detector scores did not always mean strong output. In fact, some of the highest-scoring passages were among the messiest to read.

Across the dataset, I repeatedly found several rewrite problems:

Meaning drift: the rewrite kept the topic but changed the claim, softened it, exaggerated it, or added details that were not in the source.
Formatting damage: headings, numbering, and punctuation were sometimes changed in awkward ways.
Grammar breaks: some passages looked like English, but not natural English.
Gibberish or broken phrasing: a few lines were clearly worse than the original even when they scored very high.
Wordy inflation: the tool often added filler rather than clarity.

Examples of Rewrite Problems Found in the CSV

Some examples were hard to ignore. One passage about the Golden Ratio produced the line “She is not only found in art”, which turns a mathematical concept into a person. Another rewrite about soundproofing produced “Weather your home through the edges of your doors and windows”, which is not what the original meant at all.

“A the railways...” / “The in the home is in the kitchen” / “high high water marks and low low water marks”

Those are not isolated slips. I also found duplicated headings, random capitalization, repeated words, and lines that read like a machine trying to imitate a human voice one clause at a time. In a few places the rewrite added new details—for example extra descriptive wording or stronger claims—without clear support from the original.

For students, that matters. A detector might be fooled by surface-level variety, but a teacher still reads for clarity, logic, and accuracy. If a rewritten sentence sounds unnatural or drifts away from the source idea, the detector score stops being the most important part of the assignment.

What the Full Score Curve Suggests

When the scores are sorted from lowest to highest, the shape of the curve tells a useful story. The top half of the dataset is strong. The bottom section, however, drops sharply. That means Undetectable AI was not simply “good” or “bad.” It was inconsistent.

Line chart of sorted human scores across the dataset

The upper half of the dataset performed well, but the lower tail shows that failure cases were real and not rare.

What the Screenshots Show

The screenshots below match the broader pattern in the numbers. ZeroGPT often reacts favorably to the rewritten versions, and the Undetectable AI interface clearly pushes the text toward a more detector-friendly form. But once you read closely, you can see the trade-off: the wording is often more artificial, more inflated, or less precise than the original.

ZeroGPT Detection Results

Example ZeroGPT result 1

Example ZeroGPT result 2

Example ZeroGPT result 3

Example ZeroGPT result 4

Example ZeroGPT result 5

Undetectable AI Rewrite Screens

Undetectable AI rewrite 1

Undetectable AI rewrite 2

Undetectable AI rewrite 3

Undetectable AI rewrite 4

Undetectable AI rewrite 5

The Verdict

Undetectable AI was fairly effective at pushing ZeroGPT toward higher human scores. On the numbers alone, that much is clear. The average result was strong, and many samples scored in the 90% to 100% human range.

But the writing itself tells the more useful story. The tool often improved detector performance by making bigger wording changes, and those changes sometimes introduced awkward grammar, strange phrasing, extra filler, and small shifts in meaning. In other words, it could beat the detector while still harming the paragraph.

For students, that is the real lesson. If your standard is only “Can this get past ZeroGPT?” the tool often works. If your standard is “Will this still sound clear, accurate, and natural to a human reader?” the answer is much less comfortable.