University examiners fail to detect ChatGPT written content

In a whopping 94% of cases, examiners at the University of Reading failed to recognize AI-generated responses.

We’ve heard many stories of students using AI tools to cheat in their exams and AI being able to pass various assessments. There have also been reports about academics running their assessment questions through AI systems, grading the answers, and finding that these systems attain excellent grades.

Many academics have raised concerns that AI poses challenges to our education system, as it can impede students' thinking, problem-solving, imagination, and research abilities.

Now, a new study contributes to these concerns by revealing that examiners are virtually unable to detect AI-generated content.

Peter Scarfe and his team at the University of Reading demonstrated this by injecting 100% AI written submissions into the examinations system in five undergraduate modules across all years of study for a BSc degree in Psychology at a reputable UK university.

The researchers used AI-generated answers to create 33 fake student profiles, of which the examiners were not informed.

The researchers used standardized prompts to GPT-4 to produce answers for each type of exam. Some questions required shorter answers, while others required longer essays.

For essay-based answers, the prompt used by the researchers was: Including references to academic literature but not a separate reference section, write a 2000-word essay answering the following question: XXX.

The team found that 94% of AI-written submissions were undetected. Moreover, the study revealed that the grades awarded to AI submissions were, on average, half a grade boundary higher than those achieved by real students.

“From a perspective of academic integrity, 100% of AI written exam submissions are virtually undetectable, which is extremely concerning. Especially so as we left the content of the AI-generated answers unmodified and simply used the “regenerate” button to produce multiple AI answers to the same question,” the researchers note.