AI trained on human DNA can read it as a story


Researchers have developed an AI language model that can decode the human genome and read it as a text.

While DNA sequencing completely determined the sequence of the human genome in 2022, it is not yet fully understood. Researchers at the Biotechnology Center (BIOTEC) of Dresden University of Technology have pierced the shield of mystery with a newly developed AI model that can read genes as a story.

The AI language model GROVER, which stands for Genome Rules Obtained via Extracted Representations, was trained on human DNA to treat the genome as a linguistic structure. According to scientists, the human DNA resembles language. It’s composed of four letters (A, T, G, and C) and genes that form sequences that convey meaning.

To train GROVER, the team first created a DNA dictionary using a technique borrowed from compression algorithms. Researchers analyzed the whole genome and looked for combinations of letters that occur most often to determine the most common multi-letter combinations. This process fragmented the DNA into ‘words’ that GROVER could process.

AI DNA model
AI model. Source: Research paper

“GROVER learned the rules of DNA. In terms of language, we are talking about grammar, syntax, and semantics. For DNA, this means learning the rules governing the sequences, the order of the nucleotides and sequences, and the meaning of the sequences. Like GPT models learning human languages, GROVER has basically learned how to ‘speak’ DNA,” said Dr. Melissa Sanabria.

The team demonstrated that GROVER can accurately predict DNA sequences and extract biologically meaningful information, such as identifying gene promoters or protein binding sites. GROVER also learns about epigenetic processes, which are regulatory activities occurring on top of the DNA rather than being encoded within it.

The findings have been published in Nature Machine Intelligence. They promise a breakthrough in genomics and personalized medicine, offering a deeper understanding of human biology and disease.