Predictive language models like ChatGPT are powerful artificial intelligence (AI) systems trained on extensive examples of human communication. They can generate new text based on the patterns and structures observed in the training data. Their responses to prompts from humans are contextually relevant and often appear indistinguishable from human writing. Like our writing systems, many important categories of molecules in the living world, such as proteins, RNA, and DNA are composed of a sequence of individual molecular units, akin to letters. The widespread use of sequencing platforms has by now generated data representing the entire genomes of myriad species. This vast number of sequences represents excellent training data for biological language models. Here, we will discuss some of the training methods and attributes of these powerful artificial intelligence systems.

One way in which engineers train these systems is through masking. The AI system is provided with a large amount of DNA sequence data, but with portions of it kept hidden so it can try to predict what should be there. Its attempts at reconstructing the missing data can then serve as an opportunity to refine how it generates its predictions. This is called self-supervised learning with Masked Language Modeling (MLM), and it builds up the system’s ability to perform without requiring labeled data.

Another valuable training technique is for a system to gain the ability to recognize conserved regions. By including sequence data from many different species, similarities start becoming apparent in the DNA regions encoding basic functions necessary for survival. The system can then start to recognize them even in fragmentary form. It can also grasp species-specific variations in the regulatory code. Those non-coding regions where those regulatory elements often reside have historically been challenging for alignment efforts, given their sequence structure can be highly variable. Training including MLM especially helps to overcome this barrier. As a language model’s grasp of the overall sequence context grows, it becomes more effective at identifying conserved motifs within regulatory elements, despite the overall variability.

Advances in this field are accelerating rapidly. Newer models are outperforming older AI systems in specialized tasks for which those systems were originally designed, such as identifying and reconstructing regulatory motifs. One of the most impressive advancements is the rapid gains in contextual understanding of the broader biological environment, enabling predictive leaps like estimating RNA half-life or how essential a given gene is for the organism. Half a century ago, it was thought that it would take more time than the current age of the universe to complete the task of predicting the three-dimensional structure of a single amino acid sequence. Science has truly made an unimaginable level of advancement in this field within less than a single lifetime.

 

References

  1. Karollus, A., Hingerl, J., Gankin, D., Grosshauser, M., Klemon, K., & Gagneur, J. (2024). Species-aware dna language models capture regulatory elements and their evolution. Genome Biology, 25(1), 83.
  2. Nguyen, E., Poli, M., Durrant, M. G., et al. (2024). Sequence modeling and design from molecular to genome scale with Evo. bioRxiv, 2024-02. doi.org/10.1101/2024.02.27.582234