Can genome-based Large Language Models predict gene expression?

Sofiane Sadat; Arnaud Ferré; Guillaume Kon Kam King; Sofia Lotfi

Résumé

Large Language Models (LLMs), primarily based on the transformer architecture [1], such as BERT [2], have exhibited impressive efficiency in various Natural Language Processing (NLP) tasks. The precision of these LLMs in tasks such as sentiment analysis, machine translation, information extraction, document categorization, and chatbot interactions has been significantly enhanced. As a result, they have made substantial contributions to the advancements and established new benchmarks in the field of NLP. The success of these models has encouraged researchers to extend their capabilities to biological sequences, which bear some formal similarities to natural languages but also differ in fundamental ways. Studies suggest that DNA, especially non-coding regions, display numerous linguistic characteristics similar to natural languages, such as alphabets, lexicons, grammar, and phonetics [3]. In this study, we repurpose DNABERT2 [4], an LLM trained extensively on genetic sequences to predict human median gene expression. In a similar way to BERT in natural language, DNABERT2 learns to understand those sequences by predicting masked parts of the sequence based on the context of the surrounding information, a process known as Masked Language Modeling (MLM). We adjust the model further to better suit the specific task, a process typically referred to as fine-tuning. We hypothesize that DNABERT2 encodes meaningful representations of DNA sequences that are also relevant to predict genetic expression. The authors of DNABERT2 introduced a new dataset to serve as a benchmark for genome-based LLM, called Gene Understanding Evaluation (GUE). This dataset relies on the classification of certain genomic sequences. They applied their method and other state-of-the-art genome-based LLMs to GUE, demonstrating that DNABERT2 achieved comparable performance to these advanced models. In the original paper, the model was suited exclusively on classification tasks namely: Core Promoter Detection, epigenetic marks prediction, promoter detection transcription factor prediction, splice site prediction, transcription factor prediction, covid variant classification, enhancer promoter interaction and species classification. To understand the added value of using an LLM representation of sequences for the regression task of predicting gene expression, rather than using the raw DNA sequences, we compare the performance of DNABERT2 with DExTER, a supervised method specifically designed for gene expression prediction by identifying long regulatory elements [5], which has achieved significant results. DExTER is trained to select relevant motifs based on their predictive power, and at inference time, uses their abundance in the sequence of interest. In this comparison, we directly used the exact same dataset that was utilized in the DExTER paper for the evaluation of pituitary expression. This dataset, originally from GTEx and published on June 5, 2017, contains data on more than 50,000 human genes and their median expressions measured in Transcripts Per Million (TPM). It specifically includes protein-coding genes and their expression in the pituitary tissue, resulting in a subset of 22,410 genes. To ensure the reliability of our findings, we used the same test set for both models, dividing the remaining data into training and validation sets. Our results show that for this gene expression prediction task, DNABERT2 outperformed DExTER. Our results demonstrate the potential of LLMs in providing robust representations of DNA sequences. Specifically, our findings underscore the prowess of DNABERT2, in precisely forecasting gene expression levels, without the need of designing complex algorithms like those in DExTER to capture gene-expression intricacies. This not only accentuates the promise of LLMs in gene expression prediction but also opens avenues for improving various other genetic predictions tasks using LLMs.

Origine	Fichiers produits par l'(les) auteur(s)
Licence	Autorisation HAL

Can genome-based Large Language Models predict gene expression?

Résumé

Mots clés

Domaines

Dates et versions

Licence

Identifiants

Citer

Exporter

Collections

Partager