PeTriBERT : Augmenting BERT with tridimensional encoding for inverse protein folding and design - INRAE - Institut national de recherche pour l’agriculture, l’alimentation et l’environnement Accéder directement au contenu
Pré-Publication, Document De Travail Année : 2022

PeTriBERT : Augmenting BERT with tridimensional encoding for inverse protein folding and design

Résumé

Abstract Protein is biology workhorse. Since the recent break-through of novel folding methods, the amount of available structural data is increasing, closing the gap between data-driven sequence-based and structure-based methods. In this work, we focus on the inverse folding problem that consists in predicting an amino-acid primary sequence from protein 3D structure. For this purpose, we introduce a simple Transformer model from Natural Language Processing augmented 3D-structural data. We call the resulting model PeTriBERT: Proteins embedded in tridimensional representation in a BERT model. We train this small 40-million parameters model on more than 350 000 proteins sequences retrieved from the newly available AlphaFoldDB database. Using PetriBert, we are able to in silico generate totally new proteins with a GFP-like structure. These 9 of 10 of these GFP structural homologues have no ressemblance when blasted on the whole entry proteome database. This shows that PetriBert indeed capture protein folding rules and become a valuable tool for de novo protein design.
Fichier principal
Vignette du fichier
DumortierB.-et al-bioRxiv-2022.pdf (5.57 Mo) Télécharger le fichier
Origine : Fichiers éditeurs autorisés sur une archive ouverte
Licence : CC BY - Paternité

Dates et versions

hal-03759515 , version 1 (24-08-2022)

Licence

Paternité

Identifiants

Citer

Baldwin Dumortier, Antoine Liutkus, Clément Carré, Gabriel Krouk. PeTriBERT : Augmenting BERT with tridimensional encoding for inverse protein folding and design. 2022. ⟨hal-03759515⟩
90 Consultations
115 Téléchargements

Altmetric

Partager

Gmail Facebook X LinkedIn More