Integration of pre - Kunming PCBA Group Co.,Ltd

Communications Biology volume 6, Article number: 876 (2023) Cite this article

2369 Accesses

4 Altmetric

Metrics details

Geometric deep learning has recently achieved great success in non-Euclidean domains, and learning on 3D structures of large biomolecules is emerging as a distinct research area. However, its efficacy is largely constrained due to the limited quantity of structural data. Meanwhile, protein language models trained on substantial 1D sequences have shown burgeoning capabilities with scale in a broad range of applications. Several preceding studies consider combining these different protein modalities to promote the representation power of geometric neural networks but fail to present a comprehensive understanding of their benefits. In this work, we integrate the knowledge learned by well-trained protein language models into several state-of-the-art geometric networks and evaluate a variety of protein representation learning benchmarks, including protein-protein interface prediction, model quality assessment, protein-protein rigid-body docking, and binding affinity prediction. Our findings show an overall improvement of 20% over baselines. Strong evidence indicates that the incorporation of protein language models’ knowledge enhances geometric networks’ capacity by a significant margin and can be generalized to complex tasks.

Macromolecules (e.g., proteins, RNAs, or DNAs) are essential to biophysical processes. While they can be represented using lower-dimensional representations such as linear sequences (1D) or chemical bond graphs (2D), a more intrinsic and informative form is the three-dimensional geometry1. 3D shapes are critical to not only understanding the physical mechanisms of action but also answering a number of questions associated with drug discovery and molecular design2. Consequently, tremendous efforts in structural biology have been devoted to deriving insights from their conformations3,4,5.

With the rapid advances of deep learning (DL) techniques, it has been an attractive challenge to represent and reason about macromolecules’ structures in the 3D space. In particular, different sorts of 3D information, including bond lengths and dihedral angles, play an essential role. In order to encode them, a number of 3D geometric graph neural networks (GGNNs) or CNNs6,7,8,9 have been proposed, and simultaneously achieve several crucial properties of Euclidean geometry such as E(3) or SE(3) equivariance and symmetry. Notably, they are essential constituents of geometric deep learning (GDL), an umbrella term that generalizes networks to Euclidean or non-Euclidean domains10.

Meanwhile, the anticipated growth of sequencing promises unprecedented data on natural sequence diversity. The abundance of 1D amino acid sequences has spurred increasing interest in developing protein language models at the scale of evolution, such as the series of ESM11,12,13 and ProtTrans14. These protein language models can capture information about secondary and tertiary structures and can be generalized across a broad range of downstream applications. To be explicit, they have recently been demonstrated with strong capabilities in uncovering protein structures12, predicting the effect of sequence variation on function11, learning inverse folding15 and many other general purposes13.

With the fruitful progress in protein language models, more and more studies have considered enhancing GGNNs’ ability by leveraging the knowledge of those protein language models12,16,17. This is nontrivial because compared to sequence learning, 3D structures are much harder to obtain and thus less prevalent. Consequently, learning about the structure of proteins leads to a reduced amount of training data. For example, the SAbDab database18 merely has 3K antibody-antigen structures without duplicate. The SCOPe database19 has 226K annotated structures, and the SIFTS database20 comprises around 220K annotated enzyme structures. These numbers are orders of magnitude lower than the data set sizes that can inspire major breakthroughs in the deep learning community. In contrast, while the Protein Data Bank (PDB)21 possesses approximately 182K macromolecule structures, databases like Pfam22 and UniParc23 contains more than 47M and 250M protein sequences respectively.

In addition to the data size, the benefit of protein sequence to structure learning also has solid evidence and theoretical support. Remarkably, the idea that biological function and structures are documented in the statistics of protein sequences selected through evolution has a long history24. The unobserved variables that decide a protein’s fitness, including structure, function, and stability, leave a record in the distribution of observed natural sequences25. Those protein language models use self-supervision to unlock the information encoded in protein sequence variations, which is also beneficial for GGNNs. Accordingly, in this paper, we comprehensively investigate the promotion of GGNNs’ capability with the knowledge learned by protein language models (see Fig. 1). The improvements come from two major lines. Firstly, GGNNs can benefit from the information that emerges in the learned representations of those protein language models on fundamental properties of proteins, including secondary structures, contacts, and biological activity. This kind of knowledge may be difficult for GGNNs to be aware of and learn in a specific downstream task. To confirm this claim, we conduct a toy experiment to demonstrate that conventional graph connectivity mechanisms prevent existing GGNNs from being cognizant of residues’ absolute and relative positions in the protein sequence. Secondly and more intuitively, protein language models serve as an alternative way of enriching GGNNs’ training data and allow GGNNs to be exposed to more different families of proteins, thereby greatly strengthening GGNNs’ generalization capability.

The protein sequence is first forwarded into a pretrained protein language model to extract per-residue representations, which are then used as node features in 3D protein graphs for GGNNs.

We examine our hypothesis across a wide range of benchmarks, containing model quality assessment, protein-protein interface prediction, protein-protein rigid-body docking, and ligand binding affinity prediction. Extensive experiments show that the incorporation and combination of pretrained protein language models’ knowledge significantly improve GGNNs’ performance for various problems, which require distinct domain knowledge. By utilizing the unprecedented view into the language of protein sequences provided by powerful protein language models, GGNNs promise to augment our understanding of a vast database of poorly understood protein structures. Our work hopes to shed more light on how to bridge the gap between the thriving geometric deep learning and mature protein language models and better leverage different modalities of proteins.

Our toy experiments illustrate that existing GGNNs are unaware of the positional order inside the protein sequences. Taking a step further, we show in this section that incorporating knowledge learned by large-scale protein language models can robustly enhance GGNN’s capacity in a wide variety of downstream tasks.

Model Quality Assessment (MQA) aims to select the best structural model of a protein from a large pool of candidate structures and is an essential step in structure prediction26. For a number of recently solved but unreleased structures, structure generation programs produce a large number of candidate structures. MQA approaches are evaluated by their capability of predicting the global distance test (GDT-TS score) of a candidate structure compared to the experimentally solved structure of that target. Its database is composed of all structural models submitted to the Critical Assessment of Structure Prediction (CASP)27 over the last 18 years. The data is split temporally by competition year. MQA is similar to the Protein Structure Ranking (PSR) task introduced by Townshend et al.2.

Protein-protein Rigid-body Docking (PPRD) computationally predicts the 3D structure of a protein-protein complex from the individual unbound structures. It assumes that no conformation change within the proteins happens during binding. We leverage Docking Benchmark 5.5 (DB5.5)28 as the database. It is a gold standard dataset in terms of data quality and contains 253 structures.

Protein-protein Interface (PPI) investigates whether two amino acids will contact when their respective proteins bind. It is an important problem in understanding how proteins interact with each other, e.g., antibody proteins recognize diseases by binding to antigens. We use the Database of Interacting Protein Structures (DIPS), a comprehensive dataset of protein complexes mined from the PDB29, and randomly select 15K samples for evaluation.

Ligand Binding Affinity (LBA) is an essential task for drug discovery applications. It predicts the strength of a candidate drug molecule’s interaction with a target protein. Specifically, we aim to forecast \(pK=-{\log }_{10}K\), where K is the binding affinity in Molar units. We use the PDBbind database30,31, a curated database containing protein-ligand complexes from the PDB and their corresponding binding strengths. The protein-ligand complexes are split such that no protein in the test dataset has more than 30% or 60% sequence identity with any protein in the training dataset.

We evaluate our proposed framework on the instances of several state-of-the-art geometric networks, using Pytorch32 and PyG33 on four standard protein benchmarks. For MQA, PPI, and LBA, we use GVP-GNN, EGNN, and Molformer as backbones. For PPRD, we utilize a deep learning model, EquiDock34, as the backbone. It approximates the binding pockets and obtains the docking poses using keypoint matching and alignment. For more experimental details, please refer to Supplementary Note 3.

For MQA, we document First Rank Loss, Spearman correlation (RS), Pearson’s correlation (RP), and Kendall rank correlation (KR) in Table 1. The introduction of protein language models has brought a significant average increase of 32.63% and 55.71% in global and mean RS, of 34.66% and 58.75% in global and mean RP, and of 43.21% and 63.20% in global and mean KR respectively. With the aid of language models, GVP-GNN achieves the optimal global RS, global RP, and KR of 84.92%, 85.44%, and 67.98% separately.

Apart from that, we provide a full comparison with all existing approaches in Table 2. We elect RWplus35, ProQ3D36, VoroMQA37, SBROD38, 3DCNN2, 3DGNN2, 3DOCNN39, DimeNet40, GraphQA41, and GBPNet42 as the baselines. Performance is recorded in Table 2, where the second best is underlined. It can be concluded that even if GVP-GNN is not the best architecture, it can largely outperform existing methods including the state-of-the-art no-pretraining method set by Ayken and Xia42 (i.e., GBPNet) and the state-of-the-art pretraining results set by Jing et al.43 if it is enhanced by the protein language model.

For PPRD, we report three items as measurements: the complex root mean squared deviation (RMSD), the ligand RMSD, and the interface RMSD in Table 3. The interface is determined with a distance threshold less than 8Å. It is noteworthy that, unlike the EquiDock paper, we do not apply the Kabsch algorithm to superimpose the receptor and the ligand. Contrastingly, the receptor protein is fixed during evaluation. All three metrics decrease considerably with improvements of 11.61%, 12.83%, and 31.01% in complex, ligand, and interface median RMSD, respectively. Notably, we also report the result of EquiDock, which is first pretrained on DIPS and then fine-tuned on DB5. It can be discovered that DIPS-pretrained EquiDock still performs worse than EquiDock equipped with pretrained language models. This strongly demonstrates that structural pretraining for GGNNs may not benefit GGNNs more than pretrained protein language models.

For PPI, we record AUROC as the metric in Fig. 2. It can be found that AUROC increases for 6.93%, 14.01%, and 22.62% for GVP-GNN, EGNN, and Molformer respectively. It is worth noting that Molformer falls behind EGNN and GVP-GNN originally in this task. But after injecting knowledge learned by protein language models, Molformer achieves competitive or even better performance than EGNN or GVP-GNN. This indicates that protein language models can realize the potential of GGNNs to the full extent and greatly narrow the gap between different geometric deep learning architectures. The results mentioned above are amazing because, unlike MQA, PPRD and PPI study the geometric interactions between two proteins. Though existing protein language models are all trained on single protein sequences, our experiments show that the evolution information hidden in unpaired sequences can also be valuable to analyze the multi-protein environment.

a Results of PPI with and without PLMs. b Performance of GGNNs on MQA with ESM-2 at different scales.

For LBA, we compare RMSD, RS, RP, and KR in Table 4. The incorporation of protein language models produces a remarkably average decline of 11.26% and 6.15% in RMSD for 30% and 60% identity, an average increase of 51.09% and 9.52% in RP for the 30% and 60% identity, an average increment of 66.60% and 8.90% in RS for the 30% and 60% identity, and an average increment of 68.52% and 6.70% in KR for the 30% and 60% identity. It can be seen that the improvements in the 30% sequence identity is higher than that in the less restrictive 60% sequence identity. This confirms that protein language models benefit GGNNs more when the unseen samples belong to different protein domains. Moreover, contrasting PPRD or PPI, LBA studies how proteins interact with small molecules. Our outcome demonstrates that rich protein representations encoded by protein language models can also contribute to the analysis of protein’s reaction to other non-protein drug-like molecules. The result of a different data split has been placed in Supplementary Table 1.

In addition, we compare thoroughly with existing approaches for LBA in Table 5, where the second best is underlined. We select a broad range of models including DeepAffinity44, Cormorant45, LSTM46, TAPE47, ProtTrans14, 3DCNN2, GNN2, MaSIF48, DGAT49, DGIN49, DGAT-GCN49, HoloProt50, and GBPNet42 as the baseline. It is clear that even if EGNN is a median-level architecture, it can achieve the best RMSD and the best Pearson’s correlation when enhanced by protein language models, beating a group of strong baselines including HoloProt50 and GBPNet42.

It has been observed that as the size of the language model increases, there are consistent improvements in tasks like structure prediction12. Here we conduct an ablation study to investigate the effect of protein language models’ sizes on GGNNs. Specifically, we explore different ESM-2 with the parameter numbers of 8M, 35M, 150M, 650M, and 3B and plot results in Fig. 2. It verifies that scaling the protein language model is advantageous for GGNNs. More additional results can be found in Supplementary Note 4. We also provide a comparison of different sorts of PLMs’ influence in Supplementary Table 2. Besides that, we investigate the difference of PLMs’ effectiveness with and without MSA in Supplementary Table 3.

Despite our successful confirmation that PLMs can promote geometric deep learning, there are several limitations and extensions of our framework left open for future investigation. For instance, our 3D protein graphs are residue-level. We believe atom-level protein graphs also benefit from our approach, but its increase in performance needs further exploration.

In this study, we investigate a problem that has been long ignored by existing geometric deep learning methods for proteins. That is, how to employ the abundant protein sequence data for 3D geometric representation learning. To answer this question, we propose to leverage the knowledge learned by existing advanced pre-trained protein language models and use their amino acid representations as the initial features. We conduct a variety of experiments such as protein-protein docking and model quality assessment to demonstrate the efficacy of our approach. Our work provides a simple but effective mechanism to bridge the gap between 1D sequential models and 3D geometric neural networks, and hope to throw light on how to combine information encoded in different protein modalities.

It is commonly acknowledged that protein structures maintain much more information than their corresponding amino acid sequences. And for decades long, it has been an open challenge for computational biologists to predict protein structure from its amino acid sequence. Though the advancement of Alphafold (AF)51 and RosettaFold52 has made a huge step in alleviating the limitation brought by the number of available experimentally determined protein structures, neither AF nor its successors such as Alphafold-Multimer53, IgFold54, and HelixFold55 are a panacea. Their predicted structures can be severely inaccurate when the protein is orphan and lacks multiple sequence alignment (MSA) as the template. Consequently, it is hard to conclude that protein sequences can be perfectly transformed to the structure modality by current tools and be used as extra training resources for GGNNs.

Moreover, we argue that even if conformation is a higher-dimensional representation, the prevailing learning paradigm may forbid GGNNs from capturing the knowledge that is uniquely preserved in protein sequences. Recall that GGNNs are mainly diverse in their patterns to employ 3D geometries, the input features include distance56, angles40, torsion, and terms of other orders57. The position index hidden in protein sequences, however, is usually neglected when constructing 3D graphs for GGNNs. Therefore, in this section, we design a toy trial to examine whether GGNNs can succeed in recovering this kind of positional information.

Here the structure of a protein can be represented as an atom-level or residue-level graph \({{{{{{{\mathcal{G}}}}}}}}=({{{{{{{\mathcal{V}}}}}}}},{{{{{{{\mathcal{E}}}}}}}})\), where \({{{{{{{\mathcal{V}}}}}}}}\) and \({{{{{{{\mathcal{E}}}}}}}}=({e}_{ij})\) correspond to the set of N nodes and M edges respectively. Nodes have their 3D coordinates \({{{{{{{\bf{x}}}}}}}}\in {{\mathbb{R}}}^{N\times 3}\) and the initial ψh-dimension roto-translational invariant features \({{{{{{{\bf{h}}}}}}}}\in {{\mathbb{R}}}^{N\times {\psi }_{h}}\) (e.g., atom types and electronegativity, residue classes). Normally, there are three types of options to construct connectivity for molecules: r-ball graphs, fully-connected (FC) graphs, and K-nearest neighbors (KNN) graphs. In our setting, nodes are linked to K = 10 nearest neighbors for KNN graphs, and edges include all atom pairs within a distance cutoff of 8Å for r-ball graphs.

Since most prior studies choose to establish 3D protein graphs based on purely geometric information and ignore their sequential identities, it provokes the following position identity question:

Can existing GGNNs identify the sequential position order only from geometric structures of proteins?

To answer this question, we formulate two categories of toy tasks (see Fig. 3). The first one is absolute position recognition (APR), which is a classification task. Models are asked to directly predict the position index ranging from 1 to N, the residue number of each protein. This task adopts accuracy as the metric and expects models to discriminate the absolute position of the amino acid within the whole protein sequence. We compute the distribution of the protein sequence lengths in Supplementary Fig. 1.

a Protein residue graph construction. Here we draw graphs in 2D for better visualization but study 3D graphs for GGNNs. b Two sequence recovery tasks. The first requires GGNNs to predict the absolute position index for each residue in the protein sequence. The second aims to forecast the minimum distance of each amino acid to the two sides of the protein sequence.

In addition to that, we propose the second task named relative position estimation (RPE) to focus on the relative position of each residue. Models are required to predict the minimum distance of residue to the two sides of the given protein and the root mean squared error (RMSE) is used as the metric. This task aims to examine the capability of GGNNs to distinguish which segment the amino acid belongs to (i.e., the center section of the protein or the end of the protein).

We adopt three technically distinct and broadly accepted architectures of GGNNs for empirical verification. To be specific, GVP-GNN7,43 extends standard dense layers to operate on collections of Euclidean vectors, performing both geometric and relational reasoning on efficient representations of macromolecules. EGNN58 is a translation, rotation, reflection, and permutation equivariant GNN without expensive spherical harmonics. Molformer9 employs the self-attention mechanism for 3D point clouds while guarantees SE(3)-equivariance.

We exploit a small non-redundant subset of high-resolution structures from the PDB. To be specific, we use only X-ray structures with resolution < 3.0Å, and enforce a 60% sequence identity threshold. This results in a total of 2643, 330, and 330 PDB structures for the train, validation, and test sets, respectively. Experimental details, the summary of the database, and the description of these GGNNs are elaborated in Supplementary Notes 1 and 2.

Table 6 documents the overall results, where metrics are labeled with ↑/↓ if higher/lower is better, respectively. It can be found that all GGNNs fail to recognize either the absolute or the relative positional information encoded in the protein sequences with an accuracy lower than 1% and an extremely high RMSE.

This phenomenon stems from the conventional ways to build graph connectivity, which usually excludes sequential information. To be specific, unlike common applications of GNNs such as citation networks59, social networks60, knowledge graphs61, molecules do not have explicitly defined edges or adjacency. On the one hand, r-ball graphs utilize a cut-off distance, which is usually set as a hyperparameter, to determine the particle connections. But it is hard to guarantee a cut-off to properly include all crucial node interactions for complicated and large molecules. On the other hand, FC graphs that consider all pairwise distances will cause severe redundancies, dramatically increasing the computational complexity especially when proteins consist of thousands of residues. Besides, GGNNs also easily get confused by excessive noise, leading to unsatisfactory performance. As a remedy, KNN becomes a more popular choice to establish graph connectivity for proteins34,62,63. However, all of them take no account of the sequential information and require GGNNs to learn this original sequential order during training.

The lack of sequential information can yield several problems. To begin with, residues are unaware of their relative positions in the proteins. For instance, two residues can be close in the 3D space but distant in the sequence, which can mislead models to find the correct backbone chain. Secondly, according to the characteristics of the MP mechanism, two residues in a protein with the same neighborhood are expected to share similar representations. Nevertheless, the role of those two residues can be significantly separate64 when they are located at different segments of the protein. Thus, GGNNs may be incapable of differentiating two residues with the same 1-hop local structures. This restriction has already been distinguished by several works6,65, but none of them make a strict and thorough investigation. Admittedly, sequential order may only be necessary for certain tasks. But this toy experiment strongly indicates that the knowledge monopolized by amino acid sequences can be lost if GGNNs only learn from protein structures.

As discussed before, learning about 3D structures cannot directly benefit from large amounts of sequential data. Subsequently, the model sizes of GGNNs are limited, or instead, overfitting may occur66. On the contrary, comparing the number of protein sequences in the UniProt database67 to the number of known structures in the PDB, there are over 1700 times more sequences than structures. More importantly, the availability of new protein sequence data continues to far outpace the availability of experimental protein structure data, only increasing the need for accurate protein modeling tools.

Therefore, we introduce a straightforward approach to assist GGNNs with pretrained protein language models. To this end, we feed amino acid sequences into those protein language models, where ESM-212 is adopted in our case, and extract the per-residue representations, denoted as \({{{{{{{\bf{h}}}}}}}}^{\prime} \in {{\mathbb{R}}}^{N\times {\psi }_{PLM}}\). Here ψPLM = 1280. Then \({{{{{{{\bf{h}}}}}}}}^{\prime}\) can be added or concatenated to the per-atom feature h. For residue-level graphs, \({{{{{{{\bf{h}}}}}}}}^{\prime}\) immediately replaces the original h as the input node features.

Notably, incompatibility exists between the experimental structure and its original amino acid sequence. That is, structures stored in the PDB files are usually incomplete and some strings of residues are missing due to inevitable realistic issues68. They, therefore, do not perfectly match the corresponding sequences (i.e., FASTA sequence). There are two choices to address this mismatch. On the one hand, we can simply use the fragmentary sequence as the substitute for the integral amino acid sequence and forward it into the protein language models. On the other hand, we can leverage a dynamic programming algorithm provided by Biopython69 to implement pairwise sequence alignment and abandon residues that do not exist in the PDB structures. It is empirically discovered that no big difference exists between them, so we adopt the former processing mechanism for simplicity.

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

The data of model quality assessment, protein-protein interface prediction, and ligand affinity prediction is available by https://www.atom3d.ai/. The data of protein-protein rigid-body docking can be downloaded directly from the official repository of Equidock https://github.com/octavian-ganea/equidock_public. Source data for figures can be found in Supplementary Data.

The code repository is stored at https://github.com/smiles724/bottleneck. It is also deposited in ref. 70.

Xu, M. et al. Geodiff: a geometric diffusion model for molecular conformation generation. In International Conference on Learning Representations (ICLR, 2022).

Townshend, R. J. et al. Atom3d: tasks on molecules in three dimensions. 35th Conference on Neural Information Processing Systems (NeurIPS 2021).

Wu, Z. et al. Moleculenet: a benchmark for molecular machine learning. Chem. Sci. 9, 513–530 (2018).

Article CAS PubMed Google Scholar

Lim, J. et al. Predicting drug–target interaction using a novel graph neural network with 3d structure-embedded graph representation. J Chem. Inf. Model. 59, 3981–3988 (2019).

Article CAS PubMed Google Scholar

Liu, Y., Yuan, H., Cai, L. & Ji, S. Deep learning of high-order interactions for protein interface prediction. In Proceedings of the 26th ACM SIGKDD international conference on knowledge discovery & data mining, 679–687 (ACM, 2020).

Ingraham, J., Garg, V., Barzilay, R. & Jaakkola, T. Generative models for graph-based protein design. In Advances in neural information processing systems 32 (NeurIPS, 2019).

Jing, B., Eismann, S., Suriana, P., Townshend, R. J. & Dror, R. Learning from protein structure with geometric vector perceptrons. arXiv preprint arXiv:2009.01411 (2020).

Strokach, A., Becerra, D., Corbi-Verge, C., Perez-Riba, A. & Kim, P. M. Fast and flexible protein design using deep graph neural networks. Cell Syst. 11, 402–411 (2020).

Article CAS PubMed Google Scholar

Wu, F. et al. Molformer: Motif-based transformer on 3d heterogeneous molecular graphs. In Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 37 (2023).

Atz, K., Grisoni, F. & Schneider, G. Geometric deep learning on molecular representations. Nat. Mach. Intell. 3, 1023–1032 (2021).

Article Google Scholar

Meier, J. et al. Language models enable zero-shot prediction of the effects of mutations on protein function. Adv. Neural Inf. Process. Syst. 34, 29287–29303 (2021).

Google Scholar

Lin, Z. et al. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science 379, 1123–1130 (2023).

Article CAS PubMed Google Scholar

Rives, A. et al. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proc. Natl Acad. Sci. 118, e2016239118 (2021).

Article CAS PubMed PubMed Central Google Scholar

Elnaggar, A. et al. Prottrans: towards cracking the language of life’s code through self-supervised deep learning and high performance computing. IEEE. Trans. Pattern. Anal. Mach. Intell. 44, 7112–7127 (2021).

Hsu, C. et al. Learning inverse folding from millions of predicted structures. In Proceedings of the 39th International Conference on Machine Learning. Vol. 162, 8946–8970 (PMLR, 2022).

Boadu, F., Cao, H. & Cheng, J. Combining protein sequences and structures with transformers and equivariant graph neural networks to predict protein function. Preprint at https://www.biorxiv.org/content/10.1101/2023.01.17.524477v1 (2023).

Chen, C., Chen, X., Morehead, A., Wu, T. & Cheng, J. 3d-equivariant graph neural networks for protein model quality assessment. Bioinformatics 39, btad030 (2023).

Article CAS PubMed PubMed Central Google Scholar

Dunbar, J. et al. Sabdab: the structural antibody database. Nucleic Acids Res. 42, D1140–D1146 (2014).

Article CAS PubMed Google Scholar

Chandonia, J.-M., Fox, N. K. & Brenner, S. E. Scope: classification of large macromolecular structures in the structural classification of proteins-extended database. Nucleic Acids Res. 47, D475–D481 (2019).

Article CAS PubMed Google Scholar

Velankar, S. et al. Sifts: structure integration with function, taxonomy and sequences resource. Nucleic Acids Res. 41, D483–D489 (2012).

Article PubMed PubMed Central Google Scholar

Berman, H. M. et al. The protein data bank. Nucleic Acids Res. 28, 235–242 (2000).

Article CAS PubMed PubMed Central Google Scholar

Mistry, J. et al. Pfam: the protein families database in 2021. Nucleic Acids Res. 49, D412–D419 (2021).

Article CAS PubMed Google Scholar

Bairoch, A. et al. The universal protein resource (uniprot). Nucleic Acids Res. 33, D154–D159 (2005).

Article CAS PubMed Google Scholar

Yanofsky, C., Horn, V. & Thorpe, D. Protein structure relationships revealed by mutational analysis. Science 146, 1593–1594 (1964).

Article CAS PubMed Google Scholar

Göbel, U., Sander, C., Schneider, R. & Valencia, A. Correlated mutations and residue contacts in proteins. Proteins 18, 309–317 (1994).

Article PubMed Google Scholar

Cheng, J. et al. Estimation of model accuracy in casp13. Proteins 87, 1361–1377 (2019).

Article CAS PubMed PubMed Central Google Scholar

Kryshtafovych, A., Schwede, T., Topf, M., Fidelis, K. & Moult, J. Critical assessment of methods of protein structure prediction (casp)-round xiii. Proteins 87, 1011–1020 (2019).

Article CAS PubMed PubMed Central Google Scholar

Vreven, T. et al. Updates to the integrated protein–protein interaction benchmarks: docking benchmark version 5 and affinity benchmark version 2. J. Mol. Biol. 427, 3031–3041 (2015).

Article CAS PubMed PubMed Central Google Scholar

Townshend, R., Bedi, R., Suriana, P. & Dror, R. End-to-end learning on 3d protein structure for interface prediction. In Advances in Neural Information Processing Systems 32 (NeurIPS, 2019).

Wang, R., Fang, X., Lu, Y. & Wang, S. The pdbbind database: collection of binding affinities for protein- ligand complexes with known three-dimensional structures. J. Med. Chem. 47, 2977–2980 (2004).

Article CAS PubMed Google Scholar

Liu, Z. et al. Pdb-wide collection of binding data: current status of the pdbbind database. Bioinformatics 31, 405–412 (2015).

Article CAS PubMed Google Scholar

Paszke, A. et al. Pytorch: an imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32 (NeurIPS, 2019).

Fey, M. & Lenssen, J. E. Fast graph representation learning with pytorch geometric. In Workshop of International Conference on Learning Representations (ICLR, 2019).

Ganea, O.-E. et al. Independent se (3)-equivariant models for end-to-end rigid protein docking. In International Conference on Learning Representations (ICLR, 2022).

Zhang, J. & Zhang, Y. A novel side-chain orientation dependent potential derived from random-walk reference state for protein fold selection and structure prediction. PloS one 5, e15386 (2010).

Article PubMed PubMed Central Google Scholar

Uziela, K., Menéndez Hurtado, D., Shu, N., Wallner, B. & Elofsson, A. Proq3d: improved model quality assessments using deep learning. Bioinformatics 33, 1578–1580 (2017).

Article CAS PubMed Google Scholar

Olechnovič, K. & Venclovas, Č. Voromqa: Assessment of protein structure quality using interatomic contact areas. Proteins: Structure, Function, and Bioinformatics 85, 1131–1145 (2017).

Article Google Scholar

Karasikov, M., Pagès, G. & Grudinin, S. Smooth orientation-dependent scoring function for coarse-grained protein quality assessment. Bioinformatics 35, 2801–2808 (2019).

Article CAS PubMed Google Scholar

Pagès, G., Charmettant, B. & Grudinin, S. Protein model quality assessment using 3d oriented convolutional neural networks. Bioinformatics 35, 3313–3319 (2019).

Article PubMed Google Scholar

Klicpera, J., Groß, J. & Günnemann, S. Directional message passing for molecular graphs. In International Conference on Learning Representations (ICLR, 2020).

Eismann, S. et al. Hierarchical, rotation-equivariant neural networks to select structural models of protein complexes. Proteins 89, 493–501 (2021).

Article CAS PubMed Google Scholar

Aykent, S. & Xia, T. Gbpnet: universal geometric representation learning on protein structures. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 4–14 (ACM, 2022).

Jing, B., Eismann, S., Soni, P. N. & Dror, R. O. Equivariant graph neural networks for 3d macromolecular structure. In Preprint at https://arxiv.org/abs/2106.03843 (2021).

Karimi, M., Wu, D., Wang, Z. & Shen, Y. Deepaffinity: interpretable deep learning of compound–protein affinity through unified recurrent and convolutional neural networks. Bioinformatics 35, 3329–3338 (2019).

Article CAS PubMed PubMed Central Google Scholar

Anderson, B., Hy, T. S. & Kondor, R. Cormorant: covariant molecular neural networks. In Advances in neural information processing systems 32 (NeurIPS, 2019).

Bepler, T. & Berger, B. Learning protein sequence embeddings using information from structure. Preprint at https://arxiv.org/abs/1902.08661 (2019).

Rao, R. et al. Evaluating protein transfer learning with tape. Adv Neural Inf. Process. Syst. 32, 9689–9701 (2019).

Gainza, P. et al. Deciphering interaction fingerprints from protein molecular surfaces using geometric deep learning. Nat. Methods 17, 184–192 (2020).

Article CAS PubMed Google Scholar

Nguyen, T. et al. Graphdta: Predicting drug–target binding affinity with graph neural networks. Bioinformatics 37, 1140–1147 (2021).

Article CAS PubMed Google Scholar

Somnath, V. R., Bunne, C. & Krause, A. Multi-scale representation learning on proteins. Adv. Neural Inf. Process. Syst. 34, 25244–25255 (2021).

Google Scholar

Jumper, J. et al. Highly accurate protein structure prediction with alphafold. Nature 596, 583–589 (2021).

Article CAS PubMed PubMed Central Google Scholar

Baek, M. et al. Accurate prediction of protein structures and interactions using a three-track neural network. Science 373, 871–876 (2021).

Article CAS PubMed PubMed Central Google Scholar

Evans, R. et al. Protein complex prediction with alphafold-multimer. Preprint at https://www.biorxiv.org/content/10.1101/2021.10.04.463034v2 (2022).

Ruffolo, J. A. & Gray, J. J. Fast, accurate antibody structure prediction from deep learning on massive set of natural antibodies. Biophys. J. 121, 155a–156a (2022).

Article Google Scholar

Wang, G. et al. Helixfold: an efficient implementation of alphafold2 using paddlepaddle. Preprint at https://arxiv.org/abs/2207.05477 (2022).

Schütt, K. et al. Schnet: a continuous-filter convolutional neural network for modeling quantum interactions. In Advances in neural information processing systems 30 (NeurIPS, 2017).

Liu, Y. et al. Spherical message passing for 3d molecular graphs. In International Conference on Learning Representations (ICLR, 2021).

Satorras, V. G., Hoogeboom, E. & Welling, M. E (n) equivariant graph neural networks. In International conference on machine learning, 9323–9332 (PMLR, 2021).

Sen, P. et al. Collective classification in network data. AI Mag. 29, 93–93 (2008).

Google Scholar

Hamilton, W., Ying, Z. & Leskovec, J. Inductive representation learning on large graphs. In Advances in neural information processing systems. 30 (NeurIPS, 2017).

Carlson, A. et al. Toward an architecture for never-ending language learning. In Twenty-Fourth AAAI conference on artificial intelligence (AAAI, 2010).

Fout, A., Byrd, J., Shariat, B. & Ben-Hur, A. Protein interface prediction using graph convolutional networks. In Advances in neural information processing systems, 30 (NeurIPS, 2017).

Stärk, H., Ganea, O., Pattanaik, L., Barzilay, R. & Jaakkola, T. Equibind: geometric deep learning for drug binding structure prediction. In International Conference on Machine Learning, 20503–20521 (PMLR, 2022).

Murphy, R., Srinivasan, B., Rao, V. & Ribeiro, B. Relational pooling for graph representations. In International Conference on Machine Learning, 4663–4673 (PMLR, 2019).

Zhang, Z. et al. Protein representation learning by geometric structure pretraining. In International Conference on Learning Representations (ICLR, 2023).

Hermosilla, P. & Ropinski, T. Contrastive representation learning for 3d protein structures. Preprint at https://arxiv.org/abs/2205.15675 (2022).

Consortium, U. Uniprot: a hub for protein information. Nucleic Acids Res. 43, D204–D212 (2015).

Article Google Scholar

Djinovic-Carugo, K. & Carugo, O. Missing strings of residues in protein crystal structures. Intrinsically Disord. Proteins 3, e1095697 (2015).

Article PubMed PubMed Central Google Scholar

Cock, P. J. et al. Biopython: freely available python tools for computational molecular biology and bioinformatics. Bioinformatics 25, 1422–1423 (2009).

Article CAS PubMed PubMed Central Google Scholar

Wu, F. Code for Paper ’Integration of pre-trained protein language models into geometric deep learning networks’. Zenodo https://doi.org/10.5281/zenodo.8022149 (2023).

Download references

This work is supported in part by the Institute of AI Industry Research at Tsinghua University and the Molecule Mind.

AI Research and Innovation Laboratory, Westlake University, 310030, Hangzhou, China

Fang Wu, Lirong Wu & Stan Z. Li

Department of Computer Science, Yale University, New Haven, CT, 06511, USA

Dragomir Radev

Institute of AI Industry Research, Tsinghua University, Haidian Street, 100084, Beijing, China

Jinbo Xu

Toyota Technological Institute at Chicago, Chicago, IL, 60637, USA

Jinbo Xu

You can also search for this author in PubMed Google Scholar

F.W. and J.X. led the research. F.W. contributed technical ideas. F.W. and Y.T. developed the proposed method. F.W., D.R., and Y.T. performed the analysis. J.X. and D.R. provided evaluation and suggestions. All authors contributed to the manuscript.

Correspondence to Stan Z. Li.

The authors declare no competing interests.

Communications Biology thanks Jianzhao Gao, Arne Elofsson, and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Primary Handling Editors: Yuedong Yang and Gene Chong.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and Permissions

Wu, F., Wu, L., Radev, D. et al. Integration of pre-trained protein language models into geometric deep learning networks. Commun Biol 6, 876 (2023). https://doi.org/10.1038/s42003-023-05133-1

Download citation

Received: 23 March 2023

Accepted: 11 July 2023

Published: 25 August 2023

DOI: https://doi.org/10.1038/s42003-023-05133-1

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.