Overview of DPFunc
DPFunc is a deep learning-based method for protein function prediction using domain-guided structure information. The overall architecture of DPFunc is shown in Fig. 1. It consists of three modules: (1) a residue-level feature learning module based on a pre-trained protein language model and graph neural networks for propagating features between residues through protein structures which can be the native structures from the PDB database50 or the predicted structures51,52 by AlphaFold235. (2) a protein-level feature learning module for extracting the whole structure features from residue-level features guided by domain information from sequences. (3) a protein function prediction module for annotating functions to proteins based on protein-level features.
The residue-level feature learning module takes protein sequences and structures as input. Based on protein sequences, it first generates the initial features for each residue from the pre-trained protein language model (ESM-1b)53. Simultaneously, it constructs contact maps based on corresponding protein structures. Subsequently, these contact maps and residue-level features can be considered as graphs and corresponding node features, which are further fed into several GCN layers to update and learn the final residue-level features. Additionally, inspired by ResNet54, this module also utilizes a residual learning framework in GCNs. The protein-level feature learning module holds the key component for transforming residue-level insights into a comprehensive representation of the entire protein structure. It first uses InterProScan55 to scan the target protein sequences, compares them to background databases, and detects the domains contained in the sequences, each of which is represented by a unique entry. Since these domains are functional units responsible for specific functions, in this module, they serve as a guide to discovering significant residues in the sequences with the residue-level features generated by the first module. Specifically, these domain entries are fed into an embedding layer to generate domain-level dense representations that capture their unique characteristics, and then summed as protein-level domain information. To assess the importance of different residues, inspired by the transformer architecture, an attention mechanism is introduced to interweave the protein-level domain features and residue-level features, which detects the importance of each residue. Subsequently, protein-level features can be obtained by weighted summation of the residue-level features and their corresponding importance scores. Then, the protein function prediction module combines the protein-level features and initial residue-level features to annotate functions to proteins through several fully connected layers. Finally, the prediction results are processed by a common post-processing procedure to ensure consistency with the structures of GO terms. These modules are integrated as an automatic function prediction framework. The details of each module can be found in “Methods”.
DPFunc outperforms existing state-of-the-art methods
To evaluate the performance of DPFunc, we first compare our method to three baseline methods only based on sequences (i.e., Naive16, Blast6, and DeepGO15) and two structure-based methods (i.e., DeepFRI37 and GAT-GO38). To make a fair comparison, we use the same dataset used in previous studies37,38. The dataset contains the PDB structures validated by experiments and corresponding confirmed functions (see “Methods” for dataset details). We adopt two commonly used metrics in CAFA: Fmax and AUPR (see “Methods” for details). Fmax is the maximum F-measure, which is the harmonic mean of paired precision and recall. A higher Fmax indicates better performance. AUPR is the area under the precision-recall curve with different cut-off thresholds. Again, a larger AUPR value signifies superior model performance.
The result is illustrated in Table 1. Without the post-processing procedure, DPFuncw/o post outperforms other methods in MF, CC, and BP. And the post-processing procedure further enhances the performance improvements. Specifically, when compared to GAT-GO, DPFuncw/o post achieves an increased Fmax of 8%, 5%, and 8% in MF, CC, and BP, respectively. With the post-processing procedure, these improvements become even more significant, reaching 16%, 27%, and 23%, respectively. Similar trends are observed when considering AUPR. DPFuncw/o post consistently achieves the highest performance, improving AUPR by at least 7%, 23%, and 42% in MF, CC, and BP, respectively. After considering the post-processing procedure, the performance of our model is further improved by even 8%, 26%, and 19%, respectively. We further test the effect of different sequence identities on the performance of these methods. As illustrated in Supplementary Fig. 1, DPFunc achieves better performance in all cases with different sequence identity cut-offs. It is noteworthy that although GAT-GO also uses protein structures and the same residue-level features generated from ESM-1b, our model outperforms it. This finding indicates that domain information contained in protein sequences provides valuable insights for protein function prediction. Meanwhile, the post-processing procedure makes the predictions more logical and facilitates improving the performance of models.
To enable a more comprehensive comparison with other methods6,14,15,16,26,29,33, we construct a large-scale dataset. Following the CAFA challenge, we partition it into training, validation, and test sets based on distinct time stamps (see “Methods” for details). Unlike the previously utilized PDB dataset, this large-scale dataset encompasses more proteins and corresponding additional information, such as PPI and GO structures, making it possible to compare our methods with other state-of-the-art (SOTA) methods. Specifically, we compare our method against two baseline methods (BlastKNN6 and Diamond14), three sequence-based methods (DeepGOCNN16, TALE33, and ATGO33), two PPI network-based methods (DeepGO15 and DeepGraphGO26), and three composite methods that integrate the results of baseline methods and their original predictions (DeepGOPlus16, TALE+33, ATGO+29). Moreover, we choose two additional web-servers as competitors, NetGO3.056 and COFACTOR57,58, where NetGO3.0 is the current state-of-the-art method in the CAFA24,25 challenge and COFACTOR is an effective structure-based tool for predicting protein functions as a component of I-TASSER-MTD59 in the CASP60 challenge.
Table 2 shows the predictive performance of various methods for five repetitions of the experiment. Notably, to ensure a fair comparison, the post-processing procedure is applied to all methods. This standardization potentially benefits those that do not inherently incorporate such processing. Despite this, DPFunc consistently outperforms all other methods in terms of Fmax and AUPR, exhibiting particularly significant improvements in AUPR. Specifically, DPFunc exhibits AUPR improvements of at least 9.6%, 9.3%, and 8.8% for MF, CC, and BP, respectively. Similar conclusions can be drawn from Supplementary Table 1. DPFunc surpasses the other two web-servers, NetGO3.0 and COFACTOR, in the vast majority of cases, except for Fmax in BP. These comparison results further prove the ability of DPFunc in protein function prediction. To evaluate the performance of DPFunc more comprehensively, based on the results in Table 2, we choose four approaches (BlastKNN, ATGO, DeepGraphGO, and ATGO+) as the representative methods for baseline methods, sequence-based methods, PPI network-based methods, and composite methods, respectively.
DPFunc exhibits the ability to learn protein features and infer the GO terms more effectively, even for unseen proteins with low sequence identities to known proteins. To verify this capability, we construct several protein sets from the test set, each with a distinct sequence identity threshold relative to training proteins. The results are shown in Fig. 2a, DPFunc consistently outperforms other methods in nearly all cases, except for the 50% threshold in BP, where it demonstrates comparable performance to ATGO+. Notably, the improvements of DPFunc are still stable as the identity threshold increases. This advantage is more pronounced in CC, where the rankings of ATGO+ and DeepGraphGO change with identities. This result persists when compared to all other SOTA methods (see Supplementary Fig. 2).
Beyond its overall performance, DPFunc excels in predicting informative GO terms characterized by high IC values. These terms present a greater challenge due to their few occurrences and limited training samples. As illustrated in Fig. 2c, DPFunc consistently outperforms the other methods when predicting GO terms with fewer samples, and the improvement remains for more specific GO terms (IC≥3). Notably, some methods, such as TALE, fail to accurately predict these informative GO terms (see Supplementary Fig. 3). Additionally, Fig. 2b,d-e show the performance in terms of IC-weighted AUPR (see “Methods” for details), which is different from AUPR and considers the informative of GO terms. It can be obtained that DPFunc surpasses the other methods, indicating the great potential of DPFunc in predicting informative functions. The detailed data can be obtained from Supplementary Table 2.
As functions form a loosely hierarchical structure and are related, functions with deeper depths are more specific and predicting these types of functions is more meaningful. Figure 2f shows the performance of these methods on GO terms with deeper nodes (depths > = 8 in MF and BP, depths > = 6 in CC since the maximum is 7 in CC), which evaluates the performance on each selected GO term and means their AUPR values as the final metric. DPFunc still achieves the best performance, except for being slightly weaker than BlastKNN in BP. Notably, although ATGO+ achieves comparable scores, it can only predict parts of known functions (66.3%), as shown in Fig. 2g and Supplementary Table 3. Above all, DPFunc demonstrates a distinct advantage over SOTA methods, particularly in its ability to handle unseen proteins with low sequence identity, informative GO terms with high IC values, and specific GO terms with deeper nodes.
Domain information improves the performance of DPFunc
To unequivocally demonstrate the pivotal role of domain information in DPFunc, we replace the domain attention block with a mean pooling layer, a commonly used strategy in previous studies, i.e., DeepFRI and GAT-GO. As illustrated in Fig. 3a, b, after adding the guidance of domain information, DPFunc exhibits substantial improvements in both Fmax and AUPR, compared to DPFunc w/o domain in MF, CC, and BP.
To provide a comprehensive evaluation of the differences between these two models, we test the prediction results for each individual GO term, as shown in Fig. 3c, d. Figure 3d shows the number of perfect predicted GO terms (AUPR=1), it can be observed that DPFunc with domain annotations achieves better performance. As for the other GO terms, Fig. 3c shows that DPFunc, armed with domain insights, achieves a remarkable median AUPR improvement of 12.0%, 14.7%, and 16.3% for MF, CC, and BP, respectively. These results unequivocally substantiate the unparalleled value of incorporating domain information for protein function prediction.
Moreover, to ensure model reliability, we focus on evaluating predictions with high confidence scores. Specifically, we assess the results with the top k prediction scores of these two models, where k is determined by the average number of GO terms per protein (approximately ~ 7 for MF, ~ 11 for CC, and ~ 30 for BP). As shown in Fig. 3e, it can be observed that DPFunc achieves better performance after incorporating the domain information, demonstrating mean F-measure improvements exceeding 1.6% ~ 3.1% for MF, 1.9% ~ 3.3% for CC, and 5.5% ~ 6.7% for BP. Similar conclusions can be drawn from Fig. 3f, which shows the distribution of predictions over specific k values (5 for MF, 9 for CC, and 24 for BP). In summary, our model can predict protein functions more accurately when incorporating domain information. The improvements are more striking in CC and BP.
DPFunc effectively distinguishes structure motifs and sequence identities
Since protein structures are closely related to their functions, in this section, we focus on evaluating the ability of DPFunc to discern structural motifs and their associated functions. To evaluate this ability, we first select protein pairs with low sequence similarities, and assess the similarities of their structure features using the widely adopted TM-score, a metric commonly employed in structure prediction. As illustrated in Fig. 4a, DPFunc, under the guidance of domain knowledge, demonstrates a remarkable ability to distinguish between these protein pairs, exhibiting a higher correlation with structure similarities (TM-score). In contrast, in the absence of domain information, the model struggles to differentiate structure features, resulting in consistently high structure similarities exceeding 88%, and failing to capture the nuances of dissimilar structures.
To further illustrate the potential of DPFunc in detecting similar structural motifs, even in the absence of sequence similarity, we conduct two case studies: P0C617 and Q8NGY0, two pivotal plasma membrane proteins that separate the cell from its external environment7. Despite their dissimilar sequences, these proteins share strikingly similar structures and same functions of maintaining plasma membrane integrity (GO:0005886). The details are shown as Fig. 4d-e. These two proteins have dissimilar sequences but similar structures to perform the same functions. DPFunc first extracts the same domain information via scanning their sequences, and these domain properties are all related to membrane functions, which are all validated in the UniProt database7. Then, Fig. 4b, c shows similar contact maps generated by their structures, and Fig. 4f, g shows similar attention maps, indicating that domain-guided insight empowers DPFunc to learn similar features from similar structures. These findings demonstrate the ability of DPFunc to capture structural resemblance and accurately predict functions, even when faced with disparate sequences, underscoring its exceptional potential in protein function prediction.
Additionally, there also exist scenarios where proteins with high sequence identities have different structures and functions. It is necessary for models to distinct these proteins and corresponding functions. Consequently, we present three proteins here to evaluate the capability of DPFunc in this scenario (PDB ID: 5JZV-A [ 3WG8-A [ 5Z9R-A [ see Fig. 4h). As illustrated in Supplementary Table 6 and Supplementary Table 7, these proteins have high sequence identities but different functions. For instance, the sequence identity between 5JZV-A [ and 3WG8-A [ is 87.8% but they have only 5 common functions. For these proteins, DPFunc predicts their functions with 100% accuracy, as shown in Supplementary Fig. 4, which demonstrates the ability of our model on proteins with high sequence identities but distinct structures.
DPFunc holds the potential for annotating bacteria
With more and more protein sequences being detected, many sequenced organisms have been discovered while their functions are unknown, especially bacteria and viruses61. Accurately annotating their functions is critical to understanding the role of corresponding proteins and their association with disease61. In general, these proteins from the sequenced organisms lack other information, such as protein-protein interactions and gene expressions, representing challenges for existing computational approaches that rely on multi-type biological knowledge18. Consequently, it is meaningful to annotate these proteins from sequences62,63.
In this study, to further explore the performance of our methods, we re-divide the dataset, select a specific type of bacteria, Bacillus subtilis64, as the test data, and remove all associated species data from the training data (the details can be obtained from the Supplementary Table 4). Additionally, S2F18 is chosen as a representative method, which is proposed for annotating the sequenced organisms based on network propagation. Figure 4i, j illustrate the performance of these two methods. From Fig. 4i, it can be obtained that DPFunc gets better performance over the vast majority of proteins, with weaker performance than S2F on only 3 out of 47 proteins. It is worth noting that S2F gets nearly 0 F-measure on the majority of proteins, while DPFunc achieves significant improvements on the same proteins. Additionally, Fig. 4j illustrates the PR curve of these two methods, which demonstrates that DPFunc has a great improvement in terms of AUPR, proving the potential of DPFunc for annotating bacteria.
DPFunc effectively detects significant active sites for enzyme functions
DPFunc can also detect significant residues in proteins that are highly correlated with functions (see “Methods” for details). For instance, in enzyme reactions, the catalytic process is carried out by specific active residues65,66,67. In this section, we provide several cases to show the capabilities of DPFunc in active site detection. Specifically, Q9M1Y0 and Q8S929 are two cysteine proteases involved in both proteolytic activation and delipidation of ATG8 family proteins68. Previous literature68 indicates that the two proteins both have three active sites: 173-th, 368-th, and 370-th residues for Q9M1Y0, and 170-th, 364-th, and 366-th residues for Q8S929. Figure 5a shows the details of the prediction results of Q9M1Y0 and Q8S929. The red positions represent the key residues detected by DPFunc (CYS-170, PRO-305 for Q8S929, and CYS-173, PRO-369 for Q9M1Y0) and the known validated residues from previous literatures (CYS-170, ASP-364, HIS-366 for Q8S929, and CYC-173, ASP-368, HIS-370 for Q9M1Y0). Obviously, DPFunc not only accurately predicts their functions, but also highlights significant sites that overlap with known active sites. Notably, DPFunc exhibits a remarkable ability to identify potential functional hotspots for closely spaced active sites. For example, in the case of Q8S929, where the 364-th and 366-th residues are active sites, DPFunc identifies the intermediate site (the 365-th residue) as the potential functional hotspot. This remarkable ability is attributed to the power of graph neural networks, which can aggregate the information from two neighboring active sites.
Furthermore, we find four proteins from the same species (Arabidopsis thaliana) with the same functions (Wax ester synthase/diacylglycerol acyltransferase), including Q93ZR6 (WSD1) [ Q9M3B1 (WSD6) [ Q94CK0 (WSD7) [ and Q5KS41 (WSD11) [ All of these proteins are significant enzymes and involved in cuticular wax biosynthesis69,70,71,72, as shown in Fig. 5(b). It is worth noting that each of these four proteins has a known active site, where HIS-147 for WSD169,70, HIS-163 for WSD670, HIS-135 for WSD770, and HIS-144 for WSD1171,72. Moreover, we use Clustal Omega73 to align these sequences. As illustrated in Supplementary Fig. 5, the four positions are aligned as expected, which further support the co-evolutionary conservation of these residues. As for these proteins, DPFunc detects all of these active sites accurately.
Additionally, we compare our method with another SOTA method74 in the field of functional site prediction. As illustrated in Supplementary Table 8, we test the two approaches on three common proteins that appear in both our study and ref. 74. The known active sites can be obtained from M-CSA database75. The results in Supplementary Fig. 6 show that our method achieves comparable performance with the method proposed in ref. 74, further supporting the effect of DPFunc on active site detection. Notably, although DPFunc detects significant active sites effectively, finding active sites in disordered regions remains a challenge that may be further explored in future models (see Supplementary Fig. 7).
link