Elnaggar, A. et al. ProtTrans: toward understanding the language of life through self-supervised learning. IEEE Trans. Pattern Anal. Mach. Intell. 44, 7112–7127 (2022).
Google Scholar
Lin, Z. et al. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science 379, 1123–1130 (2023).
Google Scholar
Rives, A. et al. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proc. Natl Acad. Sci. USA 118, e2016239118 (2021).
Google Scholar
Alley, E. C., Khimulya, G., Biswas, S., AlQuraishi, M. & Church, G. M. Unified rational protein engineering with sequence-based deep representation learning. Nat. Methods 16, 1315–1322 (2019).
Google Scholar
Meier, J. et al. Language models enable zero-shot prediction of the effects of mutations on protein function. in Advances in Neural Information Processing Systems (eds Ranzato, M., Beygelzimer, A., Dauphin, Y., Liang, P. S. & Vaughan, J. W.) Vol. 34, 29287–29303 (Curran Associates, Inc., 2021).
Detlefsen, N. S., Hauberg, S. & Boomsma, W. Learning meaningful representations of protein sequences. Nat. Commun. 13, 1914 (2022).
Google Scholar
Freschlin, C. R., Fahlberg, S. A. & Romero, P. A. Machine learning to navigate fitness landscapes for protein engineering. Curr. Opin. Biotechnol. 75, 102713 (2022).
Google Scholar
Bepler, T. & Berger, B. Learning the protein language: evolution, structure, and function. Cell Syst. 12, 654–669.e3 (2021).
Google Scholar
Hie, B. L., Yang, K. K. & Kim, P. S. Evolutionary velocity with protein language models predicts evolutionary dynamics of diverse proteins. Cell Syst. 13, 274–285.e6 (2022).
Google Scholar
Madani, A. et al. Large language models generate functional protein sequences across diverse families. Nat. Biotechnol. 41, 1099–1106 (2023).
Google Scholar
Ferruz, N., Schmidt, S. & Höcker, B. ProtGPT2 is a deep unsupervised language model for protein design. Nat. Commun. 13, 4348 (2022).
Google Scholar
Ferruz, N. & Höcker, B. Controllable protein design with language models. Nat. Mach. Intell. 4, 521–532 (2022).
Google Scholar
Hie, B., Zhong, E. D., Berger, B. & Bryson, B. Learning the language of viral evolution and escape. Science 371, 284–288 (2021).
Google Scholar
Hie, B. L. et al. Efficient evolution of human antibodies from general protein language models. Nat. Biotechnol. (2023).
Google Scholar
Georgiev, A. G. Interpretable numerical descriptors of amino acid space. J. Comput. Biol. 16, 703–723 (2009).
Google Scholar
Sandberg, M., Eriksson, L., Jonsson, J., Sjöström, M. & Wold, S. New chemical descriptors relevant for the design of biologically active peptides. A multivariate characterization of 87 amino acids. J. Med. Chem. 41, 2481–2491 (1998).
Google Scholar
van Westen, G. J. et al. Benchmarking of protein descriptor sets in proteochemometric modeling (part 1): comparative study of 13 amino acid descriptor sets. J. Cheminform. 5, 41 (2013).
Google Scholar
Hie, B. L. & Yang, K. K. Adaptive machine learning for protein engineering. Curr. Opin. Struct. Biol. 72, 145–152 (2022).
Google Scholar
Brookes, D. H., Aghazadeh, A. & Listgarten, J. On the sparsity of fitness functions and implications for learning. Proc. Natl Acad. Sci. USA 119, (2022).
Miton, C. M., Buda, K. & Tokuriki, N. Epistasis and intramolecular networks in protein evolution. Curr. Opin. Struct. Biol. 69, 160–168 (2021).
Google Scholar
Sailer, Z. R. & Harms, M. J. High-order epistasis shapes evolutionary trajectories. PLoS Comput. Biol. 13, e1005541 (2017).
Google Scholar
Castro, E. et al. Transformer-based protein generation with regularized latent space optimization. Nat. Mach. Intell. 4, 840–851 (2022).
Google Scholar
Aghazadeh, A. et al. Epistatic Net allows the sparse spectral regularization of deep neural networks for inferring fitness functions. Nat. Commun. 12, 5225 (2021).
Google Scholar
Spence, M. A., Kaczmarski, J. A., Saunders, J. W. & Jackson, C. J. Ancestral sequence reconstruction for protein engineers. Curr. Opin. Struct. Biol. 69, 131–141 (2021).
Google Scholar
Trudeau, D. L. & Tawfik, D. S. Protein engineers turned evolutionists-the quest for the optimal starting point. Curr. Opin. Biotechnol. 60, 46–52 (2019).
Google Scholar
Thomson, R. E. S., Carrera-Pacheco, S. E. & Gillam, E. M. J. Engineering functional thermostable proteins using ancestral sequence reconstruction. J. Biol. Chem. 298, 102435 (2022).
Google Scholar
Hendrikse, N. M., Charpentier, G., Nordling, E. & Syrén, P.-O. Ancestral diterpene cyclases show increased thermostability and substrate acceptance. FEBS J. 285, 4660–4673 (2018).
Google Scholar
Ishida, C. et al. Reconstruction of hyper-thermostable ancestral L-amino acid oxidase to perform deracemization to D-amino acids. Chem. Cat. Chem. 13, 5228–5235 (2021).
Joho, Y. et al. Ancestral sequence reconstruction identifies structural changes underlying the evolution of Ideonella sakaiensis PETase and variants with improved stability and activity. Biochemistry 62, 437–450 (2023).
Google Scholar
Schulz, L. et al. Evolution of increased complexity and specificity at the dawn of form I Rubiscos. Science 378, 155–160 (2022).
Google Scholar
Islam, M. I. et al. Ancestral reconstruction of the MotA stator subunit reveals that conserved residues far from the pore are required to drive flagellar motility. Microlife 4, uqad011 (2023).
Google Scholar
Sugiura, S. et al. Catalytic mechanism of ancestral L-lysine oxidase assigned by sequence data mining. J. Biol. Chem. 297, 101043 (2021).
Google Scholar
Gamiz-Arco, G. et al. Heme-binding enables allosteric modulation in an ancient TIM-barrel glycosidase. Nat. Commun. 12, 380 (2021).
Google Scholar
Araseki, H. et al. Definition of an index parameter to screen highly functional enzymes derived from a biochemical and thermodynamic analysis of ancestral meso-diaminopimelate dehydrogenases. Chem. Bio. Chem. 24, e202200727 (2023).
Google Scholar
Kajimoto, S. et al. Enzymatic conjugation of modified RNA fragments by ancestral RNA ligase AncT4_2. Appl. Environ. Microbiol. 88, e0167922 (2022).
Google Scholar
Johnson, S. R. et al. Computational scoring and experimental evaluation of enzymes generated by neural networks. Nat. Biotechnol. (2024).
Clifton, B. E. et al. Evolution of cyclohexadienyl dehydratase from an ancestral solute-binding protein. Nat. Chem. Biol. 14, 542–547 (2018).
Google Scholar
Kaczmarski, J. A. et al. Altered conformational sampling along an evolutionary trajectory changes the catalytic activity of an enzyme. Nat. Commun. 11, 5945 (2020).
Google Scholar
Clifton, B. E. & Jackson, C. J. Ancestral protein reconstruction yields insights into adaptive evolution of binding specificity in solute-binding proteins. Cell Chem. Biol. 23, 236–245 (2016).
Google Scholar
Buda, K., Miton, C. M., Fan, X. C. & Tokuriki, N. Molecular determinants of protein evolvability. Trends Biochem. Sci. 48, 751–760 (2023).
Google Scholar
Meger, A. T. et al. Rugged fitness landscapes minimize promiscuity in the evolution of transcriptional repressors. Cell Syst. 15, 374–387.e6 (2024).
Google Scholar
Joy, J. B., Liang, R. H., McCloskey, R. M., Nguyen, T. & Poon, A. F. Y. Ancestral reconstruction. PLoS Comput. Biol. 12, e1004763 (2016).
Google Scholar
Minh, B. Q. et al. IQ-TREE 2: new models and efficient methods for phylogenetic inference in the genomic era. Mol. Biol. Evol. 37, 1530–1534 (2020).
Google Scholar
Kozlov, A. M., Darriba, D., Flouri, T., Morel, B. & Stamatakis, A. RAxML-NG: a fast, scalable and user-friendly tool for maximum likelihood phylogenetic inference. Bioinformatics 35, 4453–4455 (2019).
Google Scholar
Shimodaira, H. An approximately unbiased test of phylogenetic tree selection. Syst. Biol. 51, 492–508 (2002).
Google Scholar
Toledo-Patiño, S., Pascarelli, S., Uechi, G.-I. & Laurino, P. Insertions and deletions mediated functional divergence of Rossmann fold enzymes. Proc. Natl Acad. Sci. USA 119, e2207965119 (2022).
Google Scholar
Burnim, A. A., Xu, D., Spence, M. A., Jackson, C. J. & Ando, N. Analysis of insertions and extensions in the functional evolution of the ribonucleotide reductase family. Protein Sci. 31, e4483 (2022).
Google Scholar
Miton, C. M. & Tokuriki, N. Insertions and deletions (indels): a missing piece of the protein engineering jigsaw. Biochemistry 62, 148–157 (2023).
Google Scholar
Emond, S. et al. Accessing unexplored regions of sequence space in directed enzyme evolution via insertion/deletion mutagenesis. Nat. Commun. 11, 3469 (2020).
Google Scholar
Jackson, C. J. et al. Conformational sampling, catalysis, and evolution of the bacterial phosphotriesterase. Proc. Natl Acad. Sci. USA 106, 21631–21636 (2009).
Google Scholar
Afriat-Jurnou, L., Jackson, C. J. & Tawfik, D. S. Reconstructing a missing link in the evolution of a recently diverged phosphotriesterase by active-site loop remodeling. Biochemistry 51, 6047–6055 (2012).
Google Scholar
Yang, G., Hong, N., Baier, F., Jackson, C. J. & Tokuriki, N. Conformational tinkering drives evolution of a promiscuous activity through indirect mutational effects. Biochemistry 55, 4583–4593 (2016).
Google Scholar
Campbell, E. et al. The role of protein dynamics in the evolution of new enzyme function. Nat. Chem. Biol. 12, 944–950 (2016).
Google Scholar
Lu, H. et al. Machine learning-aided engineering of hydrolases for PET depolymerization. Nature 604, 662–667 (2022).
Google Scholar
Tournier, V. et al. An engineered PET depolymerase to break down and recycle plastic bottles. Nature 580, 216–219 (2020).
Google Scholar
Son, H. F. et al. Rational protein engineering of thermo-stable PETase from Ideonella sakaiensis for highly efficient PET degradation. ACS Catal. 9, 3519–3526 (2019).
Google Scholar
Austin, H. P. et al. Characterization and engineering of a plastic-degrading aromatic polyesterase. Proc. Natl Acad. Sci. USA 115, E4350–E4357 (2018).
Google Scholar
Vongsouthi, V. et al. Ancestral reconstruction of polyethylene terephthalate degrading cutinases reveals a rugged and unexplored sequence-fitness landscape. Preprint at bioRxiv (2024).
Pokusaeva, V. O. et al. An experimental assay of the interactions of amino acids from orthologous sequences shaping a complex fitness landscape. PLoS Genet. 15, e1008079 (2019).
Google Scholar
Vaswani, A. et al. in Advances in Neural Information Processing Systems (eds Guyon, I. et al.) Vol. 30 (Curran Associates, Inc., 2017).
Tokuriki, N. et al. Diminishing returns and tradeoffs constrain the laboratory optimization of an enzyme. Nat. Commun. 3, 1257 (2012).
Google Scholar
Miton, C. M. et al. Origin of evolutionary bifurcation in an enzyme. Preprint at bioRxiv (2023).
Kaltenbach, M., Jackson, C. J., Campbell, E. C., Hollfelder, F. & Tokuriki, N. Reverse evolution leads to genotypic incompatibility despite functional and active site convergence. eLife 4, e06492 (2015).
Google Scholar
Miton, C. M., Chen, J. Z., Ost, K., Anderson, D. W. & Tokuriki, N. Statistical analysis of mutational epistasis to reveal intramolecular interaction networks in proteins. Methods Enzymol. 643, 243–280 (2020).
Google Scholar
Buda, K., Miton, C. M. & Tokuriki, N. Pervasive epistasis exposes intramolecular networks in adaptive enzyme evolution. Nat. Commun. 14, 8508 (2023).
Google Scholar
D’Costa, S., Hinds, E. C., Freschlin, C. R., Song, H. & Romero, P. A. Inferring protein fitness landscapes from laboratory evolution experiments. PLoS Comput. Biol. 19, e1010956 (2023).
Google Scholar
Chicco, D., Warrens, M. J. & Jurman, G. The coefficient of determination R-squared is more informative than SMAPE, MAE, MAPE, MSE and RMSE in regression analysis evaluation. PeerJ Comput. Sci. 7, e623 (2021).
Google Scholar
Hayes, T. et al. Simulating 500 million years of evolution with a language model. Preprint at bioRxiv (2024).
Luo, Y. et al. ECNet is an evolutionary context-integrated deep learning framework for protein engineering. Nat. Commun. 12, 5743 (2021).
Google Scholar
Hsu, C., Nisonoff, H., Fannjiang, C. & Listgarten, J. Learning protein fitness models from evolutionary and assay-labeled data. Nat. Biotechnol. 40, 1114–1122 (2022).
Google Scholar
Freschlin, C. R., Fahlberg, S. A., Heinzelman, P. & Romero, P. A. Neural network extrapolation to distant regions of the protein fitness landscape. Nat. Commun. 15, 6405 (2024).
Google Scholar
Castro, E., Benz, A., Tong, A., Wolf, G. & Krishnaswamy, S. Uncovering the Folding Landscape of RNA Secondary Structure Using Deep Graph Embeddings. in 2020 IEEE International Conference on Big Data (Big Data) 4519–4528 (2020).
Sejdić, M. D. L. S. Local smoothness of graph signals. Math. Probl. Eng. 2019, 14 (2019).
Google Scholar
Reidys, C. M. & Stadler, P. F. Combinatorial landscapes. SIAM Rev. 44, 3–54 (2002).
Google Scholar
Biyikoğu, T., Leydold, J. & Stadler, P. F. Laplacian Eigenvectors of Graphs (Springer, 2007).
Shuman, D. I., Narang, S. K., Frossard, P., Ortega, A. & Vandergheynst, P. The emerging field of signal processing on graphs: Extending high-dimensional data analysis to networks and other irregular domains. IEEE Signal Process Mag. 30, 83–98 (2013).
Google Scholar
Repecka, D. et al. Expanding functional protein sequence spaces using generative adversarial networks. Nat. Mach. Intell. 3, 324–333 (2021).
Google Scholar
Shin, J.-E. et al. Protein design and variant prediction using autoregressive generative models. Nat. Commun. 12, 2403 (2021).
Google Scholar
Anishchenko, I. et al. De novo protein design by deep network hallucination. Nature 600, 547–552 (2021).
Google Scholar
Park, Y., Metzger, B. P. H. & Thornton, J. W. Epistatic drift causes gradual decay of predictability in protein evolution. Science 376, 823–830 (2022).
Google Scholar
Lunzer, M., Golding, G. B. & Dean, A. M. Pervasive cryptic epistasis in molecular evolution. PLoS Genet. 6, e1001162 (2010).
Google Scholar
Bridgham, J. T., Ortlund, E. A. & Thornton, J. W. An epistatic ratchet constrains the direction of glucocorticoid receptor evolution. Nature 461, 515–519 (2009).
Google Scholar
Starr, T. N., Flynn, J. M., Mishra, P., Bolon, D. N. A. & Thornton, J. W. Pervasive contingency and entrenchment in a billion years of Hsp90 evolution. Proc. Natl Acad. Sci. USA 115, 4453–4458 (2018).
Google Scholar
Devlin, J., Chang, M.-W., Lee, K. & Toutanova, K. BERT: pre-training of deep bidirectional transformers for language understanding. Preprint at (2018).
Fu, L., Niu, B., Zhu, Z., Wu, S. & Li, W. CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics 28, 3150–3152 (2012).
Google Scholar
Almagro Armenteros, J. J. et al. SignalP 5.0 improves signal peptide predictions using deep neural networks. Nat. Biotechnol. 37, 420–423 (2019).
Google Scholar
Rozewicki, J., Li, S., Amada, K. M., Standley, D. M. & Katoh, K. MAFFT-DASH: integrated protein sequence and structural alignment. Nucleic Acids Res. 47, W5–W10 (2019).
Kalyaanamoorthy, S., Minh, B. Q., Wong, T. K. F., von Haeseler, A. & Jermiin, L. S. ModelFinder: fast model selection for accurate phylogenetic estimates. Nat. Methods 14, 587–589 (2017).
Google Scholar
Le, S. Q. & Gascuel, O. An improved general amino acid replacement matrix. Mol. Biol. Evol. 25, 1307–1320 (2008).
Google Scholar
Hoang, D. T., Chernomor, O., von Haeseler, A., Minh, B. Q. & Vinh, L. S. UFBoot2: improving the ultrafast bootstrap approximation. Mol. Biol. Evol. 35, 518–522 (2018).
Yang, Z. PAML 4: phylogenetic analysis by maximum likelihood. Mol. Biol. Evol. 24, 1586–1591 (2007).
Paradis, E., Claude, J. & Strimmer, K. APE: analyses of phylogenetics and evolution in R language. Bioinformatics 20, 289–290 (2004).
Google Scholar
Kingma, D. P. & Ba, J. Adam: a method for stochastic optimization. Preprint at (2014).
Tukey, J. W. Comparing individual means in the analysis of variance. Biometrics 5, 99–114 (1949).
Google Scholar
Domingos, J. & Moura, J. M. F. Graph Fourier transform: a stable approximation. Preprint at (2020).
Matthews, D. & Spence, M. A. RSCJacksonLab/Local-Ancestral-Sequence-Embeddings: local-ancestral-sequence-embeddings. Zenodo (2024).
link