Rapid traversal of vast chemical space using machine learning-guided docking screens

Rapid traversal of vast chemical space using machine learning-guided docking screens

Docking library preparation

The Enamine REAL Database (November 2019 version, 12.3 billion compounds) was reduced to a rule-of-four chemical subspace by excluding compounds with a molecular weight over 400 Da and cLogP over 4 using RDKit42. The rule-of-four subspace contained 3,541,746,925 compounds. A random sample containing 15 million compounds (0.4%) from this library was obtained after shuffling the Simplified Molecular Input Line Entry System (SMILES) with Terashuf43. Molecules were prepared for docking using DOCK3.7 standard protocols44. CXCalc (ChemAxon’s Marvin package Marvin 18.10.0) was used to calculate predominant protomers at relevant pH levels (6.9, 7.4 and 7.9). Conformational ensembles were generated with OMEGA (OpenEye, version 2020.2) and were capped at 400 conformations and an inter-conformer root mean squared deviation (r.m.s.d.) diversity threshold of 0.25 Å. In-house preparation of these rule-of-four molecules approximately took 18 CPU-seconds per compound. Preparation of 1 million molecules present in our training set for docking was completed in approximately 5,000 CPU-hours. In the prospective screens for dual-target compounds, the Enamine REAL Database (November 2022 version, 33.5 billion compounds) was reduced to a subspace of 3,137,276,984 lead-like molecules (ZINC Database definition: 20 ≤ heavy atom count ≤ 25 and −5 ≤ cLogP ≤ 3.5) using a similar procedure to that described above. The conformational ensembles of lead-like molecules were capped at 200 conformations and an inter-conformer r.m.s.d. diversity threshold of 0.5 Å. A random sample of 1 million rule-of-four WuXi GalaXi molecules (March 2024 version) was prepared using the same protocol as described above. The WuXi GalaXi rule-of-four space contained 1,371,598,090 compounds.

Molecular descriptors

Canonical SMILES were used to generate three different molecular descriptors as input data for the machine learning classifiers. Morgan2 descriptors were generated using RDKit29,42. Continuous data-driven descriptors were generated using the CDDD Python library31. The RoBERTa model generated descriptors directly from the SMILES. We used a pretrained RoBERTa model45 to generate the internal encoded representation of each molecule using the simpletransformers Python library46.

Preparation of proteins for docking

Experimental structures of the selected protein targets were extracted from the PDB. Details regarding preparation of crystal structures for molecular docking are provided in Supplementary Table 1. Unless stated otherwise, water molecules and other solutes were removed from the experimental structures. The N- and C-termini were capped with acetyl and methyl groups, respectively, using PyMOL47. The atoms of the bound ligands were used to generate matching spheres in the binding site. DOCK3.7 uses a flexible ligand algorithm that superimposes rigid segments of a molecule’s pre-calculated conformational ensemble on top of the matching spheres44. Histidine protonation states were assigned manually after visual inspection of the hydrogen-bonding network. The remainder of the protein structure was protonated by REDUCE48 and assigned AMBER49 united atom charges. The dipole moments of key residues involved in recognition of the bound ligands were increased to favor interactions with these (Supplementary Fig. 2). This technique is common practice for users of DOCK3.7 to improve docking performance and has been used in previous virtual screens50. The atoms of the co-crystallized ligands were used to create two sets of sphere layers on the protein surface (referred to as thin spheres). One set of thin spheres described the low dielectric region of the binding site. A second set of thin spheres was used to calibrate ligand desolvation penalties. Scoring grids were pre-calculated using QNIFFT51 for Poisson–Boltzmann electrostatic potentials, SOLVMAP52 for ligand desolvation, and CHEMGRID53 for AMBER van der Waals potentials. For each protein target, known ligands were retrieved from the ChEMBL database or previous studies, followed by generation of property-matched decoys according to the procedure described in ref. 54. Actives and decoys control sets were docked to the protein structures to evaluate the influence of different docking grid parameters (the radii of electrostatic and desolvation thin spheres). Finally, ligand enrichment and predicted binding poses were used to select the optimal grid parameters50.

Homology modeling of active-state D2 dopamine receptor

Alignment of the human D2 and D3 dopamine receptor sequences was performed using the GPCRdb (https://gpcrdb.org/)55. One hundred homology models of the activated D2R bound to agonist ligand (PD128907) were constructed using MODELLER56 version 10.2 based on a cryo-EM structure of the D3 dopamine receptor complexed with the Gi protein57 (PDB accession code: 7CMV) as a template. Residues 5 to 11 were restrained to form an alpha helix and the residue pairs Cys77–Cys152 and Cys235–Cys237 were set to form disulfide bridges. Molecular docking grids of homology models were constructed using the same protocols as for the antagonist-bound D2 dopamine receptor crystal structure (PDB accession code 6CM4; Supplementary Table 1). The resulting docking grids were then evaluated for their ability to reproduce the modeled binding mode of PD128907 in the corresponding homology models and enrich 25 known D2R agonists among a set of in-house-generated decoys54. A final receptor model was selected based on agonist enrichment calculations and the presence of the conserved salt bridge interaction with Asp1143.32 in docking screens.

Molecular docking calculations

The orientational sampling parameter was set to 5,000 matches for both the rule-of-four benchmarking set and molecules selected based on machine learning predictions. Molecules in the ultralarge docking screens (235 million lead-like molecules from the ZINC15 database32) were docked using 1,000 matches. During the generation of the benchmarking dataset, for each docked compound, 18,652 orientations were calculated on average, and for each orientation, an average of 1,654 conformations were sampled. The best-scoring pose of each ligand was optimized using a simplex rigid-body minimizer.

Compound selection from docking screens

To bias the multi-billion virtual screen targeting the D2R toward identification of novel chemotypes, we selected compounds that were dissimilar to known dopamine receptor ligands (>11,000 ChEMBL compounds). The molecular diversity was increased by clustering the 100,000 top-scoring docked compounds (0.003% of the entire library) by topological similarity. The best-scoring molecule from each cluster was visually inspected for its complementarity with the binding site and a set of 31 compounds were selected from the 1,000 top-ranking clusters (Supplementary Table 10 and Supplementary Data). The make-on-demand compounds were available in in the Enamine REAL Database and were successfully synthesized in 4–5 weeks.

Machine learning-accelerated virtual screening pipeline

Our workflow for combining machine learning and molecular docking (Fig. 1) consists of the following consecutive steps, which are described in detail in this section and Supplementary Fig. 1.

Step 1 Preparation and docking of the training set

A set of randomly selected molecules from an ultralarge chemical library is docked to the target protein structure (Fig. 1a–d). We recommend a training set of 1 million molecules in virtual screens of multi-billion-scale libraries.

Step 2 Generation and labeling of the training set

A docking score threshold (Fig. 1e) is selected to label each compound in the training set as either virtual active (better score than the selected threshold) or inactive (equal or worse score than the selected threshold). As our CP approach is based on aggregating predictions made by several classifiers, multiple independent training sets are generated. Our recommendation is to label the top-scoring 1% of the training set as virtual active and generate 5 independent training sets.

Step 3 Molecule featurization and training of the classifier

Molecular descriptors of each molecule in the training set are generated as input for the classifier. Each of the training sets is used to train an independent classification model to distinguish virtual actives from inactives (Fig. 1f,g).

Step 4 Conformal prediction for the ultralarge library

The trained classification models are used to evaluate compounds from the ultralarge chemical library (Fig. 1h). Mondrian conformal predictors provide class-specific confidence levels, which allow compounds to be categorized into one of the following four sets based on a selected significance level (ε): virtual active, virtual inactive, both (virtual active or inactive) and null (no class assignment). The significance level can be tuned to control the size of the virtual active set, which is predicted to contain compounds with a docking score better than the selected threshold.

Step 5 Post-processing and compound selection

The database reduction level achieved by the workflow is target dependent, and additional post-processing steps can be applied to identify the most promising compounds. The compounds assigned to the virtual active set are ranked by sorting them based on the quality of information (meaning, prioritizing the predictions in which the classification model has the highest confidence) and a subset of these are docked to the target. Top-scoring molecules are clustered by chemical similarity and representative compounds are visually inspected (Fig. 1i,j), followed by synthesis and experimental evaluation of selected compounds (Fig. 1k–l). We recommend docking a set of 1 million to 5 million molecules selected based on the quality of information.

Training and evaluation of machine learning classifiers

CP is a QSAR method in which an ensemble of models is used to classify molecules present in an objective set (Supplementary Fig. 1). The docking scores of the datasets were used to label molecules as virtual actives (top 1%) and virtual inactives (bottom 99%), unless stated otherwise. The scikit-learn 0.24.2 package58 was used to perform a stratified split of the datasets into proper training sets (80% of training set), calibration sets (20% of training set) and test sets. The ratio between virtual actives and inactives was maintained in all sets. This procedure was repeated using different random seeds to obtain independent sets. The CatBoost 0.26 Python package was used for building and training the corresponding classifiers. The PyTorch 1.7.1 package59 combined with the RangerLars optimizer60 was used for training the deep neural networks. The RoBERTa classifier was implemented from the simpletransformers 0.61.6 package46. The Skorch 0.10.0 package61 was used to connect the scikit-learn and PyTorch frameworks. A detailed description and analyses of the hyperparameters used in each classifier are provided in Supplementary Table 2 and Supplementary Figs. 3–5. The compounds in the test set were assigned normalized P values (confidence that the sample belongs to 1 class, P1; and 0 class, P0) by each individual classification model and its corresponding calibration set. The resulting sets of P1 and P0 values were aggregated into a single pair of P values by taking the respective medians of predictions made by individual models (Supplementary Fig. 1). On the basis of the aggregated P values and the selected significance level (ε), the Mondrian CP framework was used to divide the compounds into virtual active, virtual inactive, both (meaning either virtual active or inactive) and null (no class assignment) sets. Several metrics were used to assess the performance of the conformal predictors. The sensitivity was defined as:

$${{\mathrm{Sensitivity}}}=\frac{{{\mathrm{TP}}}}{{{\mathrm{AP}}}}$$

(1)

where TP (true positives) were true active molecules correctly classified by the CP framework (that is, in the predicted virtual active and both sets). AP (all positives) were all molecules with a score better than the threshold used to define the virtual actives. The precision was defined as:

$${{\mathrm{Precision}}}=\frac{{{\mathrm{TP}}}}{{{\mathrm{TP}}}+{{\mathrm{FP}}}}$$

(2)

where FP (false positives) were true inactive molecules incorrectly classified by the CP framework (that is, in the predicted virtual active and null sets). The efficiency was defined as:

$${{\mathrm{Efficiency}}}=\frac{\left\{1\right\}+\left\{0\right\}}{{{\mathrm{AP}}}+{{\mathrm{AN}}}}$$

(3)

where {1} are the predicted virtual actives and {0} the predicted virtual inactives. AN (all negatives) were all molecules with a score worse than or equal to the threshold used to define the virtual actives. The overall error rate was defined as:

$${{\mathrm{Overall}}\, {\mathrm{error}}\,{\mathrm{rate}}}=\frac{{{\mathrm{FP}}}+{{\mathrm{FN}}}}{{{\mathrm{AP}}}+{{\mathrm{AN}}}}$$

(4)

where FN (false negatives) are true virtual active molecules incorrectly classified by the CP framework (that is, in the predicted virtual inactives and null sets). The error rate for the virtual actives was defined as:

$${{\mathrm{Actives}}\, {\mathrm{error}}\, {\mathrm{rate}}}=\frac{{{\mathrm{FN}}}}{{{\mathrm{AP}}}}$$

(5)

The error rate for the virtual inactives was defined as:

$${{\mathrm{Inactives}}\, {\mathrm{error}}\, {\mathrm{rate}}}=\frac{{{\mathrm{FP}}}}{{{\mathrm{AN}}}}$$

(6)

Computational costs and hardware specifications

To train a conformal predictor on 1 million Morgan2 fingerprints, approximately 4 GB of random access memory was required. Training and predicting were performed using 2x Intel Xeon Gold 6130 CPUs @ 2.10 GHz. The RoBERTa models were trained on 12 cores of a single Nvidia Tesla T4 graphics processing unit with 1,844 GiB memory. The times (in seconds) required to train a conformal predictor on 1 million molecules with different architectures and descriptors or predict 1 million molecules are reported in Supplementary Table 9.

Binding assays

Screening compounds (Supplementary Tables 10 and 12) were purchased from Enamine (compound purity >90%, which was confirmed by liquid chromatography−mass spectrometry and 1H NMR spectroscopy for identified ligands; Supplementary Figs. 19–30). Human D2R competition binding experiments were carried out in polypropylene 96-well plates. Each well contained 20 µg of membranes from a CHO-D2R #S20 cell line (protein concentration of 4,322 µg ml−1), 1.5 nM [3H]-spiperone (54.3 Ci mmol−1, 1 mCi ml−1, PerkinElmer NET1187250UC) and the compound studied. Non-specific binding was determined in the presence of 10 µM sulpiride (Sigma Aldrich S8010). The reaction mixture (250 µl per well) was incubated at 25 °C for 120 min, after which 200 µl was transferred to a GF/C 96-well plate (Millipore) pretreated with 0.5% PEI and treated with binding buffer (50 mM Tris-HCl, 1 mM EDTA, 5 mM MgCl2, 5 mM KCl, 120 mM NaCl, pH 7.4). The reaction mixture was filtered and washed 4 times with 250 µl wash buffer (50 mM Tris-HCl, 0.9% NaCl, pH 7.4) before the addition of 30 μl Universol. The final measurement was performed in a microplate beta scintillation counter (Microbeta Trilux, PerkinElmer). Human A2AR competition binding experiments were carried out in a multiscreen GF/C 96-well plate (Millipore) pretreated with binding buffer (Tris-HCl 50 mM, EDTA 1 mM, MgCl2 10 mM, 2 U ml−1 adenosine deaminase, pH 7.4). Each well was incubated with 5 μg of membranes from the HeLa-A2A cell line and prepared in our laboratory (lot A003/14-04-2019, protein concentration 2,058 μg ml−1), 1 nM [3H]-ZM241385 (50 Ci mmol−1, 1 mCi ml−1, ARC-ITISA 0884), and the compounds studied and standards. Non-specific binding was determined in the presence of NECA 50 μM (Sigma E2387). The reaction mixture (total volume of 200 μl per well) was incubated at 25 °C for 30 min, then filtered and washed 4 times with 250 μl wash buffer (Tris-HCl 50 mM, EDTA 1 mM, MgCl2 10 mM, pH 7.4), and measured in a microplate beta scintillation counter (Microbeta Trilux, PerkinElmer). Unless stated otherwise, three independent experiments were carried out to calculate Ki values.

Functional assays

Human D2R activity was measured by determining the amount of cAMP produced. Human D2R functional experiments were carried out in a CHO-D2 #S20 cell line. Five-thousand cells were seeded in 30 μl of Optimem (Invitrogen 11058) + 500 μM IBMX (Sigma 17018) on a 96-well black-and-white isoplate (PerkinElmer 6005030). Test compounds and dopamine were added in their corresponding wells and incubated for 10 min at 37 °C with gentle stirring (150 r.p.m.). Then, 10 µM forskolin (Sigma 17018) was added and incubated for 5 min at 37 °C with gentle stirring (150 r.p.m.). Reagents from the kit (#CISBIO 62AM4PEC) were added and incubated for 1 h at room temperature with gentle stirring (90 r.p.m.) and protected from light. HTRF (excitation wavelength: 320 nm; emission wavelengths: 620–665 nm) from each well was measured using a Tecan Infinite M1000 Pro. Two independent experiments were carried out to calculate EC50 and Emax values.

Statistics and reproducibility

Compounds for which no molecular descriptors could be generated were excluded from the analyses. No statistical method was used to predetermine the sample size. Training and test sets were generated by taking random samples with the determined size of the virtual libraries. Proper training and calibration sets were constructed by performing a stratified split on the parent training sets, maintaining the ratio of samples belong to the minority and majority classes. Unless stated otherwise, three independent calculations (training and predictions) were performed to derive statistics on model performance. Unless stated otherwise, three independent experiments were performed to derive statistics on the activities of the designed compounds.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

link

Leave a Reply

Your email address will not be published. Required fields are marked *