Molecular fingerprints are the workhorse in ligand-based drug discovery. In recent years, an increasing number of research papers reported fascinating results on using deep neural networks to learn 2D molecular representations as fingerprints. It is anticipated that the integration of deep learning would also contribute to the prosperity of 3D fingerprints. Here, we unprecedentedly introduce deep learning into 3D small molecule fingerprints, presenting a new one we termed as the three-dimensional force fields fingerprint (TF3P). TF3P is learned by a deep capsular network whose training is in no need of labeled datasets for specific predictive tasks. TF3P can encode the 3D force fields information of molecules and demonstrates the stronger ability to capture 3D structural changes, to recognize molecules alike in 3D but not in 2D, and to identify similar targets inaccessible by other 2D or 3D fingerprints based on only ligands similarity. Furthermore, TF3P is compatible with both statistical models (e.g. similarity ensemble approach) and machine learning models. Altogether, we report TF3P as a new 3D small molecule fingerprint with a promising future in ligand-based drug discovery. All codes are written in Python and available at https://github.com/canisw/tf3p.
Keywords: Deep Learning; Capsular Network; 3D Molecular Fingerprint; Force Fields; Ligand-based Target Prediction.
It is a primary principle in ligand-based drug discovery that similar ligands bind to similar targetsThis statement, in essence, is to infer target similarity from only ligand information. Over the past decades of endeavor to improve this inference with more precision, molecular representation is a useful tool
and can be divided intuitively into two categories, namely 2D and 3D representations.
Since the 1970s, 2D molecular fingerprints have been developed maturely and can be classified into four types, namely substructure keys (e.g. MACCSKey
), circular fingerprints (e.g. ECFP
), topological fingerprints (e.g. atom pairs
, topological torsions
, Avalon fingerprint
) and pharmacophores. In addition to these classical molecular fingerprints designed by chemists, with the advent of deep learning in recent years, an increasing number of research papers reported on using deep neural networks (DNN) to fingerprint 2D molecules
. Of particular note is the recent Attentive FP
which achieves the state-of-art performance on several datasets in the field of ligand-based bioactivity prediction.
Compared to 2D representation, using molecular 3D representationis expected to enhance the performance of predictive models, especially in the prediction of biological targets for small molecule drugs. USR/USRCAT
is moment-based molecular 3D representations that encode molecular shape (and color for USRCAT). E3FP
, as another 3D molecular fingerprint, was inspired by ECFP and presents higher precision-recall performance than ECFP when integrated with the similarity ensemble approach (SEA)
. Besides the above rotation and translation invariant fingerprints, Gaussian-based representation (used in ROCS
and SimG
) is another example that strengthens this argument. When it further comes to voxelized representation
, it is a common practice to integrate this representation with supervised convolutional DNN and promising results have been demonstrated on the prediction of the biological target or binding affinity for small molecules
. Despite all the above, there is no 3D fingerprint learned by DNN reported thus far.
Whereas 2D molecular fingerprints generated by DNN showed excellent ability in many predictive models, most of them reported up to now have two flaws: i) Such fingerprinting models need individual training for each predictive tasks, resulting in that none of generated fingerprints contain the unbiased information within an intact molecule; ii) Similarity calculation of such fingerprints is a complication to their compatibility with similarity-based statistical models such as SEA. In 2019, Winter et al. introduced a DNN-based translation model between two equivalent chemical representations (e.g. IUPAC name and SMILES) to deal with these two flaws
. While this model can be trained as a general fingerprinter with such chemical representations and simultaneously nine more molecular properties (i.e. log P, etc.), the fact that the generated representation must be normalized over the dataset of interest before similarity calculation defines its deficiency. Our research presented here is also aimed to overcome these limitations and further push forward with the neural network to fingerprint 3D molecules.
Fingerprinting 3D molecule is to encode 3D object. We can learn a lot about the encoding method from computer graphics. Hinton et al. recently introduced CapsNetto learn to output the pose matrix of 3D objects, which can be intuitively interpreted as the orientation and position of objects in 3D space. The idea behind CapsNet is to surmount one deficiency of the convolutional neural network (CNN), as shown in Figure 1. Each capsule in CapsNet is capable to learn the instantiation parameters (i.e. values for orientation, size, etc.) of each feature and output a vector with its length as the possibility indicating the presence of that feature. Taken all features together, CapsNet outputs a matrix that represents the input. It is worth noting that capsules do not guarantee provable rotation/translation equi-/in-variance intrinsically. In 2018 and late 2019, P. Libuschewski et al.
and R. R. Sarma et al.
advanced the equivariance of capsular networks, but there is still no remedy for invariance.
Figure 1. The CNN cannot detect relationships between features and then fails to detect deformed samples, but CapsNet can.
In this paper, we construct a deep capsular neural network to fingerprint 3D small molecule, which can be trained without targeting specific predictive tasks. Molecular force fields grids, as a kind of voxelized representation employed in 3D field-based QSAR, were adopted as the inputs of our neural network. This was in the very consideration that one intended application of our fingerprint is ligand-based target prediction and that the binding of small molecules to biological targets is dominated by physical forces. All in all, molecular force fields grids can be compressed with the well-trained model into a real-valued matrix, termed as the three-dimensional force fields fingerprint (TF3P). We further demonstrated TF3P’s strong ability to capture 3D structural changes and recognize similar targets which are inaccessible by other fingerprints based on only ligands similarity. Additionally, TF3P is compatible with both statistical models and machine learning models.
Model Architecture
Our model is a modified version of CapsNet for 3D convolutions, composed of 15 layers. The first 8 layers are encoder and the last 7 layers are decoder. The overall architecture is shown in Figure 2.
Figure 2. The architecture of our model. The arrows indicate the data flow. TF3P: the three-dimensional force fields fingerprint.
Inputs. Our model takes the force fields grids of a molecule’s conformer as inputs, which have two channels, namely van der Waals potential and electrostatic potential. The grids calculation is implemented using open source force field MMFF94in Python with alkyl carbon (MMF type 1) and proton as probes without solvents for each channel, respectively. The box size is set to 20
20
20 ˚A so that it can hold common lead-like small molecule drugs. The grids’ size is set to 50
50 points with a resolution of 0
. In summary, the input tensor for each conformer has the shape of 2
Encoder. This part of the network takes the input force fields grids and learns to encode them into a 166 8 matrix as the fingerprint. The row vector dimension of 8 is chosen as the same value used by Hinton et al. The encoder consists of six 3D convolutional layers and two capsular layers. The convolutional layers serve as a feature extractor to reduce the sparse inputs’ dimension into a smaller size suitable for the subsequent capsular layers since capsular layer has hundred times of trainable parameters than convolutional layer under the same scale. Every convolutional layer has kernels with a size of 3
3 and stride 1, followed by batch normalization and LeakyReLU activation. Besides, in every two convolutional layers is inserted a max-pooling layer. After convolution, the inputs’ shape is transformed into 128
5. PrimaryCaps layer is in nature a convolutional layer with its scalar outputs chunked into vectors and then squashed. The squashing function is the novel non-linear activation function designed for capsules. DigitCaps layer receives input from all the capsules in the PrimaryCaps layer with routing by agreement and produces a matrix with a size of 166
Decoder. The decoder part takes the outputs of DigitCaps, the fingerprint, to reconstruct the input force fields grids. This part is used as a regularizer, whose job is to encourage the digit capsules to encode the pose information into each digit. Therefore, we simply use a nearly symmetric architecture consisting of a linear layer, transposed 3D convolutional layers with upsampling to decode the outputs of DigitCaps.
Model Training
Our model was implemented with Pytorch 1.3and trained with Adam
optimizer. The learning rate was set to 0.001, and all other parameters were default. The loss function we used was the same as that of CapsNet, which has two parts: Margin loss and reconstruction loss.
where the two decomposed losses use similar forms as Hinton et al.
where k is digital capsule index; = 0.9 and
= 0
= 1 if a digit of class k is present;
5. The margin loss here was calculated with 2D molecular fingerprints as labeled digit classes. Both MACCSKey and ECFP4 were tested here for training. The final model for evaluation was trained on 1% random samples of the full prepared data set (~ 3 million compounds, further explained below) with MACCSKey over 10 epochs with a reconstruction weight of 0.005. All fingerprints were calculated and stored for training in advance, but the force fields grids were calculated on the fly. The training process for the final model took up to about one week on three NVIDIA TITAN RTX GPUs.
Data set
ZINC15 3D lead-like (logP <= 3.5, 250 < MW < 500) subset was retrieved to train our model. A total number of ~286 million molecules with pre-generated conformers were downloaded from the ZINC database. Approximately 25 thousand molecules (less than 0.01%) that cannot be parameterized by the MMFF94 force field or cannot be sanitized by RDKit (v2019.09.2)
were filtered out and excluded from the subsequent training process. However, given the huge calculation cost of force fields grids and the large size of the full data set, not all these data were actually used in the study. We randomly sampled various proportions of the full data set and split it randomly into a training set and test set by 9:1 for model training.
ChEMBL25 were retrieved for the evaluation. The distribution of the number of rotatable bonds was analyzed for all molecules within ChEMBL25. For the assessment of fingerprints’ sensitivity to 3D structural changes, 1000 molecules were randomly sampled for each number of rotatable bonds ranging from 0 to 15. To obtain a relatively wide range of RMSD values, 100 conformers were generated for each molecule, and then 1 conformer was selected as the reference and another 99 were aligned to the reference with RMSD calculated and ranked. Finally, 10 out of 99 were selected with equal RMSD intervals from high to low and calculated similarities to the reference with various fingerprints. Conformation generation and alignment were implemented with RDKit (v2019.09.2)
. RMSD was calculated using rmsdcalc utility from Schrodinger Suites 2018-1
PDBbind v2018 General Setwas retrieved for the evaluation. Only the complexes with pActivity (pK
, pIC
, etc.) greater than 6 were kept as “active” samples and used subsequently. As a conformation-dependent fingerprint without rotation invariance, ligand pairs need to be aligned to each other before fingerprinting and similarity calculation. We investigated the performance of our model in three situations: i) Query co-crystallized conformation aligned to co-crystallized conformation (X2X) with RDKit; ii) Query low-energy conformation aligned to co-crystallized conformation (LE2X) with RDKit; iii) Query flexibly aligned to co-crystallized conformation with Omega
. The first one was aimed to rule out bad effects resulting from conformation sampling strategy; the latter two were designed to match the real target prediction scenario.
The dataset for solubility and malaria bioactivity prediction was benchmarked by Duvenaud et al.
and Kearnes et al.
Fingerprints calculation
Except that 166-bits ECFP4 was used only in the model training, 1024-bits one was utilized in both training and all the evaluations. Both ECFP4 and MACCSKey were computed using RDKit. 1024-bits E3FP was computed using the default parameters and codes provided by the author in his GitHub repo (https://github.com/keiserlab/e3fp-paper/tree/1.1/e3fp_paper). USR/USRCAT were also evaluated in this study, calculated with RDKit.
Similarity calculation for two fingerprints
Tanimoto coefficient was used for MACCSKey, ECFP4, and E3FP. USRScore implemented in RDKit was used for USR/USRCAT. As for real-valued TF3P, since it is a matrix of which each row vector has its independent structural meaning, it is intuitive to calculate the similarity row-wisely. We designed a weighted mean of the cosine similarity of each row vector to calculate the similarity between two TF3Ps, M, as the following form:
where p
Training performance of models with different super parameters
As each digit represents a certain 2D structural feature, 2D fingerprints were used as digit class labels during the training process. Under this circumstance, each digit capsule had to learn to output an eight-dimensional vector containing the 3D force fields information in regard to each 2D structural feature.
We first investigated what proportion and epochs for training could make a compromise between model performance and time cost for training. As shown in Figure 3 A, the more data was fed into the model, the lower did the loss value achieve. An interesting point is that the loss value decreases to almost the same after the same amount of the total fed data, no matter what proportion of the full dataset sampled. This indicates a great structural redundancy of the ZINC database. In summary, 1% of the full dataset (~ 3 million compounds) is sufficient for the convergence of the loss function after 10 epochs of training with MACCSKey. The grid size (GS) and reconstruction weight (W) were set to 50 and 0.005, respectively.
Figure 3. Training performance with different super parameters. A) The loss value of test set when training with super parameters as MACCSKey-b166-GS50-W5. B) The decomposed loss values of test set when training with indicated super parameters. b166/b1024, 166-bit/1024-bit version of indicated fingerprint; GS50/GS64, grid size set as 50 64; W1/W5, reconstruction set as 1/0.005. For loss curves of train/test set during training, see Figure S1.
After the optimal proportion and training epochs determined, we assessed how much the training performance dependent on other super parameters of the model, i.e. W, GS, and 2D fingerprints. As shown in Figure 3 B, increasing the W to 1 would make the reconstruction loss dominate the margin loss during training (MACCSKey-b166-GS50-W5 vs. MACCSKey-b166-GS50-W1) but the reconstruction is expected to play as a regularizer and less important than margin loss, which was mentioned by Hinton et al.We next found that increasing GS to 64
64 contributed a lot to the decrease of the reconstruction loss but a little to the margin loss (MACCSKey-b166-GS50-W5 vs. MACCSKey-b166-GS64-W5). Furthermore, another fingerprint, ECFP4, was evaluated to show how much different types of fingerprints impacted on the training performance. To make the results comparable, two versions, i.e. 166-bit one and 1024-bit one, were included. Compared with MACCSKey, the loss value convergent pretty slowly when training with 166-bit ECFP4 (MACCSKey-b166-GS50-W5 vs. ECFP4-b166-GS50-W5), but 1024-bit version demonstrated comparable results with the model with MACCSKey, GS = 64, and W = 0.005 (ECFP4-b1024-GS50-W5 vs. MACCSKey-b166-GS64-W5).
The capsular networks are often constrained by computation efficiency. Our models would certainly slow down when the number of network parameters grew due to either the increase of grid size or the increase of the number of digital capsules (i.e. the number of bits of 2D fingerprints). Results from a simple test (Table 1) shows additional non-trivial time costs when more complex models used, e.g. MACCSKey-b166-GS64-W5 and ECFP4-b1024-GS50-W5, which retarded further training with other super parameters, e.g. ECFP4-b1024-GS64-W5. All in all, three out of five models were retained to conduct the next evaluation task, namely MACCSKey-b166-GS50-W5, MACCSKey-b166-GS64-W5, and ECFP4-b1024-GS50-W5, which were referred to as TF3P-50, TF3P-64, and TF3P-1024, respectively.
Table 1. Time costs of training and inference for models with different super parameters.
: 1% proportion of the full dataset over 10 epochs on three NVIDIA TITAN RTX GPUs.
samples from ChEMBL on one NVIDIA GeForce 2080 Ti GPU.
Capturing 3D structural changes of molecules
The basic virtue of 3D fingerprint is the capability to discriminate different conformers of a molecule. To assess fingerprints’ ability to capture conformational changes, we sampled 1000 molecules with the number of rotatable bonds ranging from 0 to 15 (This range was determined by the distribution of the number of rotatable bonds in ChEMBL25, Figure S2). The RMSD values and similarity by fingerprints between different conformers of each molecule were calculated. As the number of rotatable bonds increases, the overall RMSD rises, denoting the growing conformational space of molecule (Figure 4). 2D fingerprints unable to encode 3D conformational information produce similarity values of one for all entries. Similarities by three versions of TF3P demonstrated comparably strong correlations with RMSD, measured by Pearson’s r coefficients (-0.71 ~ -0.72) (Figure 4 and Table S1). E3FP, USR, and USRCAT do have the ability to apprehend 3D conformational changes but yield lower coefficients than TF3P. Taken together, TF3P is a better predictor, i.e. more sensitive, to the actual 3D structural changes because TF3P can provide similarity rankings in better accordance with the 3D structural changes, which is actually we need when quantifying chemical similarity, despite the fact that TF3P yields generally higher similarity values. In addition, it is unreasonable that some pairs of conformers with pretty low RMSD (< 1.0 Å) yield low similarities by E3FP (Down left points in Figure 4 D, a typical example shown in Figure S3).
Figure 4. The correlation between RMSD and similarity by TF3P-50 (A), TF3P-64 (B), TF3P-1024 (C), E3FP (D), USR (E), and USRCAT (F). Each point represents a pair of conformers belonging to a molecule, colored by its number of rotatable bonds. The density plots are colored in the same way. TF3P-50, TF3P-64, and TF3P-1024 refer to MACCSKey-b166-GS50-W5, MACCSKey-b166-GS64-W5, and ECFP4-b1024-GS50-W5, respectively.
Two examples illustrated what discussed above. CHEMBL4116653 can undergo conformational isomerization because of its freely rotatable single bonds (Figure 5 A). Among three conformers of CHEMBL4116653, conf_1 is very similar to conf_3 but far dissimilar to conf_2, indicated by the divergence in RMSD values (1.89 Å vs. 6.63 Å). The similarities by TF3P are consistent with this divergence (0.88 vs. 0.52), but not by E3FP (0.28 vs. 0.28). Another case is a series of natural products with four chiral carbons, a pair of enantiomers plus a structural isomer (Figure 5 B). 2D fingerprints encounter bafflement when distinguishing between a pair of enantiomers, Anti-hh and Anti-hh’, indicated by the similarities of one. Since they are isomers, the RMSD values among them can be also calculated to indicate the 3D conformational deviation (2.89 Å vs. 3.81 Å, for Anti-hh & Anti-hh’ and Anti-hh & Anti-ht, respectively). Similar situations occur for TF3P and E3FP, respectively: The similarity by E3FP hardly changes (0.30 vs. 0.29), whereas the similarity by TF3P shifts accordingly with the RMSD (0.78 vs. 0.73). Besides, there are also counter examples where TF3P failed to capture some important conformational changes. As shown in Figure 5 C, CHEMBL1789181 undergoes conformational isomerization with part of the structure flipped over, the aromatic ring moiety and the aliphatic chain moiety swapping position with each other. Unlike USRCAT and E3FP, TF3P showed its deficiency in apprehending this kind of change.
Figure 5. A, B) Two promising examples with RMSD, similarities, 2D structures and 3D aligned conformers. C) One counter example with RMSD, similarities, 2D structures and 3D aligned conformers. TF3P here refers to TF3P-50.
Finding similar pockets by similar ligands
The primary intuition of using molecular 3D force fields grids has two aspects. On the one hand, force fields contain the raw information dominating the interaction between drugs and targets; on the other hand, force fields are degenerate of atom types and bonds of molecular graph representation. TF3P, as a fingerprint derived from force fields, therefore, was expected to be capable to recognize molecule pairs that are dissimilar in 2D topology but resemble each other in 3D shape and electrostatics. Given the 3D complementation between the ligands and the pockets they bind to, this is deemed to improve the precision of the targets similarity inference based on ligands similarity, which is the essence of similarity-based target prediction. In this study, targets similarity was represented by FuzCav pockets similarity, with particular emphasis on the binding. PDBbind v2018 General Set was selected to study the inference further with three situations (i.e. X2X, LE2X, and Flex) evaluated to meet the concerns on conformation generation, as mentioned previously in Methods & Materials. Given the time costs and performances discussed before, we only included TF3P-50 out of three models in this part and referred to as simply as TF3P.
As shown in Table 2, the overall (Top 100%) Pearson’s r coefficients of FuzCav similarity vs. the similarity by all fingerprints in all situations are low (0.08-0.20), suggesting the toughness to perfectly infer target similarity from only ligands information. Nevertheless, it is noteworthy that the correlations are generally higher in top-ranked samples by similarity than bottom-ranked ones with no exception. The same trend occurs in Spearman’s rho (Table 3) and Kendall’s tau coefficients (Table 4). This probably results from that subtle differences between similar ligands correspond to a similar degree of conformational changes between pockets and thus presenting correlations easy to fit for top-ranked pairs by fingerprints similarity. However, in contrast to similar ligands binding to pockets alike, every ligand dissimilar to others binds to its target in its own way, causing complications to the regression for dissimilar pairs. As for top-ranked pairs, good correlations only mean similar orders in two arrays of samples, not the exact values. We further analyzed the average values of FuzCav similarities of samples with incrementing ranking thresholds. As shown in Figure 6, top-ranked pairs by TF3P similarity present significantly higher average FuzCav pocket similarities than those by USRCAT, E3FP and 2D fingerprints under the same situation. Taken together, compared to other fingerprints, similar ligands found by TF3P do bind to similar targets and their ligand similarity is a better indicator of the pocket similarity.
Table 2. Pearson correlation coefficients of FuzCav similarity vs. different fingerprints similarity for all active pairs from PDBbind v2018 General Set.
X2XKey ECFP4TF3P E3FP USR
: Sorted by the indicated fingerprint similarity.
: Co-crystallized conformation aligned to co-crystallized conformation.
: Low-energy conformation aligned to co-crystallized conformation.
: Flexibly aligned to co-crystallized conformation.
Table 3. Spearman rank correlation coefficients of FuzCav similarity vs. different fingerprints similarity for all active pairs from PDBbind v2018 General Set.
X2XKey ECFP4TF3P E3FP USR
: Sorted by the indicated fingerprint similarity.
: Co-crystallized conformation aligned to co-crystallized conformation.
: Low-energy conformation aligned to co-crystallized conformation.
: Flexibly aligned to co-crystallized conformation.
Table 4. Kendall rank correlation coefficients of FuzCav similarity vs. different fingerprints similarity for all active pairs from PDBbind v2018 General Set.
X2XKey ECFP4TF3P E3FP USR
: Sorted by the indicated fingerprint similarity.
: Co-crystallized conformation aligned to co-crystallized conformation.
: Low-energy conformation aligned to co-crystallized conformation.
: Flexibly aligned to co-crystallized conformation.
Figure 6. Average pocket similarity for samples with incrementing cutoffs in ranking (Top 1% is equivalent to top 10
In fact, we only pay attention to top-ranked outputs in the real scenarios of ligand-based drug discovery. Like the early-stage enrichment is of vital importance for virtual screening, we also need a stronger “early-stage enrichment” for target prediction based on only ligands similarity. This is to say, a good fingerprint for ligand-based target prediction need have the ability to rank the targets whose pockets are similar to the true target in the top of the targets pool and with an order that is well correlated to their pockets similarity. To this end, TF3P outperforms the existing 3D and 2D fingerprints, indicated by the higher average FuzCav similarities and higher correlations coefficients under all three situations in top 1% shown in Table 2, 3, and 4, and Figure 6. This is not that mention that TF3P-LE2X can achieve comparable results with all fingerprints under Flex situation (Figure 6).
Additionally, another noteworthy point is that the correlation for top ranked ones can be generally ranked as: X2X > Flex > LE2X. This is of great significance because many reported that 2D fingerprints outperformed 3D descriptors under the condition of using generated low-energy conformers. Our results suggest that those barely satisfactory performance of 3D fingerprints are likely to be attributed to not use co-crystallized conformer (i.e. bioactive conformer), which urges more endeavors to develop new computational tools meeting this demand.
Here, we also show three promising examples among the most similar pairs by TF3P that are not alike by other fingerprints (Figure 7 A, B, and C), and one counter example (Figure 7 D). Each pair binds to the same target. As the similarity values calculated by different fingerprints over the same dataset show divergent distributions (Figure S4), the percentile ranks were computed and attached below the exact values of similarity to demonstrate how similar different fingerprints regard them. As shown in the aligned crystal structures (Figure 7 A, B, and C), all three pairs have obviously similar 3D force fields but dissimilar in 2D topology. This leads to the difficulty to recognize their resemblance for the existing fingerprints based on topological structure but not for TF3P that is derived from 3D force fields. Therefore, T3FP can rank these pairs at the top places (Top 1%) based on ligands similarity but others cannot. However, TF3P can miss under some circumstances. Figure 7 D shows a typical case where two ligands bind to the same target but have distinct structures. Since the multi-benzene moiety and the aliphatic chain both serve as spacer, only the shape of the spacer counts here but not the charges, providing a possible explanation for the behavior of TF3P that is based on both vdW potential and electrostatic potential.
Figure 7. A, B, C) Three top-ranked similar pairs by TF3P but are inaccessible by other fingerprints. In terms of receptors, GluA2 for 4O3B and 5NG9, Protein Tyrosine Phosphatase 1B for 2CMB and 2CNE, HIV-1 Protease for 2WKZ and 1B6K. D) One counter example TF3P yielded divergent results against FuzCav. Human carbonic anhydrase II for 5E2R and 3IBL.
SEA analysis of PDBbind database
SEA is a widely used statistical model that quantitatively groups and relates proteins based on the chemical similarity of their ligands. We utilized SEA to analyze all targets that annotated no less than 10 active complexes within PDBbind v2018 General Set with five fingerprints, namely TF3P, USRCAT, E3FP, MACCSKey and ECFP4. The results demonstrate a similar trend as the above study: TF3P can enrich more pairs of the targets that resemble each other by pocket similarity in the targets pool (Figure 8 A), yielding higher average FuzCav similarity in top-ranked pairs. This indicates TF3P as a better choice for SEA-based target prediction, compared to other fingerprints.
One primary application of SEA is to cluster proteins using only ligands similarity to find some links unexpected. First, it is not beyond anticipation that TF3P outputs a clustering of targets closer to FuzCav similarity, measured by adjusted Rand score (Figure 8 B; for specific explanation of adjusted Rand score, see Supplementary File 1). Moreover, some interesting links indeed emerged when we conducted SEA analysis with TF3P (Figure 8 C). The first one is a cluster of HIV relevant proteins, including HIV protease, HIV integrase, and HIV reverse transcriptase, which are highly pharmacologically related. The second one is endothiapepsin (Uniprot ID: P11838) from Cryphonectria parasitica and plasmepsin-2 (Uniprot ID: P46925) from Plasmodium falciparum, which function similarly as aspartic-type endopeptidase although they are from different organisms. The third one is a pair of nuclear receptors, retinoic acid receptor RXR-alpha (Uniprot ID: P19793) and peroxisome proliferator-activated receptor gamma (Uniprot ID: P37231), which are not only biologically related but also have a very conserved 3D structure of the ligand binding domain (Figure 8 D). Notably, the latter two did not emerge when using other four fingerprints (Figure S5).
Figure 8. SEA analysis of the PDBbind database. A) Average pocket similarity for targets pairs with incrementing cutoffs in the ranking. B) The distance of FuzCav clustering of targets to clustering by five fingerprints, measured by adjusted Rand score. Higher means closer. C) Heatmap of E-value of indicated targets calculated by SEA. Each point is colored by its E-value with lower values as green and higher as yellow. For high-resolution images, see Supplementary File 2. D) Aligned crystal structures of ligand binding domains of two indicated targets.
Application with machine learning models
Aside from calculating the similarity between two molecules, another important application of fingerprints is to be used as the inputs for machine learning models. To this end, we used a very simple model, single linear layer neural network (i.e. linear regression), to demonstrate TF3P’s compatibility and performance on several tasks. TF3P achieves the lowest MSE value in 5-fold cross-validation among all of the five fingerprints on solubility prediction and outperforms other 3D fingerprints on malaria bioactivity prediction (Table 5), making itself a promising alternative when integrated with machine learning models.
Table 5. Prediction of solubility and malaria bioactivity with different fingerprints.
Fingerprint Solubility log10 (mol/L) Malaria bioactivity
ln (
Fingerprint Solubility log10 (mol/L) Malaria bioactivity
ln (
: Evaluation matrix: mean ± std. of MSE
In summary, we developed a new fingerprint of 3D molecule with the deep capsular network, which learned to encode the 3D force fields information into each digit of 2D fingerprint without labeled data for specific predictive tasks. Ligands similarity can be intuitively calculated with the fingerprint produced by our model, TF3P, and thus making it compatible with statistical models. Also, TF3P could be applied with machine learning models as the inputs and present promising results. Since TF3P is derived from molecular 3D force fields, it demonstrated its ability to recognize molecule pairs that are dissimilar in 2D topology but resemble each other in 3D shape and electrostatics, which improves its ability to infer pockets similarity based on only ligands similarity, detecting pairs of ligands with similar targets inaccessible by other fingerprints. TF3P is anticipated to be a better choice of fingerprints for ligand-based drug discovery in the future. Further development and benchmarking studies of TF3P-based target prediction software are still going on.
Supplementary Information
Supplementary File 1: Supplementary figures and tables.
Supplementary File 2: Supplementary files.
This research was supported by the National Key Research and Development Project (Grant numbers 2019YFC1708900), the National Natural Science Foundation of China (Grant numbers 81872730, 81673279, 21772005), the National Major Scientific and Technological Special Project for Significant New Drugs Development (2019ZX09204-001, 2018ZX09735001-003) and the Beijing Natural Science Foundation (7202088, 7172118).
ECFP, Extended Connectivity Fingerprint; E3FP, Extended 3-Dimensional Fingerprint; DNN, Deep Neural Network; SEA, Similarity Ensemble Approach; ROCS, Rapid Overlay of Chemical Structures; QSAR, Quantitative Structure-Activity Relationship; MMFF, Merck Molecular Force Field; ReLU, Rectified Linear Unit; ZINC, ZINC Is Not Commercial; MSE, Mean Squared Error; RMSD, Root-Mean-Square Deviation of Atomic Positions, IUPAC, International Union of Pure and Applied Chemistry; SMILES, Simplified Molecular-Input Line-Entry System.
The authors report no conflicts of interest. The authors alone are responsible for the content and writing of this article.
1. Boström, J.; Hogner, A.; Schmitt, S., Do Structurally Similar Ligands Bind in a Similar Fashion? J. Med. Chem. 2006, 49, 6716-6725.
2. Karelson, M.; Lobanov, V. S.; Katritzky, A. R., Quantum-Chemical Descriptors in Qsar/Qspr Studies. Chem. Rev. 1996, 96, 1027-1044.
3. Todeschini, R.; Consonni, V., Handbook of Molecular Descriptors. Wileyvch, Weinheim. 2000; Vol. 11.
4. Jiang, P.; Saydam, S.; Ramandi, H. L.; Crosky, A.; Maghrebi, M. Deep Molecular Representation in Cheminformatics. In Handbook of Deep Learning Applications, Balas, V. E.; Roy, S. S.; Sharma, D.; Samui, P., Eds.; Springer International Publishing: Cham, 2019, pp 147-159.
5. Li, X.; Li, Z.; Wu, X.; Xiong, Z.; Yang, T.; Fu, Z.; Liu, X.; Tan, X.; Zhong, F.; Wan, X.; Wang, D.; Ding, X.; Yang, R.; Hou, H.; Li, C.; Liu, H.; Chen, K.; Jiang, H.; Zheng, M., Deep Learning Enhancing Kinome-Wide Polypharmacology Profiling: Model Construction and Experiment Validation. J. Med. Chem. 2019, In press.
6. Durant, J. L.; Leland, B. A.; Henry, D. R.; Nourse, J. G., Reoptimization of Mdl Keys for Use in Drug Discovery. J. Chem. Inf. Comput. Sci. 2002, 42, 1273-1280.
7. Rogers, D.; Hahn, M., Extended-Connectivity Fingerprints. J. Chem. Inf. Model. 2010, 50, 742-754.
8. Carhart, R. E.; Smith, D. H.; Venkataraghavan, R., Atom Pairs as Molecular Features in Structure-Activity Studies: Definition and Applications. J. Chem. Inf. Comput. Sci. 1985, 25, 64-73.
9. Nilakantan, R.; Bauman, N.; Dixon, J. S.; Venkataraghavan, R., Topological Torsion: A New Molecular Descriptor for Sar Applications. Comparison with Other Descriptors. J. Chem. Inf. Comput. Sci. 1987, 27, 82-85.
10. Gedeck, P.; Rohde, B.; Bartels, C., Qsar - How Good Is It in Practice? Comparison of Descriptor Sets on an Unbiased Cross Section of Corporate Data Sets. J. Chem. Inf. Model. 2006, 46, 1924-1936.
11. Huo, H.; Rupp, M., Unified Representation of Molecules and Crystals for Machine Learning. arXiv preprint arXiv:1704.06439 2017.
12. Sch tt, K. T.; Arbabzadah, F.; Chmiela, S.; M ller, K. R.; Tkatchenko, A., Quantum-Chemical Insights from Deep Tensor Neural Networks. Nat. Commun. 2017, 8, 13890.
13. Lubbers, N.; Smith, J. S.; Barros, K., Hierarchical Modeling of Molecular Energies Using a Deep Neural Network. J. Chem. Phys. 2018, 148, 241715.
14. Sch tt, K. T.; Sauceda, H. E.; Kindermans, P. J.; Tkatchenko, A.; M ller, K. R., Schnet C a Deep Learning Architecture for Molecules and Materials. J. Chem. Phys. 2018, 148, 241722.
15. Xiong, Z.; Wang, D.; Liu, X.; Zhong, F.; Wan, X.; Li, X.; Li, Z.; Luo, X.; Chen, K.; Jiang, H.; Zheng, M., Pushing the Boundaries of Molecular Representation for Drug Discovery with the Graph Attention Mechanism. J. Med. Chem. 2019, In press.
16. Duvenaud, D. K.; Maclaurin, D.; Iparraguirre, J.; Bombarell, R.; Hirzel, T.; Aspuru-Guzik, A.; Adams, R. P. Convolutional Networks on Graphs for Learning Molecular Fingerprints. In Advances in Neural Information Processing Systems, 2015; 2015; pp 2224-2232.
17. Kearnes, S.; McCloskey, K.; Berndl, M.; Pande, V.; Riley, P., Molecular Graph Convolutions: Moving Beyond Fingerprints. J. Comput. Aided Mol. Des. 2016, 30, 595-608.
18. Shin, W.-H.; Zhu, X.; Bures, G. M.; Kihara, D., Three-Dimensional Compound Comparison Methods and Their Application in Drug Discovery. Molecules 2015, 20, 12841-12862.
19. Schreyer, A. M.; Blundell, T., Usrcat: Real-Time Ultrafast Shape Recognition with Pharmacophoric Constraints. J. Cheminform. 2012, 4, 27.
20. Ballester, P. J.; Richards, W. G., Ultrafast Shape Recognition to Search Compound Databases for Similar Molecular Shapes. J. Comput. Chem. 2007, 28, 1711-1723.
21. Axen, S. D.; Huang, X.-P.; C ceres, E. L.; Gendelev, L.; Roth, B. L.; Keiser, M. J., A Simple Representation of Three-Dimensional Molecular Structure. J. Med. Chem. 2017, 60, 7393-7409.
22. Keiser, M. J.; Roth, B. L.; Armbruster, B. N.; Ernsberger, P.; Irwin, J. J.; Shoichet, B. K., Relating Protein Pharmacology by Ligand Chemistry. Nat. Biotechnol. 2007, 25, 197-206.
23. Hawkins, P. C. D.; Skillman, A. G.; Nicholls, A., Comparison of Shape-Matching and Docking as Virtual Screening Tools. J. Med. Chem. 2007, 50, 74-82.
24. Cai, C.; Gong, J.; Liu, X.; Gao, D.; Li, H., Simg: An Alignment Based Method for Evaluating the Similarity of Small Molecules and Binding Sites. J. Chem. Inf. Model. 2013, 53, 2103-2115.
25. Fontaine, F.; Bolton, E.; Borodina, Y.; Bryant, S. H., Fast 3d Shape Screening of Large Chemical Databases through Alignment-Recycling. Chem. Cent. J. 2007, 1, 12.
26. Koes, D. R.; Camacho, C. J., Shape-Based Virtual Screening with Volumetric Aligned Molecular Shapes. J. Comput. Chem. 2014, 35, 1824-1834.
27. Wallach, I.; Dzamba, M.; Heifets, A., Atomnet: A Deep Convolutional Neural Network for Bioactivity Prediction in Structure-Based Drug Discovery. arXiv preprint arXiv:1510.02855 2015.
28. Gomes, J.; Ramsundar, B.; Feinberg, E. N.; Pande, V. S., Atomic Convolutional Networks for Predicting Protein-Ligand Binding Affinity. arXiv preprint arXiv:1703.10603 2017.
29. Jim nez, J.; Škalič, M.; Mart nez-Rosell, G.; De Fabritiis, G., Kdeep: Protein CLigand Absolute Binding Affinity Prediction Via 3d-Convolutional Neural Networks. J. Chem. Inf. Model. 2018, 58, 287-296.
30. Ragoza, M.; Hochuli, J.; Idrobo, E.; Sunseri, J.; Koes, D. R., Protein CLigand Scoring with Convolutional Neural Networks. J. Chem. Inf. Model. 2017, 57, 942-957.
31. Skalic, M.; Jim nez, J.; Sabbadin, D.; De Fabritiis, G., Shape-Based Generative Modeling for De Novo Drug Design. J. Chem. Inf. Model. 2019, 59, 1205-1214.
32. Golkov, V.; Skwark, M. J.; Mirchev, A.; Dikov, G.; Geanes, A. R.; Mendenhall, J.; Meiler, J.; Cremers, D., 3d Deep Learning for Biological Function Prediction from Physical Fields. arXiv preprint arXiv:1704.04039 2017.
33. Winter, R.; Montanari, F.; No , F.; Clevert, D.-A., Learning Continuous and Data-Driven Molecular Descriptors by Translating Equivalent Chemical Representations. Chem. Sci. 2019, 10, 1692-1701.
34. Hinton, G. E.; Krizhevsky, A.; Wang, S. D. Transforming Auto-Encoders. In Artificial Neural Networks and Machine Learning C ICANN 2011, Berlin, Heidelberg, 2011; Honkela, T.; Duch, W.; Girolami, M.; Kaski, S., Eds. Springer Berlin Heidelberg: Berlin, Heidelberg, 2011; pp 44-51.
35. Sabour, S.; Frosst, N.; Hinton, G. E. Dynamic Routing between Capsules. In Advances in Neural Information Processing Systems, 2017; 2017; pp 3856-3866.
36. Sabour, S.; Frosst, N.; Hinton, G. Matrix Capsules with Em Routing. In International Conference on Learning Representations, 2018; 2018; pp 1-15.
37. Lenssen, J. E.; Fey, M.; Libuschewski, P. Group Equivariant Capsule Networks. In Advances in Neural Information Processing Systems, 2018; 2018; pp 8844-8853.
38. Venkatraman, S.; Balasubramanian, S.; Sarma, R. R., Building Deep, Equivariant Capsule Networks. arXiv preprint arXiv:1908.01300 2019.
39. Cramer, R. D.; Patterson, D. E.; Bunce, J. D., Comparative Molecular Field Analysis (Comfa).
1. Effect of Shape on Binding of Steroids to Carrier Proteins. J. Am. Chem. Soc. 1988, 110, 5959-5967.
40. Klebe, G.; Abraham, U.; Mietzner, T., Molecular Similarity Indices in a Comparative Analysis (Comsia) of Drug Molecules to Correlate and Predict Their Biological Activity. J. Med. Chem. 1994, 37, 4130-4146.
41. Halgren, T., Merck Molecular Force Field. Iii. Molecular Geometries and Vibrational Frequencies for Mmff94. J. Comput. Chem. 1996, 17, 553-586.
42. Halgren, T., Merck Molecular Force Field. V. Extension of Mmff94 Using Experimental Data, Additional Computational Data, and Empirical Rules. J. Comput. Chem. 1996, 17, 616-641.
43. Halgren, T. A., Merck Molecular Force Field. I. Basis, Form, Scope, Parameterization, and Performance of Mmff94. J. Comput. Chem. 1996, 17, 490-519.
44. Halgren, T. A., Merck Molecular Force Field. Ii. Mmff94 Van Der Waals and Electrostatic Parameters for Intermolecular Interactions. J. Comput. Chem. 1996, 17, 520-552.
45. Tosco, P.; Stiefl, N.; Landrum, G., Bringing the Mmff Force Field to the Rdkit: Implementation and Validation. J. Cheminform. 2014, 6, 37.
46. Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L. Pytorch: An Imperative Style, High-Performance Deep Learning Library. In Advances in Neural Information Processing Systems, 2019; 2019; pp 8024-8035.
47. Kingma, D. P.; Ba, J., Adam: A Method for Stochastic Optimization. arXiv preprint arXiv:1412.6980 2014.
48. Sterling, T.; Irwin, J. J., Zinc 15 C Ligand Discovery for Everyone. J. Chem. Inf. Model. 2015, 55, 2324-2337.
49. Landrum, G. Rdkit: Open-Source Cheminformatics.
50. Lovrić, M.; Molero, J. M.; Kern, R., Pyspark and Rdkit: Moving Towards Big Data in Cheminformatics. Mol. Inform. 2019, 38, 1800082.
51. Davies, M.; Nowotka, M.; Papadatos, G.; Dedman, N.; Gaulton, A.; Atkinson, F.; Bellis, L.; Overington, J. P., Chembl Web Services: Streamlining Access to Drug Discovery Data and Utilities. Nucleic Acids Res. 2015, 43, W612-W620.
52. Gaulton, A.; Hersey, A.; Nowotka, M.; Bento, A. P.; Chambers, J.; Mendez, D.; Mutowo, P.; Atkinson, F.; Bellis, L. J.; Cibri n-Uhalte, E.; Davies, M.; Dedman, N.; Karlsson, A.; Magariños, M. P.; Overington, J. P.; Papadatos, G.; Smit, I.; Leach, A. R., The Chembl Database in 2017. Nucleic Acids Res. 2016, 45, D945-D954.
53. Schrödinger Suites 2018-1, Schrödinger, LLC, New York, NY, 2018.
54. Liu, Z.; Su, M.; Han, L.; Liu, J.; Yang, Q.; Li, Y.; Wang, R., Forging the Basis for Developing Protein CLigand Interaction Scoring Functions. Acc. Chem. Res. 2017, 50, 302-309.
55. Hawkins, P. C. D.; Skillman, A. G.; Warren, G. L.; Ellingson, B. A.; Stahl, M. T., Conformer Generation with Omega: Algorithm and Validation Using High Quality Structures from the Protein Databank and Cambridge Structural Database. J. Chem. Inf. Model. 2010, 50, 572-584.
56. Gao, P.; Wang, L.; Zhao, L.; Zhang, Q.-y.; Zeng, K.-w.; Zhao, M.-b.; Jiang, Y.; Tu, P.-f.; Guo, X.-y., Anti-Inflammatory Quinoline Alkaloids from the Root Bark of Dictamnus Dasycarpus. Phytochemistry 2020, 172, 112260.
57. Weill, N.; Rognan, D., Alignment-Free Ultra-High-Throughput Comparison of Druggable Protein-Ligand Binding Sites. J. Chem. Inf. Model. 2010, 50, 123-135.
58. Hu, G.; Kuang, G.; Xiao, W.; Li, W.; Liu, G.; Tang, Y., Performance Evaluation of 2d Fingerprint and 3d Shape Similarity Methods in Virtual Screening. J. Chem. Inf. Model. 2012, 52, 1103-1113.
59. Awale, M.; Reymond, J.-L., Atom Pair 2d-Fingerprints Perceive 3d-Molecular Shape and Pharmacophores for Very Fast Virtual Screening of Zinc and Gdb-17. J. Chem. Inf. Model. 2014, 54, 1892-1907.