Applications of Support Vector Machines as a Robust tool in High Throughput Virtual Screening

Received Jan 10 th , 2012 Revised Jun 5 th , 2012 Accepted Jun 20 th , 2012 Chemical space is enormously huge but not all of it is pertinent for the drug designing. Virtual screening methods act as knowledge-based filters to discover the coveted novel lead molecules possessing desired pharmacological properties. Support Vector Machines (SVM) is a reliable virtual screening tool for prioritizing molecules with the required biological activity and minimum toxicity. It has to its credit inherent advantages such as support for noisy data mainly coming from varied high-throughput biological assays, high sensitivity, specificity, prediction accuracy and reduction in false positives. SVM-based classification methods can efficiently discriminate inhibitors from noninhibitors, actives from inactives, toxic from non-toxic and promiscuous from non-promiscuous molecules. As the principles of drug design are also applicable for agrochemicals, SVM methods are being applied for virtual screening for pesticides too. The current review discusses the basic kernels and models used for binary discrimination and also features used for developing SVM-based scoring functions, which will enhance our understanding of molecular interactions. SVM modeling has also been compared by many researchers with other statistical methods such as Artificial Neural Networks, knearest neighbour (kNN), decision trees, partial least squares, etc. Such studies have also been discussed in this review. Moreover, a case study involving the use of SVM method for screening molecules for cancer therapy has been carried out and the preliminary results presented here indicate that the SVM is an excellent classifier for screening the molecules. Keyword:


INTRODUCTION
Virtual screening is a broad term encompassing various tools, techniques and methods to obtain the molecule with desired properties [1].The vast amount of data being generated by the "omics" technologies and the computational possibility of generating innumerable compounds require a virtual screening (VS) tool capable of identifying different types of potential inhibitors from large compound libraries with high yields and low false-hit rates similar to high throughput screening (HTS).A right choice of the method depends upon the amount of available knowledge at hand.The target-based virtual screening methods such as docking and pharmacophore modeling are routinely used if we have well resolved X-ray crystallographic data available or a good homology model can be generated.In the absence of both, the ligand-based techniques that can be used include Quantitative Structure Activity Relationship (QSAR), pharmacophore searching, similarity searching, database screening etc. [2].Recently, several machine learning methods have been reported; among them Support Vector Machine (SVM) and its regression performing analogue namely Support Vector Regression (SVR) have emerged as the most powerful techniques in computational chemistry [3].Support vector machines have been used on million dimensional data sets and in other cases with more than a million examples [4].An SVM-based virtual screening method is a rapid computational tool for the prediction of potential lead molecules for any drug discovery program [5].The SVM and SVR techniques have been extensively employed in bioinformatics and chemical engineering (see e.g.[6][7][8][9][10][11][12]). As compared to the SVM, its regression analogue SVR has been sparingly used in virtual screening efforts.Accordingly, this review mainly considers SVM as a powerful technique for speeding up virtual screening efforts.Although there exists a deficiency of 3D structural experimental data on known expressed proteins, the SVM models can be still employed as filters to obtain efficient lead molecules in virtual screening cycles.The SVM builds predictive models for regression and classification of molecules thus offering medicinal chemists a rapid way to identify novel leads requiring least experimental efforts.Its predictive ability is emphasized by the formalism's successful application to different test sets of diverse and various multi-targeted compounds.The SVM models generally give an estimated sensitivity of greater than 83% and specificity of greater than 99%, and thus have been used to screen millions of compounds in free and proprietary chemical databases [13].They are capable of identifying novel inhibitors and distinguishing inhibitors from structurally similar non-inhibitors.In comparison with other virtual screening tools the SVM model is found to possess a broad applicability domain and a low false positive rate, which makes it suitable for the virtual screening of chemical libraries.SVM models exhibit good accuracy at cross-validation and independent testing.The predictive ability of these models in virtual screening is evaluated by parameters such as the number of false positives, false negatives, true negatives and true positives, sensitivity and specificity, error rate and Mathews correlation coefficient (MCC), and finally experimental validation is conducted to substantiate model predictions.Target-based SVM models can correctly indicate interactions with residues as well as hydrogen bonding between the key residues which has a tremendous relevance for important substructures and functional groups that are linked to the protein-ligand interactions for structure-activity relationship (SAR) studies.Hence, these models consistently provide guidelines for enhancing activity in novel candidate molecules-the ultimate objective of drug designing.In this review, we analyze three major types of SVM-based virtual screening approaches as depicted in Figure 1.

SVM theory: An overview
The SVM is a statistical learning theory based nonlinear model developed by Vapnik et al. [3] that discriminates between data points of distinct classes (binary SVM) such that the margin between both the classes is maximized.This margin models the linear decision hyperplane.The basic idea of SVM is to map the training data nonlinearly into a higher dimensional feature space via a mapping function and construct a separating hyperplane with maximum margin there.This yields a nonlinear decision boundary in the input space.The final position and orientation of the hyperplane are defined by a subset of training vectors, the socalled support vectors.Usually, the SVM approaches are used in association with a radial basis function (RBF) as the kernel function although other kernel functions such as dot, polynomial, sigmoid, multi-quadratic, Gaussian combination etc. have also been explored.

IJCB
SVMs use an implicit mapping,  , of the input data into a high-dimensional feature space defined by a kernel function, i.e., a function returning the inner product ( (x),  (x')) between the images of two data points x, x' in the feature space.The learning then takes place in the feature space, and the data points only appear inside dot products with other points [14].This is often referred to as the "kernel trick" [15].Owing to this trick, it becomes unnecessary to perform any computations in the high-dimensional feature space since all the requisite computations can be carried out directly in the input space.
where C refers to the regularization parameter, and i refers to the slack variables.Intuitively, the kernel function k computes similarity between two given examples.The most commonly employed kernel functions are: RBF kernels: Polynomial kernels: The SVM finds a large margin separation between the training examples and previously unseen examples will often be close to the training examples.The large margin then ensures that these examples are correctly classified as well (high generalization ability).The SVM training algorithm constructs models that are adequately complex yet unlike artificial neural networks are simple enough to be analysed mathematically.SVM is capable of handling multiple continuous and also categorical variables.

Ligand-based virtual screening using SVM
Since a very few protein crystal structures have been solved experimentally by an X-ray or an NMR technique, the ligand-based approaches serve as a valuable tool for virtual screening.The ligand-based SVM approach can be used even in the absence of receptor information and a number of researchers have used it successfully to obtain inhibitors for a range of pharmacologically important targets.Commonly, two kernels are used in the ligand-based virtual screening viz.2D Tanimoto kernel and 3D pharmacophore kernel.While a number of graph kernels are used in chemoinformatics, the Tanimoto kernel is the most commonly used one.If we consider feature map ɸd and corresponding kernel kd then Tanimoto kernel ktd is defined as [17 ] ktd where u and v denote two molecules and d is an integer.The initial reports on virtual screening used Binary Kernel Discrimination (BKD), a fingerprint based method based on Tanimoto concept of similarity as the machine learning approach.Willet et al. [18] performed a comparative study of virtual screening using binary kernel discrimination (BKD) with other ligand fingerprint based virtual screening chemoinformatics methods and found the BKD method to be much superior in drug and pesticide discovery but lower in performance than the SVM.Byvatov et al. [19] trained an SVM for the prediction of D3 and D2 receptor selective ligands.The hit compounds were synthesized and shown to possess nanomolar affinity.This approach was further refined by Chen et al. [20] who observed the effect of noisy data on the model using MDDR database and also noticed that its predictive ability depended on the number of false positives in the training set.Wilton et al. [21] obtained similar results in their work on pesticide data from Syngenta corporate database.Pharmacophores are important structural features present in a molecule which help in binding with a receptor thus giving rise to its biological activity.The pharmacophoric features are limited in number that mainly include ring centroid, aromatic ring, hydrophobic regions, hydrogen bond acceptor, hydrogen bond donors, etc.An example of a three point pharmacophore is shown in Figure 4. Pharmacophores serve as very important parameters in virtual screening of chemical libraries.An SVM based pharmacophore modeling has been performed by using the pharmacophore kernel which is defined as follows [22].where ||•|| corresponds to the Euclidean distance, the index i+1 is taken as modulo 3, KFeat and KDist are the kernel functions introduced to compare pairs of labels of atoms and pairs of distances, Ki corresponds to the intrinsic similarity and KS corresponds to the spatial similarity of the pharmacophores.Franke et al. [23] have demonstrated the effectiveness of this strategy by applying it for screening of COX-2 inhibitors.
Chen et al. [24] used Atom Pair (AP) structure and physicochemical (PC) descriptors of compounds to generate SVM-AP (support Vector Machine -Atom Pairs) and SVM-PC (Support Vector Machine Physico-Chemical) models to develop "LigSeeSVM," a screening tool for ligand-based virtual screening, which was validated on five different datasets.In another study [25], SVM modeling was performed to screen inhibitors for LCk1-a target implicated in the auto-immune diseases.The model had an estimated sensitivity of greater than 83% and specificity of greater than 99%, and it was used to screen 168014 compounds in the MDDR database and found to have a yield of 45.8% and a false positive rate of 0.52 %.Ma et al. [26] assessed the performance of SVM by using a sparse data set of active compounds for six target classes in the MDDR database namely muscarinic M1 receptor agonists, NMDA receptor antagonists, thrombin inhibitors, HIV protease inhibitors, cephalosporins, and renin inhibitors.They found the SVM method to be superior to Tanimoto coefficient based similarity search methods at giving low false hit rates.The predicted compounds were verified by a cellular activity assay proving SVM to be an efficient method.Jorrisen and Gilson [27] developed a modified version of SVM to not only classify molecular data but also enrichment of actives by using a novel method for identifying descriptors and cross-validating them as parameters in training the SVM.The results obtained were better than those based on fingerprints such as binary kernel discrimination.Han et al. [28] improved the performance of SVM for virtual screening of huge libraries by including a number of diverse non-actives in the training dataset.The hit rates and enrichment factors were found to increase dramatically for datasets of HIV protease inhibitors, DHFR inhibitors, dopamine antagonists and CNS active agents and were better than those based on the other ligand-based virtual screening methods.

Sequence based virtual screening using SVM
In a recent report by Wang et al. [29], a novel SVM method using only sequence data of targets and information on 2D structures of small molecules was developed.This SVM model was based on 15,000 ligand protein interactions derived from 626 protein and 10,000 active compounds.The methodology was used for identifying nine active compounds for four targets viz., GRP40, SIRT1, p38, and GSK-γ and can also be extended to other proteins.

Target-based virtual screening using SVM
Target-based virtual screening methods have recently been introduced compared to the conventional ligand based methods.The latter method has several drawbacks, the most important being the neglect of the receptor with no consideration to its flexibility and secondly a lack of adequate scoring functions.Currently, a number of research groups are focussing on the target-based SVM methods for success in virtual screening.Li et al. [30] have reported SVM-SP, an exclusive target based scoring function that yielded better enrichment than Glide docking score as evidenced by the ROC-AUC (Receiver Operating Curve -Area Under the Curve) characteristic plots.They evaluated its performance on 41 targets mostly from Directory of Useful Decoys (DUD) and obtained best results with kinases.The strategy worked well with the homology model and also succeeded when few structures were available as the training set.Virtual screening was performed against 1125 compounds for two targets namely EGFR and CAMKII wherein three out of the 25 hit compounds showed good inhibitory activity in vitro.In another report by the above group [31] the support vector regression method was applied for rank ordering and virtual screening of chemical libraries using community structure-activity resource (CSAR) datasets.Here, two new scoring functions were developed based on the knowledge of pair-wise potential and physiochemical properties.These scoring functions outperformed the well-established seven scoring functions namely Glide, VINA, Gold score, Dock, Chemscore, PMF and X score.Li et al. [31] developed a new function SVR_KBD employing a target specific strategy and the enrichment results were found comparable to their previously reported SVM-SP scoring function.
Waaserman and Bajorath [32] have discussed at length various SVM-based target selective searching strategies for virtual screening.They have elucidated a superior approach in terms of enrichment factor as the SVM is trained on the data comprising more than two different classes viz., selective, promiscuously active and non-active compared to the commonly employed binary classification approach.Further, they present a modified preference ranking strategy leading to higher recall of selective compounds.Combinatorial support vector machines have been used as virtual screening tools [33] for searching dual-inhibitors of 11 combinations of 9 anticancer kinase targets (EGFR, VEGFR, PDGFR, Src, FGFR, Lck, CDK1, CDK2, GSK3).In this study, the C-SVM was found either comparable or slightly better than the other conventional method such as Surflex, Dock, Blaster, KNN and PNN.Plewczynski et al. [34] performed an exclusive target-specific supervised SVM analysis for compounds retrieved from MDDR database related to five targets including cyclooxygenase-2, dihydrofolate reductase, thrombin, HIV-reverse transcriptase and antagonists of the estrogen receptor.The SVM model was based on only two dimensional topological descriptors related to atom pairs.The sensitivity and classification for all the protein targets were 80% and 100%, respectively.The literature is replete with more examples of successful application of SVM with both ligand and target based approaches.Table 1 shows a few representative ligands and their corresponding targets; needless to say that SVM modeling has comprehensively mapped chemical diversity and target space in virtual screening.

Feature based library selection
Although machine learning methods including the SVM yield good classification/regression results they are considered as "black-box" since the and parameters of the developed model cannot be interpreted meaningfully to gain an insight into the classification/ regression process.In the context of virtual screening, for example, the influence of individual molecular feature on the classification performance is not easily discernable.An interesting attempt to circumvent this drawback was made by Byvatov and Schneider [35] who reported an SVM based algorithm for feature-based selection method for libraries of enzyme inhibitors a better understanding of the protein-ligand interactions.The study included a comparison with the classical model-independent Kolmogorov-Smirnov (KS)-based feature selection and it was found that the SVM is faster than the KS-based classifiers.This algorithm established the SVM as more intelligent, comprehensible and no longer a complete black box.

Comparison of SVM with other virtual screening methods
A number of researchers have used SVM along with other commonly utilized machine learning methods in the arena of virtual screening and the comparative results of some of these studies are briefly discussed here.An SVM in combination with docking studies produced better results for virtual screening experiments as observed by Li and co-workers [36] who identified a novel series of benzimidazole derivatives for EGFR, VEGFR and PDGFR kinases.Similar results have been reported by Xie et al. [37] who discovered new and potent inhibitors of c-Met, a membrane receptor required for embryonic development and wound healing.The SVM results when supplemented with docking results led to higher number of hits and enrichment of actives in the dataset.Some of the SVM predicted actives were also validated experimentally in assays.Ren et al. [38] used a hierarchical methodology encompassing three virtual screening techniques viz.SVM, pharmacophore modeling and docking in succession to predict potent inhibitors for Pim-1.These techniques were used to screen a large number of databases including Pubchem, Specs and Enamine and finally fifteen hits showing nanomolar activity were obtained.Luan and co-workers [39] used this strategy of combining SVM and molecular docking for discovering 9-amino acridine scaffolds as multi-target inhibitors for VEGFR-2 and Src Kinase and confirmed the results experimentally.
Apart from biological activity, toxicity of drugs is another major concern for pharma companies and relatively few accurate prediction methods are available.Kumar et al. [40] used SVM methods to correctly estimate the genotoxicity levels of compounds in their training set.They studied in depth the effect of training set size and noise levels on the performance of SVM analyzing genotoxic and nongenotoxic compounds from large virtual screening libraries.The predictions made by the SVM model were on par with those made by TOPKAT-a commercial toxicity prediction tool by Accelerys.In another study [41], SVM in combination with pharmacophore modeling yielded good results for the development of glutamate non-competitive antagonists of mGluR1, a target implicated in nervous disorders.As the X-ray-solved structure is not available, ligand based virtual screening is the only approach for this target.Using MDDR data set it was shown that multi-step virtual screening approach involving both virtual screening techniques is superior to using each of them individually.Mballo and Makarenkov [42] made a comparative study of six known machine learning methods viz.binary decision trees, neural networks, SVM, linear discriminant analysis, k-nearest neighbours and partial least squares by analysing test assay from the McMaster University Data Mining and Docking Competition.They evaluated the methods on the basis of various parameters such as sensitivity, enrichment factor and number of false positive and negatives.Finally, they came up with a variable selection procedure and applied it to the polynomial SVM.Yang and co-workers [43] used random forest (RF) and SVM learning techniques to design inhibitors for gamma secretase, an important target for Alzheimer's disease.They observed that the RF model marginally outperformed the SVM method.Virtual screening using the model resulted in three hits in the ZINC database.AB1 is an important target for cancer therapy and considerable efforts have been made for developing inhibitors using insilico methods such as docking and pharmacophore modeling.Liu et al. [44] found SVM approach better than the above mentioned techniques for identifying ABl inhibitors as it led to lower false hit rates and enabled searching of huge libraries.Tropsha et al. [45] used combined Quantitative Structure Activity Relationship -Virtual Screening (QSAR-VS) approach involving kNN and SVM to develop human histone deacetylase HDAC inhibitors.Highly predictive models with good r2 vales were obtained and were rigorously cross-validated on external datasets.The model gave forty five unique hits while searching a huge inhouse database.In another comparative study by Bajorath and co-workers [46] the ranking provided by involving SVM proved to be far superior than the ones provided by nearest neighbour and centroid similarity search methods even when a smaller data set was used for training.The explanation given was that during the learning phase SVM uses information about database molecules, in addition to known active compounds.In a systematic study, Byvatov and co-workers [47] compared ANN and SVM methods as binary classifiers for discriminating drugs from non-drugs.They used three sets of descriptors viz.120 Ghose-Crippen fragments, a wide range of 180 descriptors from the Molecular Operating Environment (MOE) package, and 225 topological pharmacophore (CATS) descriptors.In general, SVM performed marginally better than the ANN with minimum error regardless of the choice of descriptors.However, the authors concluded that the two methodologies are complementary to each other as the results were similar but not identical.Melagraki et al. [48] did extensive work on developing inhibitors for MCH1 receptor by using a number of ligand based virtual screening techniques in tandem.First, a linear QSAR model was developed using multiple regression method following which the most suitable input variables were selected using the Elimination Selection-Stepwise Regression (ES-SWR) method.Finally, SVM was used to categorize the molecules into actives and non-actives.A number of efforts were expended to select the optimum scaffold and the activities of the predicted actives by SVM were estimated by using the MLRS model.Jorgenson group at University of Copenhagen developed many insilico models based on different classification methods such as binary QSAR, kNN, SVM, decision tree etc. for developing inhibitors for P450 1A2, an important enzyme in drug metabolism.Here, SVM, kNN and random forest methods were found to be the best methods delivering models with high prediction accuracy with a Mathews correlation coefficient of 0.5 [49].Similar results were obtained by Khandelwal et al. [50] in their work on predicting pregnane X receptor activators using machine learning methods coupled with docking protocol.They observed that docking combined with regression yielded inferior results when compared with SVM and RF methods.Plewczynski et al. [51] have conducted extensive studies to assess a host of machine learning techniques such as SVM, random forest, ANN, k-nearest neighbour (kNN) classification with geneticalgorithm-optimized feature selection, trend vectors, naive Bayesian classification, and decision tree, for their capacity to recognize ligands from a large data collection of molecules.Interestingly, they obtained varying results from the stated methods; while some were good in retrieving actives, others yielded high enrichment scores.However, all the methods could correctly predict the recently reported ligands.It was concluded that no single method can be the most consistent one; rather a combination of methods is essential for better results in virtual screening.
A case study: SVM based binary classification of molecules for their potential as anticancer agents A number of molecules tested as anticancer lead compounds against the human breast cancer cell line MCF-7 were downloaded from Pubchem bioassay and NCI database and used for building an SVM based classifier.A dataset of 54 molecules containing actives and in-actives was constructed and split into training set (41 molecules) and test set (13 molecules) (see Figure 5).BCUT descriptors closely encoding the activity were selected.RapidMiner 5.1 an open source data-mining software tool was used to generate the SVM classifier model [53].The objective of the was to categorize a given molecule into "active" or "inactive" depending upon the descriptors encoding the activity.This is a supervised classification task where an example data containing each molecule's descriptors (model inputs) and its class (active or inactive) defining model output are available.The RapidMiner consists of various types of SVM modules for performing supervised classification (clustering).Among these, the Support Vector Machine -Evolutionary (SVME) [54] yielded best classification results.The SVME uses an evolutionary algorithm (ES) for solving the dual optimization problem of an SVM.On many datasets, SVME performs as fast and accurate as the usual SVM implementations.Additionally, it is also capable of learning with Kernels which are not positive semi-definite.The parameter values used in conducting the above stated classification are: (i) kernel type: ANOVA, kernel gamma = 1.0, kernel degree = 3, C = 0.β5, ε = 0.1, maximum number of generations = 10,000, population size = 5, selection type: Tournament, and crossover probability = 1.0.The stated combination of SVME parameters yielded excellent classification results wherein all the molecules in both the training and test sets were classified with 100% accuracy (no false positives or false negatives).For a training set of 41 inputoutput patterns, the SVME method identified 24 support vectors.An excellent classification accuracy of 100% in respect of the test set molecules indicates that the developed SVME model is capable of accurately generalizing the learned classification to new molecules.A detailed study comparing the classification results from a number of SVM and ANN based classifiers for other cell lines data is currently in progress.

CONCLUSION
SVM based screening is flexible, fast and it significantly increases the speed and accuracy of prediction in virtual screening experiments.To some extent, these advantages are offset by limitations such as low hit rate and high number of false positives.Consequently, there exists a need to speed up SVM and kernel methods, which will surely benefit virtual screening efforts at large.As rightly pointed out by Schneider [55] it might be wise to try out several predictive methods in parallel for the right solution in virtual screening.To sum it up, a consensus approach would be ideal for the large disparate datasets generally available for computational biology studies.

Figure 2 :
Figure 2: Mapping function  maps the training data nonlinearly into a higher dimensional feature space wherein a separating hyperplane with maximum margin is constructed yielding a nonlinear decision boundary in the input space.

Figure3
Figure3 (a) A number of hyperplanes are possible to separate the two classes, (b) The SVM algorithm constructs a hyperplane that maximizes separation between two classes.

Fig. 5 :
Fig.5: 2D structure of some diverse molecules used in training and test set