TpPred : A Tool for Hierarchical Prediction of Transport Proteins Using Cluster of Neural Networks and Sequence Derived Features

Received Feb 2 nd , 2012 Revised Feb 29 th , 2012 Accepted Mar 6 th , 2012 A top–down predictor, called TpPred, is developed which consists of 3 level of hierarchical classification using cascade of neural networks from sequence derived features. The 1st layer of the prediction engine is for identifying a query protein as transport protein or not; the 2nd layer for the main functional class; and the 3rd layer for the sub-functional class. The overall success rates for all the three layers are higher than 65% that were obtained through rigorous cross-validation tests on the very stringent benchmark datasets in which none of the proteins has 30% sequence identity with any other in the same class or subclass. TpPred achieved good prediction accuracies and could nicely complement experimental approaches for identification of transport proteins. TpPred is freely available to be use in-house as a standalone version and is accessible at http://www.juit.ac.in/attachments/tppred/Home.html. Keyword:


INTRODUCTION
Transport proteins are biologically important and play indispensable roles in the fundamental cellular processes of all organisms.They are involved in the transport of ions and molecules across the membrane, play essential roles in cellular metabolism and activities.They mediate the entry of nutrients into cytoplasm and the extrusion of metabolite wastes, maintain a stable internal environment inside the cell by regulating the uptake and efflux of ions, protect cells from environmental insults, and enhance communications between cells through the secretion of proteins, carbohydrates and lipids [1][2][3].Specific transporters have been explored as therapeutic targets [4][5][6].A variety of transporters are responsible for the absorption, distribution and excretion of drugs within the human body which must be factored into pharmacological studies [7,8].Different transport systems differ in their putative membrane topology, energy coupling mechanism and substrate specificities [9].The immense importance of studying transport proteins and the enormity of the data available on these proteins has warranted the systematic annotation and classification of transport proteins for elucidating the functional mechanisms of proteins and biological processes.
Transport proteins have been identified by such experimental approaches as absorbance spectroscopy, gel electrophoresis, metal-affinity columns and shift assay, chromatography, mass spectroscopy, and combined spectroscopic studies.However, some of these methods generally require a purified or semi-purified target of interest, do not facilitate identification of unknown targets form complex protein mixtures, or require multi-step processes and very specialized equipment, which limit their application ranges.Therefore, there is need to explore other methods including computational approaches for facilitating the identification of transport proteins to complement these experimental methods.With the explosion of protein sequences entering into databanks, it is highly desirable to explore the feasibility of selectively classifying newly found protein sequences into their respective transport protein classes by means of an automated method [10,11].This is indeed important because knowing which protein belongs to which particular class may help to deduce its catalytic mechanism and specificity, giving clues to the relevant biological function.Primary sequence of these proteins are readily available, therefore a method using the sequence derived features will prove a much valuable and a cost effective process of determining and classifying these proteins into broader transporter/non-transporter and specifically into major classes and subclasses as defined by Transport Classification (TC) system (http://www.tcdb.org/browse.php)[12].So far, sequence alignment and clustering are the primary method for predicting the TC family, as well as the function of transporters [13,14].Some transporters are known to have no or low homology to other proteins of known function [15][16][17][18].A substantial portion of transporters in different TC families have been found to have very low sequence identity to other family members.For instance, a member of the multidrug transporter family, bmr3, has only 7% sequence identity and 17% similarity to another family member blt [18].The potassium channel, TASK-2, has 18-22% sequence identity to other members of the two-pore domain K + channel family, such as TWIK-1, TREK-1, TASK-1, and TRAAK [19].Two members of the major facilitator family, GlpT and LacY, are 21% identical to each other [21].Thus, the function of some of these transporters may be difficult to assign based solely on homology, [21,22] and methods that predict protein function without the use of sequence similarity are needed.
This work explored a machine learning method, artificial neural network (NN) that predicts transport proteins directly from sequence or sequence-derived properties.The sequence derived features that were used are amino acid composition, pseudo amino acid composition and physicochemical properties.Using these parameters and their combination we have developed a cluster of neural networks for the hierarchical classification of transport proteins in a -top-down‖ approach.

Preparation of dataset
All transport proteins used in this study are taken from the Transport Classification Database (http://www.tcdb.org/) in which the proteins are classified on the basis of their function [12].A total of 5,359 transport protein sequences taken together, have been classified into seven major classes as: channels/pores (1139), electrochemical potential-driven transporters (1456), primary active transporters (2045), group translocators (107), transmembrane electron carriers (106), accessory factors involved in transport (129) and incompletely characterized transport systems (377).With the aim of avoiding prejudiced learning in the networks, we scaled the sequences such that the inequality in the data points or number of protein sequences in each class may be compromised.We reduced the proteins in each class with a similarity cutoff of 30% using BLASTClust [23].A negative dataset consisting of 2,907 protein sequences, representing non-transport members is also created from PDB database.These datasets are divided into separate training, testing and independent evaluation sets (Table 1).

Feature vector construction
Following three types of discrete feature vectors were constructed for each protein sequence.1.Amino acid composition: Given the sequence of a protein, its amino acid composition was computed and then used to generate a set of 20 features representing composition of 20 standard amino acids in the protein sequences that include A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W and Y.These features have been widely used in predicting different structural classes and subcellular localization of proteins [10,11,24].The formula used to calculate amino acid composition is: where AA(i) = Frequency of i th amino acid in the sequence 2. Physicochemical properties: Twelve sequence derived properties for each protein sequence was calculated using EMBOSS (EBI) package [25].The parameters include: molecular weight, totalcharge, isoelectric point, mole percentages of tiny (A+C+G+S+T), small (A+B+C+D+G+N+P+S+T+V), aliphatic (I+L+V), aromatic (F+H+W+Y), non-polar (A+C+F+G+I+L+M+P+V+W+Y), polar (D+E+H+K+N+Q+R+S+T+Z), charged (B+D+E+H+K+R+Z), acidic (B+D+E+Z) and basic (H+K+R) amino acids .

Pseudo amino acid composition (PseAA):
This class of descriptor consists of a set of 37 features, 20 of which are weighted amino acid compositions and rest 17 are correlation factors calculated among amino acids for each protein sequence [26].
A protein sequence P with L amino acid resides can be represented as: where R 1 represents the 1 st residue of the protein P, R 2 the 2 nd residue and so forth.According to the simplest discrete model, the amino acid composition of the protein Pbased on the equation ( 1) can be expressed as: are the normalized occurrence frequencies for the 20 native amino acids in P and T the transposing operator.The additional 17 features are a series of rank-different correlation factors along a protein chain and were calculated as follows.
A protein sequence P consisting of L amino acid resides can be represented as: (4) Wherew is the weight factor and τ k is the k th tier correlation factor that reflects the sequence order correlation between all the k th most contiguous residues as formulated by WhereΦ ξ (R i )is the ξ-th function of the amino acid R i , and Гthe total number of the functions considered.Φ 1 (R i ), Φ 2 (R i ) and Φ 3 (R i ) represented respectively the hydrophobicity value [27], hydrophilicity value [28], and side chain mass of amino acid R i (Table 2); while Φ 1 (R i+k ), Φ 2 (R i+k ) and Φ 3 (R i+k ) are the corresponding values for the amino acid R i+k .Therefore, the total number of functions considered is Γ=3.
It can be seen from equation ( 3) that the first 20 components, i.e. p 1 , p 2 , …, p 20 are associated with the conventional AA composition of protein, while the remaining components p 20+1 , …, p 20+λ are the correlation factors that reflect the 1 st tier, 2 nd tier, …, and the λ th tier sequence order correlation patterns.It is through these additional λ factors the important sequence-order information are incorporated.

System architecture and component of NN topology
The overall classification system consists of three layer of successive multilayer feed forward (acyclic) artificial NNs (Fig. 1), each one with a single hidden layer at which the computation takes place.Some common features shared by all NNs are the following: 1.There is full connectivity as every node in each network layer is connected to every other node in the adjacent forward layer.
2. There are a small number of nodes in the hidden layer responsible for the actual learning process carried out by each component network.
3. The activation function on each node is a nonlinear, sigmoid logistic function of the weighted sum of all synaptic weights (plus a constant bias).
NN1 is a binary classifier which classifies an input protein sequence as a transport protein or nontransport protein.If the input protein sequence is classified as a transport protein then it is processed by NN2 which gets classified into one of the seven main classes of transport proteins (channels/pores, electrochemical potential-driven transporters, primary active transporters, group translocators, transmembrane electron carriers, Each class (except electrochemical potential-driven transporters) consists of an independent NN [channels/pores (NN3), primary active transporters (NN4), group translocators (NN5), transmembrane electron carriers (NN6), accessory factors involved in transport (NN7) and incompletely characterized transport systems (NN8)] for classification of input protein sequence specifically into its functional sub-class.We have used three categories of sequence derived features such as physicochemical properties, amino acid composition and pseudo amino acid composition for training of NNs.Using these parameters independently and with combination we have developed seven neural network clusters:, NN AAcomp , NN pseAA , NN prop, NN AAcomp+pseAA, NN AAcomp+prop , NN pseAA+prop , and NN AAcomp+pseAA+prop .Before the learning process, all network synaptic weights are initialized to small random values which have been optimized to final weights during learning process based on backpropagation algorithm [29].An important issue in the design of a NN classification system is the network's generalization, that is, its ability to give correct predictions when it is presented with unseen examples.With a small number of training samples and a relatively large number of synaptic weights, there is always the possibility that the network's free parameters will adapt to the special features of the training data (overfitting).A straightforward way to overcome this problem is to use sufficient number of training examples (usually more than 30 times the number of adjustable network parameters).However, the protein classes are unbiased and it is not possible to have these many numbers.Therefore to control the over fitting in our application, we have employed nonconvergent criteria (early stopping method); the training process is stopped before the optimization procedure finished.We follow the common method which is to withhold and use part of the training data (20%) as an internal validation set.Training is stopped at the point at which the classification error on the holdout subset begins to rise.
In the prediction phase, just like the forward pass in learning, network weights are globally fixed (those obtained after the convergence of the training process) and the NN is presented with an unknown example for classification.In the same hierarchical manner, the input signal propagates once in the forward direction and the output value constitutes the network's decision based on the already studied training examples.The prediction accuracy of the models has been validated using self-consistency, jackknife and independent data set.For jackknife test we randomized the test set for 100 times and recorded average performance accuracy.

RESULTS AND ANALYSIS
Neural network has been successfully used previously for predicting the functional classes of proteins from sequence-derived structural and physicochemical properties and irrespective of sequence similarity [30][31][32].However, transport proteins involve a substantially more diverse spectrum of proteins than most of the other classes of proteins.The diverse spectrum of proteins poses a more critical test for constructing a NN prediction system.In order to assess the performance of the TpPred, we applied several tests.We created a new independent test set with well-characterized protein sequences from all level of classes and sub-classes (Table 1) to evaluate the performance of the new integrated system.In addition we have also performed sub-sampling test (self consistency test) and jackknife test for evaluating the performance of TpPred.These validation tests are commonly used for measuring the accuracy of a classifier [10,[33][34][35].The performance of neural networks with combined features (especially the one combined all three types of features) tend to perform better than the one using only a single type of features or less type of features.

Performance of 1 st layer of neural network
The performance and validation results of NN1 are given in Table 3.The network achieved an overall accuracy of 97.3% and 88.4% for the training set and test set data using combination of sequence derived features-amino acid composition, pseudo amino acid composition and physicochemical properties.While considering the validation techniques by using an independent data set, self consistency test and jackknife test, the overall accuracy of the 1 st layer of TpPred is 85.2%, 88.0% and 81.4% respectively.The details of the performance accuracy and validation results based on different types of sequence derived feature have been represented in Table 5.

Performance of 2 nd layer of neural network
The overall success rate in identifying the transport proteins among their seven major functional classes is 97.5% (using training set) and 75.0%(using test set) (Table 4).Similarly the overall performance accuracy based on three types of validation tests has been found to be 79.8% (using independent data set), 84.2% (using self consistency test) and 68.5% (using jackknife test).The corresponding results by TpPred on the data set for seven major classes of transport proteins using different types of sequence derived features are given in below.The transport proteins are classified at two levels (TC class, and TC subclass) as indicated by a specific TC number TC I.X.Here I = 1,…..,9 represents each of the 9 TC classes, X = A, B, C, D, E,… represents each of the TC subclasses that belong to a TC class.a Hydrophobicity values are from reference [27] b Hydrophilicity values are from reference [28] Table 3 Performance accuracy and validation results of 1 st layer of TpPred based on combination of pseudo amino acid composition, amino acid composition and physicochemical properties.

CONCLUSION
From a practical point of view, the most important aspect of a prediction model is its ability to make correct predictions.Till date most of the available methods use the 3-D structure of the protein to predict and classify transport proteins.This is a very tedious job and requires much costlier endeavors.The sequence of a protein is an important determinant for the detailed molecular function of proteins and would consequently also be useful for prediction of transport protein and classes.Additionally much encouraging results have been predicted using the sequence derived features.Therefore, a much accurate and reliable method is that which predicts the transport proteins and their classes based on both strategies.Cascade of neural networks used in this study appears to be a potentially useful tool for the prediction of transport proteins of different classes.The prediction accuracy may be further enhanced with the further expansion of our knowledge about transport proteins particularly for those small transport classes, more refined representation of the structural and physicochemical properties of proteins, and the improvement of prediction algorithms such as the better treatment of imbalanced dataset.
www.ijcb.inaccessory factors involved in transport and incompletely characterized transport systems).

Table 1
Number of transport proteins according to their class and subclass used for training and validation of TpPred.

Table 2Hydrophobicity ,
hydrophilicity and mass of side chain scales for 20 amino acids used in calculating pseudo amino acid composition (PseAA).

Table 4
Performance accuracy and validation results of 2 nd layer of TpPred based on combination of pseudo amino acid composition, amino acid composition and physicochemical properties.

Table 5
Performance accuracy and validation results of 3 rd layer of TpPred based on combination of pseudo amino acid composition, amino acid composition and physicochemical properties.