Abstract
Pregnane X receptor (PXR) regulates drug metabolism and is involved in drug-drug interactions. Prediction of PXR activators is important for evaluating drug metabolism and toxicity. Computational pharmacophore and quantitative structure-activity relationship models have been developed for predicting PXR activators. Because of the structural diversity of PXR activators, more efforts are needed for exploring methods applicable to a broader spectrum of compounds. We explored three machine learning methods (MLMs) for predicting PXR activators, which were trained and tested by using significantly higher number of compounds, 128 PXR activators (98 human) and 77 PXR non-activators, than those of previous studies. The recursive feature-selection method was used to select molecular descriptors relevant to PXR activator prediction, which are consistent with conclusions from other computational and structural studies. In a 10-fold cross-validation test, our MLM systems correctly predicted 81.2 to 84.0% of PXR activators, 80.8 to 85.0% of hPXR activators, 61.2 to 70.3% of PXR nonactivators, and 67.7 to 73.6% of hPXR nonactivators. Our systems also correctly predicted 73.3 to 86.7% of 15 newly published hPXR activators. MLMs seem to be useful for predicting PXR activators and for providing clues to physicochemical features of PXR activation.
Pregnane X receptor (PXR) is a nuclear receptor known to be activated by structurally diverse xenobiotics and endogenous compounds (Lehmann et al., 1998; Jones et al., 2000; Ekins, 2004). PXR plays important roles in the metabolism of xenobiotics and drug-drug interactions by regulating the expression of metabolizing enzymes such as cytochrome P450 enzymes (CYP3A4, CYP2B6, and CYP2C8/9), and glutathione-S-transferases (Kliewer et al., 2002). It also regulates the expression of important drug transporters such as P-glycoprotein and multidrug resistance proteins (Ekins, 2004; Xie et al., 2004). Therefore, drugs capable of activating PXR may have significant impact on their own metabolism, transport, and interaction with other drugs. Identification of PXR activators is important for analyzing metabolism and pharmacokinetic profiles of drug candidates and for detecting potential drug-drug interactions.
Most of the drug metabolism prediction efforts have been directed at the development of tools for predicting cytochrome P450 substrates and inhibitors (Ekins et al., 2000; Doniger et al., 2002). However, significantly fewer works have been devoted to the development of tools for identifying PXR activators. So far, experimental high-throughput screening assays have been used for detecting PXR binding ligands (Jones et al., 2000), and computational pharmacophore (Ekins and Erickson, 2002; Schuster and Langer, 2005) and quantitative structure and activity relationship (QSAR) (Jacobs, 2004) models have been developed for predicting PXR activators. Because of the importance of PXR in drug metabolism and drug-drug interactions, more efforts are needed to explore additional methods for predicting a broader spectrum of PXR activators than those covered by existing studies.
We explored machine learning methods (MLMs) for predicting PXR and human PXR (hPXR) activators. PXRs show a high amount of sequence diversity in their ligand-binding domains (Moore et al., 2002), resulting in marked differences in ligand selectivity of PXRs across species, which is likely to have evolutionary significance in cross-species difference in adaptation to toxic compounds (Krasowski et al., 2005). Some compounds are known to activate mouse but not human PXR and vice versa. Therefore, it is more relevant to develop prediction systems for hPXR activators. Nonetheless, prediction systems for PXR and hPXR activators were developed in this work for facilitating the search of broader spectrum of activators, particularly those of species frequently used in drug toxicity tests.
MLMs have been used for predicting compounds of different pharmacological classes (Doniger et al., 2002; Xue et al., 2004b; Yap and Chen, 2005). The most widely used MLMs in these studies are support vector machines (SVMs) (Burges, 1998), probabilistic neural network (PNN) (Specht, 1990), and k nearest neighbor (k-NN) (Johnson and Wichern, 1982). These methods have consistently exhibited good prediction performance for compounds of diverse structures. Moreover, a feature selection method can be incorporated into these methods for selecting molecular descriptors most relevant to the prediction of compounds with specific pharmacological property (Xue et al., 2004a, b; Li et al., 2005a, b).
PXR activators are structurally diverse partly because PXR ligand binding domain is highly flexible (Watkins et al., 2001). Nonetheless, certain common physicochemical characteristics can be found at the binding site. For instance, the binding site is largely hydrophobic but contains a few polar residues capable of both donating and accepting hydrogen bonds (Watkins et al., 2001). These and other distinguished binding-site features probably define the common structural and physicochemical properties of the compounds that can bind and activate PXR, which can be exploited by using MLMs to distinguish PXR activators and nonactivators. Several molecular descriptors of PXR activators have been used for deriving QSAR (Jacobs, 2004) and pharmacophore models (Ekins and Erickson, 2002; Schuster and Langer, 2005). It is likely that not all of the molecular descriptors related to PXR activation have been included in previous studies because of the limited coverage of compounds and the number of other relevant descriptors. Therefore, feature selection methods (Xue et al., 2004a, b; Li et al., 2005a, b) may be applied for finding additional molecular descriptors relevant to PXR activation. The use of a higher number of relevant molecular descriptors also serves to improve the performance of MLMs.
In this work, PXR and hPXR activator prediction systems were developed by using SVM, PNN, and k-NN, which were trained and tested by using a significantly higher number of compounds than those used in the previous studies. A comprehensive literature search was conducted to collect a diverse set of literature-reported PXR activators and nonactivators. A popular feature selection method, recursive feature elimination (RFE) (Guyon et al., 2002; Xue et al., 2004a, b; Li et al., 2005a, b), was used to extract molecular descriptors associated with PXR activation. The performance of these systems were tested by using 10-fold cross-validation and an independent set of 15 newly published experimental PXR activators (Lemaire et al., 2006).
Materials and Methods
Collection of PXR Activators and Nonactivators
Figure 1 illustrates the procedure for searching and selecting PXR activators, hPXR activators, and the corresponding nonactivators. PXR activators were selected based on the criterion that they have been reported to show potent activation to at least one PXR ortholog regardless of its effect on other PXR orthologs. A total of 128 PXR activators were collected from literature, which were used as the activator data set for predicting PXR activators irrespective of host species. There are 98 PXR activators reported to activate hPXR, which were used as the activator data set for predicting hPXR activators. The first data set is of higher statistical significance because of the higher number of compounds included. Compared with the largest data set of 53 compounds used in the previous studies (Ekins and Erickson, 2002; Jacobs, 2004; Schuster and Langer, 2005), our data sets contain a significantly higher number of compounds and are more diverse in structures, as shown by the computed structural diversity index as will be described.
PXR nonactivators include known PXR antagonists and PXR non-binders reported in the literature. Moreover, compounds explicitly reported to not activate PXR-regulated gene expression of CYP3A4 were further considered to be implicated PXR nonactivators if they satisfied the subsequent criterion that they have not been reported to induce the expression of other PXR-regulated drug-metabolizing enzyme genes such as CYP2B6 and CYP2C8/9. These PXR nonactivators and implicated PXR nonactivators were used as the nonactivator data set for predicting PXR activators irrespective of host species. The hPXR nonactivator data set include all compounds in the PXR nonactivator data set plus known nonhuman PXR activators.
The 2D and 3D structure of each compound was generated by using ChemDraw (http://www.cambridgesoft.com/) and DS Viewer-Pro 5.0 (http://www.accelrys.com/), respectively, and geometrical optimization was conducted subsequently. The optimized 3D structure of each compound was manually inspected to ensure that the chirality of each chiral agent is properly generated and is consistent with that described in the literature. For those compounds with transactivation activities but without a reported active enantiomer, the default enantiomer structure in the chemical database such as Pub-Chem (http://pubchem.ncbi.nlm.nih.gov/) and ChemFinder (http://www.chemfinder.com/) was straightforwardly used.
Determination of Structural Diversity
Structural diversity of a collection of compounds can be measured by using the diversity index (DI) value, which is the average value of the similarity between pairs of compounds in a data set (Perez, 2005):
where sim(i,j) is a measure of the similarity between compounds i and j, and N is the number of compounds in the data set. The structural diversity of a data set increases with decreasing DI value. In this work, sim(i,j) is computed by using the Tanimoto coefficient (Willett et al., 1998):
where l is the number of descriptors computed for the molecules in the data set.
Construction of Training and Testing Sets
PXR and hPXR activators and nonactivators were divided into training and testing sets in a manner suitable for conducting 10-fold cross-validation study. For instance, the 128 PXR activators and 77 PXR nonactivators were each randomly divided into 10 subsets of approximately equal size. Nine of the subsets were used as the training set, and the remaining subset was used as the testing set for PXR activators and nonactivators, respectively. This process was repeated 10 times such that every subset is used as the test set once. The same procedure was applied to the 98 hPXR activators and 77 hPXR nonactivators for constructing the training and testing sets of the hPXR activator prediction systems. An additional set of 15 experimentally determined PXR activators (14 of which are structurally dissimilar in our data set by visual inspection) obtained from a newly published article (Lemaire et al., 2006) was used as the independent set for further evaluation of the performance of our prediction systems.
Molecular Descriptors
Molecular descriptors are quantitative representations of structural and physicochemical features of molecules, which have been extensively used in the structure-activity relationship (Fang et al., 2001), QSAR (Jacobs, 2004) and other machine learning studies of pharmaceutical agents (Doniger et al., 2002; Zernov et al., 2003; Xue et al., 2004b; Yap and Chen, 2004). A total of 199 molecular descriptors were used in this work. These descriptors were selected from more than 1000 descriptors described in the literature by eliminating those descriptors that are obviously redundant or unrelated to the prediction of pharmaceutical agents (Xue et al., 2004b; Li et al., 2005b). The resulting 199 molecular descriptors include 18 descriptors in the class of simple molecular properties, 28 descriptors in the class of molecular connectivity and shape, 97 descriptors in the class of electrotopological state, 31 descriptors in the class of quantum chemical properties, and 25 descriptors in the class of geometrical properties. They were computed from the 3D structure of each compound by using our own designed molecular descriptor computing program. A feature selection method, recursive feature elimination (described below), was used for eliminating those descriptors that are redundant or have no significant contribution to PXR activator prediction (Guyon et al., 2002).
Feature Selection Method
The RFE method (Guyon et al., 2002) was used in this work as the feature selection method for selecting molecular descriptors associated with PXR activation. RFE has gained popularity due to its effectiveness for improving prediction performance and for discovering informative features associated with drug activity (Guyon et al., 2002), pharmacokinetic, and toxicological properties (Xue et al., 2004a, b). Each of the compounds studied is represented by a vector xi, with its molecular descriptors (or features) as its components. The task of selecting appropriate molecular descriptors to a particular compound classification problem can be conducted by ranking and selecting those with meaningful contributions to the classification of the studied compounds.
Descriptor ranking in RFE is based on the magnitude of the change of an objective function of a MLM model upon removing each descriptor (which roughly measures the extent of contribution of each feature to the prediction capability of the model) (Kohavi and John, 1997). The prediction capability of a MLM model is more significantly affected by a greater change in the objective function, and thus the corresponding descriptor is ranked higher. To improve the efficiency of training, this objective function is represented by a cost function J computed from the training set only. When a given feature is removed or its weight is brought to 0, the change DJ(i)in the cost function J is computed by DJ(i) = [(1∂2J)/(2∂w2i)] × (Dwi)2, where wi is the weight of the feature i, and the change in weight Dwi = wi corresponds to the removed descriptor xi. One or more of descriptors with the smallest DJ(i) can be eliminated in each iteration (Guyon et al., 2002).
Machine Learning Methods
SVM. SVM is illustrated in Fig. 2. A linear SVM constructs a hyperplane separating two different classes of feature vectors with a maximum margin (Vapnik, 1995). This hyperplane is constructed by finding a vector w and a parameter b that minimizes ∥w∥2, which satisfies the following conditions: w × xi + b ≥ +1, for yi =+1 (PXR activators as positive class) and w × xi + b ≤ -1, for yi =-1 (PXR nonactivators as negative class). Here xi is a feature vector, yi is the group index, w is a vector normal to the hyperplane, |b|/∥w∥ is the perpendicular distance from the hyperplane to the origin, and ∥w∥2 is the Euclidean norm of w. A nonlinear SVM projects feature vectors into a high-dimensional feature space by using a kernel function such as . The linear SVM procedure is then applied to the feature vectors in this feature space. After the determination of w and b, a given vector x can be classified by using sign[(w × x) + b]; a positive or negative value indicates that the vector x belongs to the positive or negative class, respectively.
k-NN. k-NN is illustrated in Fig. 3. k-NN measures the Euclidean distance between a to-be-classified vector x and each individual vector xi in the training set (Johnson and Wichern, 1982). The Euclidean distances for the vector pairs are calculated using the following formula:
A total of k number of vectors nearest to the vector x are used to determine its class, f(x).
where δ(a,b) = 1if a = b and δ(a,b) = 0 if a ≠ b, argmax is the maximum of the function, V is a finite set of vectors {v1,... vs}, and f́ is an estimate of f (x). Here, estimate refers to the class of the majority of the k nearest neighbors.
PNN. As illustrated in Fig. 4, PNN is a form of neural network designed for classification through the use of Bayes' optimal decision rule (Specht, 1990):
where hi and hj are the prior probabilities, ci and cj are the costs of misclassification, and fi (x) and fj (x) are the probability density function for classes i and j, respectively. An unknown vector x is classified into population i if the product of all the three terms is greater for class i than for any other class j (not equal to i). In most applications, the prior probabilities and costs of misclassifications are treated as being equal. The probability density function for each class for a univariate case can be estimated by using the Parzen's nonparametric estimator
where n is the sample size, σ is a scaling parameter defining the width of the bell curve that surrounds each sample point, W(d) is a weight function that has its largest value at d = 0, and (x - xi)is the distance between the unknown vector and a vector in the training set. The Parzen's nonparametric estimator was later expanded by Cacoullos for the multivariate case:
The Gaussian function is frequently used as the weight function because it is well-behaved, easily calculated, and satisfies the conditions required by Parzen's estimator. Thus the probability density function for the multivariate case becomes
The network architectures of PNN are determined by the number of compounds and descriptors in the training set. There are four layers in a PNN. The input layer provides input values to all neurons in the pattern layer and has as many neurons as the number of descriptors in the training set. The number of pattern neurons is determined by the total number of compounds in the training set. Each pattern neuron computes a distance measure between the input and the training case represented by that neuron and then subjects the distance measure to the Parzen's nonparameteric estimator. The summation layer has a neuron for each class, and the neurons sum all the pattern neurons' output corresponding to members of that summation neuron's class to obtain the estimated probability density function for that class. The single neuron in the output layer then estimates the class of the unknown vector x by comparing all the probability density function from the summation neurons and choosing the class with the highest probability density function.
Evaluation of Prediction Performance
As in the case of all discriminative methods (Baldi et al., 2000), the performance of MLMs can be evaluated by the quantity of true positives (TP; true PXR activators), true negatives (TN; true nonactivators), false positives (FP; false PXR activators), and false negatives FN (false nonactivators). Sensitivity [SE = TP/(TP + FN)] and specificity [SP = TN/(TN + FP)] are the prediction accuracy for PXR activators and nonactivators, respectively. The overall prediction accuracy (Q) and Matthews' correlation coefficient (C) (Matthews, 1975) are used to measure the overall prediction performance:
Computational Parameters and Performance Evaluation
There is only one parameter to be optimized in training each of the SVM, k-NN, and PNN classification systems. The classification speed of these MLM-based prediction systems is in the order of a few thousands to hundreds of thousands of compounds per second (Li et al., 2005a). The classification speed of SVM is usually 25 to 55% faster than that of k-NN and PNN because SVM typically uses 45 to 75% of the training set as support vectors for classification, whereas k-NN and PNN use the whole training set.
MLMs generally require a sufficient number of samples to develop a classification system. Irrelevant molecular descriptors may reduce the performance of these classification systems (Kohavi and John, 1997; Xue et al., 2004a, b; Li et al., 2005a). SVM has been found to be the least sensitive to data over-fitting, even in cases when a large number of redundant and overlapping molecular descriptors are used (Vapnik, 1995). This is because SVM is based on the structural risk minimization principle, which minimizes both training error and generalization error simultaneously.
SVM, k-NN, and PNN do not explicitly provide information about the importance of each molecular descriptor. For SVM, this problem is further compounded when kernel function is used because there is no simple method to inversely map the solution back into the input space. Incorporation of feature selection methods (Li et al., 2005b; Yap and Chen, 2005) and regression methods (Yap and Chen, 2004) have been frequently used for extracting important molecular descriptors from these machine learning-based prediction systems.
Results
Promiscuity Nature of PXR Activator Structures and the Selected Molecular Descriptors for Classifying PXR Activators.Table 1 gives the computed DI value of PXR activators and those of several groups of compounds possessing various different activities or properties. PXR activators are structurally more diverse not only than some of the well-known promiscuous binder groups such as estrogen receptor agonists and P-glycoprotein substrates, but also than some of the compound groups involved in multiple mechanisms such as human intestine absorbing agents. Figure 5 shows the structures of selected PXR activators, which are indicative of the extent of structural diversity of PXR activators. The DI value of our data set is 0.535, which is smaller than that of 0.605 of the largest data set of other PXR activators studies (Schuster and Langer, 2005). Therefore, our data set is structurally more diverse than those of other studies of PXR activators.
A total of 83 molecular descriptors, listed in Supplementary Table S2, were selected by the RFE method from a set of 199 molecular descriptors. These descriptors include simple molecular descriptions such as count of atom types (nhyd, nhal, nhet, ncocl, nnitro), ring (nring) and rotatable bonds (nrot), molecular connectivity and geometry (3χC, 4χPC, 5χCH, 6χCH, 1χv, 2χv, 3χvP, 3χvC, 4χvPC, 6χvCH, dis1, dis2, dis3, etc.), molecular flexibility (φ), electrotopological states or Estates [S car, S het, S hal, S(1), S(5), S(12), S(13), S(16), S(18), Tcent, Tradi, Tdiam, Tiwie, etc.], molecular surface area (polar molecular surface area, Sapc, Sanc, Sapcw, Sancw, Svpc, etc.), molecular shape (Rugty, Gloty), hydrophobicity (Shpl, Shpb, Hiwpl, Hiwpb, Hiwpa), and quantum chemical descriptors (ϵa, ϵb, μ, η, SN, IP, A, μ cp, χ en, ω, etc.).
Some features of these RFE-selected descriptors such as hydrophobicity, hydrogen bond acceptors, molecular globularity, and some Volsurf descriptors are also consistent with the structural features or descriptors described or used in the previous pharmacophore and QSAR studies of PXR activators. Pharmacophore models have shown that hydrophobic and hydrogen bond acceptors (HBAs) are important features for PXR activators (Ekins and Erickson, 2002; Schuster and Langer, 2005). In a QSAR study (Jacobs, 2004), hydrogen bond acceptors, dispersion forces, molecular globularity, and some VolSurf descriptors were found to be the key positive correlated variables for constructing the PXR QSAR model for predicting PXR activators.
The number of selected descriptors in this study is substantially larger than the 22 to 39 molecular descriptors selected in the prediction of compounds of various other drug activities or properties (Xue et al., 2004a, b; Li et al., 2005a, b). An examination of the selected descriptors shows that most of the “extra” set of descriptors is from the electrotopological, connectivity, and quantum chemical classes. As shown in Fig. 5, apart from the usual chemical structures, a substantial number of PXR activators contain highly complex multiaromatic rings, or highly flexible chain-like structures, or halogen-rich structures. These structural features coupled with highly diverse structural frameworks are probably the primary reasons for the need of the “extra” set of electrotopological, connectivity, and quantum mechanical descriptors in distinguishing PXR activators.
Performance of MLMs for Predicting PXR Activators.Table 2 gives the prediction performance of the three MLMs, with and without the use of the RFE feature-selection method, for predicting PXR and hPXR activators and nonactivators based on a 10-fold cross-validation study. The parameters of the PXR SVM, k-NN, and PNN systems are δ = 1, k = 1, and δ = 0.3, respectively. Those of the hPXR systems are δ = 1, k = 3, and δ = 0.2, respectively. The use of the RFE feature-selection method helps to improve the overall prediction performance of the PXR MLM systems from an accuracy level of 72.6 to 74.0% to that of 75.4 to 77.4% and that of the hPXR systems from an accuracy level of 72.5 to 74.9% to that of 75.0 to 79.6%. All of the MLM systems seem to show good performance. When considering overall prediction accuracies, PNN and SVM perform better than k-NN.
Our classification systems were further evaluated by using 15 newly published hPXR activators (Lemaire et al., 2006) whose structures are shown in Fig. 6. These include five herbicides (pretilachlor, metolachlor, oxadiazon, alachlor, and isoproturon), six fungicides (bupirimate, fenarimol, propiconazole, fenbuconazole, prochloraz, and imazalil), and four insecticides (toxaphene, permethrin, fipronil, and diflubenzuron). As shown in Table 3, 86.7, 73.3, and 73.3% of these activators were correctly predicted by the SVM, PNN, and k-NN PXR prediction systems, and 66.7, 66.7, and 53.3% were correctly predicted by the corresponding hPXR prediction systems, respectively. One possible reason for the lower accuracies of the hPXR systems is that they were trained by using compounds structurally more different from the newly published hPXR activators than some PXR activators in the training set of PXR prediction systems. As shown in supplementary Table S3, the Euclidean distance between the 15 newly published hPXR activators and the 28 PXR activators outside the hPXR data set is closer than that of the 98 hPXR activators. One activator, fenbuconazole, was incorrectly predicted by all of our PXR and hPXR systems. One possible reason for misclassifying this compound is that it contains a cyano group (-C≡N) that may not be adequately represented by existing molecular descriptors.
Discussion
Our selected descriptors are consistent with the molecular binding features derived from the study of the binding site of the ligand-free and drug-bound PXR receptor structures (Watkins et al., 2001). It has been reported (Watkins et al., 2001) that molecular flexibility, surface area, geometry, and connectivity are important for characterizing molecular recognition between PXR ligand-binding site and activators. The solved crystal structure of hPXR ligand-binding domain shows high mobility and flexibility in largely hydrophobic site that incorporates a few polar residues capable of forming hydrogen bonds with a binding ligand (Watkins et al., 2001, 2003a; Chrencik et al., 2005). Hydrogen bonds are important in determining the specificity of molecular recognition. Upon binding to PXR ligand-binding site, PXR activator is oriented in a specific orientation stabilized by hydrogen bonds and causes conformational change of PXR ligand binding domain to recruit the binding of coactivators. On the other hand, connectivity is important not only for discriminating between active from nonactive analogs but also for representing important molecular topological features involved in PXR activation. Moreover, electrotopological states, hydrophobicity, and quantum chemical descriptors describe polarity and charge of molecules that contribute in hydrogen bonding, polar, and salt-bridge interactions between PXR activators with the amino acid residues in the ligand-binding cavity of PXR.
PXR activators generally show higher content of halogen atoms, especially chlorine atoms, than nonactivators, as can be seen from higher mean values of halogen atom count (nhal) (1.16 versus 0.80), chlorine atom count (ncocl) (1.02 versus 0.27), and atom-type Estate sum for chlorine S(60) (6.33 versus 1.63). Moreover, PXR activators contain less nitrogen atoms (nnitro) than nonactivators (0.80 versus 1.79), and have lower values of several descriptors including the mean values of atom-type electrotopological state (Estate) sum for >NH, S(5) (0 versus 0.45); atom-type Estate sum for = N-, S(34) (0.31 versus 1.15); atom-type Estate sum for >N-, S(36) (0.15 versus 0.94); and atom-type Estate sum for -N≪, S(37) (0.05 versus 0.41). In addition, polar and salt bridges between PXR ligand binding domain residues and π-π stacking between aromatic rings of activators and ligand binding domain are also important for PXR activation. The descriptors for sums of solvent-accessible surface areas of positively charged atoms (Sapc, Sapcw, Svpc), negatively charged atoms (Sanc, Sancw), and ionization potential (IP) are associated with salt-bridge interactions. Those of atom-type Estate sum for CHn unsaturated, S(13), and atom-type Estate sum for:CH: aromatic, S(21), are relevant to π-π stacking.
Although PXR activators generally contain a fewer number of hydrogen bond donors (ϵa) and acceptors (ϵb) than those of nonactivators, nonetheless, hydrogen bonding plays some roles in activator binding to PXR. It was found that, on average, PXR activators have higher number of HBAs (ϵb) than hydrogen bond donors (ϵa), which are consistent with the results from QSAR and pharmacophore studies (Ekins and Erickson, 2002; Jacobs, 2004; Schuster and Langer, 2005). A higher number of HBAs for PXR activators may result from the existence of the hydrogen bond donor-containing residues His327, His407, and Arg410 residues in the interior region of PXR ligand-binding site. These features are captured by the RFE-selected descriptors Svpcw and Svncw for the sum of weighted van der Waals surface areas of positive and negative atoms, respectively. The mean values for Svncw (36.25 versus 21.84) is larger than Svpcw for PXR activators showing complementary charge for activators to the PXR ligand-binding site may contribute to the entry of binding site and stable binding.
The computed mean values for the number of rotatable bonds (nrot) (4.45 versus 5.99), Kier molecular flexibility index (φ) (5.84 versus 6.24), polar molecular surface area (61.90 versus 69.53) of PXR activators are smaller than those of nonactivators, which is consistent with the view that PXR activators are generally smaller in size (Handschin and Meyer, 2005). The smaller size and number of rotatable bonds enable a better access to the ligand-binding site. Although how a ligand gains access to the PXR ligand-binding cavity remains unclear, it has been hypothesized that the flexible α2 (residues 192-205) that is unique to PXR may be critical component for ligand entry and exit site (Watkins et al., 2003a). The flexible region may operate like a trappingdoor allowing ligands to enter the central of the ligand-binding site. In addition, Leu209 located near the C terminus of α2 shifted in position by up to 7.7 E when bound by different ligands (Watkins et al., 2003b). Binding by coactivators further stabilizes the bound orientations of ligands. Taken together, the large and flexible ligand-binding pocket of PXR explains the promiscuous nature of PXR to bind to a variety of endogenous and xenobiotic compounds.
Although some aspects of activator binding to PXR can be exhibited by analyzing the selected descriptors, these descriptors are quantitative representations of structural and physicochemical features. Therefore, analysis of these descriptors without consideration of the receptor site structure is insufficient for providing a molecular level picture about the connection between a descriptor and the predicted activity. In the protein 3D structure database PDB (http://www.rcsb.org/pdb/Welcome.do), there are four entries of ligand-bound PXR structures. Analysis of some of these structures provides useful information about the atomic-level interactions represented by some of our selected descriptors. Figures 7 and 8 show the binding site structure of PXR bound by activator SR12813 (Watkins et al., 2001, 2003a; Chrencik et al., 2005) and hyperforin (Watkins et al., 2003b), respectively. Both activators form hydrophobic contacts with two hydrophobic residues, and they form hydrogen bonds with two polar residues. These provide clear molecular picture about the connection between our selected descriptors, hydrophobic and hydrogen bond descriptors, and activator-binding to PXR.
k-NN is based on a nearest neighbor algorithm that works best when activators and nonactivators tend to cluster in different regions or pockets of chemical space. SVM and PNN are based on nonlinear algorithms that are generally effective for all cases of distributions. SVM has fewer parameters than PNN, which makes it easier for deriving an optimal prediction system. MLMs are subjected to some degree of error due to such factors as data-set quality and the inherent limitation in predicting biological activities solely based on structure-derived molecular descriptors.
From the chemistry point of view, one can state that the molecular structure of a compound is the key in understanding its physicochemical properties and ultimately its biological activity and physiological effect (Johnson and Maggiora, 1990). Although hydrophobic interactions and hydrogen bondings are known to play important roles in molecular recognition from ligand-protein, protein-protein, up to macromolecular assemblies, there are many ways to describe these interactions from chemistry point of views as can be expressed by various molecular descriptors. However, which descriptions are more relevant to a given activity has to be further characterized by various means such as using feature selection methods in the machine learning methods.
Current representations of molecular physicochemical properties by molecular descriptors are still far from complete. Further refinement to develop a more sophisticated set of molecular descriptors is definitely an important task. Moreover, it is essential to include more PXR activators and nonactivators from future experimental works. Currently, we used a set of 199 molecular descriptors. However, when the data set grows in future, we believe more completed set of molecular descriptors is required. Furthermore, the biological activity of a compound is an induced response that is influenced by numerous factors dictated by many levels of biological complexity. The relationship between structure and activity is thus more implicit and thereby requires a more thorough investigation and rigorous validation (Tong et al., 2004). Hence, the choice for better descriptors is still under investigation.
Conclusion
Identification of novel PXR activators from structurally diverse compounds is important for the discovery of drugs with desired metabolic and toxicological profiles. This study shows that MLMs and especially SVM are useful for in silico prediction of the activators of highly promiscuous proteins, such as PXR and for characterizing the molecular features of PXR activation. By incorporating feature-selection methods such as RFE into MLMs, molecular descriptors relevant to PXR activators can be identified. Most of these selected molecular descriptors are consistent to those used in previous pharmacophore and QSAR studies and with the findings from X-ray crystallography studies. Further works on the improvement and refinement of feature selection methods and molecular descriptors are needed to improve the capability of MLMs for accurately identifying PXR activators and the related molecular characteristics.
Footnotes
-
ABBREVIATIONS: PXR, pregnane X receptor; MLM, machine learning method; QSAR, quantitative structure and activity relationship; h, human; SVM, support vector machine; PNN, probabilistic neural network; k-NN, k nearest neighbor; RFE, recursive feature elimination; DI, diversity index; CAS, Chemical Abstracts Service; SR12813, 4-[2,2-bis(diethoxyphosphoryl)ethenyl]-2,6-di-tert-butyl-phenol.
-
↵ The online version of this article (available at http://molpharm.aspetjournals.org) contains supplemental material.
- Received June 6, 2006.
- Accepted September 26, 2006.
- The American Society for Pharmacology and Experimental Therapeutics