J. Med. Chem. 2001, 44, 2432-2437 Discriminating between Drugs and Nondrugs by Prediction of Activity Spectra for Substances (PASS)
Soheila Anzali,*,† Gerhard Barnickel,† Bertram Cezanne,† Michael Krug,† Dmitrii Filimonov,‡ andVladimir Poroikov‡
Bio- and Chemoinformatics Department, Merck KGaA, Darmstadt D-64271, Germany, and Institute of Biomedical Chemistryof Russian Academy of Medical Sciences, Pogodinskaya Street, 10, Moscow 119832, Russia
Using the computer system PASS (prediction of activity spectra for substances), which predictssimultaneously several hundreds of biological activities, a training set for discriminatingbetween drugs and nondrugs is created. For the training set, two subsets of databases of drugsand nondrugs (a subset of the World Drug Index, WDI, vs the Available Chemicals Directory,ACD) are used. The high value of prediction accuracy shows that the chemical descriptors andalgorithms used in PASS provide highly robust structure-activity relationships and reliablepredictions. Compared to other methods applied in this field, the direct benchmark undertakenwith this paper showed that the results obtained with PASS are in good accordance with theseapproaches. In addition, it has been shown that the more specific drug information used in thetraining set of PASS, the more specific discrimination between drug and nondrug can beobtained. Introduction
ACD database they may become drugs in the future,whereas a few compounds from MDDR and WDI will
In the past decade the drug discovery process has
changed dramatically. The challenge to identify novel
Because of the lack of discrimination among struc-
leads has driven the need for automated systems that
tural features for drug and nondrug compounds, differ-
can rapidly perform selection of compounds at the
ent approaches have to be applied to compensate. As
beginning of the drug discovery process, namely in the
concluded by Walters et al.,17 “future work is likely to
analysis and the extension of the high throughput
include additional approaches and more robust attempts
screening (HTS) pool. The number of discovered hits
depends on the cutoff level, e.g., 10 mM. First of all,
The PASS program,18-22 which is based on a regres-
the activity needs have to be confirmed and then
sion approach applied to noncongeneric chemical series,
followed by selectivity and functional assays.
provides highly robust predictions for more than 500
An important task is the rejection of false hits and
biological activities. Since PASS is trained to recognize
focus on the promising molecules. The lead molecule
drugs with activities on various targets, the approach
plays the pivotal role for the initiation of a lead
may have potential use to discriminate drugs from
optimization project. A promising lead compound with
nondrugs. The purpose of this work is to evaluate the
a desired pharmacological activity may have undesir-
ability of the PASS approach in discriminating between
able side effects, characteristics that limit its bioavail-
ability, or structural features which adversely influenceits metabolism and excretion from the body. Materials and Methods
Therefore biological activity has to be balanced with
PASS Approach. The computer system PASS (prediction
“drug-like” properties, and the closer we get to a
of activity spectra for substances)18-21 predicts several hun-
candidate compound, the more important drug-likeness
dreds of biological activities (pharmacological main and side
becomes. Despite the many attempts1-11 to classify
effects, mechanisms of action, mutagenicity, carcinogenicity,
compounds into the “drug” and “nondrug” categories,
teratogenicity, and embryotoxicity).
Biological activity results from the interaction of chemical
there is no unambiguous definition for drug and non-
compounds with biological entities. In clinical studies, the
drug. Especially, it may vary depending the indications
biological entity is the whole human organism. In preclinical
or diseases considered.12 Reagent databases such as
testing they are the experimental animal (in vivo) and/or the
ACD,13 as an example, is often used as a model database
experimental model (in vitro). Biological activity depends on
for nondrug compounds, while CMC,14 WDI,15 and
peculiarities of compound (structure and physicochemical
MDDR16 could be seen as databases for drugs. Certainly,
properties), biological entity (species, gender, age, etc.), andmode of treatment (dose, route of administration, etc.).
if one could consider the fate of some compounds in the
The majority of biologically active compounds reveal often
a wide spectrum of different effects. Some of them are useful
* Correspondence: Soheila Anzali, Ph.D., Merck KGaA, Bio- and
in treatment of definite diseases; others cause various side and
Chemoinformatics Department, Frankfurter Str. 250, D-64271 Darm-
toxic effects. The whole complex of activities caused by the
stadt, Germany. Tel: +49-6151-724863. Fax: +49-6151-7233299.
compound in biological entities is called the “biological activity
‡ Institute of Biomedical Chemistry of Russian Academy of Medical
The biological activity spectrum of a compound presents all
its activities despite the difference in essential conditions of
Discriminating between Drugs and Nondrugs by PASSJournal of Medicinal Chemistry, 2001, Vol. 44, No. 15Table 1. Functional Groups Describing Nondrug Compoundsa Figure 1.
their experimental determination. If the difference in species,
gender, age, dose, route, etc., is neglected, the biological
activity can be identified only qualitatively. Thus, “the biologi-
cal activity spectrum” is defined as the “intrinsic” property of
a compound depending only on its structure and physicochem-
The prediction of this spectrum by PASS is based on SAR
analysis of a training set containing more than 30 000
compounds which reveal more than 500 kinds of biological
activities. Therefore, PASS once trained is able to predict for
a test compound all likely biological activities, which are
It was shown that the mean accuracy of prediction with
PASS is about 86% in leave-one-out cross-validation.21 PASS
prediction accuracy exceeds more than three times the expert’s
guess-work for an independent set of 33 different compounds
studied as pharmacological agents, which are not included in
the PASS training set.22 Recently PASS was tested in a blind
mode by nine scientists from eight countries on the hetero-
geneous set of 118 compounds having 138 activities, and the
mean accuracy of prediction was shown to be 82.6%.23 The
PASS prediction is relatively successful even in the case of
rather new compounds which have nontraditional structures
and/or belong to new chemical classes. Like any other ligand-
based design approach, PASS cannot predict the affinity for
a Minimum frequency of a certain functional group is indicated
new targets, but even in that case PASS points to possible side
in parentheses; in all other cases it is 1. Compounds with MW <
effects which may also prevent the application of a drug
150 were also classified as nondrugs.
Besides this SAR-base available in PASS, it is also possible
approach could provide a reasonable discrimination between
to create other SAR-bases or to enlarge it.
drugs/nondrugs, the expected results should be better for the
Activities Description. In this work, the investigated
activity is “drug”, so the compounds from WDI and the CipslineDB were described as drugs and the compounds from the other
As an example for a nondrug data set, we prepared 9737
compounds (ND) from a supplier database of approximately
Chemical Structure Description. We described in detail
57 000 commercially available compounds. A compound was
the substructure descriptors called “multilevel neighborhoods
identified as nondrug by the analysis of 60 different functional
of atoms” (MNA) in a paper published recently.24 MNA
groups/fragments. Most of them are reactive groups, which are
descriptors of a molecule are based on the 2D representation
unfavorable for drugs. Some examples of such groups are
of its structure. According to the valences and partial charges
of the atoms, hydrogens are included, whereas bond types are
In addition, all compounds with a molecular weight less
not explicitly specified. An MNA descriptors set is subdivided
than 150 Da were classified as nondrugs.
on levels and generated recursively. A zero-level MNA descrip-
As an independent evaluation set of drugs (TOP-100), we
tor describes the atom itself. Any next level MNA descriptor
use a list of top-100 prescription pharmaceuticals26 (Table 2).
is the substructure notation A(D
Twelve of these entries are biopolymers and were not included
1D2.), where A is the atom
A descriptor, and D i is the previous level MNA descriptor of ith neighbor atom for atom A. For example, for carbon(3) in
Computation Time. The calculation time on a PC (Pen-
Figure 1, the MNA descriptors are as follows: first, “C”; second,
tium 2; 300 MHz; 128 Mb RAM) for the prediction of one
“C(CCCC)”; third, “C(C(HHHC)C(HHHC)C(HHCN)C(HH-
compound is 4 ms, which demonstrates the ability of PASS to
handle huge data sets, as they are used, for example, in the
Different stereoisomers of a molecule have identical MNA
analysis of virtual libraries or supplier databases.
descriptors and are considered as equivalent molecules inPASS. The use of MNA descriptors in PASS for prediction is
Results and Discussion
described in the Appendix. In the present version of PASS,up to second level MNA descriptors are used. Training of PASS. The results of a leave-one-out Databases Used for the Training and Evaluation of
cross-validation (LOO), which characterizes the quality
PASS. To compare the PASS ability in discriminating drug-
of obtained structure-property relationships, are shown
like compounds and nondrugs with the recently published
in Table 4, no. 1. The quality of the prediction is
results of Sadowski and Kubinyi,3 we used the same subsets
described by the percentage of false classification.
of WDI and ACD compounds for the training of PASS. Thesesubsets include 5000 compounds from WDI (“drugs”) and 5000
During model building (including LOO cross-validation),
compounds from ACD (“nondrugs”). This data set was also
the quality is expressed as the mean error of prediction
(MEP). The mean accuracy for prediction in the LOO
To evaluate the method we prepared several test sets. As a
cross-validation is about 80%, which is slightly less than
sample of drug compounds we extracted two data sets from
in the current version of PASS applied for the prediction
the Cipsline database,25 which is a subset of MDDR.16 The first
of the biological activity spectra, but which is still
subset includes all launched, registered, and investigatedcompounds (LRID). At the second stage, in order to focus on
satisfactory to discriminate between drugs and non-
real drug compounds, we extracted the subset of Cipsline with
drugs. Such accuracy of prediction is comparable to the
just launched and registered compounds (LRD). If the PASS
results obtained by Sadowsky and Kubinyi.3
Journal of Medicinal Chemistry, 2001, Vol. 44, No. 15Table 2. Evaluation Set Based on the List of Top-100 Drugs a Pa scores representing probability belong to this therapeutic class. Evaluation of PASS vs “Drugs”. Formally the first
pounds. A total of 4514 (73.4%) compounds were pre-
test set (LRID) includes 7468 presumed drug com-
dicted as drugs and 1634 (26.6%) compounds as non-
pounds. Their structures were checked for being present
in the training set yielding 632 compounds. These
There exists no independent criteria to be sure that
compounds were eliminated from the test set, as were
some compounds predicted as nondrug will not become
688 compounds which had no connection table fields or
drugs in the future; therefore we eliminated all the
had errors in structural formulas (invalid compounds).
investigated compounds from the LRID set. The re-
After filtering, the final test set contained 6148 com-
maining 1184 compounds were launched and registered
Discriminating between Drugs and Nondrugs by PASSJournal of Medicinal Chemistry, 2001, Vol. 44, No. 15Figure 2. Distribution of predicted scores Pa for drugs (black) and nondrugs (white): a, WDI/ACD training set and LRID test set (Table 4, no. 2); b, WDI/ACD training set and LR test set (Table 4, no. 3); c, WDI/ACD training set and ND test set (Table 4, no. 4); d, WDI/ACD training set and TOP-100 test set (Table 4, no. 5); e, LR/ND training set and TOP-100 test set (Table 4, no. 11). Table 3. Entries Excluded from Evaluation (Biologicals) Table 4. Quality of Discriminating between Drugs and Nondrugs by Different Methods
a LOO c-v: leave-one-out cross-validation. b MEP: maximal
error of prediction in LOO cross validation.
compounds (LR) and represent real drugs. Their mo-lecular structures were again checked for presence in
diction. A total of 7950 compounds (83.8%) were pre-
the training set (111 compounds), and 208 compounds
dicted as nondrugs and 1534 (16.2%) compounds as
were removed as being invalid. A total of 864 structures
drugs (Table 4; no. 4). These results show that cleaning
were calculated, and 678 (78.5%) compounds were
of the test set gave a higher prediction accuracy.
predicted as drugs and 186 (21.5%) as nondrugs (Table
Evaluation of PASS vs Drugs from the Top-100
4; no. 3). It is obvious that the fraction of compounds
List. As we suggested that most of drugs from this list
classified as drugs is higher in comparison with the first
may be also included into the WDI set, all predictions
test set. This can be explained by a more objective
were carried out under exclusion of the equivalent
definition of drug and nondrug for the second test set,
compounds from the training set. For 88 compounds
which provides better recognition of real drugs from the
remaining from the list of top-100 prescription phar-
maceuticals, 77 compounds (87.5%) were predicted as
Evaluation of PASS vs “Nondrugs”. The third
drugs and 11 (12.5%) were predicted as nondrugs (Table
evaluation set (ND) included 9737 compounds from
different sources carefully selected as nondrugs accord-
Evaluation of PASS with the Cleaned Training
ing to the criteria discussed above. After the same
Set. It was interesting to see if the cleaning of the
filtering procedure, 9484 compounds were left for pre-
training set could also increase the accuracy of the PASS
Journal of Medicinal Chemistry, 2001, Vol. 44, No. 15
prediction. Therefore, we trained PASS with a new drug/
ni is the amount of compounds, containing descriptor
nondrug SAR-base represented by the test sets LR and
ND. The results of the LOO cross-validation are listed
nj is the amount of compounds, revealing activity j.
in the Table 4; no. 10. It is obvious that the accuracy of
nij is the amount of compounds, containing descriptor
prediction is about 90%. That is significantly higher
i and revealing activity j.
than in the WDI/ACD training procedure used in the
nj/n is the estimate of the a priori probability of
The results of prediction for the 88 compounds from
nij/ni is the estimate of the conditional prob-
the list of top-100 prescription pharmaceuticals were
ability of the activity j for the descriptor i.
even better than in the LOO cross-validation. A total
m is the number of descriptors for the compound
of 84 compounds (95.5%) were predicted as drugs, while
only four compounds (4.5%) were predicted as nondrugs.
0.5/m) is the regulating factor.
In Figure 2 the distributions of the numbers of drugs/
Prj is the initial estimate of the probability of the
nondrugs predicted with different training and test sets
activity j for the compound under prediction.
are presented versus the value of the PASS score Pa,
LOO is leave-one-out procedure: for each com-
which represents the estimated probability of compound
pound in the training set, the values n, ni, nj,
belonging to the class of “drugs”. It is clear that the
n ij are changed for n - 1, ni 1, and nj 1, nij
discriminating ability of PASS is significantly higher
- 1 when one is active, and the estimates Prj
in case of the cleaned training set, as it was obviously
are calculated.
demonstrated for the test set of the top-100 prescription
For the compound under prediction, the struc- ture descriptors are generated. Conclusions
For each activity, the following values are calculated:
The discrimination between drug and nondrug is
facing three problems: (i) not well-defined databases,
(ii) choice of a method to discriminate, and (iii) the
selection of appropriate descriptors.
The widely used databases for the discrimination
s0j)/(1 - sjs0j))/2
between drugs and nondrugs are relatively noisy: some
compounds assigned as drugs are nondrugs in reality
For each compound in the training set, the LOO
and vice versa. Since this problem lies in the nature of
the complex term “drug-likeness”, there seems no simple
way to overcome the underlying problem. j(CP) is the estimate of the first kind of error
Our experiments provide the evidence that informa-
ESj(CP) is the estimate of the second kind of error
tion-guided selection of the data sets gives higher
accuracy in discrimination between the classes of drug-
like compounds and nondrugs. The high value of predic-
The first kind of error is fixed when the compound
tion accuracy shows that the chemical descriptors and
under prediction actually is active but Pr <
algorithms used in PASS provide highly robust struc-
The second kind of error is fixed when the compound
ture-activity relationships and reliable predictions on
this basis. Compared to other methods applied in the
For each activity, the estimates of EFj(CP) and ESj-
field, the direct benchmark undertaken with this paper
showed that the results obtained with PASS are in good
The cutting points CPj* which gives equality:
EFj(CPj*) ) ESj(CPj*) are calculated.
Since no specific adaption of the prediction scheme
The maximal error of prediction MEP is as follows:
implemented in the PASS program was required, the
EFj(CPj*) ) ESj(CPj*)
advantage of the PASS approach lies in the fact thatonly two annotated data pools for drug and nondrug
cases are necessary to allow a reliable prediction of
The probability to be active is Paj
discrimination of given features. So the PASS methodol-
The probability to be inactive is Pij
ogy opens the door to include more specific drug
Pa (Pi) can be considered as the probability of the
information in order to get a more specific discrimina-
first (second) kind of errors for the compound under
tion. This may also be extended to physical-chemical
prediction or as the probability of the compound
properties as well as the interplay of those properties
belonging to classes of active (inactive) compounds,
with dedicated pharmacological properties. Acknowledgment. We are sincerely grateful to Jens References
Sadowski (AstraZeneca) for providing us with subsets
(1) Cummins, D. J.; Andrews, C. W.; Bentley, J. A.; Cory, M.
from WDI and ACD, which were used as an initial
Molecular Diversity in Chemical Databases: Comparison ofMedicinal Chemistry Knowledge Bases and Databases of Com-
mercially Available Compounds. J. Chem. Inf. Comput. Sci. 1996, 36, 750-763. Appendix: Mathematical Method
(2) Ajay; Walters, W. P.; Murcko, M. A. Can We Learn To Distin-
guish between “Drug-like” and “Nondrug-like” Molecules? J.Med. Chem. 1998, 41, 3314-3324.
(3) Sadowski, J.; Kubinyi, H. A scoring Scheme for Discriminating
n is the total amount of compounds in the training
between Drugs and Nondrugs. J. Med. Chem. 1998, 41, 3325- Discriminating between Drugs and Nondrugs by PASSJournal of Medicinal Chemistry, 2001, Vol. 44, No. 15
(4) Gillett, V. J.; Willett, P.; Bradshaw, J. Identification of Biological
(17) Walters, W. P.; Ajay; Murcko, M. A. Recognizing Molecules with
Activity Profiles Using Substructural Analysis and Genetic
Drug-Like Properties. Curr. Opin. Chem. Biol. 1999, 3, 384-
Algorithms. J. Chem. Inf. Comput. Sci. 1998, 38, 165-179.
(5) Ghose, A. K.; Viswanadhan, V. N.; Wendolowski, J. J. A
(18) Filimonov, D. A.; Poroikov, V. V.; Karaicheva, E. I.; Kazaryan,
Knowledge-Based Approach in Designing Combinatorial and
R. K.; Boudunova, A. P.; Mikhailovsky, E. M.; Rudnitskih, A.
Medicinal Chemistry Libraries for Drug Discovery. 1. A Qualita-
V.; Goncharenko, L. V.; Burov, Yu. V. Computer-Aided Predic-
tive and Quantitative Characterization of Known Drug Data-
tion of Biological Activity Spectra of Chemical Substances on
bases. J. Comb. Chem. 1999, 1, 55-68.
the Basis of Their Structural Formulae: Computerized System
(6) Blake, J. F. Chemoinformatics - predicting the physicochemical
PASS. Exp. Clin. Pharmacol. (Rus) 1995, 58, 56-62.
properties of “drug-like” molecules. Curr. Opin. Biotechnol. 2000,
(19) Filimonov, D. A.; Poroikov, V. V. PASS: Computerized prediction
of biological activity spectra for chemical substances. In Bioactive
(7) Teague, S. J.; Davis, A. M.; Leeson, P. D.; Oprea, T. The design
Compound Design: Possibilities for Industrial Use; BIOS Sci-
of leadlike combinatorial libraries. Angew. Chem., Int. Ed. 1999,
entific Publishers: Oxford, 1996, 47-56.
(20) Poroikov, V. V.; Filimonov, D. A.; Stepanchikova, A. V.; Bou-
(8) Oprea, T. I. Property distribution of drug-related chemical
dunova, A. P.; Shilova, E. V.; Rudnitskih, A. V.; Selezneva, T.
databases. J. Comput.-Aided Mol. Des. 2000, 14, 251-264.
M.; Goncharenko, L. V. Optimization of synthesis and pharma-
(9) Wagener, M.; van Geerestein, V. J. Potential drugs and non-
cological testing of new compounds based on computerized
drugs: Prediction and identification of important structural
prediction of their biological activity spectra. Chim.-Pharm. J.
features. J. Chem. Inf. Comput. Sci. 2000, 40, 280-292. (Rus) 1996, 30, 20-23.
(10) Clark, D. E.; Picket, S. D. Computational methods for the
(21) Web site: http://www.ibmh.msk.su/PASS.
prediction of “drug-likeness”. Drug Discovery Today. 2000, 5, 49-
(22) Poroikov, V. V.; Filimonov, D. A.; Boudunova, A. P. Comparison
of the Results of Prediction of the Spectra of Biological Activity
(11) Frimurer, Th.; Bywater, R.; Naerum, L.; Lauritsen, L. N.;
of Chemical Compounds by Experts and the PASS System.
Brunak, S.; Improving the odds in discriminating “drug-like”
Autom. Doc. Math. Linguist. 1993, 27, 40-43.
from “non drug-like” compounds. J. Chem. Inf. Comput. Sci.
(23) Website: http://www.vei.co.uk/chemweb/library/lecture17/slide-
2000, 40, 1315-1324.
(12) Ajay; Bemis, G. W.; Murcko, M. A. Designing Libraries with CNS
(24) Filimonov, D. A.; Poroikov, V. V.; Borodina, Y.; Gloriozova, T.
Activity. J. Med. Chem. 1999, 42, 4942-4951.
Chemical Similarity Assessment trough Multilevel Neighbor-
(13) ACD: Available Chemicals Directory, Version 2/97, MDL Infor-
hoods of Atoms: Definition and Comparison with the Other
Descriptors. J. Chem. Inf. Comput. Sci. 1999, 39, 666-670.
(14) CMC: Comprehensive Medicinal Chemistry, Version 1/97, MDL
(25) Cipsline, Correlates in Pharmacostructures Online, Version
(15) WDI: World Drug Index, Version 2/96; Derwent Information,
(26) Pharma Business 1996, July/August, 18-53.
(16) MDDR: MDL Drug Report, Version 2/97; MDL Information
Outpatient Warfarin Management What is warfarin? Warfarin is an anticoagulant or blood thinning agent and its efficacy can only be It is used to reduce the chance of blood clots occurring or recurring. Warfarin slows down the clotting process by interfering with the action of vitamin K, this vitamin is required for blood clotting to occur. Warfarin is often recommended for the followi