Applying inductive logic programming to predicting gene function
Applying Inductive Logic Programming to Predicting Gene Function
■ One of the fastest advancing areas of modern sci-
Has this anything to do with AI? I believe
ence is functional genomics. This science seeks to
yes. The relationship between machines and
understand how the complete complement of
living organisms is central to AI and was a core
molecular components of living organisms (nucle-
interest of the founding fathers (Alan Turing,
ic acid, protein, small molecules, and so on) inter-
John von Neuman, and Norbert Wiener). Mod-
act together to form living organisms. Functionalgenomics is of interest to AI because the relation-
ern biology is also an instructive and fun do-
ship between machines and living organisms is
main to apply and sharpen our tools and ideas;
central to AI and because the field is an instructive
it requires complex knowledge representation,
and fun domain to apply and sharpen AI tools and
reasoning, learning, and so on. Because the
ideas, requiring complex knowledge representa-
concrete generally precedes the abstract, real-
tion, reasoning, learning, and so on. This article
world applications catalyze new research areas.
describes two machine learning (inductive logic
The sequencing of a genome is only a first
programming [ILP])–based approaches to the
step to fully understanding how a living organ-
bioinformatic problem of predicting protein func-tion from amino acid sequence. The first approach
isms works. In computer science terms, knowl-
is based on using ILP as a way of bootstrapping
edge of a genome is only equivalent to obtain-
from conventional sequence-based homology
methods. The second approach used protein-func-
complication that the function and language
tional ontologies to provide function classes and a
of the program are unknown and the code un-
hybrid ILP method to predict function directly
documented and badly written. The key cur-
from sequence. Both ILP approaches were success-
rent scientific challenge in biology is to unrav-
ful in producing accurate prediction rules that
el and understand this code. AI tools will be
could biologically be interpreted. The work was al-
so of interest to machine learning research becauseit highlighted the flexibility of ILP systems in deal-
ing with heterogeneous data, the importance ofproblems where classes are related hierarchically,
Perhaps the most important discovery from
and problems where examples have more than
the sequenced genomes is that the functions of
only approximately 40 to 70 percent of thepredicted genes are typically known with anyconfidence. For example, in bakers’ yeast (S.
We live in interesting scientific times. cerevisiae),one of the most intensely studied of
all organisms, of the approximately 6000 pre-
dicted protein-encoding genes, the function of
living organisms. These genomes provide the
only approximately 70 percent can be assigned
complete specification of the parts and pro-
with any confidence. The new science of func-
grams to create living organisms. This knowl-
tional genomics (Bussey 1997; Hieter and Bo-
guski 1997) is dedicated to determining the
Copyright 2004, American Association for Artificial Intelligence. All rights reserved. 0738-4602-2004 / $2.00
function of genes of unassigned function and
simonious. In practical applications, there is
to further detailing the function of genes with
the additional problem that there is usually
purported function. Most functional genomics
noise present (that is, some E+ are false, and
is concerned with the development of new ex-
some E– are true), which means that the previ-
perimental techniques for elucidating gene
ous conditions need to be relaxed such that the
function (Blackstock and Weir 1999; DeRisi, Iy-
best hypothesis maximizes its coverage of E+
er, and Brown 1997; Oliver et al. 1998). Bio-
and minimizes its coverage of E–. These two
informatics is playing an essential role in ana-
values give a cost ratio that represents the di-
vergence from the ideal. The problem of learn-
genomics data (Bork et al. 1998; Brent 1999;
ing H is typically designed as a search problem
through the space of models meeting the pre-
A key bioinformatic problem in functional
genomics is the prediction of protein function
ILP has shown its value in many scientific
from sequence. Such predictions both provide
problems such as drug design and toxicology
important initial information about newly se-
(for example, Dzeroski et al. [1999]; Finn et al.
quenced genomes and aid “wet” experimental
[1998]; King, Srinivasan, and Dehaspe [2001];
determination of function. Such predictions
King et al. [1996, 1992]) and molecular biology
are usually done by utilizing sequence similar-
(Badea 2003; Donescu et al. 2002; King et al.
ity methods to find an evolutionary related
1994; Muggleton, King, and Sternberg 1992;
(homologous) protein in the database that has
Sternberg et al. 1994; Turcotte et al. 2001). ILP
a known function (Altschul et al. 1997; Pear-
has been particularly well suited to problems
son and Lipman 1988). The function of the
dealing with molecular structure. In such prob-
new sequence is then inferred to be the same as
lems, ILP has often found solutions not acces-
the homologous protein because it is assumed
sible to standard statistical, neural network,
to have been conserved over evolution. This
propositional machine learning, or genetic al-
inference is a kind of nearest-neighbor type in
gorithms (King et al. 1996). The theories pro-
sequence space. Unfortunately, with this ap-
duced by ILP have also been generally more
proach, only around 50 percent of possible ho-
comprehensible than those using proposition-
mologies are identified (Park et al. 1998), and
al methods because they are more compact and
little biological insight is obtained.
closer to natural language (King et al. 1992;Turcotte, Muggleton, and Sternberg 2001).
that utilize first-order predicate logic (FOPL) torepresent background knowledge and theories
are generally described as coming from the
If a problem can satisfactorily be represented
field of inductive logic programming (ILP)
and solved using propositional methods, then
(Lavrac and Dzeroski 1994; Muggleton 1992,
there is no need to apply ILP techniques. Prop-
1990) or relational data mining (RDM) (Dze-
ositional methods are generally better devel-
roski and Lavrac 2001), depending on the re-
oped and more computationally and statisti-
search emphasis. For simplicity in this article, I
cally efficient. ILP methods do not necessarily
refer to all such algorithms as coming from ILP.
default to efficient propositional learners when
In ILP, background knowledge (B), examples
given wholly propositional data. Use of ILP
(E), and hypotheses (H) are represented as logic
therefore needs to be justified. My rationale is
programs. It is first assumed that there is a re-
based on the following features of the problem
quirement for an inductive hypothesis; that is,
B ç/ E (prior necessity). The core ILP problem is
Relational descriptors: Functional genom-
to find hypotheses H such that B ٙ H ç E. Mat-
ics naturally involves many relationships in
ters are complicated slightly by the fact that ev-
the data: phylogenic hierarchies (the tree of life),
idence in ILP is usually divided into two types:
homologies (genes sharing a common ancestor),
(1) E+ data that are consistent with the con-
directed graphs relating functions, and so on.
joined hypothesis and background informa-
Traditional propositional methods (statistical,
tion and (2) E– data that are contrary to the
neural network, machine learning, genetic al-
conjoined hypothesis and background infor-
gorithms, and so on) cannot efficiently repre-
mation. Therefore, H is required to meet the
sent these relations in inductive inference.
following criterion: B ٙ H ç E+ and B ç H ç/ E. Data heterogeneity: The relevant data come
It is typical that H is also restricted to necessar-
from many different sources and are necessari-
ily meet other requirements such as being non-
ly stored in multiple tables of relational data-
trivial (not, for example, E, or B ⇒ E) and par-
bases. To use a conventional data-mining algo-
rithm, the tables would have to be joined to
form a single prohibitively large table for
Homology induction uses background knowl-
analysis, which is impractical. In addition,
edge, together with the protein’s amino acid
ILP/RDB allows the direct analysis of the mul-
sequence, to induce homology. The idea is to
collect as much information as possible for a
Comprehensible results: It is important
protein and then infer homology using dis-
that the prediction rules are understandable.
criminatory ILP. The homology induction ap-
Biologists generally require that the rules are
understandable so that they can suggest new
First is the collection of possible homolo-
biological ideas and have confidence in them.
gous proteins using an existing method of se-
In some bioinformatic applications, this re-
quence similarity search (SSS). Aberystwyth re-
quirement is not necessary, for example, in
searchers use PSI-BLAST (Altschul et al. 1997)
predicting protein secondary structure. How-
which is essentially an iterative nearest-neigh-
ever, given a choice, comprehensible results are
bor method (in sequence space). The result of
a PSI-BLAST search is a list of possible homolo-
At Aberystwyth, researchers have developed
gous proteins sorted by probability. Proteins
two approaches to applying ILP to predicting
where the probability of homology is ambigu-
protein function. The first is called homology
ous are termed to be in the “twilight zone.”
induction and is based on utilizing machine
Second is the accumulation of all available
information for these proteins. We developed a
large multitable database of DATALOG (Ullman
and King 2002, 2001). The second method uses
1988) facts to describe the proteins from a wide
a hybrid ILP–propositional machine learning
variety of bioinformatic sources. This informa-
method to predict protein functional class di-
tion was selected for relevance to the detection
rectly from sequence (King, Srinivasan, and
of homology. For each protein, we collected
Dehaspe 2001; King et al. 2001a, 2001b, 2000).
bioinformatic database keywords, the organ-ism’s classification (family tree), bioinformatic
database references (PROSITE, HSSP, EMBL, PIR—ex-cluding SCOP [structural classification of pro-
The identification of evolutionary-related (ho-
tein] classifications), predicted secondary struc-
mologous) proteins is a key problem in compu-
tational molecular biology. Knowledge of a ho-
distribution for singlets and pairs of residues,
mologous relationship between two proteins,
one of known function and the other of un-
Third is the induction of rules. We used the
known function, allows the probabilistic infer-
ILP ALEPH algorithm to learn rules that were
ence that the protein with unknown function
true for proteins of very high probability of be-
has the same function as that of the known
one (because evolution generally conserves
ilarity)1 and false for proteins with close to zero
function). Such inferences are the basis of most
probability of being homologous (also based
of our knowledge about sequenced genomes.
Protein homology is typically inferred by using
Fourth is the application of the rules. We ap-
computer programs to measure the similarity
plied the rules to the set of twilight zone pro-
of two or more proteins. This inference is gen-
teins to predict whether they were homologous.
erally done by comparing the two amino acid
To assess the accuracy of homology induc-
strings of the proteins and measuring the char-
tion, it was necessary to have a “gold standard”
acterwise similarity between them. Such pro-
set of known homologies. We used the syste-
grams probably consume more processing time
matic approach of Park et al. (1997), which
than all other bioinformatic programs put to-
used a subset of the SCOP database (Murzin et al.
gether. These methods perform well for closely
1995). The SCOP database is a classification
related homologous sequences. However, the
database of proteins of known structure; most
results for more distantly related proteins are
also have known function. Over evolutionary
less reliable (Park et al. 1998), detecting only
time, protein structure changes more slowly
about 50 percent of all possible homologies,
than sequence; therefore, structure can be used
given an acceptable false-positive rate.
to identify more remote homologies than se-
In learning problems, all relevant informa-
quence. At the family level of SCOP, the struc-
tion should be used. The idea behind homolo-
tures are so similar that homology is inferred,
gy induction is to exploit additional sequence
which is not to be confused with protein fold
information to bootstrap on the performance
recognition, where there is no necessary expec-
A perfect prediction method would be able
mology induction rules, I use the protein C-
to detect all homologous relationships in SCOP.
Phycocyanin (1CPC). Figure 2 shows the HIall
However, in practice, unrelated nonhomolo-
and HIseq rules learned for C-Phycocyanin
gous proteins are predicted (errors of commis-
both in their original Prolog form and in Eng-
sion), and evolutionary related proteins are
lish translation. Phycocyanins are light har-
missed (errors of omission). The cost of these
vesting proteins. Applying PSI-BLAST to the data
different types of errors depends on the biolog-
produced three proteins in the twilight zone:
ical problem. Therefore, we used receiver-oper-
(1) allophycocyanin alpha-b chain (Anabena),
ating characteristic (ROC) curves to compare
(2) erythroid transcription factor (gata-1 Musmusculus), and (3) oryzain gamma chain pre-cursor (Oryza sativa). The rules in the HIall and
HIseq rule sets correctly identified the allophy-
Homology induction induced rules for 1,015
proteins of known structure (PDB40D). The
convincing experimental evidence for this ho-
original PSI-BLAST results were used for the se-
mology and note that PSI-BLAST does not use
quences where no rules could be induced. In
protein names! No rule in any rule set identi-
fied the other two twilight zone sequences as
rules. The most commonly used predicate of
homologous, which would appear to be cor-
the single predicate rules was db_ref, used by
rect (no structures exist to be certain). Furtherevidence for the power of the homology in-
651 rules. These rules consisted mainly of ref-
duction rules is that the homology induction
erences to the bioinformatic database PROSITE.2
analysis was applied to version 37.0 of the
This result was expected because this database
bioinformatic database SWISS-PROT,3 and each
contains patterns designed to cluster homolo-
of the 13 positive examples not covered by
gous families of proteins together. In the
this rule have had the keyword phycobilisome
PDB40D database, there are 8,022 true homol-
added to their annotation since version 38.0
ogy relationships and 2,046,900 false ones. The
of SWISS-PROT. It is particularly intriguing that
accuracy for PSI-BLAST was 99.69 percent and for
the most characteristic feature of the amino
homology induction 99.70 percent. Although
acid–type rules is low-histidine and -trypoto-
the accuracy of homology induction is margin-
phan content and that both amino acids have
ally higher than PSI-BLAST alone, it is not clear at
nitrocyclic aromatic rings, which can be ex-
first sight if it is significantly higher. To test sig-
plained chemically. Phycocyanins have cova-
nificance, researchers therefore performed a
lently linked bilin prosthetic groups that con-
two-sample χ2 test to compare the actual fre-
sist of linked nitrocyclic aromatic rings.
quency of a prediction with the estimated fre-
Aberystwyth researchers hypothesize that evo-
quency of the prediction. The critical value of
lution has selected for low-histidine and -try-
χ2 for 1 degree of freedom and 99.995-percent
potophan content in phycocyanins to reduce
confidence is 7.879, which indicates that ho-
electron-transport interference. The require-
mology induction was significantly better than
ment for a high number of leucine-arginine
pairs is also structurally significant because
For comparison over all linear costs, I per-
these arginines form salt bridges with the pros-
formed a ROC analysis. We compared the area
thetic groups. The structural rule s2 is also
under the ROC (AUROC) for homology induc-
consistent with the known structure of phyco-
tion and PSI-BLAST. The AUROC value for homol-
cyanins, which are well known to have an all
ogy induction was 0.65, and the AUROC for PSI-
BLAST 0.61. The ROC curve for PSI-BLAST, alongwith the ROC curves for the standard homolo-gy induction (HIall) and a version of homology
induction based on the subset of descriptors di-
rectly calculable from only sequence informa-tion (HIseq) is shown in figure 1. The dominat-
Perhaps the most important recent advance in
ing curve is that of HIall, to the left of the other
bioinformatics has been the development of
two curves. The ROC curve of HIseq does not en-
good ontologies to describe protein function,
tirely dominate the PSI-BLAST ROC curve, and for
for example, GO and RILEY.4,5 These ontologies
large sections of the false positive axis, the two
take the form of hierarchies or directed acyclic
curves have a similar true positive rate. Howev-
graphs. Figure 3 illustrates part of the Riley hi-
er, for the false positive rate interval of 0.38 to
erarchy for the bacteria E. coli, one of the best-
0.5, the ROC curve produced by HIseq does
established functional ontologies. The cre-
To illustrate the biological utility of the ho-
possibility of directly predicting protein func-
Maxmimum ROC curves for PSI-BLAST and the Two HI methods
Figure 1. The Three Receiver-Operating Characteristic (ROC) Curves Produced by
PSI-BLAST, HIALL and HIseq, for Predictions in the Twilight Zone.
Although the ROC curve for PSI-BLAST results from applying ROC analysis directly to the results produced, the ROC curves for both homol-ogy induction methods are maximized using a cross-validated value for resorting. The ROC curve for HIall dominates over the other twocurves at all times, but the curves for PSI-BLAST and HIseq oscillate around each other. HIseq dominates the PSI-BLAST curve between ~0.38and ~0.5.
tional class from sequence. Abstractly, what is
required is a discrimination function that
We selected the E. coli genome to test the idea
maps sequence to biological functional class.
of using machine learning to learn predictive
The existing sequence homology recognition
mappings between protein sequence and func-
methods (see earlier discussion) can be viewedas examples of such functions: Methods based
tion. E. coli is probably the best characterized
on direct sequence similarity can be consid-
extant genome and is the “model” bacteria. It
ered as nearest-neighbor–type functions (in se-
has an estimated 4,289 identified proteins
quence space), and the more complicated ho-
(Blattner et al. 1997). Of these proteins, ap-
mology recognition methods based on motifs
and profiles resemble case-based learning
tion. To predict functional class, we collected
similar data to that used in homology induc-
PDB 1CPC C-Phycocyanin HI Prolog
desc(A,chain),amino_acid_ratio_rule(A,h,1).
it has the word “chain” in its SWISS-PROT description line andit has a level 1 histidine content in the residue chain and
it has the word “phycobilisome” as a SWISS-PROT keyword. HI Prolog
amino_acid_ratio_rule(A,w,1),amino_acid_ratio_rule(A,h,1),amino_acid_pair_ratio_rule(A,l,r,10).
mol_wt_rule(A,3),sec_struc_distribution_rule(A,a,10).
it has a level 1 tryptophan content andit has a level 1 histidine content andit has a level 10 leucine-arginine pair content.
it has a level 3 molecular weight andit has a level 10 predicted α-helix content. Figure 2. The Homology Induction Rules Learned to Identify 1CPC (C-Phycocyanin) Are Illustrated First in Their Original Prolog Form and Then in English Translation.
Two sets of rules are shown, those using HIall and those learned from HIseq.
All numbers were discretized into 10 levels for ease of symbolic induction (1 low to 10 high).
tion (see earlier discussion). We formed a DATA-
from SWISS-PROT; SWISS-PROT protein keywords
LOG database containing all the data we could
find on the protein sequences. The most com-
molecular weight of the protein, and its pre-
dicted secondary structure using PROF (Ouali
about a sequence is to run a sequence similari-
and King 2000). In total, 10,097,865 DATALOG
ty search, which was used as the starting point
facts were generated for the E. coli genome.
in forming descriptions (we used PSI-BLAST). For
To analyze this database, we used a hybrid
each protein in the genome, we formed a de-
combination of ILP and propositional tree
scription based on the frequency of singlets
learning (figure 4). The ILP data-mining pro-
and pairs of residues in the protein, the phy-
gram WARMR (Dehaspe and Toivonen 1999) was
logeny (family tree) of the organism from
first used to identify frequent patterns (con-
which each homologous protein was obtained
junctive queries) in the databases. WARMR is a
Figure 3. An Example Subset of the Riley Group Protein Functional Ontology in E. coli.
No. predicting more than one homology class
Table 1. Learning Results for E. coli.
The number of rules found are those selected on the validation set. A rule predicts more than one homologyclass if there is more than one sequence similarity cluster in the correct test predictions. A rule predicts a newhomology class if there is a sequence similarity cluster in the test predictions that has no members in the train-ing data. Average test accuracy is the accuracy of the predictions on the test proteins of assigned function (ifconflicts occurred; the prediction with the highest a priori probability was chosen). Default test accuracy is theaccuracy that could be achieved by always selecting the most populous class. New functions assigned is the num-ber of proteins of unassigned function predicted. The test accuracy estimates might be too pessimistic becauseproteins can have more than one functional class, but only one of these is considered correct.
general-purpose data-mining algorithm that
can discover knowledge in structured data. It
can learn patterns reflecting one-to-many and
These frequent patterns were converted into
many-to-many relationships over several ta-
Boolean (indicator) attributes for propositional
bles. No standard data-mining program can do
rule learning. An attribute has value 1 for a spe-
this because they are restricted to simple asso-
cific gene if the corresponding query succeeds
ciations in single tables. WARMR uses a first-or-
for that gene and 0 if the query fails. The
der version of the efficient levelwise a priori al-
propositional machine learning algorithm C5
gorithm (Agrawal and Srikant 1994), which
(Quinlan 1993) was then used to induce rules
allows it to be used on very large databases.
that predict function from these Boolean at-
The WARMR levelwise search algorithm is based
tributes. Good rules were selected on a valida-
on a breadth-first search of the pattern space.
tion set and the unbiased accuracy of these
The application of WARMR can be considered as
rules estimated on a test set. Rules were select-
a way of identifying the most important struc-
ed to balance accuracy with unidentified gene
ture in a database. In the E. coli database, WARMR
coverage. The prediction rules were then ap-
Figure 4. Flowchart of the Data-Mining Program Methodology.
This inductive logic programming (ILP)/propositional hybrid approach has proved successful in the past on other scientific discovery tasks. Itis powerful because the clustering improves the representation for learning (using the expressive power of ILP), and the discrimination stepefficiently exploits the prelabeled examples. Good rules were selected on a validation set and the unbiased accuracy of these rules estimatedon the test set. The unbiased accuracy of these rules was estimated on the 1/3 test set. The selection criteria for good rules was that on the val-idation data, they covered at least two correct examples, had an accuracy of at least 50 percent, and had an estimated deviation of ≥ 1.64.
plied to genes that have not been assigned a
rules that were more general than is possible
function to predict their functions. Note: We
did not aim for a general model of the relation-
Using the same prediction rules, we also pre-
ship between sequence and function; we were
dicted the functional class of 1,309 proteins of
satisfied with finding good rules to cover part
were made publicly available at the gene-pre-dictions web site.6 Statistical theory, and the
design of our machine methodology, gave us
In 2000 to 2001, we published this data-min-
confidence in these predictions. However,
ing prediction approach to predicting protein
doubts remained: It seemed a priori unlikely
functional class from sequence (King, Srini-
that protein function could be predicted from
vasan, and Dehaspe 2001; King et al. 2001a,
sequence (predicting protein structure from se-
2000b, 2000). By using a held-out test set of
quence has proved intractable, and predicting
proteins with annotated function, data-mining
function from structure is an unsolved prob-
prediction had an estimated accuracy of 50 to
lem), and it was possible that the proteins of
90% percent (depending on the position of
unknown function came from a significantly
class in the function ontology). A summary of
different distribution from those of known
these results is given in table 1. A key finding
function—which would invalidate a key statis-
was that the method could learn predictive
The ORF is not predicted to have a β-strand length ≤ 3 and
a homologous protein from class Chytridiomycetes was found
Then its functional class is “Cell processes, Transport/binding proteins”
This rule is based on predicted structural and phylogenic features. In the original test set, this rule was 12/13(86 percent) correct. The default accuracy for this class is 21 percent. Twenty-four genes of unknown functionwere predicted by the rule. Of these 24 predictions, 2 have been confirmed by experiment, 7 have been (inde-pendently) annotated to have the predicted function, and 1 has been annotated to have a nonpredicted func-tion. The rule also has a possible biological explanation. We hypothesize that cytochrome c oxidase in Chytrid-iomycetes is a “molecular living fossil” that has retained features of an ancestral protein that radiated into awide variety of transport proteins, which has allowed the protein to be used to identify very remote homolo-gous that would otherwise be missed.
In the period since these predictions, biolog-
sample of functions confirmed). See table 2 for
ical knowledge has advanced greatly. Some pro-
details of the results for level 3 of the function
teins in E. coli have had their function deter-
hierarchy. An example prediction rule is shown
mined by wet biology. Function determination
in figure 5. It was gratifying that the rules illus-
has also occurred in other organisms, allowing
trated in previous publications were found to
better homology-based function predictions.
have accuracies consistent with prediction on
quences have been determined, allowing se-quence-similarity methods to predict function
with greater accuracy. This new biologicalknowledge allows us to test our predictions di-
In this article, I described two applications of
rectly. We used two ways to test the predictions.
the first-order machine learning methodology,
First, we compared our predictions to the
inductive logic programming, to the problem
of predicting protein function from sequence.
group annotation, which has the advantage of
I consider biology first. Are the results of any
practical use? On this question, I believe the ju-
Second, we examined the scientific literature
ry is still out. The results of homology induc-
for the direct experimental derivation of pro-
tion (although a statistically significant im-
tein functions for our predictions. This test has
provement on PSI-BLAST) are perhaps too small a
the advantage of directly testing the predic-
step-change to make biologists use the system
en masse. However, the functional class results
To test for the probability of our predictions
are probably more significant. These I believe
occurring by chance, Aberystwyth researchers
constitute a step-change in protein function
used a binomial test, with the probability of
prediction methodology. Although initially,
success being the probability of the most popu-
many biologists were skeptical, this reaction
lous class. This test has the advantage of being
seems to be slowly changing, and interest is
simple to calculate, makes few assumptions,
growing in the approach. The new evidence of
and is guaranteed to give an overestimate.
the results of the blind-test predictions (table
The results for the new Riley group annota-
2) should help the acceptance of this method-
tion were statistically highly significant (< 1e-
ology. One important limitation of the data-
15), with prediction accuracies of approximate-
mining–prediction approach is that although
ly 90 percent for the cases where more than
the rules are presented in a symbolic form,
one rule agreed on a prediction. It should also
their meaning is often obscure. It is certainly
be stressed again that these accuracies are likely
possible to find biological justification for
to be underestimates because they are based on
some of the rules (see, for example, figure 5),
the assumption that the Riley annotation is
but in many cases, the biological meaning is
obscure, even when the rules are empirically
The results for the function predictions that
successful. Much more work is needed on the
have either been confirmed, or not, by wet bi-
design of learning systems that produce se-
ological experiments were also highly signifi-
cant, although at lower accuracy than for the
A number of interesting machine learning is-
annotations (probably because of bias in the
sues were brought into focus by the applications:
Predicted Class Confirmed Function Table 2. Predictions of Classes in Level 3 of the E. coli Gene Ontology That Now Have Wet Biological Evidence.
ORF is the Blattner identifier for the protein. The predictions are ordered by result and ID. The rule numbers are identifiers for the specificrule predicting the gene. C = Correct, W = Wrong, NM = Near Miss. There are 10 correct predictions and 14 wrong ones. The probabilityof obtaining this accuracy on newly determined functions occurring by chance is estimated at less than 4.8e-10.
approaches have time after time been the most
chine learning applications, perhaps most, it is
successful in blind trials.7 However, in the pre-
not necessary to use ILP/relational data mining
diction of protein function, it is very hard to
because propositional methods are sufficient
see how the crucial relational aspects of the
because there has been orders-of-magnitude
The functional classes for proteins exist in
and ILP methods do not necessarily act as effi-
hierarchies or directed acyclic graphs, which
cient propositional learners when given wholly
means that the classes are not independent of
propositional data. For example, in bioinfor-
each other. Problems with this characteristic
matics, propositional methods would empiri-
are relatively common in the real world (for ex-
cally seem sufficient to predict protein sec-
ample, in text classification) but have been lit-
tle considered by the statistical or machine
Quantitative and Physical Mapping of Cellular Pro-
teins. TIBTECH 17(1): 121–127.
It is possible for proteins to have more than
Blattner, F. R.; Plunkett, G., 3d.; Block, C. A.; Perna,
one function, that is, to have more than one
N. T.; Burland, V.; Riley, M.; Collado-Vides, J.; Glas-
class value. Such problems are also common in
ner, J. D.; Rode, C. K.; Mayhew, G. F.; Gregor, J.;
the real world and little studied (for example,
Davis, N. W.; Kirkpatrick, H. A.; Goeden, M. A.; Rose,
Clare and King [2001]; Schapire and Singer
D. J.; Mau, B.; and Shao, Y. 1997. The CompleteGenome Sequence of Escherichia coli K–12. Science
[2000]). Of course, it is always possible to create
disjoint classes, but this can distort the problem
Bork, P.; Dandekar, T.; Diaz-Lazcoz, Y.; Eisenhaber, F.;
and create large numbers of artificial classes.
Huynen, M.; and Yuan, Y. P. 1998. Predicting Func-
In conclusion, the application of AI to deci-
tion: From Genes to Genomes and Back. Journal ofMolecular Biology 283(4): 707–725.
beginning. Enormous challenges exist in data
Bradley, A. P. 1995. The Use of Area under ROC
integration, the analysis of data from micro-
Curve in the Evaluation of Learning Algorithms. Pat-
arrays, proteomics, and so on. The dream for
tern Recognition 30(6): 1145–1159.
the future is to be able to develop models of
Brent, R. 1999. Functional Genomics: Learning to
cells, development, tissues, and even whole or-
Think about Gene Expression Data. Current Biology
ganisms. AI has the potential to contribute sub-
stantially to this enterprise. In turn, AI will
Bussey, H. 1997. 1997 Ushers in an Era of Yeast Func-
tional Genomics. Yeast 13(16): 1501–1503.
Clare, A. J., and King, R. D. 2001. Knowledge Discov-ery in Multilabel Phenotype Data. In Proceedings of
I would like to thank Amanda Clare, Luc De-
the Fifth European Conference on Principles and Practice
haspe, and Andreas Karwath who worked with
of Knowledge Discovery in Databases (PKDD’01), eds. L.
me on homology induction and data- mining
De Raedt and A. Siebes, 42–53. Lecture Notes in Arti-
prediction as well as Douglas Kell, Walter
ficial Intelligence 2168. Heidelberg, Germany:
Luyten, and Mike Young for helpful comments
on the work. The work was partly funded by
Dehaspe, L., and Toivonen, H. 1999. Discovery of
Engineering and Physical Science Research
Frequent DATALOG Patterns. Data Mining and Knowl-
DeRisi, J. L.; Iyer, V. R.; and Brown, P. O. 1997. Ex-ploring the Metabolic and Genetic Control of Gene
Expression on a Genomic Scale. Science 278(5338):
1. web.comlab.ox.ac.uk/oucl/research/areas/mach-
Donescu, A.; Waissman, J.; Richard, G.; and Roux, G.
2002. Characterization of Bio-Chemical Signals byInductive Logic Programming. Knowledge-Based Sys-
Dzeroski, S., and Lavrac, N. 2001. Relational DataMining. Berlin: Springer-Verlag.
6. www.aber.ac.uk/compsci/Research/bio/Protein-
Dzeroski, S.; Blockeel, H.; Kompare, B.; Kramer, S.;
Pfahringer, B.; and Van Laer, W. 1999. Experiments
in Predicting Biodegradability. In Proceedings of theNinth International Workshop on Inductive Logic Pro-gramming, 80–91. Lecture Notes in Artificial Intelli-
gence 1634. New York: Springer-Verlag.
Agrawal, R., and Srikant, R. 1994. Fast Algorithms
Finn, P.; Muggleton, S.; Page, D.; and Srinivasan, A.
for Mining Association Rules in Large Databases. Pa-
1998. A. Pharmacophore Discovery Using the Induc-
per presented at the Twentieth International Con-
tive Logic Programming System PROGOL. Machine
ference on Very Large Databases (VLDB), 12–15 Sep-
Hieter, P., and Boguski, N. 1997. Functional Ge-
Altschul, S. F.; Madden, T. L.; Schaffer, A. A.; Zhang,
nomics: It’s All How You Read It. Science 278(5338):
J.; Zhang, Z.; Miller, W.; and Lipman, D. J. 1997.
Gapped BLAST and PSI-BLAST: A New Generation of Pro-
Karwath, A., and King, R. D. 2002. Homology Induc-
tein Database Search Programs. Nucleic Acid Re-
tion: The Use of Machine Learning to Improve Se-
quence Similarity Searches. BMC Bioinformatics 3(1):
Badea, L. 2003. Functional Discrimination of Gene
Expression Patterns in Terms of the Gene Ontology.
Karwath, A., and King, R. D. 2001. An Automated ILP
Paper presented at the Pacific Symposium on Bio-
Server in the Field of Bioinformatics. In Proceedings of
computing, 3–7 January, Lihue, Hawaii. the Eleventh International Conference on Inductive Logic
Blackstock, W. P., and Weir, M. P. 1999. Proteomics:
Programming (ILP’01), eds. C. Rouveirol and M. Se-
bag, 91–103. Lecture Notes in Artificial Intelligence
Murzin, A. G.; Brenner, S. E.; Hubbard, T. J. P.; and
Chothia, C. 1995. SCOP: A Structural Classification ofProtein Database for the Investigation of Sequences
Kell, D., and King, R. D. 2000. On the Optimization
and Structures. Journal of Molecular Biology 247(4):
of Classes for the Assignment of Unidentified Read-
ing Frames in Functional Genomics Programmes:The Need for Machine Learning. Trends in Biotechnol-
Oliver, S. G.; Winson, M. K.; Kell, D. B.; and Baganz,
F. 1998. Systematic Functional Analysis of the YeastGenome. Trends in Biotechnology 16(10): 373–378.
King, R. D.; Clark, D. A.; Shirazi, J.; and Sternberg, M. J. E. 1994. On the Use of Machine Learning to Iden-
Ouali, M., and King, R. D. 2000. Cascaded MultipleClassifiers for Secondary-Structure Prediction. Protein
tify Topological Rules in the Packing of Beta-Strands. Protein Engineering 7(11): 1295–1303.
Park, J.; Teichmann, S. A.; Hubbard, T.; and Chothia,
King, R. D.; Srinivasan, A.; and Dehaspe, L. 2001.
C. 1997. Intermediate Sequences Increase the Detec-
WARMR: A Data-Mining Tool for Chemical Data. Jour-
tion of Homology between Sequences. Journal ofnal of Computer-Aided Molecular Design 15(2): 173–181. Molecular Biology 273(1): 349–354.
King, R. D.; Karwath, A.; Clare, A.; and Dehapse, L.
Park, J.; Karplus, K.; Barrett, C.; Hughey, R.; Haussler,
2001. The Utility of Different Representations of Pro-
D.; Hubbard, T.; and Chothia, C. 1998. Sequence
tein Sequence for Predicting Functional Class. Bioin-
Comparisons Using Multiple Sequences Detect Three
Times as Many Remote Homologues as Pairwise
King, R. D.; Karwath, A.; Clare, A.; and Dehapse, L.
Methods. Journal of Molecular Biology 284(4):
2000a. Accurate Prediction of Protein Class in the M.tuberculosis and E. coli Genomes Using Data Mining.
Pearson, W. R., and Lipman, D. J. 1988. Improved
Yeast (Comparative and Functional Genomics) 17(4):
Tools for Biological Sequence Comparison. Proceed-ings of the National Academy of Science 85(8):
King, R. D.; Karwath, A.; Clare, A.; and Dehapse, L.
2000b. Genome-Scale Prediction of Protein Func-
Quinlan, R. 1993. C4.5: Programs for Machine Learn-
tional Class from Sequence Using Data Mining. In
ing. San Francisco, Calif.: Morgan Kaufmann.
Proceedings of the Sixth ACM SIGKDD International
Schapire, R., and Singer, Y. 2000. BOOSTEXTER: A
Conference on Knowledge Discovery and Data Min-
Boosting-Based System for Text Categorization. Ma-
ing, eds. R. Ramakrishnan, S. Stolfo, R. Bayardo, and
chine Learning 39(2): 135–168.
I. Parsa, 384–389. New York: Association for Comput-ing Machinery.
Sternberg, M. J. E.; King, R. D.; Lewis, R. A.; and Mug-gleton, S. 1994. Application of Machine Learning to
King, R. D.; Muggleton, S. H.; Srinivasan, A.; and
Structural Molecular Biology. Philosophical Transac-
Sternberg, M. J. E. 1996. Structure-Activity Rela-
tions of the Royal Society of London Series B—Biological
tionships Derived by Machine Learning: The Use of
Atoms and Their Bond Connectivities to Predict Mu-
Turcotte, M.; Muggleton, S. H.; and. Sternberg, M. J.
tagenicity by Inductive Logic Programming. Pro-
E. 2001. Automated Discovery of Structural Signa-
ceedings of the National Academy of Sciences 93(1):
tures of Protein Fold and Function. Journal of Molec-ular Biology 306(3): 591–605.
King, R. D.; Muggleton, S.; Lewis, R. A.; and Stern-
Ullman, J. D. 1988. Principles of Databases and Knowl-
berg, M. J. E. 1992. Drug Design by Machine Learn-
edge-Based Systems, Volume 1. Rockville, Md.: Com-
ing—The Use of Inductive Logic Programming to
Model the Structure-Activity-Relationships ofTrimethoprim Analogs Binding to Dihydrofolate-Re-
Ross D. King completed his Ph.D.
ductase. Proceedings of the National Academy of Sci-
Kohler, D., and Sahami, M. 1997. Hierarchically Clas-
sifying Documents Using Very Few Words. Paper pre-
sented at the Fourteenth International Conference of
ment, Brainware GmbH, Berlin. He returned to the United King-
Machine Learning, 8–12 July, Nashville, Tennessee.
Lavrac, N., and Dzeroski, S. 1994. Inductive Logic Pro-gramming: Techniques and Applications. Chichester,
work in the biomolecular structure group of the Im-
perial Cancer Research Fund. In 1997, he moved to
Mitchell, T. M. 1982. Generalization as Search. Arti-
the Department of Computer Science at the Univer-
ficial Intelligence 18(2): 203–226.
sity of Wales, Aberystwyth. He was awarded his chair
Muggleton, S. H. 1992. Inductive Logic Programming.
in 2002. His e-mail address is [email protected].
Muggleton, S. H. 1990. Inductive Logic Program-ming. New Generation Computing 8(4): 295–318.
Muggleton, S.; King, R. D.; and Sternberg, M. J. E. 1992. Protein Secondary-Structure Prediction UsingLogic-Based Machine Learning. Protein Engineering5(7): 647–657.
International Journal of RESEARCH ARTICLE Effect of Different Levels of Dietary Bole (Lake Soil) Inclusion on Feed Intake, Milk Yield and Composition of Holstein Friesian Cows *Shewangzaw Addisu1, Firew Tegegne2 and Zeleke Mekuriaw3 1Alage ATVET College, P.O.Box 77, Zeway; 2Bahir Dar University, P.O.Box 26, Bahir Dar; 3LIVES, ILRI-Bahir Dar A R T I C L E I N F O A B S T R A
Irritable Bowel Syndrome What is Irritable Bowel Syndrome? Irritable Bowel Syndrome is a condition that primarily involves the large intestine (colon) and it is thought to be the result of abnormal muscle contractions. Irritable bowel syndrome is also known as IBS, spastic colon, irritable colon and mucous colitis. Normally, regular and coordinated muscular contractions in the wal