Applying inductive logic programming to predicting gene function

Medical Science

Tadalafil appartiene alla classe degli inibitori selettivi della fosfodiesterasi di tipo 5, con un profilo farmacocinetico caratterizzato da un’emivita terminale di circa diciotto ore. Dopo somministrazione orale viene assorbito rapidamente e raggiunge concentrazioni plasmatiche massime in due ore. La biotrasformazione avviene principalmente tramite CYP3A4 con formazione di metaboliti inattivi, escreti in prevalenza con le feci. L’elevato legame alle proteine plasmatiche (>90%) assicura una distribuzione stabile. Nei confronti delle altre molecole della stessa classe, cialis compresse italia è noto per la durata prolungata dell’attività farmacologica.

Applying inductive logic programming to predicting gene function

Applying Inductive
Logic Programming
to Predicting
Gene Function
■ One of the fastest advancing areas of modern sci- Has this anything to do with AI? I believe ence is functional genomics. This science seeks to yes. The relationship between machines and understand how the complete complement of living organisms is central to AI and was a core molecular components of living organisms (nucle- interest of the founding fathers (Alan Turing, ic acid, protein, small molecules, and so on) inter- John von Neuman, and Norbert Wiener). Mod- act together to form living organisms. Functionalgenomics is of interest to AI because the relation- ern biology is also an instructive and fun do- ship between machines and living organisms is main to apply and sharpen our tools and ideas; central to AI and because the field is an instructive it requires complex knowledge representation, and fun domain to apply and sharpen AI tools and reasoning, learning, and so on. Because the ideas, requiring complex knowledge representa- concrete generally precedes the abstract, real- tion, reasoning, learning, and so on. This article world applications catalyze new research areas. describes two machine learning (inductive logic The sequencing of a genome is only a first programming [ILP])–based approaches to the step to fully understanding how a living organ- bioinformatic problem of predicting protein func-tion from amino acid sequence. The first approach isms works. In computer science terms, knowl- is based on using ILP as a way of bootstrapping edge of a genome is only equivalent to obtain- from conventional sequence-based homology methods. The second approach used protein-func- complication that the function and language tional ontologies to provide function classes and a of the program are unknown and the code un- hybrid ILP method to predict function directly documented and badly written. The key cur- from sequence. Both ILP approaches were success- rent scientific challenge in biology is to unrav- ful in producing accurate prediction rules that el and understand this code. AI tools will be could biologically be interpreted. The work was al- so of interest to machine learning research becauseit highlighted the flexibility of ILP systems in deal- ing with heterogeneous data, the importance ofproblems where classes are related hierarchically, Perhaps the most important discovery from and problems where examples have more than the sequenced genomes is that the functions of only approximately 40 to 70 percent of thepredicted genes are typically known with anyconfidence. For example, in bakers’ yeast (S. We live in interesting scientific times. cerevisiae),one of the most intensely studied of all organisms, of the approximately 6000 pre- dicted protein-encoding genes, the function of living organisms. These genomes provide the only approximately 70 percent can be assigned complete specification of the parts and pro- with any confidence. The new science of func- grams to create living organisms. This knowl- tional genomics (Bussey 1997; Hieter and Bo- guski 1997) is dedicated to determining the Copyright 2004, American Association for Artificial Intelligence. All rights reserved. 0738-4602-2004 / $2.00 function of genes of unassigned function and simonious. In practical applications, there is to further detailing the function of genes with the additional problem that there is usually purported function. Most functional genomics noise present (that is, some E+ are false, and is concerned with the development of new ex- some E– are true), which means that the previ- perimental techniques for elucidating gene ous conditions need to be relaxed such that the function (Blackstock and Weir 1999; DeRisi, Iy- best hypothesis maximizes its coverage of E+ er, and Brown 1997; Oliver et al. 1998). Bio- and minimizes its coverage of E–. These two informatics is playing an essential role in ana- values give a cost ratio that represents the di- vergence from the ideal. The problem of learn- genomics data (Bork et al. 1998; Brent 1999; ing H is typically designed as a search problem through the space of models meeting the pre- A key bioinformatic problem in functional genomics is the prediction of protein function ILP has shown its value in many scientific from sequence. Such predictions both provide problems such as drug design and toxicology important initial information about newly se- (for example, Dzeroski et al. [1999]; Finn et al.
quenced genomes and aid “wet” experimental [1998]; King, Srinivasan, and Dehaspe [2001]; determination of function. Such predictions King et al. [1996, 1992]) and molecular biology are usually done by utilizing sequence similar- (Badea 2003; Donescu et al. 2002; King et al.
ity methods to find an evolutionary related 1994; Muggleton, King, and Sternberg 1992; (homologous) protein in the database that has Sternberg et al. 1994; Turcotte et al. 2001). ILP a known function (Altschul et al. 1997; Pear- has been particularly well suited to problems son and Lipman 1988). The function of the dealing with molecular structure. In such prob- new sequence is then inferred to be the same as lems, ILP has often found solutions not acces- the homologous protein because it is assumed sible to standard statistical, neural network, to have been conserved over evolution. This propositional machine learning, or genetic al- inference is a kind of nearest-neighbor type in gorithms (King et al. 1996). The theories pro- sequence space. Unfortunately, with this ap- duced by ILP have also been generally more proach, only around 50 percent of possible ho- comprehensible than those using proposition- mologies are identified (Park et al. 1998), and al methods because they are more compact and little biological insight is obtained.
closer to natural language (King et al. 1992;Turcotte, Muggleton, and Sternberg 2001).
that utilize first-order predicate logic (FOPL) torepresent background knowledge and theories are generally described as coming from the If a problem can satisfactorily be represented field of inductive logic programming (ILP) and solved using propositional methods, then (Lavrac and Dzeroski 1994; Muggleton 1992, there is no need to apply ILP techniques. Prop- 1990) or relational data mining (RDM) (Dze- ositional methods are generally better devel- roski and Lavrac 2001), depending on the re- oped and more computationally and statisti- search emphasis. For simplicity in this article, I cally efficient. ILP methods do not necessarily refer to all such algorithms as coming from ILP.
default to efficient propositional learners when In ILP, background knowledge (B), examples given wholly propositional data. Use of ILP (E), and hypotheses (H) are represented as logic therefore needs to be justified. My rationale is programs. It is first assumed that there is a re- based on the following features of the problem quirement for an inductive hypothesis; that is, B ç/ E (prior necessity). The core ILP problem is Relational descriptors: Functional genom-
to find hypotheses H such that B ٙ H ç E. Mat- ics naturally involves many relationships in ters are complicated slightly by the fact that ev- the data: phylogenic hierarchies (the tree of life), idence in ILP is usually divided into two types: homologies (genes sharing a common ancestor), (1) E+ data that are consistent with the con- directed graphs relating functions, and so on.
joined hypothesis and background informa- Traditional propositional methods (statistical, tion and (2) E– data that are contrary to the neural network, machine learning, genetic al- conjoined hypothesis and background infor- gorithms, and so on) cannot efficiently repre- mation. Therefore, H is required to meet the sent these relations in inductive inference.
following criterion: B ٙ H ç E+ and B ç H ç/ E.
Data heterogeneity: The relevant data come
It is typical that H is also restricted to necessar- from many different sources and are necessari- ily meet other requirements such as being non- ly stored in multiple tables of relational data- trivial (not, for example, E, or B ⇒ E) and par- bases. To use a conventional data-mining algo- rithm, the tables would have to be joined to form a single prohibitively large table for Homology induction uses background knowl- analysis, which is impractical. In addition, edge, together with the protein’s amino acid ILP/RDB allows the direct analysis of the mul- sequence, to induce homology. The idea is to collect as much information as possible for a Comprehensible results: It is important
protein and then infer homology using dis- that the prediction rules are understandable.
criminatory ILP. The homology induction ap- Biologists generally require that the rules are understandable so that they can suggest new First is the collection of possible homolo- biological ideas and have confidence in them.
gous proteins using an existing method of se- In some bioinformatic applications, this re- quence similarity search (SSS). Aberystwyth re- quirement is not necessary, for example, in searchers use PSI-BLAST (Altschul et al. 1997) predicting protein secondary structure. How- which is essentially an iterative nearest-neigh- ever, given a choice, comprehensible results are bor method (in sequence space). The result of a PSI-BLAST search is a list of possible homolo- At Aberystwyth, researchers have developed gous proteins sorted by probability. Proteins two approaches to applying ILP to predicting where the probability of homology is ambigu- protein function. The first is called homology ous are termed to be in the “twilight zone.” induction and is based on utilizing machine Second is the accumulation of all available information for these proteins. We developed a large multitable database of DATALOG (Ullman and King 2002, 2001). The second method uses 1988) facts to describe the proteins from a wide a hybrid ILP–propositional machine learning variety of bioinformatic sources. This informa- method to predict protein functional class di- tion was selected for relevance to the detection rectly from sequence (King, Srinivasan, and of homology. For each protein, we collected Dehaspe 2001; King et al. 2001a, 2001b, 2000).
bioinformatic database keywords, the organ-ism’s classification (family tree), bioinformatic database references (PROSITE, HSSP, EMBL, PIR—ex-cluding SCOP [structural classification of pro- The identification of evolutionary-related (ho- tein] classifications), predicted secondary struc- mologous) proteins is a key problem in compu- tational molecular biology. Knowledge of a ho- distribution for singlets and pairs of residues, mologous relationship between two proteins, one of known function and the other of un- Third is the induction of rules. We used the known function, allows the probabilistic infer- ILP ALEPH algorithm to learn rules that were ence that the protein with unknown function true for proteins of very high probability of be- has the same function as that of the known one (because evolution generally conserves ilarity)1 and false for proteins with close to zero function). Such inferences are the basis of most probability of being homologous (also based of our knowledge about sequenced genomes.
Protein homology is typically inferred by using Fourth is the application of the rules. We ap- computer programs to measure the similarity plied the rules to the set of twilight zone pro- of two or more proteins. This inference is gen- teins to predict whether they were homologous.
erally done by comparing the two amino acid To assess the accuracy of homology induc- strings of the proteins and measuring the char- tion, it was necessary to have a “gold standard” acterwise similarity between them. Such pro- set of known homologies. We used the syste- grams probably consume more processing time matic approach of Park et al. (1997), which than all other bioinformatic programs put to- used a subset of the SCOP database (Murzin et al.
gether. These methods perform well for closely 1995). The SCOP database is a classification related homologous sequences. However, the database of proteins of known structure; most results for more distantly related proteins are also have known function. Over evolutionary less reliable (Park et al. 1998), detecting only time, protein structure changes more slowly about 50 percent of all possible homologies, than sequence; therefore, structure can be used given an acceptable false-positive rate. to identify more remote homologies than se- In learning problems, all relevant informa- quence. At the family level of SCOP, the struc- tion should be used. The idea behind homolo- tures are so similar that homology is inferred, gy induction is to exploit additional sequence which is not to be confused with protein fold information to bootstrap on the performance recognition, where there is no necessary expec- A perfect prediction method would be able mology induction rules, I use the protein C- to detect all homologous relationships in SCOP.
Phycocyanin (1CPC). Figure 2 shows the HIall However, in practice, unrelated nonhomolo- and HIseq rules learned for C-Phycocyanin gous proteins are predicted (errors of commis- both in their original Prolog form and in Eng- sion), and evolutionary related proteins are lish translation. Phycocyanins are light har- missed (errors of omission). The cost of these vesting proteins. Applying PSI-BLAST to the data different types of errors depends on the biolog- produced three proteins in the twilight zone: ical problem. Therefore, we used receiver-oper- (1) allophycocyanin alpha-b chain (Anabena), ating characteristic (ROC) curves to compare (2) erythroid transcription factor (gata-1 Mus musculus), and (3) oryzain gamma chain pre-cursor (Oryza sativa). The rules in the HIall and HIseq rule sets correctly identified the allophy- Homology induction induced rules for 1,015 proteins of known structure (PDB40D). The convincing experimental evidence for this ho- original PSI-BLAST results were used for the se- mology and note that PSI-BLAST does not use quences where no rules could be induced. In protein names! No rule in any rule set identi- fied the other two twilight zone sequences as rules. The most commonly used predicate of homologous, which would appear to be cor- the single predicate rules was db_ref, used by rect (no structures exist to be certain). Furtherevidence for the power of the homology in- 651 rules. These rules consisted mainly of ref- duction rules is that the homology induction erences to the bioinformatic database PROSITE.2 analysis was applied to version 37.0 of the This result was expected because this database bioinformatic database SWISS-PROT,3 and each contains patterns designed to cluster homolo- of the 13 positive examples not covered by gous families of proteins together. In the this rule have had the keyword phycobilisome PDB40D database, there are 8,022 true homol- added to their annotation since version 38.0 ogy relationships and 2,046,900 false ones. The of SWISS-PROT. It is particularly intriguing that accuracy for PSI-BLAST was 99.69 percent and for the most characteristic feature of the amino homology induction 99.70 percent. Although acid–type rules is low-histidine and -trypoto- the accuracy of homology induction is margin- phan content and that both amino acids have ally higher than PSI-BLAST alone, it is not clear at nitrocyclic aromatic rings, which can be ex- first sight if it is significantly higher. To test sig- plained chemically. Phycocyanins have cova- nificance, researchers therefore performed a lently linked bilin prosthetic groups that con- two-sample χ2 test to compare the actual fre- sist of linked nitrocyclic aromatic rings.
quency of a prediction with the estimated fre- Aberystwyth researchers hypothesize that evo- quency of the prediction. The critical value of lution has selected for low-histidine and -try- χ2 for 1 degree of freedom and 99.995-percent potophan content in phycocyanins to reduce confidence is 7.879, which indicates that ho- electron-transport interference. The require- mology induction was significantly better than ment for a high number of leucine-arginine pairs is also structurally significant because For comparison over all linear costs, I per- these arginines form salt bridges with the pros- formed a ROC analysis. We compared the area thetic groups. The structural rule s2 is also under the ROC (AUROC) for homology induc- consistent with the known structure of phyco- tion and PSI-BLAST. The AUROC value for homol- cyanins, which are well known to have an all ogy induction was 0.65, and the AUROC for PSI- BLAST 0.61. The ROC curve for PSI-BLAST, alongwith the ROC curves for the standard homolo-gy induction (HIall) and a version of homology induction based on the subset of descriptors di- rectly calculable from only sequence informa-tion (HIseq) is shown in figure 1. The dominat- Perhaps the most important recent advance in ing curve is that of HIall, to the left of the other bioinformatics has been the development of two curves. The ROC curve of HIseq does not en- good ontologies to describe protein function, tirely dominate the PSI-BLAST ROC curve, and for for example, GO and RILEY.4,5 These ontologies large sections of the false positive axis, the two take the form of hierarchies or directed acyclic curves have a similar true positive rate. Howev- graphs. Figure 3 illustrates part of the Riley hi- er, for the false positive rate interval of 0.38 to erarchy for the bacteria E. coli, one of the best- 0.5, the ROC curve produced by HIseq does established functional ontologies. The cre- To illustrate the biological utility of the ho- possibility of directly predicting protein func-

Maxmimum ROC curves for PSI-BLAST and the Two HI methods Figure 1. The Three Receiver-Operating Characteristic (ROC) Curves Produced by PSI-BLAST, HIALL and HIseq, for Predictions in the Twilight Zone. Although the ROC curve for PSI-BLAST results from applying ROC analysis directly to the results produced, the ROC curves for both homol-ogy induction methods are maximized using a cross-validated value for resorting. The ROC curve for HIall dominates over the other twocurves at all times, but the curves for PSI-BLAST and HIseq oscillate around each other. HIseq dominates the PSI-BLAST curve between ~0.38and ~0.5.
tional class from sequence. Abstractly, what is required is a discrimination function that We selected the E. coli genome to test the idea maps sequence to biological functional class.
of using machine learning to learn predictive The existing sequence homology recognition mappings between protein sequence and func- methods (see earlier discussion) can be viewedas examples of such functions: Methods based tion. E. coli is probably the best characterized on direct sequence similarity can be consid- extant genome and is the “model” bacteria. It ered as nearest-neighbor–type functions (in se- has an estimated 4,289 identified proteins quence space), and the more complicated ho- (Blattner et al. 1997). Of these proteins, ap- mology recognition methods based on motifs and profiles resemble case-based learning tion. To predict functional class, we collected similar data to that used in homology induc- PDB 1CPC C-Phycocyanin
HI
Prolog
desc(A,chain),amino_acid_ratio_rule(A,h,1).
it has the word “chain” in its SWISS-PROT description line andit has a level 1 histidine content in the residue chain and it has the word “phycobilisome” as a SWISS-PROT keyword.
HI
Prolog
amino_acid_ratio_rule(A,w,1),amino_acid_ratio_rule(A,h,1),amino_acid_pair_ratio_rule(A,l,r,10).
mol_wt_rule(A,3),sec_struc_distribution_rule(A,a,10).
it has a level 1 tryptophan content andit has a level 1 histidine content andit has a level 10 leucine-arginine pair content.
it has a level 3 molecular weight andit has a level 10 predicted α-helix content.
Figure 2. The Homology Induction Rules Learned to Identify 1CPC (C-Phycocyanin) Are Illustrated First in Their Original Prolog Form and Then in English Translation. Two sets of rules are shown, those using HIall and those learned from HIseq. All numbers were discretized into 10 levels for ease of symbolic induction (1 low to 10 high).
tion (see earlier discussion). We formed a DATA- from SWISS-PROT; SWISS-PROT protein keywords LOG database containing all the data we could find on the protein sequences. The most com- molecular weight of the protein, and its pre- dicted secondary structure using PROF (Ouali about a sequence is to run a sequence similari- and King 2000). In total, 10,097,865 DATALOG ty search, which was used as the starting point facts were generated for the E. coli genome. in forming descriptions (we used PSI-BLAST). For To analyze this database, we used a hybrid each protein in the genome, we formed a de- combination of ILP and propositional tree scription based on the frequency of singlets learning (figure 4). The ILP data-mining pro- and pairs of residues in the protein, the phy- gram WARMR (Dehaspe and Toivonen 1999) was logeny (family tree) of the organism from first used to identify frequent patterns (con- which each homologous protein was obtained junctive queries) in the databases. WARMR is a Figure 3. An Example Subset of the Riley Group Protein Functional Ontology in E. coli. No. predicting more than one homology class Table 1. Learning Results for E. coli. The number of rules found are those selected on the validation set. A rule predicts more than one homologyclass if there is more than one sequence similarity cluster in the correct test predictions. A rule predicts a newhomology class if there is a sequence similarity cluster in the test predictions that has no members in the train-ing data. Average test accuracy is the accuracy of the predictions on the test proteins of assigned function (ifconflicts occurred; the prediction with the highest a priori probability was chosen). Default test accuracy is theaccuracy that could be achieved by always selecting the most populous class. New functions assigned is the num-ber of proteins of unassigned function predicted. The test accuracy estimates might be too pessimistic becauseproteins can have more than one functional class, but only one of these is considered correct.
general-purpose data-mining algorithm that can discover knowledge in structured data. It can learn patterns reflecting one-to-many and These frequent patterns were converted into many-to-many relationships over several ta- Boolean (indicator) attributes for propositional bles. No standard data-mining program can do rule learning. An attribute has value 1 for a spe- this because they are restricted to simple asso- cific gene if the corresponding query succeeds ciations in single tables. WARMR uses a first-or- for that gene and 0 if the query fails. The der version of the efficient levelwise a priori al- propositional machine learning algorithm C5 gorithm (Agrawal and Srikant 1994), which (Quinlan 1993) was then used to induce rules allows it to be used on very large databases.
that predict function from these Boolean at- The WARMR levelwise search algorithm is based tributes. Good rules were selected on a valida- on a breadth-first search of the pattern space.
tion set and the unbiased accuracy of these The application of WARMR can be considered as rules estimated on a test set. Rules were select- a way of identifying the most important struc- ed to balance accuracy with unidentified gene ture in a database. In the E. coli database, WARMR coverage. The prediction rules were then ap- Figure 4. Flowchart of the Data-Mining Program Methodology. This inductive logic programming (ILP)/propositional hybrid approach has proved successful in the past on other scientific discovery tasks. Itis powerful because the clustering improves the representation for learning (using the expressive power of ILP), and the discrimination stepefficiently exploits the prelabeled examples. Good rules were selected on a validation set and the unbiased accuracy of these rules estimatedon the test set. The unbiased accuracy of these rules was estimated on the 1/3 test set. The selection criteria for good rules was that on the val-idation data, they covered at least two correct examples, had an accuracy of at least 50 percent, and had an estimated deviation of ≥ 1.64.
plied to genes that have not been assigned a rules that were more general than is possible function to predict their functions. Note: We
did not aim for a general model of the relation- Using the same prediction rules, we also pre- ship between sequence and function; we were dicted the functional class of 1,309 proteins of satisfied with finding good rules to cover part were made publicly available at the gene-pre-dictions web site.6 Statistical theory, and the design of our machine methodology, gave us In 2000 to 2001, we published this data-min- confidence in these predictions. However, ing prediction approach to predicting protein doubts remained: It seemed a priori unlikely functional class from sequence (King, Srini- that protein function could be predicted from vasan, and Dehaspe 2001; King et al. 2001a, sequence (predicting protein structure from se- 2000b, 2000). By using a held-out test set of quence has proved intractable, and predicting proteins with annotated function, data-mining function from structure is an unsolved prob- prediction had an estimated accuracy of 50 to lem), and it was possible that the proteins of 90% percent (depending on the position of unknown function came from a significantly class in the function ontology). A summary of different distribution from those of known these results is given in table 1. A key finding function—which would invalidate a key statis- was that the method could learn predictive The ORF is not predicted to have a β-strand length ≤ 3 and
a homologous protein from class Chytridiomycetes was found Then its functional class is “Cell processes, Transport/binding proteins”
This rule is based on predicted structural and phylogenic features. In the original test set, this rule was 12/13(86 percent) correct. The default accuracy for this class is 21 percent. Twenty-four genes of unknown functionwere predicted by the rule. Of these 24 predictions, 2 have been confirmed by experiment, 7 have been (inde-pendently) annotated to have the predicted function, and 1 has been annotated to have a nonpredicted func-tion. The rule also has a possible biological explanation. We hypothesize that cytochrome c oxidase in Chytrid-iomycetes is a “molecular living fossil” that has retained features of an ancestral protein that radiated into awide variety of transport proteins, which has allowed the protein to be used to identify very remote homolo-gous that would otherwise be missed.
In the period since these predictions, biolog- sample of functions confirmed). See table 2 for ical knowledge has advanced greatly. Some pro- details of the results for level 3 of the function teins in E. coli have had their function deter- hierarchy. An example prediction rule is shown mined by wet biology. Function determination in figure 5. It was gratifying that the rules illus- has also occurred in other organisms, allowing trated in previous publications were found to better homology-based function predictions.
have accuracies consistent with prediction on quences have been determined, allowing se-quence-similarity methods to predict function with greater accuracy. This new biologicalknowledge allows us to test our predictions di- In this article, I described two applications of rectly. We used two ways to test the predictions. the first-order machine learning methodology, First, we compared our predictions to the inductive logic programming, to the problem of predicting protein function from sequence. group annotation, which has the advantage of I consider biology first. Are the results of any practical use? On this question, I believe the ju- Second, we examined the scientific literature ry is still out. The results of homology induc- for the direct experimental derivation of pro- tion (although a statistically significant im- tein functions for our predictions. This test has provement on PSI-BLAST) are perhaps too small a the advantage of directly testing the predic- step-change to make biologists use the system en masse. However, the functional class results To test for the probability of our predictions are probably more significant. These I believe occurring by chance, Aberystwyth researchers constitute a step-change in protein function used a binomial test, with the probability of prediction methodology. Although initially, success being the probability of the most popu- many biologists were skeptical, this reaction lous class. This test has the advantage of being seems to be slowly changing, and interest is simple to calculate, makes few assumptions, growing in the approach. The new evidence of and is guaranteed to give an overestimate.
the results of the blind-test predictions (table The results for the new Riley group annota- 2) should help the acceptance of this method- tion were statistically highly significant (< 1e- ology. One important limitation of the data- 15), with prediction accuracies of approximate- mining–prediction approach is that although ly 90 percent for the cases where more than the rules are presented in a symbolic form, one rule agreed on a prediction. It should also their meaning is often obscure. It is certainly be stressed again that these accuracies are likely possible to find biological justification for to be underestimates because they are based on some of the rules (see, for example, figure 5), the assumption that the Riley annotation is but in many cases, the biological meaning is obscure, even when the rules are empirically The results for the function predictions that successful. Much more work is needed on the have either been confirmed, or not, by wet bi- design of learning systems that produce se- ological experiments were also highly signifi- cant, although at lower accuracy than for the A number of interesting machine learning is- annotations (probably because of bias in the sues were brought into focus by the applications: Predicted Class
Confirmed Function
Table 2. Predictions of Classes in Level 3 of the E. coli Gene Ontology That Now Have Wet Biological Evidence. ORF is the Blattner identifier for the protein. The predictions are ordered by result and ID. The rule numbers are identifiers for the specificrule predicting the gene. C = Correct, W = Wrong, NM = Near Miss. There are 10 correct predictions and 14 wrong ones. The probabilityof obtaining this accuracy on newly determined functions occurring by chance is estimated at less than 4.8e-10.
approaches have time after time been the most chine learning applications, perhaps most, it is successful in blind trials.7 However, in the pre- not necessary to use ILP/relational data mining diction of protein function, it is very hard to because propositional methods are sufficient see how the crucial relational aspects of the because there has been orders-of-magnitude The functional classes for proteins exist in and ILP methods do not necessarily act as effi- hierarchies or directed acyclic graphs, which cient propositional learners when given wholly means that the classes are not independent of propositional data. For example, in bioinfor- each other. Problems with this characteristic matics, propositional methods would empiri- are relatively common in the real world (for ex- cally seem sufficient to predict protein sec- ample, in text classification) but have been lit- tle considered by the statistical or machine Quantitative and Physical Mapping of Cellular Pro- teins. TIBTECH 17(1): 121–127.
It is possible for proteins to have more than Blattner, F. R.; Plunkett, G., 3d.; Block, C. A.; Perna, one function, that is, to have more than one N. T.; Burland, V.; Riley, M.; Collado-Vides, J.; Glas- class value. Such problems are also common in ner, J. D.; Rode, C. K.; Mayhew, G. F.; Gregor, J.; the real world and little studied (for example, Davis, N. W.; Kirkpatrick, H. A.; Goeden, M. A.; Rose, Clare and King [2001]; Schapire and Singer D. J.; Mau, B.; and Shao, Y. 1997. The CompleteGenome Sequence of Escherichia coli K–12. Science [2000]). Of course, it is always possible to create disjoint classes, but this can distort the problem Bork, P.; Dandekar, T.; Diaz-Lazcoz, Y.; Eisenhaber, F.; and create large numbers of artificial classes.
Huynen, M.; and Yuan, Y. P. 1998. Predicting Func- In conclusion, the application of AI to deci- tion: From Genes to Genomes and Back. Journal of Molecular Biology 283(4): 707–725.
beginning. Enormous challenges exist in data Bradley, A. P. 1995. The Use of Area under ROC integration, the analysis of data from micro- Curve in the Evaluation of Learning Algorithms. Pat- arrays, proteomics, and so on. The dream for tern Recognition 30(6): 1145–1159.
the future is to be able to develop models of Brent, R. 1999. Functional Genomics: Learning to cells, development, tissues, and even whole or- Think about Gene Expression Data. Current Biology ganisms. AI has the potential to contribute sub- stantially to this enterprise. In turn, AI will Bussey, H. 1997. 1997 Ushers in an Era of Yeast Func- tional Genomics. Yeast 13(16): 1501–1503.
Clare, A. J., and King, R. D. 2001. Knowledge Discov-ery in Multilabel Phenotype Data. In Proceedings of I would like to thank Amanda Clare, Luc De- the Fifth European Conference on Principles and Practice haspe, and Andreas Karwath who worked with of Knowledge Discovery in Databases (PKDD’01), eds. L.
me on homology induction and data- mining De Raedt and A. Siebes, 42–53. Lecture Notes in Arti- prediction as well as Douglas Kell, Walter ficial Intelligence 2168. Heidelberg, Germany: Luyten, and Mike Young for helpful comments on the work. The work was partly funded by Dehaspe, L., and Toivonen, H. 1999. Discovery of Engineering and Physical Science Research Frequent DATALOG Patterns. Data Mining and Knowl- DeRisi, J. L.; Iyer, V. R.; and Brown, P. O. 1997. Ex-ploring the Metabolic and Genetic Control of Gene Expression on a Genomic Scale. Science 278(5338): 1. web.comlab.ox.ac.uk/oucl/research/areas/mach- Donescu, A.; Waissman, J.; Richard, G.; and Roux, G.
2002. Characterization of Bio-Chemical Signals byInductive Logic Programming. Knowledge-Based Sys- Dzeroski, S., and Lavrac, N. 2001. Relational Data Mining. Berlin: Springer-Verlag.
6. www.aber.ac.uk/compsci/Research/bio/Protein- Dzeroski, S.; Blockeel, H.; Kompare, B.; Kramer, S.; Pfahringer, B.; and Van Laer, W. 1999. Experiments in Predicting Biodegradability. In Proceedings of theNinth International Workshop on Inductive Logic Pro-gramming, 80–91. Lecture Notes in Artificial Intelli- gence 1634. New York: Springer-Verlag.
Agrawal, R., and Srikant, R. 1994. Fast Algorithms Finn, P.; Muggleton, S.; Page, D.; and Srinivasan, A.
for Mining Association Rules in Large Databases. Pa- 1998. A. Pharmacophore Discovery Using the Induc- per presented at the Twentieth International Con- tive Logic Programming System PROGOL. Machine ference on Very Large Databases (VLDB), 12–15 Sep- Hieter, P., and Boguski, N. 1997. Functional Ge- Altschul, S. F.; Madden, T. L.; Schaffer, A. A.; Zhang, nomics: It’s All How You Read It. Science 278(5338): J.; Zhang, Z.; Miller, W.; and Lipman, D. J. 1997.
Gapped BLAST and PSI-BLAST: A New Generation of Pro- Karwath, A., and King, R. D. 2002. Homology Induc- tein Database Search Programs. Nucleic Acid Re- tion: The Use of Machine Learning to Improve Se- quence Similarity Searches. BMC Bioinformatics 3(1): Badea, L. 2003. Functional Discrimination of Gene Expression Patterns in Terms of the Gene Ontology.
Karwath, A., and King, R. D. 2001. An Automated ILP Paper presented at the Pacific Symposium on Bio- Server in the Field of Bioinformatics. In Proceedings of computing, 3–7 January, Lihue, Hawaii.
the Eleventh International Conference on Inductive Logic Blackstock, W. P., and Weir, M. P. 1999. Proteomics: Programming (ILP’01), eds. C. Rouveirol and M. Se-

bag, 91–103. Lecture Notes in Artificial Intelligence Murzin, A. G.; Brenner, S. E.; Hubbard, T. J. P.; and Chothia, C. 1995. SCOP: A Structural Classification ofProtein Database for the Investigation of Sequences Kell, D., and King, R. D. 2000. On the Optimization and Structures. Journal of Molecular Biology 247(4): of Classes for the Assignment of Unidentified Read- ing Frames in Functional Genomics Programmes:The Need for Machine Learning. Trends in Biotechnol- Oliver, S. G.; Winson, M. K.; Kell, D. B.; and Baganz, F. 1998. Systematic Functional Analysis of the YeastGenome. Trends in Biotechnology 16(10): 373–378.
King, R. D.; Clark, D. A.; Shirazi, J.; and Sternberg, M.
J. E. 1994. On the Use of Machine Learning to Iden- Ouali, M., and King, R. D. 2000. Cascaded MultipleClassifiers for Secondary-Structure Prediction. Protein tify Topological Rules in the Packing of Beta-Strands.
Protein Engineering 7(11): 1295–1303.
Park, J.; Teichmann, S. A.; Hubbard, T.; and Chothia, King, R. D.; Srinivasan, A.; and Dehaspe, L. 2001.
C. 1997. Intermediate Sequences Increase the Detec- WARMR: A Data-Mining Tool for Chemical Data. Jour- tion of Homology between Sequences. Journal of nal of Computer-Aided Molecular Design 15(2): 173–181.
Molecular Biology 273(1): 349–354.
King, R. D.; Karwath, A.; Clare, A.; and Dehapse, L.
Park, J.; Karplus, K.; Barrett, C.; Hughey, R.; Haussler, 2001. The Utility of Different Representations of Pro- D.; Hubbard, T.; and Chothia, C. 1998. Sequence tein Sequence for Predicting Functional Class. Bioin- Comparisons Using Multiple Sequences Detect Three Times as Many Remote Homologues as Pairwise King, R. D.; Karwath, A.; Clare, A.; and Dehapse, L.
Methods. Journal of Molecular Biology 284(4): 2000a. Accurate Prediction of Protein Class in the M. tuberculosis and E. coli Genomes Using Data Mining.
Pearson, W. R., and Lipman, D. J. 1988. Improved Yeast (Comparative and Functional Genomics) 17(4): Tools for Biological Sequence Comparison. Proceed- ings of the National Academy of Science 85(8): King, R. D.; Karwath, A.; Clare, A.; and Dehapse, L.
2000b. Genome-Scale Prediction of Protein Func- Quinlan, R. 1993. C4.5: Programs for Machine Learn- tional Class from Sequence Using Data Mining. In ing. San Francisco, Calif.: Morgan Kaufmann.
Proceedings of the Sixth ACM SIGKDD International Schapire, R., and Singer, Y. 2000. BOOSTEXTER: A Conference on Knowledge Discovery and Data Min- Boosting-Based System for Text Categorization. Ma- ing, eds. R. Ramakrishnan, S. Stolfo, R. Bayardo, and chine Learning 39(2): 135–168.
I. Parsa, 384–389. New York: Association for Comput-ing Machinery.
Sternberg, M. J. E.; King, R. D.; Lewis, R. A.; and Mug-gleton, S. 1994. Application of Machine Learning to King, R. D.; Muggleton, S. H.; Srinivasan, A.; and Structural Molecular Biology. Philosophical Transac- Sternberg, M. J. E. 1996. Structure-Activity Rela- tions of the Royal Society of London Series B—Biological tionships Derived by Machine Learning: The Use of Atoms and Their Bond Connectivities to Predict Mu- Turcotte, M.; Muggleton, S. H.; and. Sternberg, M. J.
tagenicity by Inductive Logic Programming. Pro- E. 2001. Automated Discovery of Structural Signa- ceedings of the National Academy of Sciences 93(1): tures of Protein Fold and Function. Journal of Molec- ular Biology 306(3): 591–605.
King, R. D.; Muggleton, S.; Lewis, R. A.; and Stern- Ullman, J. D. 1988. Principles of Databases and Knowl- berg, M. J. E. 1992. Drug Design by Machine Learn- edge-Based Systems, Volume 1. Rockville, Md.: Com- ing—The Use of Inductive Logic Programming to Model the Structure-Activity-Relationships ofTrimethoprim Analogs Binding to Dihydrofolate-Re- Ross D. King completed his Ph.D.
ductase. Proceedings of the National Academy of Sci- Kohler, D., and Sahami, M. 1997. Hierarchically Clas- sifying Documents Using Very Few Words. Paper pre- sented at the Fourteenth International Conference of ment, Brainware GmbH, Berlin.
He returned to the United King- Machine Learning, 8–12 July, Nashville, Tennessee.
Lavrac, N., and Dzeroski, S. 1994. Inductive Logic Pro- gramming: Techniques and Applications. Chichester, work in the biomolecular structure group of the Im- perial Cancer Research Fund. In 1997, he moved to Mitchell, T. M. 1982. Generalization as Search. Arti- the Department of Computer Science at the Univer- ficial Intelligence 18(2): 203–226.
sity of Wales, Aberystwyth. He was awarded his chair Muggleton, S. H. 1992. Inductive Logic Programming. in 2002. His e-mail address is [email protected].
Muggleton, S. H. 1990. Inductive Logic Program-ming. New Generation Computing 8(4): 295–318.
Muggleton, S.; King, R. D.; and Sternberg, M. J. E.
1992. Protein Secondary-Structure Prediction UsingLogic-Based Machine Learning. Protein Engineering5(7): 647–657.

Source: http://www.aistudy.co.kr/paper/aaai_journal/AIMag25-01-006.pdf

Microsoft word - ijab 181 composed

International Journal of RESEARCH ARTICLE Effect of Different Levels of Dietary Bole (Lake Soil) Inclusion on Feed Intake, Milk Yield and Composition of Holstein Friesian Cows *Shewangzaw Addisu1, Firew Tegegne2 and Zeleke Mekuriaw3 1Alage ATVET College, P.O.Box 77, Zeway; 2Bahir Dar University, P.O.Box 26, Bahir Dar; 3LIVES, ILRI-Bahir Dar A R T I C L E I N F O A B S T R A

dubuqueinternalmed.com

Irritable Bowel Syndrome What is Irritable Bowel Syndrome? Irritable Bowel Syndrome is a condition that primarily involves the large intestine (colon) and it is thought to be the result of abnormal muscle contractions. Irritable bowel syndrome is also known as IBS, spastic colon, irritable colon and mucous colitis. Normally, regular and coordinated muscular contractions in the wal