The authors have declared that no competing interests exist.
Conceived and designed the experiments: CP FMC. Performed the experiments: CP. Analyzed the data: CP. Contributed reagents/materials/analysis tools: CP. Wrote the paper: CP FMC.
Developing and extending a biomedical ontology is a very demanding task that can never be considered complete given our ever-evolving understanding of the life sciences. Extension in particular can benefit from the automation of some of its steps, thus releasing experts to focus on harder tasks. Here we present a strategy to support the automation of change capturing within ontology extension where the need for new concepts or relations is identified. Our strategy is based on predicting areas of an ontology that will undergo extension in a future version by applying supervised learning over features of previous ontology versions. We used the Gene Ontology as our test bed and obtained encouraging results with average f-measure reaching 0.79 for a subset of biological process terms. Our strategy was also able to outperform state of the art change capturing methods. In addition we have identified several issues concerning prediction of ontology evolution, and have delineated a general framework for ontology extension prediction. Our strategy can be applied to any biomedical ontology with versioning, to help focus either manual or semi-automated extension methods on areas of the ontology that need extension.
Biomedical knowledge is complex and in constant evolution and growth, making it difficult for researchers to keep up with novel discoveries. Ontologies have become essential to help with this issue since they provide a standardized format to describe knowledge that facilitates its storing, sharing and computational analysis. However, the effort to keep a biomedical ontology up-to-date is a demanding and costly task involving several experts. Much of this effort is dedicated to the addition of new elements to extend the ontology to cover new areas of knowledge. We have developed an automated methodology to identify areas of the ontology that need extension based on past versions of the ontology as well as external data such as references in scientific literature and ontology usage. This can be a valuable help to semi-automated ontology extension systems, since they can focus on the subdomains of the identified ontology areas thus reducing the amount of information to process, which in turn releases ontology developers to focus on more complex ontology evolution tasks. By contributing to a faster rate of ontology evolution, we hope to positively impact ontology-based applications such as natural language processing, computer reasoning, information integration or semantic querying of heterogenous data.
Despite the last decade's efforts to structure and organize the deluge of biomedical data brought on by high throughput techniques, there are still many issues that challenge biomedical knowledge discovery and management
On one hand, most scientific knowledge is still present only in natural language text in the form of scientific publications, whose number grows nearly exponentially making it necessary to employ text mining techniques if we are ever to aspire at keeping up. However, the natural ambiguity and subjectivity of natural language hinders the automated processing of scientific publications. On the other hand, although there is a large number of databases to store biomedical data, the effort to achieve interoperability between them is still lagging behind, given that most resources, particularly the older ones, were developed in a completely independent fashion, and the efforts to connect them to other resources are still insufficient.
One very important breakthrough for both areas, was the development of biomedical ontologies (bio-ontologies). They support both issues, by providing unequivocal and structured models of specific domains, which is fundamental to resolve semantic ambiguities in text mining and also to serve as a common background to biomedical databases.
The development of a biomedical ontology, or other domain ontologies, is a very demanding process that requires both expertise in the domain to model, as well as in ontology design. This means that people from different backgrounds, such as biology, philosophy and computer science should be involved in the process of creating an ontology. However, specific biomedical ontologies are usually built by small teams of life sciences researchers, with little experience in ontology design. They are responsible for first, agreeing on the precise limits of the domain to model; second, defining the structure and complexity of the model; and finally, building the ontology itself by creating the concepts, relations and other axioms it might contain
Several methodologies have been developed to help build ontologies
This ontology evolution
A relevant process of ontology evolution is the addition of new elements, i.e. ontology extension. Ontology extension is particularly relevant in fast growing domains such as biomedicine, where new knowledge is created everyday. The first step in this is to identify the changes that need to be performed: change capturing. This is vitally different from a general ontology learning process that handles the whole domain at once, in that it is focused on specific areas within the domain of the ontology to be extended.
In this paper we present a methodology that addresses change capturing by predicting ontology extension. The fact that these changes can in principle be semi-automatically discovered by analyzing the ontology data and its usage motivated the present work. It is a supervised learning based strategy that predicts the areas of the ontology that will undergo extension in a future version, based on previous versions of the ontology. By pinpointing which areas of the ontology are more likely to undergo extension, this methodology can be integrated into ontology extension approaches, both manual and semi-automated, to provide a focus for extension efforts and thus contributing to ease the burden of keeping an ontology up-to-date.
The primary goal of our methodology is to function as a first step in automated ontology learning or extension systems. Ontology learning systems, usually rely on the analysis of a manually constructed corpus of documents pertaining to the domain of interest and their performance is closely coupled to the relevance of these documents. The challenge of focusing the ontology given an heterogenous corpus in ontology learning has been identified
Our main contribution for ontology developers lies in the speeding of the process of extension in these areas, thus releasing the experts to focus on more complex ontology evolution issues. We have chosen to evaluate our approach using the Gene Ontology, since it provides many versions spanning a number of years, and is perhaps one of the best known and widely used biomedical ontologies.
In the remainder of this section we will introduce some basic concepts, present related work and describe the Gene Ontology.
Ontology evolution can be defined as the process of modifying an ontology in response to a certain change in the domain or its conceptualization
Ontology evolution comprises several different processes, based on the type of change transformations they employ over ontology elements: add, remove or modify. While adding new elements is mostly employed in response to a change of the first or third type, removing elements is often related to the first, second and fourth types. Modifying existing elements can belong to any of the four kinds and ultimately be seen as a compound change of removing one element and adding a slightly different new one. In this work we are only concerned with change transformations that add new elements to the ontology, thereby extending it.
Although
Thus, ontology extension is concerned with elementary changes of the addition type. Many reasons can motivate such a change, such as new discoveries, access to previously unavailable information sources, a change in the viewpoint or usage of the ontology, a change in the level of refinement of the ontology, etc, but they all rely on the finding of new knowledge. Ontology extension encompasses both ontology refinement and ontology enrichment.
Before these changes are actually performed, the need for the change must be identified. This is the first step of any ontology evolution process, the change capturing phase, and it can be based on explicit or implicit requirements
Although there is a large body of work on ontology evolution (for a review see
Browsing-based measures are based on the user's browsing of links between ontology concepts. They define the usage of two concepts
Another usage-driven strategy was proposed by
Also relevant for our work is the investigation of ontology evolution in biomedical ontologies.
In
On a previous study we delineated a framework to analyze ontology extension and used it as a background for investigating the feasibility of predicting ontology extension based on a set of rules
A concept with many instances is a candidate for being split into subconcepts and its instances distributed among newly generated concepts.
If a class has only one direct subclass there may be a modeling problem or the ontology is not complete.
If there are more than a dozen subclasses for a given class then additional intermediate categories may be necessary.
Based on these we created a set of rules for predicting the extension of the Gene Ontology:
Application of these rules to several versions of the Gene Ontology yielded very poor prediction results, highlighting the need for more complex approaches to model this issue.
The strategy we present here is unlike previously described works, since we use metrics of previous ontology versions to support prediction, whereas change capturing approaches are based on manually derived rules and ontology evolution approaches analyze evolution of existing ontology versions.
The Gene Ontology (GO) is currently the most successful case of ontology application in bioinformatics and provides a controlled vocabulary to describe functional aspects of gene products under three distinct ontologies: biological process, molecular function and cellular component. GO terms are structured in a directed acyclic graph with its hierarchical backbone being composed of
GO is used to annotated gene products, and these annotations are compiled by the Gene Ontology Annotation project (GOA). GO annotations are assigned an evidence code which identifies the kind of evidence supporting the annotation. Although over a dozen evidence codes exist, the most relevant distinction between them is whether they are manually assigned by a curator or inferred electronically. Electronic annotations are generally considered to be of lower quality than manual ones, but compose the vast majority of present GO annotations (over 97%). Another relevant aspect of annotations is whether they can be considered direct, i.e. the annotation was made precisely to that GO term; or indirect, i.e. the annotation was made to a subconcept of that GO term, from which we can deduce that there is also an annotation to all of its superconcepts.
GO also provides a cut-down version of the GO ontologies, GO Slims, which contain a subset of the terms in the whole GO to give a broad overview of the ontology content without the detail of the specific fine grained terms.
There are about one hundred contributors to GO between the GO Consortium and GO Associates, and they are expected to contribute regularly towards the content of GO. Other GO users can also contribute by suggesting new terms via Sourceforge.net, however the majority of content requests are made by GO team members
working closely with the reference genome annotation group to ensure that areas that are known to undergo intense annotation in the near future are updated
listening to the biological community
ensuring that emerging genomes have the necessary classes to support their needs
Although some steps have been taken in the direction of automatizing some aspects of GO evolution, namely the extension of GO with computable logical definitions including cross-references to other ontologies
Following our previous work
ontology version | n. terms | n. relations | max depth | avg depth | deletions |
insertions |
total annotations | manual annotations |
Jan 2005 | 17K | 26K | 17 | 6.8 | N/A | N/A | 6.0 M | 0.50 M |
Jul 2005 | 18K | 28K | 19 | 7.0 | 111 | 885 | 7.1 M | 0.62 M |
Jan 2006 | 19K | 30K | 18 | 7.0 | 42 | 1311 | 7.3 M | 0.56 M |
Jul 2006 | 20K | 31K | 18 | 7.0 | 20 | 578 | 9.0 M | 0.56 M |
Jan 2007 | 22K | 35K | 18 | 7.2 | 97 | 2079 | 10.4 M | 0.62 M |
Jun 2007 | 23K | 38K | 18 | 6.9 | 131 | 1454 | 12.4 M | 0.66 M |
Jan 2008 | 24K | 40K | 18 | 4.9 | 153 | 1674 | 19.0 M | 0.73 M |
Jul 2008 | 25K | 44K | 18 | 4.9 | 104 | 807 | 23.0 M | 0.78 M |
Jan 2009 | 27K | 47K | 18 | 4.9 | 17 | 1415 | 24.7 M | 0.79 M |
Aug 2009 | 28K | 51K | 18 | 5.0 | 77 | 1487 | 33.0 M | 0.87 M |
Jan 2010 | 29K | 54K | 19 | 4.9 | 61 | 1476 | 33.5 M | 0.91 M |
Jul 2010 | 32K | 57K | 15 | 3.9 | 31 | 1302 | 60.5 M | 1.06 M |
Jan 2011 | 33K | 60K | 15 | 4.01 | 106 | 2698 | 54.4 M | 1.23 M |
Jul 2011 | 34K | 63K | 15 | 4.03 | 48 | 1208 | 63.8 M | 1.35 M |
Jan 2012 | 36K | 65K | 15 | 4.05 | 32 | 1113 | 77.8 M | 1.41 M |
with respect to the version in the line above.
The intuition behind our proposed strategy is that information encoded in the ontology or its annotation resources can be used to support the prediction of ontology areas that will be extended in a future version. This notion is inspired by change capturing strategies that are based on implicit requirements. However in the existing change capturing approaches, these requirements are manually defined based on expert knowledge. Our system attempts to go beyond this, by trying to learn these requirements based on previous extension events using supervised learning.
In our test case using GO, we use as attributes for learning a series of ontology features based on structural, annotation or citation data. These are calculated for each GO term and then used to train a model able to capture whether a term would be extended in a following version of GO.
Structural features give information on the position of a term and the surrounding structure of the ontology, such as height (i.e. distance to a leaf term), number of sibling or children terms. A term is considered to be direct child if it is connected to its parent by an
all | simple structure | uniformity | annotations | direct | indirect |
|
|
||
Type | Feature | Feature set | |||||||
Structural | + | + | + | ||||||
+ | + | + | + | + | |||||
+ | + | ||||||||
+ | + | ||||||||
Annotation | + | + | + | ||||||
+ | + | + | |||||||
+ | + | + | + | + | |||||
+ | + | + | + | + | |||||
Citation | + | + | + | ||||||
Hybrid | + | + | |||||||
+ | |||||||||
+ | + | ||||||||
+ | + | ||||||||
+ | + |
in the
Due to the complexity of ontology extension, we have established a framework for the outlining of ontology extension in an applicational scenario. This framework defines the following parameters:
Extension type:
Extension mode:
Term set:
terms at a given
terms at a given distance to
Time parameters:
By clearly describing the ontology extension process according to this framework, we are able to accurately circumscribe our ontology extension prediction efforts.
The datasets used for classification were then composed of vectors of attributes followed by a boolean class value, that corresponded to extension in the version to be predicted, according to the used parameters. To compose the datasets we need not only the parameters but also an initial set of ontology versions to be used to calculate features and the ontology version to calculate the extension outcome (i.e. class labels). So given a set of sequential ontology versions
We tested several supervised learning algorithms, namely Decision Tables, Naive Bayes, SVM, Neural Networks and Bayesian Networks, using their WEKA implementations
To evaluate our Ontology Extension Prediction strategy we employed a simple approach: compare our predictions to the actual extension of the Gene Ontology in a future version. To this end we employ another time parameter:
This time parameter is used to create the test set, by shifting the ontology versions according to
This approach allows us to compare the set of proposed extensions to real ones that actually took place in a future version of the ontology. We can calculate precision, recall and f-measure metrics, by using the real extension events observed in the more recent ontology version as our test case. These metrics are based on the number of true positives, false positives, true negatives and false negatives. A true positive is an ontology class that our supervised learning strategy identified as a target for extension, and that was indeed extended in the test set, whereas a false positive although having also been identified as a target for extension, was not actually extended. Likewise, a false negative is an ontology class which was not identified as a target for extension, but was in fact extended in reality, whereas a true negative was neither identified as a target nor was it extended in the test set. Precision corresponds to the fraction of classes identified as extension targets that have actually been extended, while recall is the fraction of classes identified as extension targets out of all real extensions. F-measure is a measure of a test's accuracy that considers both precision and recall.
When trying to predict ontology extension we are not just focusing on which features are best predictors, but also on how to design the learning process to best support the prediction. Consequently, we are not only trying to find the best prediction set up in terms of features and machine learning algorithms, but also in terms of our strategy's parameters.
A first step in our experiments was to determine the best term set to use, and to investigate if this was influenced by different parameters. To this end, we tested the following term sets within each GO ontology: all terms, all terms with a depth of 3, 4 and 5, all GO Slim general terms, all GO Slim general leaf terms, all terms at a depth of 1 from the GO Slim general leaf terms, under the same sets of parameters (see
Average term set size | |||
Term Sets | Biological Process | Cellular Component | Molecular Function |
all | 15928 | 2272.8 | 8265.6 |
depth = 3 | 97.07 | 21 | 154 |
depth = 4 | 374 | 112.47 | 495.33 |
depth = 5 | 849 | 178.47 | 1093.67 |
GOSlim | 65.27 | 31.67 | - |
GOSlim leaves | 54.07 | 26.07 | - |
GOSlim leaves | 1189.93 | 758.73 | - |
depth = 1 |
To provide a simple basis for our first analysis we focused on the biological process hierarchy and chose a single feature
Before comparing term sets, we need to analyze the trends between parameter sets. First we focused on extension types and modes (see
Term Sets | refinement direct ( |
refinement indirect ( |
enrichment indirect ( |
extension indirect ( |
all | 0.0999 |
0.4919 |
0.2009 |
0.4674 |
depth = 3 | 0.2704 |
0.7896 |
0.2955 |
0.7495 |
depth = 4 | 0.2176 |
0.7083 |
0.3429 |
0.6790 |
depth = 5 | 0.2313 |
0.6348 |
0.2898 |
0.6268 |
GOSlim | 0.2024 |
0.8637 |
0.1722 |
0.6530 |
GOSlim leaves | 0.1635 |
0.8553 |
0.1003 |
0.6470 |
GOSlim leaves | 0.1523 |
0.6529 |
0.3168 |
0.6243 |
depth = 1 |
Values are average and standard deviation f-measure for all runs using the 15 ontology versions and a Decision Table algorithm, in the biological process hierarchy. Time parameters:
To clarify this difference, we calculated the average extended proportion for each extension type (see
refinement | enrichment | extension | |
biological process | 0.293 | 0.103 | 0.292 |
cellular component | 0.122 | 0.027 | 0.124 |
molecular function | 0.076 | 0.013 | 0.077 |
Values are averaged for all GO term at depth = 4 for the 15 ontology versions with an indirect extension mode.
As for the time parameters (see
Term Sets | ||||
all | 0.4919 |
0.5301 |
0.4890 |
0.5301 |
depth = 3 | 0.7896 |
0.8177 |
0.8152 |
0.8005 |
depth = 4 | 0.7083 |
0.7520 |
0.7267 |
0.7437 |
depth = 5 | 0.6348 |
0.6962 |
0.6526 |
0.7101 |
GOSlim | 0.8637 |
0.9020 |
0.8264 |
0.8869 |
GOSlim leaves | 0.8553 |
0.9004 |
0.8378 |
0.9046 |
GOSlim leaves | 0.6529 |
0.6748 |
0.6624 |
0.7021 |
depth = 1 |
Values are average and standard deviation f-measure for all runs using the 15 ontology versions and a Decision Table algorithm, in the biological process hierarchy. Extension mode: refinement, indirect.
In general, when comparing term sets considering the best sets of parameters (
The next step in our experiment was to compare different features and feature sets.
Term set | |||
Features | depth |
|
|
Single | dirChildren | 0.6723 |
0.6662 |
allChildren | 0.7437 |
0.7021 |
|
height | 0.7426 |
0.6854 |
|
sibsUniformity | 0.5814 |
0.5283 |
|
parentsUniformity | 0.6336 |
0.5430 |
|
childrenUniformity | 0.6469 |
0.5899 |
|
dirAnnots | 0.4857 |
0.4964 |
|
dirManAnnots | 0.4838 |
0.4748 |
|
allAnnots | 0.7335 |
0.6821 |
|
allManAnnots | 0.7452 |
0.6965 |
|
PubMed | 0.5960 |
0.6552 |
|
ratioAll | 0.6850 |
0.6192 |
|
ratioDir | 0.5735 |
0.5856 |
|
Sets | all | 0.7459 |
0.7801 |
structure | 0.7431 |
0.6906 |
|
uniformity | 0.6523 |
0.5727 |
|
annotations | 0.7396 |
0.6949 |
|
direct | 0.6661 |
0.6569 |
|
indirect | 0.7641 |
0.6883 |
|
bestA | 0.7415 |
0.7704 |
|
bestB | 0.7550 |
0.7750 |
Values are average and standard deviation f-measure for all runs using the 15 ontology versions and a Decision Table algorithm. Time parameters:
When using single features, the best performers are
So far we have focused on predicting refinement within the biological process ontology.
Term set | |||
Features | depth |
|
|
Single | allManAnnots | 0.7085 |
0.6068 |
allChildren | 0.6800 |
0.5650 |
|
ratioAll | 0.6604 |
0.4636 |
|
height | 0.6450 |
0.5248 |
|
Sets | bestB | 0.7210 |
0.5174 |
bestA | 0.7155 |
0.4758 |
|
annotations | 0.7046 |
0.6198 |
|
all | 0.6916 |
0.4367 |
|
structure | 0.6890 |
0.5985 |
Values are average and standard deviation f-measure for all runs using the 15 ontology versions and a Decision Table algorithm. Time parameters:
Term set | ||
Features | depth | |
Single | allChildren | 0.6650 |
allManAnnots | 0.5898 |
|
height | 0.5633 |
|
dirChildren | 0.5577 |
|
allAnnots | 0.5572 |
|
Sets | bestA | 0.6441 |
indirect | 0.6395 |
|
bestB | 0.6285 |
|
all | 0.6218 |
|
structure | 0.6168 |
Values are average and standard deviation f-measure for all runs using the 15 ontology versions and a Decision Table algorithm. Time parameters:
Although average f-measure is generally lower for both molecular function and cellular component, than for biological process,
In addition to Decision Tables, chosen due to their simplicity, we also tested several other commonly used supervised learning algorithms, namely Naive Bayes, SVM, Neural Networks (Multilayer Perceptron) and Bayesian Networks, using their WEKA implementations.
When applying different learning algorithms, we still see that overall biological process has the best performance, followed by molecular function and cellular component. Likewise, the general performance in the
Looking in with more detail at the biological process results, the difference between feature sets is small, so we will not distinguish between them in our analysis. Naive Bayes gives the top precision values (0.87–0.90) but the lowest recall (0.48–0.57), whereas Bayesian Networks have the highest recall (0.78–0.79) with precision values between 0.74 and 0.79, which correspond to average f-measures between 0.76 and 0.79. SVM, Decision Tables and Multilayer Perceptron have performances in between these with both recall and precision values clustered around 0.70.
In molecular function, the highest precision is given by Multilayer Perceptron at 0.70 for
In cellular component, there is a marked difference between the performance in the depth term set and in the
To provide a basis for comparison, we implemented Stojanovic's browsing uniformity measures
For plotting our strategy instead of relying on the binary labels output by the classifier, we used the probabilities for each instance to be true (i.e. refined), so that the generated plots are more directly comparable to those produced by the uniformity strategy, allowing a more granular calculation of precision at different recalls to allow for a threshold based evaluation. Consequently, the presentation of the results of our strategy in these plots differs from the presentation in previous tables.
The prediction results for all ontologies were combined together in the plotting of the Precision/Recall plots to provide a better visualization of results. As it is patent in the plots, our strategy has a considerably improved performance in all three GO ontologies, with curves closer to the top right corner, which are indicative of both higher precision and recall. The uniformity strategy performed worse in all cases, except at higher recall values in molecular function.
The other uniformity strategies (parents and siblings) have an even lower performance than that of children uniformity.
Change capturing through prediction of ontology extension is a complex issue, due to the inherently complex nature of ontology extension itself. Ontology extension can be motivated by implicit or explicit requirements, which have very different mechanisms. Implicit requirements are in principle easier to predict since they do not change between ontology versions, whereas explicit requirements, which are created by experts to adapt the ontology to a novel conceptualization or change in the domain, are much harder to predict. Our strategy, by virtue of being based on learning using past extension events, cannot distinguish between these two types, and thus attempts to predict extension regardless of it being motivated by implicit or explicit requirements. To capture both kinds of requirements we use a set of ontology based features that not only contemplate intrinsic features, such as structural ones, but also extrinsic ones, such as annotations and citations.
The assumption that extension can be predicted based on existing knowledge, either in the form of the ontology itself or its usage, is acceptable regarding the more common extension events, but is not applicable to extension events that are the result of deep restructuring or revision of existing knowledge. These extension events are part of a complex ontology change that also includes deletions and modifications. As such, these more complex changes are not the object of our strategy. In fact, one of our strategy's goals is to speed up the process of accomplishing the simpler extensions, to give experts more time and resources to focus on the more complex events.
One very relevant aspect of our evaluation strategy is that we compare our results to the real extension events that occurred in more recent versions of the ontology. This means that although some of our predictions are conceptually correct, they may not have yet been included in the ontology version used for testing and will thus be considered incorrect. This will have an impact on precision values, since we might be capturing needed but still unperformed extensions, and then be considering them to be incorrect in our evaluation. Due to this line of thought, we might then give preference to strategies that increase recall even if at the cost of precision. However, this could have the negative effect of including many incorrect predictions in our output, which is not desirable in a semi-automated ontology extension system. As such we have chosen to base our evaluation on f-measure, to provide balanced precision and recall.
A basic requirement of our strategy is to be able to access several versions of the ontology to consider. The minimum set of ontology versions it requires is two: one which will be used to calculate the features, and a second one, more recent than the first, from which we will extract the class labels to train the model. It then becomes crucial to define the interval between the versions to use. In our test case using the Gene Ontology we decided on versions with an interval of at least 6 months, based on the intuition that a smaller interval would not provide us with sufficient extension examples to be able to train a model. This intuition was shown to be a good approximation, since as seen in
Due to the complexity of ontology extension, particularly in such a large ontology as the Gene Ontology, our prediction strategy has to account for several parameters that help circumscribe our effort. One such parameter, extension type, was designed to capture the different types of extension: refinement and enrichment. We have found that refinement is considerably easier to predict than enrichment, with refinement having a greater average f-measure by between 0.3 and 0.7. There are two likely explanations for this difference: on one hand, there are many more refinement events between ontology versions than there are enrichment events (see
Another parameter related to extension, is its mode, direct or indirect. Predicting direct extension, i.e. exactly which terms will be extended in a future version, should be the ultimate goal of an ontology extension prediction strategy. However this was proven to be a difficult task, which is unsurprising given the multitude of different processes that can lead to extension, and also the fact that on average new terms correspond to about 5% of all terms in an ontology version (see
To address this issue we focused our prediction efforts in slices of the ontology, and defined the extension that happens within the subgraphs rooted in terms within these slices as indirect extension. Focusing only on the term sets thus defined greatly improved the performance of our strategy (
Predicting for a subset of the ontology is supported by our previous finding
We chose distance to root for its simplicity in creating a middle layer of GO terms. However, since terms at the same distance to the root do not always have the same degree of specificity, we also used GO Slim general as a basis for our other strategy. By using GO Slim general we were attempting to capture a similar degree of specificity among terms, specific enough to provide a useful prediction and general enough to allow for branch extension prediction. We tested three different sets within each approach, each yielding different term set sizes. Since molecular function does not have a GO Slim general, we only tested distance to root (
For both approaches, the smaller the data set the better the results. This can in did be due to the fact that in smaller data sets there is a better balance of positive and negative instances, which despite our use of SMOTE to balance the training sets, still has an impact on training the models. However, we are not interested in very small term sets, since they would not provide enough specificity to change capturing for ontology extension. Considering this we focused on the term set defined by terms at a distance of one from GO Slim leaf terms, which corresponds to an average term set size of 1189 for biological process and 758 for cellular component, and on the term set defined by terms at a distance of four to the root, which corresponds to sizes around 370, 460 and 100, for biological process, molecular function and cellular component respectively.
The final parameters in our strategy are those related with time:
Although the parameters previously discussed represent the basis of our strategy, by defining exactly on what the prediction is focusing, it is the features used to support prediction that are essential to be able to capture extension events. Using the best parameter setup we investigated a set of thirteen single features, also arranged into eight sets, and found some interesting trends. In the
One of the most obvious patterns we get from these results is that terms with a lot of children terms or a lot of total annotations tend to be extended. It is arguable that for larger subgraphs, the probability of an extension event occurring is greater, given that there are more terms in it. However, to support the theory that the only factor involved is indeed the number of terms in the subgraph (i.e allChildren), we would have to consider that the probability of extension for any given term is equal. Intuitively, this does not appear to be a valid assumption, since it would mean that the extension of GO does not follow any particular direction. Nevertheless, we investigated this possibility by comparing the distribution of real refinement events for
Furthermore, the total number of annotations is influenced by the total number of children, since the annotations of the children contribute to the total number of annotations of the parent. To take this into account, we created the feature ratioAll to mitigate the influence of the number of children on the annotation data. Although this resulted in a decrease in f-measure of around 6%, compared to either feature separately, it is still a better performance than most other features. This gives further support to the notion that areas which attract a larger interest (in this case patent in the number of annotations) become the object of more refinement events.
Although these simple notions appear quite intuitive, and we could in principle derive a simple generic rule based on the number of children, in order to support automated change capturing, we need to establish the best separation possible between targets and non-targets for refinement, which is best achieved by employing supervised learning.
The results discussed so far were all based in Decision Tables, a simple supervised learning algorithm. We also tested other algorithms, but realized that although other algorithms such as SVM, Neural Networks and Bayesian Networks were capable of providing a better performance, and specifically in the case of SVM and Neural Networks of being parametrized to privilege either precision or recall, Decision Tables was still able to provide generally good results comparatively, without requiring parameter optimization.
We were particularly interested in the performance of Bayesian Networks, since our attributes are not independent, but in fact are temporally related when we consider multiple ontology version for feature extraction. For instance the value of
Another particularly interesting aspect is that most machine learning algorithms, including the ones that were used, assume that instances are all independent and identically distributed. However, the dataset instances correspond to GO terms which are hierarchically related through the GO structure. Although the inclusion of features that describe the neighboring area tried to capture this aspect (e.g. siblings, and all the uniformity features), we still believe it was not properly contemplated by the proposed setup. The hierarchical relations between instances may be affecting the experiments considering the full set of terms, since they are not being captured by the representation. In the subset of terms dataset, their influence would not be as strong, since there are fewer hierarchical relations between instances.
To complete our evaluation, we compared our strategy to the one proposed by Stojanovic
The output of our extension prediction methodology is a list of ontology classes, which are the roots of subgraphs that correspond to ontology areas which have been predicted as good candidates for extension. Our methodology is applicable to the most simple yet most frequent type of ontology change, the addition of new elements. It is not suited to predict more complex changes such as a reorganization of an entire branch of the ontology. As such, the ontology extension prediction can be used to speed up the process of extension in these simpler cases, by allowing ontology developers and/or ontology learning systems to focus on smaller areas of the domain. This frees the experts to spend more time focusing on the more complex changes that cannot be predicted.
Automated ontology learning systems can also use the list to focus their efforts on the identified areas. For instance, most ontology learning systems employ a corpus of scientific texts as input, and their performance is tightly coupled to the quality of such corpora. If our candidate list is used to guide the creation of specific corpora for the areas to extend, it can have a positive impact on the performance of such strategies.
We have chosen to highlight three examples of the results given by our ontology extension prediction system, two successful ones (
Extension was predicted for the root term and occurred at a distance of two edges, in every subclass.
Extension was predicted for the root term and occurred at a distance of one edge, with the addition of a whole new branch.
Extension was predicted for the root term and although it did not occur in the version for which it was predicted (January 2010), it did in fact occur in later versions, with the addition of one new sub-subclass in July 2010 and another in January 2011.
In
In
Ontologies are crucial to handle the challenges of an increasingly data-driven world. However, ontologies themselves face this challenge, since the effort to keep them updated in face of the new knowledge that is produced on a daily-basis is never complete. To support this effort, some of the processes involved in ontology evolution can be automated, in order to reduce the time and resource investment made by expert curators.
In this work we present such a strategy for the first step of ontology evolution: change capturing. Our strategy is based on predicting areas of the ontology that will undergo extension in a future version, by applying supervised learning over features of previous ontology versions. We applied our strategy to the Gene Ontology, where we obtained encouraging results with average f-measure reaching 0.79 for prediction of refinement for a subset of relevant biological process GO terms.
In addition we defined a framework to better define extension in an applicational context, that can be applied to ontologies with versioning, as is the case of OBO ontologies and many of its candidates. This framework is crucial to provide a better understanding of the various nuances of ontology extension, and as such support ontology extension prediction efforts.
We find that two particular characteristics of our strategy can be improved, namely the selection of ontology versions to use and the selection of the term set. Both of these can benefit from recent works on ontology evolution
Although we applied our strategy to the Gene Ontology, it is applicable to any ontology with multiple versions available, which is becoming increasingly prevalent, as ontologies in biomedicine mature. The performance of our strategy on other ontologies is still to be tested and the next logical testing ground for the proposed methodology are smaller ontologies which lack the maturity and funding of larger ontologies such as GO. Several ontologies would be interesting to explore, such as the Pathway Ontology or the Ontology of Physics for Biology, which provide several versions but are much more recent and quite smaller than GO. The success of our strategy on GO using simple structural data is encouraging, since most ontologies lack such a rich annotation corpus as GO's, but all provide structural data which can be explored.
Predicting the extension of an ontology can have a positive impact in ontology evolution processes, be they manual or automated, by focusing efforts and reducing the amount of new information that needs to be processed. Moreover, OBO's principles of maintenance and orthogonality strongly advocate for the existence of a single ontology for each domain that is progressively enhanced, rather than a myriad of niche ontologies. Consequently, strategies that aid in the evolution of existing ontologies, as the one proposed here, present themselves as relevant contributions to the end goal of ontologies in biomedicine.
F-measure for refinement prediction for separate ontology versions using Decision Tables with the
(EPS)
Percentage of positive examples for training models for refinement prediction for separate ontology versions using Decision Tables with the
(EPS)
Relation between number of
(EPS)
In the supplemental text we present two additional studies: one on using consecutive monthly versions of the ontologies instead of six-month separated ones, and another on the evolution of prediction, to investigate whether prediction performance is comparable through time.
(PDF)