Introduction to ontologies¶
Based on CL editors training by David Osumi-Sutherland
Why do we need ontologies¶
We face an every increasing deluge of biological data, analysis. Ensuring that this data and analysis is Findable, Accessible, Interoperable and Re-usable (FAIR) is a major challenge. Findability Interoperabiltiy and Resuability can all be enhanced by standardising metadata. Well standardised metadata can make it easy to find data and analyses despite variations in terminology ('Clara cell' vs 'nonciliated bronchiolar secretory cell' vs 'club cell') and precision ('bronchial epithelial cell' vs 'club cell'). Understanding what entities are referred to in metadata and how they relate to the annotated material can help users work out if the data or analysis they have found is of interest to them and can aid in its re-use and interoperability with other data and analyses. For example does an annotation of sample data with a term for breast cancer refer to the health status of the patient from which the sample was derived or that the sample itself comes from a breast cancer tumor?
We can't find what we're looking for¶
Given variation in terminology and precision, annotation with free text alone is not sufficient for findability. One very lightweight solution to this is to rely on user generated keyword systems and some system that allows users to choose from previously used keywords. This can produce some degree of annotation alignment but also results in fragmented annotation and varying levels of precision with no clear way to relate annotations.
For example, trying to refer to feces, in NCBI BioSample:
|Stool NOT faeces||21,798|
|Stool NOT feces||18,314|
We don't know what we're talking about¶
Terminology alone can be ambiguous. The same term may be used for completely unrelated or vaguely analogous structures. An insect femur and an mammalian femur are not evolutionarily or related or structurally similar. Biologists often like to use abbreviations to annotate data, but these can be extremely ambiguous. Drosophila biologists use DA1 to refer to structures in the tracheal system, musculature and nervous system. Outside of Drosophila biology it is used to refer to many other things including a rare disease, and a a neuron type in C.elegans.
Some extreme examples of this ambiguity come from terminological drift in fields with a long history. For example in the male genitalia of a gasteruptiid wasp, these 5 different structures here have each been labeled "paramere" by different people, each studying different hymenopteran lineages. How do we know what "paramere" means when it is referred to?
This striking example shows that even precise context is not always sufficient for disambiguation.
Controlled vocabulary (CV)¶
Rather than rely on users to generate lists of re-usable keywords, we can instead pre-specify a set of terms to use in annotation. This is usually refered to a controlled vocabulary or CV.
- Terms are not usually defined
- Relationships between the terms are not usually defined
- Simplest form is a list
Example using wines¶
- Pinot noir
Hierarchical controlled vocabulary¶
Any controlled vocabulary that is arranged in a hierarchy.
- Terms are arranged in a hierarchy, typically from general (top) to specific (bottom) with each term having only one parent.
- Terms are not usually defined.
- Relationships between the terms are not usually named or defined
Example using wines (Taxonomy of wine)¶
- Pinot Noir
- Pinot Gris
Taxonomy – a hierarchical CV in which hierarchy = classification. e.g. 'Merlot' is classified as a 'Red' (wine). Not all heirchical CVs are classifications. For example, anatomical atlases often have heirarchical CVs representing parthood.
Support for Grouping and varying levels of precision¶
The use of hierachical CV in which general terms group more specific terms allows for varying precision (glial cell vs some specific subtype) and simple grouping of annotated content.
From hierarchical CVs to ontologies¶
Hierarchical CVs tend to increase in complexity in particular ways.
To support findability, the developers of heirarchical CVs often need to associated synonyms or closely related terms with terms in their CV.
CV content is often driven by requests from annotators and so expansion is not driven by any unified vision of scheme. This often leads to presssure for heirarchies to support terms having multiple parents either reflecting multiple relationship types or multiple types of classification. For example in a CV with the terms 'retinal bipolar cell', retina, 'bipolar neuron' and 'glutamatergic neuron' could reasonably put 'retinal bipolar neuron' under retina based on location and under the other two terms based on classification.
Developers of heirarchical CVs often come to realise that multiple relationship types are represented in the heirarchy and that it can be useful to name these. For example, a heart glial cell is a (type of) glial cell , but is 'part of' the heart.
What is an ontology?¶
Definitions of ontologies can be controversial. Rather than attempts a comprehensive definition, this tutorial will emphasise ontologies as:
- Queryable stores of knowledge
Key features of well structured ontolgies:¶
- Terms are arranged in a classification hierarchy
- Terms are defined
- Terms are richly annotated:
- Textual definitions, references, synonyms, links, cross-references
- Relationships between terms are defined, allowing logical inference and sophisticated queries as well as graph representations.
- Expressed in a knowledge representation language such as RDFS, OBO, or OWL
- Gene Ontology, Uberon, Cell Ontology, EFO, SNOMED
Non-logical parts of onotologies¶
Terminology can be ambiguous, so text definitions, references, synonyms and images are key to helping users understand the intended meaning of a term.
Using nonmeaningful identifiers¶
Identifiers that do not hold any inherent meaning are important to ontologies. If you ever need to change the names of your terms, you're going to need identifiers that stay the same when the term name changes.
A microgilal cell is also known as: hortega cell, microglia, microgliocyte and brain resident macrophage.
In the cell ontology, it is however referred to by a unique identifier:
These identifiers are short ways of referring to IRIs (e.g., CL:000129 = http://purl.obolibrary.org/obo/CL_0000129)
This IRI is a unique, resolvable identifier on the web.
A group of ontologies - loosely co-ordinated through the OBO Foundry, have standardised their IRIs (e.g. http://purl.obolibrary.org/obo/CL_0000129 - A term in the cell ontology; http://purl.oblibrary.org/obo/cl.owl - The cell ontology)
IRIs? URIs? URLs?¶
- URI: Unique Resource Identifier - is a string of characters, following a standard specification, that unambiguously identifies a particular (web) resource.
- IRI: Internationalised Resource Identifier - a URI that can use characters in multiple languages
- URL: Uniform Resource Locator - a web-resolvable URI
Building scalable ontologies¶
OBO ontologies are mostly written in OWL2 or OBO syntax. The latter is a legacy format that maps completely to OWL.
For a more in-depth explanation of formats (OWL, OBO, RDF etc.) refer to explainer on OWL format variants. In the examples below we will use OWL Manchester syntax, which allows us to express formal logic in English-like sentences.
An ontology as a classification¶
Ontology terms refer to classes of things in the world. For example, the class of all wings
Below you will see a classification of parts of the insect and how it is represented in a simple ontology.
We use a SubClassOf (or is_a in obo format) to represent that one class fully subsumes another. For example: OWL: hindwing SubClassOf wing OBO: hindwing is_a wing
In English we might say: "a hindwing is a type of wing" or more specifically, "all instances of hindwing are instances wing." 'Instance' here refers to a single wing of an individual fly.
In the previous section, we talked about different types of relationships. In OWL we can define specific relations (known as object properties). One of the commonest is 'part of' which you can see used below.
English: all (insect) legs are part of a thoracic segment OWL: 'leg' SubClassOf part_of some thoracic segment OBO: 'leg'; relationship: part_of thoracic segment
It might seem odd at first that OWL uses subClassOf here too. The key to understanding this is the concept of an anonymous class - in OWL, we can refer to classes without giving them names. In this case, the anonymous class is the class of all things that are 'part of' (some) 'thoracic segment' (in insects). A vast array of different anatomical strctures are subclasses of this anonymous class, some of which, such as wings legs and spiracles, are visible in the diagram.
Note the existential quantifier
some in OWL format -- it is interpreted as "there exists", "there is at least one", or "some".
The quantifier is important to the direction of relations.
'wing' SubClassOf part_of some 'thoracic segment' is correct
'thoracic segment' SubClassOf has_part some 'wing' is incorrect as it implies all thoracic segment have wings as a part.
'claw' SubClassOf connected_to some 'tarsal segment' is correct
'tarsal segment' SubClassOf connected_to some 'claw' is incorrect as it implies all tarsal segments are connected to claws (for example some tarsal segments are connected to other tarsal segments)
There are many ways to classify things. For example, a neuron can be classified by structure, electrophysiology, neurotransmitter, lineage, etc.
Manually maintaining these multiple inheritances (that occur through multiple classifications) does not scale.
Problems with maintaining multiple inheritance classifications by hand
- Doesn’t scale
- When adding a new class, how are human editors to know
- all of the relevant classifications to add?
- how to rearrange the existing class hierarchy?
- It is bad for consistency
- Reasons for existing classifications often opaque
- Hard to check for consistency with distant superclasses
- Doesn’t allow for querying
- A formalized ontology can be queried for classes with arbitrary sets of properties. A manual classification can not.
The knowledge an ontology contains can be used to automate classification For example:
English: Any sense organ that functions in the detection of smell is an olfactory sense organ OWL:
'olfactory sense organ' EquivalentTo ‘sense organ’ that capable_of some ‘detection of smell’
If we then have an entity
nose that is subClassOf
sense organ and
capable_of some detection of smell, it will be automatically classified as an olfacotry sense organ.
- David Osumi-Sutherland (original creator of slides)
- Nicole Vasilevsky (OSHU) Alex Diehl (Buffalo), Nico Matentzoglu, Matt Brush, Matt Yoder, Carlo Toriniai, Simon Jupp
- Chris Mungall (LNBL), Melissa Haendal (OSU), Jim Balhoff (RENCI), James Overton - slides, ideas & discussions
- Terry Meehan - who edited CL more than anyone
- Helen Parkinson (EBI)
- Michael Ashburner