Learning Outcomes for Critical Path Tutorial
Understand the value of URIs as global identifiers and the potential shortcomings.
Having a basic picture of the flagship efforts of the Semantic Web.
Being aware of some of the central Semantic Web applications in the biomedical domain.
Having a cursory understanding of how linked data can help to power your Critical Path data analysis problems.
Interesting Case Studies to talk about:
The Experimental Factor Ontology: from controlled vocabulary to integrated application ontology driving drug target identification.
From barely structured data via data dictionaries to semantic data integration:
International HundredK+ Cohorts Consortium (IHCC) data harmonization case study: How to get from messy, individual data dictionaries for COHORT data to an integrated resource for browsing and grouping.
The EJPRD story:
EFO case study
Build controlled vocabulary
Look a bit at the
anatomy of a term So what happens now?
The story of scientific database curation
integrator hub with the killer use case comes along Now the vocabulary is getting “forced” onto other databases that want to be part (and have to be part)
The number of terms needed shoot up exponentially - external ontologies need two be integrated
Why Mondo and not DO?
Finally: better, more specialised hierarchies
Its hard to re-use. (Measurement story)
Output data of integrator hub can now be integrated even higher (e.g. disease to gene networks)
Individual sources can also be integrated individually
Stories like this happen all the time: The SCDO story
First started building a vocab
Then using ROBOT
Then linking OBO terms
Then applying for OBO membership
Then using OBO purls and re-using OBO terms
More to come
Cohort data are scattered and there is no easy way to group data across cohorts
Even just finding the right cohort can be difficult
Data dictionaries are often just spreadsheets on someones computer
Data dictionaries do not have rich metadata (you dont know data dictionary category or value pertains to a disease)
What to do:
Build controlled vocabulary
Map data dictionaries to a controlled vocabulary
Build ontological model from controlled terms rich enough to group the data for the use cases at hand
Design a process that makes the above scalable
examples So now, we want enable the discovery of data across these cohorts.
GECKO Assign data dictionary elements to IDs and
publish as "Linked Data" ( browse here) Build mapping pipeline
google sheet Link IDs to ontology terms
These links can now
be used to group the metadata for identifying cohorts EJPRD story
Rare disease registries are scattered across the web and there is no easy way to search across all
EJPRD is developing two metadata schemas:
On Registry level, they are building the
metadata model which is reusing some standard vocabularies such as dcat. There is not that much "semantics" here - it really is a metadata model On Record level, they are building the
Clinical Data Elements (CDE) Semantic Model, see for example the core model. The idea is that registries publish their metadata (and eventually data) as linked data that can be easily queried using the above models. One of the most major problems is the size of the project and competing voices ("If its not RDF its not FAIR"), but also the sheer scale of the technical issue: many of the so called registries are essentially excel spreadsheets on an FTP server.