Learning Outcomes for Critical Path Tutorial¶

Understand the value of URIs as global identifiers and the potential shortcomings.
Having a basic picture of the flagship efforts of the Semantic Web.
Being aware of some of the central Semantic Web applications in the biomedical domain.
Having a cursory understanding of how linked data can help to power your Critical Path data analysis problems.

Interesting Case Studies to talk about:¶

The Experimental Factor Ontology: from controlled vocabulary to integrated application ontology driving drug target identification.
From barely structured data via data dictionaries to semantic data integration:
International HundredK+ Cohorts Consortium (IHCC) data harmonization case study: How to get from messy, individual data dictionaries for COHORT data to an integrated resource for browsing and grouping.
The EJPRD story:
- Registry level integration using a semantic metadata model
- Common data elements in rare disease registration.

Cohort data are scattered and there is no easy way to group data across cohorts
Even just finding the right cohort can be difficult
Data dictionaries are often just spreadsheets on someones computer
Data dictionaries do not have rich metadata (you dont know data dictionary category or value pertains to a disease)
What to do:
Build controlled vocabulary
Map data dictionaries to a controlled vocabulary
Build ontological model from controlled terms rich enough to group the data for the use cases at hand
Design a process that makes the above scalable
Show examples
So now, we want enable the discovery of data across these cohorts.
Build GECKO
Assign data dictionary elements to IDs and publish as "Linked Data" (browse here)
Build mapping pipeline
1. Check example google sheet
2. Link IDs to ontology terms
These links can now be used to group the metadata for identifying cohorts

Rare disease registries are scattered across the web and there is no easy way to search across all
EJPRD is developing two metadata schemas:
On Registry level, they are building the metadata model which is reusing some standard vocabularies such as dcat. There is not that much "semantics" here - it really is a metadata model
On Record level, they are building the Clinical Data Elements (CDE) Semantic Model, see for example the core model.
The idea is that registries publish their metadata (and eventually data) as linked data that can be easily queried using the above models. One of the most major problems is the size of the project and competing voices ("If its not RDF its not FAIR"), but also the sheer scale of the technical issue: many of the so called registries are essentially excel spreadsheets on an FTP server.