Tables and Triples¶
Tables and triples seem very different. Tables are familiar and predictable. Triples are weird and floppy. SQL is normal, SPARQL is bizarre, at least at first.
Tables are great, and they're the right tool for a lot of jobs, but they have their limitations. Triples shine when it comes to merging heterogeneous data. But it turns out that there's a clear path from tables to triples, which should help make RDF make more sense.
Tables are great! Here's a table!
You won't be surprised to find out that tables have rows and columns. Often each row corresponds to some thing that we want to talk about, such as a fictional character from Star Wars. Each column usually corresponds to some sort of property that those things might have. Then the cells contain the values of those properties for their respective row. We take some sort of complex information about the world, and we break it down along two dimensions: the things (rows) and their properties (columns).
Tables are great! We can add another name to our table:
Hmm. That's a perfectly good table, but it's not capturing the information that we wanted. It turns out (Spoiler Alert!) that Anakin Skywalker is Darth Vader! We might have thought that the rows of our table were describing individual people, but it turns out that they're just describing individual names. A person can change their name or have more than one name.
We want some sort of identifier that lets us pick out the same person, and distinguish them from all the other people. Sometimes there's a "natural key" that we can use for this purpose: some bit of information that uniquely identifies a thing. When we don't have a natural key, we can generate an "artificial key". Random strings and number can be good artificial keys, but sometimes a simple incrementing integer is good enough.
The main problem with artificial keys is that it's our job to maintain the link between the thing and the identifier that we gave it. We prefer natural keys because we just have to inspect that thing (in some way) to figure out what to call it. Even when it's possible, sometimes that's too much work. Maybe we could use a DNA sequence as a natural key for a person, but it probably isn't practical. We do use fingerprints and facial recognition, for similar things, though.
(Do people in Star Wars even have DNA? Or just midichlorions?)
Let's add a column with an artificial key to our table:
This is our table of names, allowing a given person to have multiple names. But what we thought we wanted was a person table with one row for each person, like this:
In SQL we could assert that the "sw_id" column of the person table is a PRIMARY KEY. This means it must be unique. (It probably shouldn't be NULL either!)
The names in the person table could be the primary names that we use in our Star Wars database system, and we could have another alternative_name table:
Tables are great! We can add more columns to our person table:
The 2D pattern of a table is a strong one. It not only provides a "slot" (cell) for every combination of row and column, it also makes it very obvious when one of those slots is empty. What does it mean for a slot to be empty? It could mean many things.
For example, in the previous table in the row for Darth Vader, the cell for the "occupation" column is empty. This could mean that:
- we don't know whether he has an occupation
- we know that he has an occupation, but we don't know which occupation it is.
- we might know, but we haven't bothered to write it down yet
- we might know, but it doesn't fit nicely into the New Republic Standard Registry of Occupations; in other words, we know what his occupation is, but including it here would violate a constraint on our database
- we specifically know that he doesn't have an occupation; we triple-checked
- we know more generally (Spoiler Alert!!) that he's dead, and dead people can't have an occupation.
I'm sure I haven't captured all the possibilities. The point is that there's lot of possible reasons why a cell would be blank. So what can we do about it?
If our table is stored in a SQL database, then we have the option of putting a NULL value in the cell. NULL is pretty strange. It isn't TRUE and it isn't FALSE. Usually NULL values are excluded from SQL query results unless you are careful to ask for them.
The way that NULL works in SQL eliminates some of the possibilities above. SQL uses the "closed-world assumption", which is the assumption that if a statement is true then it's known to be true, and conversely that if it's not known to be true then it's false. So if Anakin's occupation is NULL in a SQL database, then as far as SQL is concerned, we must know that he doesn't have an occupation. That might not be what you were expecting!
The Software Carpentry module on Missing Data has more information.
Tables are great! Let's add even more information to our table:
We're trying to say that Darth Vader is the enemy of everybody else in our table. We're using the primary key of the person in the enemy column, which is good, but we've ended up with multiple values in the "enemy" column for Darth Vader.
In any table or SQL database you could make the "enemy" column a string, pick a delimiter such as the comma, and concatenate your values into a comma-separated list. This works, but not very well.
In some SQL databases, such as Postgres, you could given the "enemy" column an array type, so it can contain multiple values. You get special operators for querying inside arrays. This can work pretty well.
The usual advice is to break this "one to many" information into a new "enemy" table:
Then you can JOIN the person table to the enemy table as needed.
Tables are great! Let's add even more information to our table:
A bunch of these columns only apply to a few rows. Now we've got a lot more NULLs to deal with. As the number of columns increases, this can become a problem.
Tables are great! If sparse tables are a problem, then let's try to apply the same solution that worked for the "many to one" problem in the previous section.
Hmm. Yeah, that will work. But every query we write will need some JOINs. It feels like we've lost something.
Entity, Attribute, Value¶
Tables are great! But there's such a thing as too many tables. We started out with a table with a bunch of rows and a bunch of columns, and ended up with a bunch of tables with a bunch of rows but just a few columns.
I have a brilliant idea! Let's combine all these property tables into just one table, by adding a "property" column!
It turns out that I'm not the first one to think of this idea. People call it "Entity, Attribute, Value" or "EAV". People also call it an "anti-pattern", in other words: a clear sign that you've made a terrible mistake.
There are lots of circumstances in which one big, extremely generic table is a bad idea. First of all, you can't do very much with the datatypes for the property and value columns. They kind of have to be strings. It's potentially difficult to index. And tables like this are miserable to query, because you end up with all sorts of self-joins to handle.
But there's at least one use case where it turns out to work quite well...
Tables are great! Until they're not.
The strong row and column structure of tables makes them great for lots of things, but not so great for merging data from different sources. Before you can merge two tables you need to know all about:
- how the columns are structured
- what the rows mean
- what the cells mean
So you need to know the schemas of the two tables before you can start merging them together. But if you happen to have two EAV tables then, as luck would have it, they already have the same schema!
You also need to know that you're talking about the same things: the rows have to be about the same things, you need to be using the same property names for the same things, and the cell values also need to line up. If only there was an open standard for specifying globally unique identifiers...
Yes, you guessed it: URLs (and URNs and URIs and IRIs)! Let's assume that we use the same URLs for the same things across the two tables. Since we're a close-knit community, we've come to an agreement on a Star Wars data vocabulary.
URLs are annoyingly long to use in databases, so let's use standard "sw" prefix to shorten them. Now we have table 1:
and table 2:
To merge these two tables, we simple concatenate them. It couldn't be simpler.
Wait, this looks kinda familiar...
These tables are pretty much in RDF format. You just have to squint a little!
- sw_id == subject
- property == predicate
- value == object
Each row of the table is a subject-predicate-object triple. Our subjects, predicates, and some objects are URLs. We also have some literal objects. We could turn this table directly into Turtle format with a little SQL magic (basically just concatenating strings):
SELECT "@prefix sw: <http://example.com/sw_> ." UNION ALL SELECT "" UNION ALL SELECT sw_id || " " || property || " " || IF( INSTR(value, ":"), value, -- CURIE """" || value || """" -- literal ) || " ." FROM triple_table;
The first few lines will look like this:
@prefix sw: <http://example.com/sw_> . sw:1 sw:first_name "Luke" . sw:2 sw:first_name "Leia" . sw:3 sw:first_name "Darth" . sw:4 sw:first_name "Han" .
Two things we're missing from RDF are language tagged literals and typed literals. We also haven't used any blank nodes in our triple table. These are easy enough to add.
The biggest thing that's different about RDF is that it uses the "open-world assumption", so something may be true even though we don't have a triple asserting that it's true. The open-world assumption is a better fit than the closed-world assumption when we're integrating data on the Web.
Tables are great! We use them all the time, they're strong and rigid, and we're comfortable with them.
RDF, on the other hand, looks strange at first. For most common data processing, RDF is too flexible. But sometimes flexiblity is the most important thing.
The greatest strength of tables is their rigid structure, but that's also their greatest weakness. We saw a number of problems with tables, and how they could be overcome by breaking tables apart into smaller tables, until we got down to the most basic pattern: subject-predicate-object. Step by step, we were pushed toward RDF.
Merging tables is particularly painful. When working with data on the Web, merging is one of the most common and important operations, and so it makes sense to use RDF for these tasks. If self-joins with SQL is the worst problem for EAV tables, then SPARQL solves it.
These examples show that it's not really very hard to convert tables to triples. And once you've seen SPARQL, the RDF query language, you've seen one good way to convert triples to tables: SPARQL SELECT results are just tables!
Since it's straightforward to convert tables to triples and back again, make sure to use the right tool for the right job. When you need to merge heterogeneous data, reach for triples. For most other data processing tasks, use tables. They're great!