Wednesday, 5 March 2014

SPARQL and RDF introduction for technical persons

This is a quick introduction to SPARQL and RDF for developers. I believe SPARQL and RDF are such profound innovations that all IT-persons and developers should know about it. So here's me spreading the word.

In this blog post I will show some examples of basic SPARQL queries, but first a short intro to RDF.

Typical relational databases are organized into tables that look something like this:

Table: Persons
Id Name Date of birth Department Id
2 Roy Lachica 22.12.1975 5
4 John Doe 3.4.1973 6
Table: Departments
Id Name
5 Development
6 Human relations


In RDF this would be represented something like this:
subject predicate object
http://vocab.org/ns/roy_lachica http://schema.org/dateOfBirth 1975-12-22
http://vocab.org/ns/roy_lachica http://ourCorpVocab.com/department http://ourCorpVocab.com/dep/dev
http://vocab.org/ns/john_doe http://schema.org/dateOfBirth 1973-4-3
http://vocab.org/ns/john_doe http://ourCorpVocab.com/departmen http://ourCorpVocab.com/dep/hr

This could also be represented as a node graph.
In the example RDF data above we use URI's as identifiers instead of primary key ID columns that are used in relational databases. These URI's and their namespaces are predefined schemas typically defined by other organizations. This is where the big innovation lies.

The problem RDF solves 

In the relational database example we have manually defined the schema our self. There is no way for others to know what the columns mean. Although you might guess from looking at the data. If we were to expose the data through a REST API there is no way for the API consumers to be sure what the data mean, unless they read the documentation. The API consumer will manually have to couple properties, parse and transform data. In short, they will have to make sense of data and its structure.

The innovation

In the triplestore (RDF database) you can put anything in, you can change the data to whatever you like. This will make you more agile. You are not restricted as in a relational database development stack where you would have to change all the above layers if you make a change in the schema.

You also don't have to care about SQL and database engine quirks and performance issues when designing the schema. The extremely simple universal RDF model is a directed graph and you don't need to normalize, denormalize or setup indexes, keys, decide to use stored procedures or not etc. There is no notion of a NULL-value and this will reduce potential bugs.

In RDF, anyone (also computers, not just humans) can make sense of the data, as long as you use known RDF schemas. These schemas are also called vocabularies and ontologies (ontologies are just more advanced vocabularies). There are even some basic semantics inherent in RDF so for simple structures you might not even need to define or reuse a schema.
The URI's tells us what schemas are used and the schema is not hard-coded into the database but instead openly defined. (A schema could also be company internal but that would sort of defeat the purpose of enabling a global database or web of data through connecting disparate data sources)

SPARQL examples

RDF triplestores have SPARQL-endpoints for querying the data. Fortunately university of Mannheim have made such an endpoint openly available. This endpoint exposes a CIA world factbook example database.

The endpoint is located at:
http://wifo5-03.informatik.uni-mannheim.de/factbook/snorql/
Feel free to check out the endpoint and click around.

This endpoint (with a typical SPARQL user interface) lets you query the triplestore and return the results as JSON, XML or HTML for reading on screen.

You may also type in a query directly in your browser address bar:
http://wifo5-03.informatik.uni-mannheim.de/factbook/snorql/?query=SELECT+%3Fcountry+%3Fpopulation+%3Fgrowthrate+%3Fcapital_city+%3Farea%0D%0AWHERE+%7B%0D%0A%3Fx++factbook%3Aname+%3Fcountry+%3B%0D%0Afactbook%3Apopulation_total+%3Fpopulation+%3B%0D%0Afactbook%3Apopulationgrowthrate+%3Fgrowthrate+%3B%0D%0Afactbook%3Acapital_name+%3Fcapital_city+%3B%0D%0Afactbook%3Aarea_total+%3Farea+.%0D%0A%7D


Various SPARQL queries

Try copying these into the endpoint SPARQL textfield.

SELECT * WHERE { ?s ?p ?o} limit 10
This will get the first triples (subject-object-predicate set).


SELECT COUNT(*)  { ?s ?p ?o  }
Get number of triples in the database.


SELECT (COUNT(*) AS ?triples_count) { ?s ?p ?o  }
Same as above but with a name for the result, making the output more human readable.


SELECT DISTINCT ?value WHERE { db:Algeria ?value ?o}
Get all properties (the predicate part) of triples where Algeria is part of the triple.


SELECT DISTINCT ?value WHERE { db:Algeria factbook:climate ?value }
Gets the climate in Algeria.


SELECT DISTINCT *WHERE { db:Norway factbook:landboundary ?borderingcountry }
 ORDER BY ?borderingcountry
Get the countries that border to Norway.


SELECT DISTINCT ?country ?literacypercentage
WHERE {
  ?country factbook:literacy_totalpopulation ?literacypercentage .
   FILTER ( ?literacypercentage > 70 )
}
order by desc(?literacypercentage)
Get all countries with a literacy percentage over 70.


SELECT DISTINCT ?country ?literacypercentage ?populatoncount
WHERE {
  ?country factbook:literacy_totalpopulation ?literacypercentage .
  ?country factbook:population_total  ?populatoncount
   FILTER ( ?literacypercentage > 50 && ?populatoncount>5000000)
}
order by asc(?literacypercentage)
Get all countries with a literacy percentage over 50 and having a population count of more than 5 million. Sort results by countries with the lowest literacy on top.


SELECT ?country ?population ?growthrate ?capital_city ?area
WHERE {
?x  factbook:name ?country ;
factbook:population_total ?population ;
factbook:populationgrowthrate ?growthrate ;
factbook:capital_name ?capital_city ;
factbook:area_total ?area .
}
Get country name, population, growth rate, capital city and country area size of all countries.



Further reading

A Relational View of the Semantic Web