Fuzzzy blog: SPARQL

This is a quick introduction to SPARQL and RDF for developers. I believe SPARQL and RDF are such profound innovations that all IT-persons and developers should know about it. So here's me spreading the word.

In this blog post I will show some examples of basic SPARQL queries, but first a short intro to RDF.

Typical relational databases are organized into tables that look something like this:

Table: Persons

Id	Name	Date of birth	Department Id
2	Roy Lachica	22.12.1975	5
4	John Doe	3.4.1973	6

Table: Departments

Id	Name
5	Development
6	Human relations

In RDF this would be represented something like this:

subject	predicate	object
http://vocab.org/ns/roy_lachica	http://schema.org/dateOfBirth	1975-12-22
http://vocab.org/ns/roy_lachica	http://ourCorpVocab.com/department	http://ourCorpVocab.com/dep/dev
http://vocab.org/ns/john_doe	http://schema.org/dateOfBirth	1973-4-3
http://vocab.org/ns/john_doe	http://ourCorpVocab.com/departmen	http://ourCorpVocab.com/dep/hr

This could also be represented as a node graph.
In the example RDF data above we use URI's as identifiers instead of primary key ID columns that are used in relational databases. These URI's and their namespaces are predefined schemas typically defined by other organizations. This is where the big innovation lies.

The problem RDF solves

In the relational database example we have manually defined the schema our self. There is no way for others to know what the columns mean. Although you might guess from looking at the data. If we were to expose the data through a REST API there is no way for the API consumers to be sure what the data mean, unless they read the documentation. The API consumer will manually have to couple properties, parse and transform data. In short, they will have to make sense of data and its structure.

The innovation

In the triplestore (RDF database) you can put anything in, you can change the data to whatever you like. This will make you more agile. You are not restricted as in a relational database development stack where you would have to change all the above layers if you make a change in the schema.

You also don't have to care about SQL and database engine quirks and performance issues when designing the schema. The extremely simple universal RDF model is a directed graph and you don't need to normalize, denormalize or setup indexes, keys, decide to use stored procedures or not etc. There is no notion of a NULL-value and this will reduce potential bugs.

In RDF, anyone (also computers, not just humans) can make sense of the data, as long as you use known RDF schemas. These schemas are also called vocabularies and ontologies (ontologies are just more advanced vocabularies). There are even some basic semantics inherent in RDF so for simple structures you might not even need to define or reuse a schema.
The URI's tells us what schemas are used and the schema is not hard-coded into the database but instead openly defined. (A schema could also be company internal but that would sort of defeat the purpose of enabling a global database or web of data through connecting disparate data sources)

SPARQL examples

RDF triplestores have SPARQL-endpoints for querying the data. Fortunately university of Mannheim have made such an endpoint openly available. This endpoint exposes a CIA world factbook example database.

The endpoint is located at:
http://wifo5-03.informatik.uni-mannheim.de/factbook/snorql/
Feel free to check out the endpoint and click around.

This endpoint (with a typical SPARQL user interface) lets you query the triplestore and return the results as JSON, XML or HTML for reading on screen.

You may also type in a query directly in your browser address bar:
http://wifo5-03.informatik.uni-mannheim.de/factbook/snorql/?query=SELECT+%3Fcountry+%3Fpopulation+%3Fgrowthrate+%3Fcapital_city+%3Farea%0D%0AWHERE+%7B%0D%0A%3Fx++factbook%3Aname+%3Fcountry+%3B%0D%0Afactbook%3Apopulation_total+%3Fpopulation+%3B%0D%0Afactbook%3Apopulationgrowthrate+%3Fgrowthrate+%3B%0D%0Afactbook%3Acapital_name+%3Fcapital_city+%3B%0D%0Afactbook%3Aarea_total+%3Farea+.%0D%0A%7D

Various SPARQL queries

Try copying these into the endpoint SPARQL textfield.

SELECT * WHERE { ?s ?p ?o} limit 10

This will get the first triples (subject-object-predicate set).

SELECT COUNT(*)  { ?s ?p ?o  }

Get number of triples in the database.

SELECT (COUNT(*) AS ?triples_count) { ?s ?p ?o  }

Same as above but with a name for the result, making the output more human readable.

SELECT DISTINCT ?value WHERE { db:Algeria ?value ?o}

Get all properties (the predicate part) of triples where Algeria is part of the triple.

SELECT DISTINCT ?value WHERE { db:Algeria factbook:climate ?value }

Gets the climate in Algeria.

SELECT DISTINCT *WHERE { db:Norway factbook:landboundary ?borderingcountry }
 ORDER BY ?borderingcountry

Get the countries that border to Norway.

SELECT DISTINCT ?country ?literacypercentage
WHERE {
  ?country factbook:literacy_totalpopulation ?literacypercentage .
   FILTER ( ?literacypercentage > 70 )
}
order by desc(?literacypercentage)

Get all countries with a literacy percentage over 70.

SELECT DISTINCT ?country ?literacypercentage ?populatoncount
WHERE {
  ?country factbook:literacy_totalpopulation ?literacypercentage .
  ?country factbook:population_total  ?populatoncount
   FILTER ( ?literacypercentage > 50 && ?populatoncount>5000000)
}
order by asc(?literacypercentage)

Get all countries with a literacy percentage over 50 and having a population count of more than 5 million. Sort results by countries with the lowest literacy on top.

SELECT ?country ?population ?growthrate ?capital_city ?area
WHERE {
?x  factbook:name ?country ;
factbook:population_total ?population ;
factbook:populationgrowthrate ?growthrate ;
factbook:capital_name ?capital_city ;
factbook:area_total ?area .
}

Get country name, population, growth rate, capital city and country area size of all countries.

While reading the book; Making Things Work: Solving Complex Problems in a Complex World by Yaneer Bar-Yam and his chapter on networks and collective memory I got the idea to mix Hebbian theory with RDF. I have also played with similar thoughts before with Topic Maps technology in my research paper Quality, Relevance and Importance in Information Retrieval with Fuzzy Semantic Networks. This time my thoughts were more on adaptive knowledge.

So here's my idea for a Hebbian triplestore
Each triple in the triplestore database has an array of say up to 10 rows. Each array row has a datetime value. When a SPARQL query is processed the current date is added to the array as a new row. The array acts a FIFO queue so the first date is removed if the array is full. Whenever more triples are added to the database it checks if the database is full. If it is full it will delete triples that are least used. Whenever the database is not in use it will perform routine checks to find out what triples are least used. The database could then be regularly fed with new triples and it would over time automatically adapt to the domain where it is being used (queried). So the Hebbian adaptive semantic triplestore is a knowledge store that evolves and becomes more relevant in the environment where it is being used.

Another interesting feature of this triplestore is that it would know what triples are most used. Based on correlation of dates in arrays for triples it can suggest relevant or extended SPARQL queries.

Fuzzzy blog

Wednesday, 5 March 2014

SPARQL and RDF introduction for technical persons

The problem RDF solves

The innovation

SPARQL examples

Various SPARQL queries

Further reading

Monday, 5 September 2011

A Hebbian adaptive semantic triplestore

Saturday, 18 June 2011

Sparql.us online RDF SPARQL query builder