Converting tabular data to RDF: Generate IRIs

  • open_withSample RDF input
    @prefix :     <http://example.com/> .
    @prefix skos: <http://www.w3.org/2004/02/skos/core#> .
    @prefix xsd:  <http://www.w3.org/2001/XMLSchema#> .
    
    [ a skos:Concept ;
      skos:prefLabel "Alajärvi"@fi ;
      skos:notation "005" ;
      skos:inScheme <https://linked.opendata.cz/resource/ec.europa.eu/eurostat/lau/2016> ;
      :AREA "1056740000"^^xsd:decimal ;
      :CHANGE false ;
      :LAU1_NAT_CODE "146" ;
      :NUTS_3 "FI194" ;
      :POP 10006 ;
      :sheet_name "FI" ] .
    
    [ a skos:Concept ;
      skos:prefLabel "Eura"@fi ;
      skos:notation "050" ;
      skos:inScheme <https://linked.opendata.cz/resource/ec.europa.eu/eurostat/lau/2016> ;
      :AREA "630200000"^^xsd:decimal ;
      :CHANGE false ;
      :LAU1_NAT_CODE "041" ;
      :NUTS_3 "FI196" ;
      :POP 12128 ;
      :sheet_name "FI" ] .
    
    [ a skos:Concept ;
      skos:prefLabel "Petäjävesi"@fi ;
      skos:notation "592" ;
      skos:inScheme <https://linked.opendata.cz/resource/ec.europa.eu/eurostat/lau/2016> ;
      :AREA "495420000"^^xsd:decimal ;
      :CHANGE false ;
      :LAU1_NAT_CODE "131" ;
      :NUTS_3 "FI193" ;
      :POP 4008 ;
      :sheet_name "FI" ] .

Tabular outputs blank nodes by default. Blank nodes are not stable identifiers, so they do not persist across queries or updates, which makes them difficult to manipulate in data transformations. Since blank nodes are usually “a convenience for the content provider and a burden on the content consumer” (source), we convert them to IRIs to make it easier to consume the data. We mint IRIs within a domain under our control; in this case https://linked.opendata.cz. This avoids clashes with others creating IRIs in their namespaces and allows us to make the IRIs dereferenceable.

Since we want to avoid using the same IRI for multiple resources, we need to ensure that we base it on a unique key. A safe choice of a unique identifier is UUID, which can be conveniently generated by the struuid() function in SPARQL, but UUID-based IRIs are less readable and prevent linking data by conveniently generating IRIs based on shared keys. Instead, whenever possible we generate IRIs using simple or compound shared keys, such as the codes of local administrative units.

When you choose a key to use in IRIs, you should verify that it is available for all the resources that you want to create IRIs for and that the keys are unique. To check if these requirements are violated, you can run a query with an OPTIONAL graph pattern matching the keys that groups by the key values and checks that each group contains exactly one solution.

GROUP BY ?key1 ?key2 # more keys...
HAVING (COUNT(*) != 1)

If such query returns non-empty results, then your key is non-unique and you need to change it. For example, we can check whether LAU level 1 codes overlap between countries by using a similar query.

PREFIX : <http://example.com/>

SELECT ?lau1Notation
WHERE {
  [] :LAU1_NAT_CODE ?lau1Notation ;
    :sheet_name ?country .
}
GROUP BY ?lau1Notation
HAVING (COUNT(DISTINCT ?country) > 1)

This query returns non-empty results indicating that some countries share the same LAU1 codes. Therefore, we need to include country codes in IRIs of LAU1. In this step we also map LAU1 labels, provided by the :LAU1_NAME property, and link to the concept scheme in which LAU1 codes belong.

PREFIX :     <http://example.com/>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>

DELETE {
  ?lau2 :LAU1_NAT_CODE ?lau1Notation ;
    :LAU1_NAME ?lau1PrefLabel .
}
INSERT {
  ?lau2 skos:broaderTransitive ?lau1 .
  ?lau1 a skos:Concept ;
    skos:prefLabel ?lau1PrefLabel ;
    skos:inScheme <https://linked.opendata.cz/resource/ec.europa.eu/eurostat/lau/2016> .
}
WHERE {
  ?lau2 :LAU1_NAT_CODE ?lau1Notation ;
    :sheet_name ?country .
  BIND (iri(concat("https://linked.opendata.cz/resource/ec.europa.eu/eurostat/lau/2016/",
                   lcase(?country),
                   "/",
                   ?lau1Notation)) AS ?lau1)
  OPTIONAL {
    ?lau2 :LAU1_NAME ?lau1PrefLabel .
  }
}

Next, we generate IRIs for local administrative units on the level 2. This step is more convoluted because not all LAU2 have codes and when their codes are available they may not be unique per country. For example, the LAU2 code 32010 in the column LAU2_NAT_CODE is shared among several countries, such as Belgium, Bulgaria, and Ireland:

NUTS_3 LAU1_NAT_CODE LAU2_NAT_CODE CHANGE NAME_1 NAME_2_LAT POP AREA
BE252 n.a. 32010 no Koekelare Koekelare 8761 39185258
BG423 PAZ19 32010 no Ивайло Ivaylo 2750 14078000
IE011 01 32010 no Mullagh Mullagh n.a 2440000

In order to address these limitations, we generate UUID-based IRIs for LAU2 without codes and include LAU1 codes in LAU2 IRIs to disambiguate the administrative units within countries. However, not all LAU2 codes link a LAU1 code, so its inclusion is optional.

PREFIX :     <http://example.com/>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>

DELETE {
  ?_lau2 ?p ?o .
}
INSERT {
  ?lau2 ?p ?o .
}
WHERE {
  {
    SELECT ?_lau2 ?lau2
    WHERE {
      ?_lau2 :sheet_name ?country .
      OPTIONAL {
        # There are some LAU2 without a code.
        ?_lau2 skos:notation ?code .
      }
      OPTIONAL {
        ?_lau2 skos:broaderTransitive/skos:notation ?lau1 .
      }
      BIND (if(bound(?code),
               concat(lcase(?country),
                      "/",
                      if(bound(?lau1), concat(encode_for_uri(lcase(?lau1)), "/"), ""),
                      encode_for_uri(lcase(?code))),
               struuid()) AS ?key)
      BIND (iri(concat("https://linked.opendata.cz/resource/ec.europa.eu/eurostat/lau/2016/", ?key))
            AS ?lau2)
    }
  }
  ?_lau2 ?p ?o .
}

Some local administrative units refer to codes of their parent NUTS regions via the :NUTS_3 or :NUTS3_13 properties. NUTS (Nomenclature of territorial units for statistics) divides the EU into hierarchically organized regions that serve the purpose of aggregating statistics. There are unofficial versions of NUTS in RDF that correspond to older versions of NUTS. However, our dataset links the current version of NUTS from 2013. Hence, we cannot reuse the existing RDF versions. Instead, we mint our own IRIs for NUTS regions on the level 3. Besides providing NUTS regions with IRIs we link them as parents to LAU2 via skos:broaderTransitive; either directly or indirectly through their parent LAU1, if present.

PREFIX :     <http://example.com/>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>

DELETE {
  ?lau2 ?nutsProperty ?notation .
}
INSERT {
  ?narrower skos:broaderTransitive ?nuts3 .
  ?nuts3 a skos:Concept ;
    skos:notation ?notation ;
    skos:inScheme <https://linked.opendata.cz/resource/ec.europa.eu/eurostat/nuts/2013> .
}
WHERE {
  VALUES ?nutsProperty {
    :NUTS_3
    :NUTS3_13
  }
  ?lau2 ?nutsProperty ?notation ;
    :sheet_name ?country .
  BIND (if(strstarts(?notation, ucase(?country)), ?notation, concat(ucase(?country), ?notation))
        AS ?key)
  BIND (iri(concat("https://linked.opendata.cz/resource/ec.europa.eu/eurostat/nuts/2013/", ?key))
        AS ?nuts3)
  OPTIONAL {
    ?lau2 skos:broaderTransitive ?lau1 .
  }
  BIND (coalesce(?lau1, ?lau2) AS ?narrower)
}

The remaining blank nodes in the data after these steps identify the deprecated LAU from 2015. These LAU are explicitly linked via the dcterms:replaces property, which we added in one of the previous steps. We generate code-based IRIs for the deprecated LAU, both on the level 1 and 2.

PREFIX :        <http://example.com/>
PREFIX dcterms: <http://purl.org/dc/terms/>
PREFIX skos:    <http://www.w3.org/2004/02/skos/core#>

DELETE {
  ?newLau2 dcterms:replaces ?_oldLau2 .
  ?_oldLau2 ?p ?o .
}
INSERT {
  ?newLau2 dcterms:replaces ?oldLau2 .
  ?oldLau2 a skos:Concept ;
    skos:inScheme <https://linked.opendata.cz/resource/ec.europa.eu/eurostat/lau/2015> ;
    skos:notation ?oldLau2Notation ;
    skos:broaderTransitive ?oldLau1 .
  ?oldLau1 a skos:Concept ;
    skos:inScheme <https://linked.opendata.cz/resource/ec.europa.eu/eurostat/lau/2015> ;
    skos:notation ?oldLau1Notation .
}
WHERE {
  ?newLau2 dcterms:replaces ?_oldLau2 ;
    :sheet_name ?country .
  OPTIONAL {
    ?_oldLau2 skos:notation ?oldLau2Notation .
  }
  BIND ("https://linked.opendata.cz/resource/ec.europa.eu/eurostat/lau/2015/" AS ?ns)
  OPTIONAL {
    ?_oldLau2 skos:broaderTransitive/skos:notation ?oldLau1Notation .
    BIND (iri(concat(?ns,
                     lcase(?country),
                     "/",
                     ?oldLau1Notation)) AS ?oldLau1)
  }
  BIND (if(bound(?oldLau2Notation),
           concat(lcase(?country),
                  "/",
                  if(bound(?oldLau1Notation), concat(encode_for_uri(lcase(?oldLau1Notation)), "/"), ""),
                  encode_for_uri(lcase(?oldLau2Notation))),
           struuid()) AS ?key)
  BIND (iri(concat(?ns, ?key)) AS ?oldLau2)
  ?_oldLau2 ?p ?o .
}
  • open_withSample RDF output
    @prefix :     <http://example.com/> .
    @prefix skos: <http://www.w3.org/2004/02/skos/core#> .
    @prefix xsd:  <http://www.w3.org/2001/XMLSchema#> .
    
    <https://linked.opendata.cz/resource/ec.europa.eu/eurostat/lau/2016/fi/005> a skos:Concept ;
      skos:prefLabel "Alajärvi"@fi ;
      skos:notation "005" ;
      skos:broaderTransitive <https://linked.opendata.cz/resource/ec.europa.eu/eurostat/lau/2016/fi/146> ;
      skos:inScheme <https://linked.opendata.cz/resource/ec.europa.eu/eurostat/lau/2016> ;
      :AREA "1056740000"^^xsd:decimal ;
      :CHANGE false ;
      :POP 10006 ;
      :sheet_name "FI" .
    
    <https://linked.opendata.cz/resource/ec.europa.eu/eurostat/lau/2016/fi/050> a skos:Concept ;
      skos:prefLabel "Eura"@fi ;
      skos:notation "050" ;
      skos:broaderTransitive <https://linked.opendata.cz/resource/ec.europa.eu/eurostat/lau/2016/fi/041> ;
      skos:inScheme <https://linked.opendata.cz/resource/ec.europa.eu/eurostat/lau/2016> ;
      :AREA "630200000"^^xsd:decimal ;
      :CHANGE false ;
      :POP 12128 ;
      :sheet_name "FI" .
    
    <https://linked.opendata.cz/resource/ec.europa.eu/eurostat/lau/2016/fi/592> a skos:Concept ;
      skos:prefLabel "Petäjävesi"@fi ;
      skos:notation "592" ;
      skos:broaderTransitive <https://linked.opendata.cz/resource/ec.europa.eu/eurostat/lau/2016/fi/131> ;
      skos:inScheme <https://linked.opendata.cz/resource/ec.europa.eu/eurostat/lau/2016> ;
      :AREA "495420000"^^xsd:decimal ;
      :CHANGE false ;
      :POP 4008 ;
      :sheet_name "FI" .