Geocoding with Nominatim
Data often refers to places in the real world. Geographic location provides an important context for many datasets, even when it is not referred explicitly. Locations are commonly described in terms of postal addresses. In order to compute with them or place them on a map, they have to be converted to geographic coordinates. Lookup of geographic coordinates for postal addresses is known as geocoding.
OpenStreetMap is a premier source of open geographic data. For each location it provides both its address and its geographic coordinates. In order to convert between these two kinds of descriptions of locations it offers a geocoding web service Nominatim. In this tutorial, we show how LinkedPipes ETL (LP-ETL) can be used for geocoding with Nominatim. We cover how to generate queries for Nominatim and how to map their results in JSON to RDF.
Geocoding starts with postal addresses. For example, let's have addresses of three universities in Prague, Czech Republic, described as instances of PostalAddress
from schema.org:
Usually, much effort is spent cleaning the addresses before geocoding. Normalization of addresses, such as expansion of common abbreviations or trimming extraneous characters, has crucial impact on the quality of geocoding. However, in this tutorial we skip the cleaning of the addresses so that we can focus on geocoding. That is why the addresses in our simplified example are already well-structured.
In real cases, you would typically retrieve the addresses to geocode via a SPARQL query or from a data dump. In order to be self-contained, our tutorial instead loads the addresses from the Text holder component into which we paste the example data above. We convert it to RDF by using the Files to RDF single graph component, so that we can process it further.
Generate query
For each address we need to generate a query to Nominatim. The query is expressed via parameters sent to the Nominatim's endpoint URL http://nominatim.openstreetmap.org/search
. The endpoint supports both unstructured and structured queries. Since our data is structured, we turn it into structured queries. Structured search maps elements of postal addresses to specific query parameters. The parameters roughly correspond to the properties of Schema.org's PostalAddress
, which simplifies generating the queries. The properties can be mapped in the following way:
Schema.org property | Nominatim's parameter |
---|---|
schema:streetAddress |
street |
schema:addressLocality |
city |
schema:postalCode |
postalcode |
schema:addressCountry |
countrycodes |
We use either city or postal code, since both usually identify the same level of postal addresses and are therefore to some degree exchangeable. Postal code is preferred as it is more standardized and exhibits less variety. We map country to countrycodes
instead of country
because the former allows you to specify a precise ISO 3166-1 alpha-2 code, which is what we have in our sample addresses. We ask only for the best match via limit=1
, since we do not have other ways to assess the matches. We request the response to be in JSON via format=json
. URLs of queries to Nominatim can be generated as RDF configuration for the HTTP GET list component via the SPARQL CONSTRUCT component:
PREFIX : <http://localhost/>
PREFIX httpList: <http://plugins.linkedpipes.com/ontology/e-httpGetFiles#>
PREFIX schema: <http://schema.org/>
CONSTRUCT {
:config a httpList:Configuration ;
httpList:reference ?reference .
?postalAddress a httpList:Reference ;
httpList:fileUri ?url ;
httpList:fileName ?fileName .
}
WHERE {
?postalAddress a schema:PostalAddress ;
schema:streetAddress ?streetAddress ;
schema:addressCountry ?country .
OPTIONAL {
?postalAddress schema:postalCode ?postalCode .
}
OPTIONAL {
?postalAddress schema:addressLocality ?city .
}
BIND (if(bound(?postalCode), concat("postalcode=", ?postalCode),
if(bound(?city), concat("city=", encode_for_uri(?city)), "")) AS ?cityParam)
BIND (concat("http://nominatim.openstreetmap.org/search?format=json&limit=1&street=",
encode_for_uri(?streetAddress),
"&", ?cityParam,
"&countrycodes=",
lcase(?country)
) AS ?url)
BIND (replace(str(?postalAddress), "^.*(\\d+)$", "$1") AS ?fileName)
}
Configuration for the HTTP GET list component is an instance of httpList:Configuration
that refers to one or more instances of httpList:Reference
, each of which is a resource retrievable from the URL given by the httpList:fileUri
property. Moreover, in order to pair the obtained geo-coordinates with the geocoded postal addresses, we pass in the postal address identifier via the httpList:fileName
property to serve as file name of the response produced by the HTTP GET list component.
Note that Nominatim has a limited usage policy, so that you should use it only for geocoding few addresses. If you need to geocode many addresses in bulk, there are better solutions available. Hence, when querying Nominatim via the HTTP GET list component, it is advisable to use only a single thread, the default setting, to avoid overloading the service.
Map JSON to RDF
Nominatim responds with JSON containing matches found for the generated queries. For instance, this is a response to one of our queries:
In order to map the JSON responses to RDF we can use JSON-LD, a JSON-based syntax for RDF. The JSON to JSON-LD component allows us to interpret any JSON as RDF given a JSON-LD context. A JSON-LD context maps attributes in JSON to RDF properties. Let's have a look at the context that we use in our example:
Nominatim's response contains a lot of data. However, we are interested only in few parts of the response, namely the latitude and longitude. Consequently, we map lat
and lon
attributes to properties from Schema.org, namely schema:latitude
and schema:longitude
. Since Nominatim outputs geo-coordinates as strings, we cast the values of these attributes to numbers by setting the @type
of the RDF properties to xsd:decimal
. JSON-LD context allows us to use compact IRIs like xsd:decimal
if we declare their prefixes, so we define xsd
as http://www.w3.org/2001/XMLSchema#
and schema
as http://schema.org/
. In order to produce valid RDF, every JSON attribute is interpreted as a local name of a property in the namespace http://localhost/
set via @vocab
. The choice of namespace is insignificant in this case because we filter out this part of the data further on.
We use the Hydra Core Vocabulary to wrap the Nominatim's response. Each returned match is treated as a hydra:member
of a hydra:Collection
. We therefore configure the JSON to JSON-LD component to use hydra:member
as the data predicate and hydra:Collection
as the root entity type. Note that the component does not support compact IRIs, so we need to refer to the terms from the Hydra Core Vocabulary by using their absolute IRIs. Additionally, the JSON to JSON-LD component can include the file name of its input file in the JSON-LD it produces. We use the dbo:filename
property from the DBpedia ontology to associate the Nominatim's response with the file name identifying the geocoded postal address. Let's see the configuration of the JSON to JSON-LD component:
Since JSON-LD is an RDF serialization it can be read as RDF. We employ the Files to RDF chunked component to convert each JSON-LD file to an RDF chunk. Chunks split RDF data into smaller parts, each of which can be handled separately for efficient processing. We transform each RDF chunk via a SPARQL CONSTRUCT query executed by the SPARQL CONSTRUCT chunked component. The query throws away all data besides the geo-coordinates, which it wraps as an instance of schema:GeoCoordinates
. In line with Schema.org, geo-coordinates are linked to a schema:Place
that also links the geocoded postal address. We use the following SPARQL CONSTRUCT query to extract the geo-coordinates:
PREFIX : <http://schema.org/>
PREFIX dbo: <http://dbpedia.org/ontology/>
PREFIX hydra: <http://www.w3.org/ns/hydra/core#>
CONSTRUCT {
?place :address ?postalAddress ;
:geo ?geoCoordinates .
?geoCoordinates a :GeoCoordinates ;
:latitude ?latitude ;
:longitude ?longitude .
}
WHERE {
[] dbo:filename ?fileName ;
hydra:member [
:latitude ?latitude ;
:longitude ?longitude
] .
BIND ("http://example.com/resource/" AS ?ns)
BIND (iri(concat(?ns, "place/", ?fileName)) AS ?place)
BIND (iri(concat(?ns, "geo-coordinates/", ?fileName)) AS ?geoCoordinates)
BIND (iri(concat(?ns, "postal-address/", ?fileName)) AS ?postalAddress)
}
We reconstruct the IRIs of the geocoded postal addresses from the file names given to Nominatim's responses. schema:GeoCoordinates
assumes that the coordinates follow the WGS 84 coordinate reference system, which matches the system used by Nominatim. If we obtained the geo-coordinates in a different coordinate reference system, we would be able to reproject them to a desired reference system by using the Geotools component.
Finally, we merge the RDF chunks via the Chunked merger component. Subsequently, you can use the Union component to combine the geo-coordinates with the source addresses or you can push them to an RDF store. If we merge the geo-coordinates with the input addresses, we get this data:
The example pipeline for geocoding with Nominatim can be found here.