In this tutorial, we will walk through the process of preparing a schedulable pipeline for loading data from an external data source to a Wikibase instance such as Wikidata, using LinkedPipes ETL.
This was made possible thanks to the Wikimedia Foundation Project Grant Wikidata & ETL.
We will demonstrate the process on a pipeline loading basic data about Czech solitaryremarkable trees to Wikidata.
Note that the tutorial also provides pipeline examples, which can be easily imported into a LinkedPipes ETL instance.
0. Prerequisites
Before we start creating the pipeline, we will go through the prerequisites for the process.
For the creation of the pipeline, we need the following, and we provide the values used in this tutorial:
URL of the target Wikibase instance and its Wikibase API (api.php)
https://www.wikidata.org/w/api.php
URL of the SPARQL endpoint of the query service of the target Wikibase instance
https://query.wikidata.org/sparql
An account with write access to that instance, preferrably with a bot flag and a bot password associated with that account
In our case, we will work with the data from the Digital Register of the Nature Conservancy Central Register, accessible to people as a web portal and through a user unfriendly API giving a CSV file.
The exact process of getting to the CSV file is out of scope of this tutorial, but it can be seen in our first pipeline fragment, together with the Tabular component, which transforms the data from CSV to RDF according to the CSV on the Web standard.
open_withHeader and one row of the CSV file
Kód;Starý kód;Typ objektu;Název;Datum vyhlášení;Datum zrušení;Ochranné pásmo - Typ;Ochranné pásmo - Popis;Počet vyhlášený;Počet skutečný;Poznámka;Souřadnice;Způsob určení souřadnic;Původ souřadnic;Kraj;Okres;Obec s rozšířenou působností;Dat. účinnosti nejnovějšího vyhl. předpisu;Dat vydání nejnovějšího vyhl. předpisu
"100001";"811054";"Jednotlivý strom";"Klen nad Českou Vsí";"22.09.2004";"";"ze zákona";"";"1";"1";"Na svažité louce při rozhraní obecní zástavby a lučních porostů, cca 150 m jihovýchodně od kostela";"{X:1048550,08, Y:541804,66} ";"určena poloha všech jednotlivých stromů";"doměřeno, odvozeno z mapy nebo leteckých ortofoto snímků";"Olomoucký";"Olomoucký - Jeseník";"Olomoucký - Jeseník - Jeseník";"22.09.2004";"30.08.2004"
open_withRepresentation of the one row in RDF
2. Linked Open Data representation of the data
The next step is to create a Linked Open Data (LOD) representation of the input data.
Data in this form could be, with some additional effort, published as LOD on the Web of Data.
As all structured data published on the Web should ideally be published as LOD, this will be our starting point for the process of loading this data to Wikidata.
Here, we add a SPARQL construct query, transforming the data to the target LOD vocabularies.
In addition, we use the GeoTools component, which facilitates the conversion of Geocoordinates from one projection to another.
Simply put, we transform the source representation of the coordinates ("{X:1048550,08, Y:541804,66} ") using SPARQL to correct EPSG:5514 representation ("-541804.66 -1048550.08") and then, using GeoTools, to WGS84 (EPSG:4326) ("50.25096013863794 17.224497787399326") representation required by Wikibase.
In this step, we take the representation of the data about the remarkable trees as LOD and transform it to the Wikibase vocabulary.
In this case, we use the Wikidata properties mentioned in prerequisites.
We use them to create Items, Statements, Qualifiers, References and Values according to the Wikibase RDF Dump Format.
The transformation is, again, done using a SPARQL construct query.
Note that we mark Items and Statements with a urn:fromSource class, and we store Item labels in urn:nameFromSource and urn:descriptionFromSource, respectively.
This will help us in the next steps to determine which data is already in Wikidata and which needs to be added.
Note that some geocoordinates could be missing, because they were missing in the source.
To avoid creating empty geocoordinate values later, we remove those with an additional SPARQL update.
open_withSPARQL Construct query used for transformation
open_withRepresentation of one tree using Wikidata RDF Dump Format
4. Querying Wikibase for existing data
In this step, we query the Wikidata Query Service for existing data about Czech remarkable trees.
To identify all Czech remarkable trees, we query for all Items having the P677 - ÚSOP code property - this is a unique code used in the source registry to identify the remarkable trees.
In addition, we query for all data (statements) about the remarkable trees that we work with in this pipeline so that we can determine, which are already in Wikidata and which need to be created.
Note that we mark Items and Statements with a urn:fromWikiBase class, and we store Item labels in urn:nameFromWiki and urn:descriptionFromWikibase, respectively.
This will help us in the next steps to determine which data is already in Wikidata and which needs to be added.
Note that in fact, this code is also used for other objects, but this won't actually matter, because we use the result of the query later only to determine, which items from the source are already in Wikidata.
Therefore, it will not matter that there are also some additional items in the query result.
We have the data from the data source transformed to the Wikidata RDF Dump Format, and we also have all the data already in Wikidata in the same format from the Query Service.
The next step towards being able to load the updated data into Wikibase is to match the Items from our source data to existing Items from the Query Service.
We do that in three steps.
First, we create owl:sameAs links between Items from the data source and corresponding Items from Wikidata.
Second, we replace the IRIs of the source Items with IRIs of the found corresponding ones from Wikidata.
Third, for those Items not found in Wikidata, we add the loader:New class, marking the Item to be created in Wikidata.
open_withSPARQL Construct query generating owl:sameAs links between corresponding Items
open_withSPARQL Update query attaching create tags to Items to be created in Wikidata
6. Resolving existing Wikidata Statements
For the Items already existing in Wikibase (now tagged with both urn:fromWikiBase and urn:fromSource classes) we need to determine, which of their Statements are already in Wikibase and which need to be created.
We do this again in three steps, the same as with Items.
First, we create owl:sameAs links between Statements from the data source and corresponding Statements from Wikidata.
In this instance, we say that two Statements are the same, when they use the same Property and have the same Value.
Second, we replace the IRIs of the source Statements with IRIs of the found corresponding ones from Wikidata.
Third, for those Statements not found in Wikidata, we add the loader:New class, marking the Statements to be created in Wikidata.
Note that for P625 - coordinate location we adopt a simpler technique.
We simply check whether there is already a Statement with a set of coordinates in Wikidata for a given Item.
If there already is one, we do not want to overwrite it in Wikidata at this time, so we remove the wikibase:Statement class from the Statement, making it ignored.
To deal with labels, we proceed in four steps.
First, we pass the labels found in Wikidata Query Service, so that we can work with them.
Second, we remove (therefore ignore) those, which are the same as the ones from the data source - there is nothing to do in this case.
Third, due to Wikidata best practices, if the source label differs from the one found in Wikidata, we attach the source label as an alias.
Finally, for Items which have no label in Wikidata, we add it.
Note that we follow a similar approach for descriptions, except there are no aliases in descriptions.
Therefore, we add a description only if one is missing in Wikidata.
open_withSPARQL Construct query passing Item labels from Wikidata
open_withSPARQL Update query removing labels already present in Wikidata
open_withSPARQL Update query resolving labels which are different from those in Wikidata
open_withSPARQL Update query resolving labels which are missing in Wikidata
8. Loading data into Wikibase
Now we have the data ready to be loaded to Wikidata.
Before we do that, we remove all remaining owl:sameAs and similar tags, but it is not necessary.
We load the data using the configured Wikibase loader component.
Finally, in the Report output of the Wikibase loader, we can check for any errors encountered while loading.
In addition, in the Output output of the loader, we can see the Q ids of the newly created Items for future use.
The contributions of the bot should also be checked.
Tips, Tricks and Experiences
Here we list some observations we gathered during our project, which may be helpful to others.
Resuming pipeline execution side-effects
Note that LP-ETL provides debugging support in the form of the ability to resume failed or cancelled pipeline execution.
When loading data to Wikidata, this may have an undesirable side-effect.
When we determine which Items and Statements we want to create, we do that based on the status of the Wikibase instance.
When those Items are created in the Wikibase, we cannot simply change something in the pipeline and run it again without again querying the Wikibase.
If we would do that, we would again create the already created Items, creating undesirable duplicates.
Generating complex data values
According to the Wikibase RDF Dump Format, complex values such as wikibase:GlobecoordinateValue, have more items than usual in the data currently available on the web.
This leads to the need to generate these additional items such as wikibase:geoPrecision and wikibase:geoGlobe as constants e.g. in SPARQL.
In case where there is no data for the geocoordinate, the constants are then generated anyway, leading to undesired empty values.
In that case, an additional SPARQL Update query can be used to clean up the representations with no actual values.
open_withSample of wikibase:GlobecoordinateValue in RDF
open_withSPARQL CONSTRUCT query for generating geocoordinates
open_withSample of wikibase:GlobecoordinateValue with missing properties
open_withSPARQL Update query for cleaning up missing values
Wikidata Query Service lag
When running a Wikidata loading pipeline multiple times, one has to be aware of the current Wikidata Query Service lag.
This is the time required for the updates done to the Wikibase instance to be propagated to the Blazegraph triplestores on top of which the Wikidata Query Service is running.
It may happen that, for example, an Item is created in the Wikibase, but it is not yet propagated to the Blazegraph instance.
When a pipeline runs and queries the Wikidata Query Service, the Item is still missing there, even though it has already been created in the Wikibase.
This would lead to the creation of duplicates.
Therefore, the current maximum lag should be respected as the minimum time interval between making updates to Wikidata and querying the Wikidata Query Service for the updates.
It may also happen, that the Wikibase and Wikidata Query Service are so overloaded, they stop accepting requests from bots completely.
This can be then seen in the logs of the Wikibase loader like this:
2019-11-20 23:32:01,641 [pool-6-thread-1] INFO o.w.w.w.WbEditingAction - We are editing too fast. Pausing for 1396 milliseconds.
2019-11-20 23:32:03,258 [pool-6-thread-1] WARN o.w.w.w.WbEditingAction - [maxlag] Waiting for all: 5.2166666666667 seconds lagged. -- pausing for 5 seconds.
2019-11-20 23:32:08,499 [pool-6-thread-1] WARN o.w.w.w.WbEditingAction - [maxlag] Waiting for all: 5.2166666666667 seconds lagged. -- pausing for 5 seconds.
2019-11-20 23:32:13,707 [pool-6-thread-1] WARN o.w.w.w.WbEditingAction - [maxlag] Waiting for all: 5.2166666666667 seconds lagged. -- pausing for 5 seconds.
2019-11-20 23:32:19,177 [pool-6-thread-1] WARN o.w.w.w.WbEditingAction - [maxlag] Waiting for all: 5.2166666666667 seconds lagged. -- pausing for 5 seconds.
2019-11-20 23:32:24,411 [pool-6-thread-1] WARN o.w.w.w.WbEditingAction - [maxlag] Waiting for all: 5.2166666666667 seconds lagged. -- pausing for 5 seconds.
2019-11-20 23:32:29,412 [pool-6-thread-1] ERROR o.w.w.w.WbEditingAction - Gave up after several retries.
Last error was: org.wikidata.wdtk.wikibaseapi.apierrors.MaxlagErrorException:
[maxlag] Waiting for all: 5.2166666666667 seconds lagged.
In that case, the loader will wait until the lag has disappeared.
Use cases
Using the same method, these use cases were addressed: