SPARQL endpoint chunked list

Extractor, allows the user to extract RDF triples from a SPARQL endpoint using a series of CONSTRUCT queries. It is especially suitable for querying multiple SPARQL endpoints. It is also suitable for bigger data, as it queries the endpoints for descriptions of a limited number of entities at once, creating a separete RDF data chunk from each query.

Number of threads to use
Number of threads to be used for querying in total.
Query time limit in seconds (-1 for no limit)
Some SPARQL endpoints may hang on a query for a long time. Sometimes it is therefore desirable to limit the time waiting for an answer so that the whole pipeline execution is not stuck.
Encode invalid IRIs
Some SPARQL endpoints such as DBpedia contain invalid IRIs which are sent in results of SPARQL queries. Some libraries like RDF4J then can crash on those IRIs. If this is the case, choose this option to encode such invalid IRIs.
Fix missing language tags on rdf:langString literals
Some SPARQL endpoints such as DBpedia contain rdf:langString literals without language tags, which is invalid RDF 1.1. Some libraries like RDF4J then can crash on those literals. If this is the case, choose this option to fix this problem by replacing rdf:langString with xsd:string datatype on such literals.


RDF single graph - Configuration
Files - Lists of values in CSV files for each task
RDF single graph - Tasks
RDF chunked - Output
RDF single graph - Report
Look in pipeline
SPARQL endpoint chunked list

The SPARQL endpoint chunked list component queries a list of remote SPARQL endpoints using SPARQL CONSTRUCT queries. The typical scenarios include discovery tasks such as determining which classes are used in which endpoints, etc.

On the Tasks input, the component expects a list of tasks specifying endpoints and queries. This chunked version of the component is suitable for bigger data, which needs to be queried by parts. The typical use case is getting descriptions of a larger number of entities of the same type, which would be too big to get in one query.

On the Files input, the component expects CSV files containing columns with headers. The column headers are the names of variables, which will be used in the SPARQL queries. The rows then contain the list of values assigned to the variables. The list is split into pieces determined by the Chunk size parameter of the task specification and each piece is inserted into the query using a VALUES clause in place of the ${VALUES} placeholder, forming one RDF data chunk on the output. The input list of values can be created either manually, or using the SPARQL endpoint select, SPARQL select multiple or SPARQL endpoint select scrollable cursor components.

The Output contains the collected results as chunks, where one Task produces multiple chunks according to the runtime configuration and the corresponding input CSV file. The Report output contains potential error messages encountered when querying the SPARQL endpoints.

Tasks specification

Below you can see sample task specification for the component. This task queries for a list of resources for specific classes in a given endpoint. The list of class IRIs is given in the specified CSV file.

@prefix sel: <> .

<urn:uuid:0b6d0abb-5040-4511-8a05-f73d95bff5fc> a sel:Task;
  sel:chunkSize "1";
  sel:endpoint "";
  sel:fileName "http___sparql_reegle_info_.csv";
  sel:query """
PREFIX adhoc: <>
?subj adhoc:resource ?resource ;
          adhoc:class ?Class ;
          adhoc:endpointUri \"\" .
    SELECT ?resource ?Class
    WHERE {
      ?resource a ?Class .
    LIMIT 5
BIND(UUID() as ?subj)
""" .

The provided CSV file looks like this (only the Class column/variable is used in the query):