Components

A key building block of a pipeline is a component. There is a library of components ready and here we go through their configuration. Of course, everyone can add their own component, the best way to do that is to copy&paste one of ours and change it. There are five basic types of components. Extractors bring data from outside to the ETL pipeline, e.g. by downloading a file from the Web. Transformers perform transformation of data already in the pipeline and the result is kept in the pipeline. Quallity Assessment components verify that data in the pipeline meet defined criteria. Loaders take data produced by the pipeline and place it outside, e.g. upload it to a web server or database. Special purpose components may do something else, e.g. run an external process.

Component types

The icons next to the components in the component list represent their type. Currently, we have five types of components.

Component list

  • Extractor Extractors

    Bring data from the outside into the pipeline

  • Transformer Transformers

    Take data from the pipeline, change it and pass it on

  • Quality Quality Assessment

    Check data passing through the pipeline and stop execution in case of problems

  • Loader Loaders

    Take data from the pipeline and load it to an external place

  • Special Special

    They do something else, e.g. execute a remote command

List of available components

DCAT-AP Dataset
open_in_browserDCAT-AP Dataset

Provides a form to fill in DCAT-AP v1.1 Dataset metadata

DCAT-AP Datasetclose

Extractor, provides a form to fill in DCAT-AP v1.1 Dataset metadata.

DCAT-AP Distribution
open_in_browserDCAT-AP Distribution

Provides a form to fill in DCAT-AP v1.1 Distribution metadata

DCAT-AP Distributionclose

Extractor, provides a form to fill in DCAT-AP v1.1 Distribution metadata.

Files from local
open_in_browserFiles from local

Allows files in local file system to be used in the pipeline.

Files from localclose

Extractor, allows files in local file system to be used in the pipeline.

Path to a local file or directory
Copies a file or directory on a local file system to be accessible by the pipeline
Extract from FTP
open_in_browserExtract from FTP

Downloads files from the Web using FTP.

Extract from FTPclose

Extractor, allows the user to download a file from the web using FTP.

Use passive mode
Determines whether the component should use the passive mode
HTTP get
open_in_browserHTTP get

Downloads files from the Web using HTTP.

HTTP getclose

Extractor, allows the user to download a file from the web.

File URL
URL from which the file should be downloaded
File name
File name under which the file will be visible in the pipeline
Force to follow redirects
The component will follow HTTP 3XX redirects
User agent
The user agent string that should be present in the HTTP get headers. Defaults to the default Java user agent string.
HTTP get list
open_in_browserHTTP get list

Downloads a list of files from the Web using HTTP.

HTTP get listclose

Extractor, allows the user to download a list of files from the web.

Number of threads used for download
This component supports parallel download using multiple threads. This property sets the thread pool size.
Force to follow redirects
The component will follow HTTP 3XX redirects
Skip on errors
When enabled and one of the files in the list fails to download, the component skips it and downloads the rest of the list instead of failing
Pipeline input
open_in_browserPipeline input

Passes files from an HTTP POST pipeline execution request

Pipeline inputclose

Extractor, allows the user to pass files from an HTTP POST pipeline execution request.

SPARQL Endpoint
open_in_browserSPARQL Endpoint

Extracts RDF triples from a SPARQL endpoint.

SPARQL Endpointclose

Extractor, allows the user to extract RDF triples from a SPARQL endpoint using a CONSTRUCT query.

Endpoint
URL of the SPARQL endpoint to be queried
MIME type
MIME type to be used in the Accept header of the HTTP request made by RDF4J to the remote endpoint
SPARQL CONSTRUCT query
Query for extraction of triples from the endpoint
SPARQL Endpoint chunked
open_in_browserSPARQL Endpoint chunked

Extracts RDF triples from a SPARQL endpoint based on a list of entities to extract. Suitable for bigger data.

SPARQL Endpoint chunkedclose

Extractor, allows the user to extract RDF triples from a SPARQL endpoint using a series of CONSTRUCT queries based on an input list of entity IRIs. It is suitable for bigger data, as it queries the endpoint for descriptions of a limited number of entities at once, creating a separete RDF data chunk from each query.

Endpoint URL
URL of the SPARQL endpoint to be queried
MimeType
Some SPARQL endpoints (e.g. DBpedia) return invalid RDF data with the default MIME type and it is necessary to specify another one here. This is sent in the SPARQL query HTTP request Accept header.
Chunk size
Number of entities to be queried for at once. This is done by using the VALUES clause in the SPARQL query, so this number corresponds to the number of items in the VALUES clause.
Default graph
IRIs of the graphs to be passed as default graphs for the SPARQL query
SPARQL CONSTRUCT query with ${VALUES} placeholder
Query for extraction of triples from the endpoint. At the end of its WHERE clause, it should have the ${VALUES} placeholder. This is where the VALUES clause listing the current entities to query is inserted.
SPARQL endpoint chunked list
open_in_browserSPARQL endpoint chunked list

Extracts RDF triples from SPARQL endpoints based on a list of endpoints and queries to execute, and a list of entities to query for.

SPARQL endpoint chunked listclose

Extractor, allows the user to extract RDF triples from a SPARQL endpoint using a series of CONSTRUCT queries. It is especially suitable for querying multiple SPARQL endpoints. It is also suitable for bigger data, as it queries the endpoints for descriptions of a limited number of entities at once, creating a separete RDF data chunk from each query.

Number of threads to use
Number of threads to be used for querying in total.
Query time limit in seconds (-1 for no limit)
Some SPARQL endpoints may hang on a query for a long time. Sometimes it is therefore desirable to limit the time waiting for an answer so that the whole pipeline execution is not stuck.
Encode invalid IRIs
Some SPARQL endpoints such as DBpedia contain invalid IRIs which are sent in results of SPARQL queries. Some libraries like RDF4J then can crash on those IRIs. If this is the case, choose this option to encode such invalid IRIs.
Fix missing language tags on rdf:langString literals
Some SPARQL endpoints such as DBpedia contain rdf:langString literals without language tags, which is invalid RDF 1.1. Some libraries like RDF4J then can crash on those literals. If this is the case, choose this option to fix this problem by replacing rdf:langString with xsd:string datatype on such literals.
SPARQL endpoint construct scrollable cursor
open_in_browserSPARQL endpoint construct scrollable cursor

Extracts RDF triples from a SPARQL endpoint according to a SPARQL CONSTRUCT query, using the scrollable cursor technique for OpenLink Virtuoso.

SPARQL endpoint construct scrollable cursorclose

Extractor, allows the user to extract RDF triples from a SPARQL endpoint using a CONSTRUCT query, using the scrollable cursor technique for OpenLink Virtuoso.

Endpoint URL
URL of the SPARQL endpoint to be queried
Page size
Specifies the size of one page to be returned
Default graph
IRIs of default graphs used for the SPARQL query
Query prefixes
Prefixes to be defined for the scrollable cursor query
Outer construct clause
The CONSTRUCT clause of the outer paging query. It should use a subset of the variables returned by the SELECT clause of the inner query
Inner select query
Ordered SELECT query to be used as a nested query in the outer paging query
SPARQL Endpoint list to single graph
open_in_browserSPARQL Endpoint list to single graph

Extracts RDF triples from SPARQL endpoints based on a list of endpoints and queries to execute.

SPARQL Endpoint list to single graphclose

Extractor, allows the user to extract RDF triples from a SPARQL endpoint using a series of CONSTRUCT queries. It is especially suitable for querying multiple SPARQL endpoints.

Number of threads to use
Number of threads to be used for querying in total.
Query time limit in seconds (-1 for no limit)
Some SPARQL endpoints may hang on a query for a long time. Sometimes it is therefore desirable to limit the time waiting for an answer so that the whole pipeline execution is not stuck.
Encode invalid IRIs
Some SPARQL endpoints such as DBpedia contain invalid IRIs which are sent in results of SPARQL queries. Some libraries like RDF4J then can crash on those IRIs. If this is the case, choose this option to encode such invalid IRIs.
Fix missing language tags on rdf:langString literals
Some SPARQL endpoints such as DBpedia contain rdf:langString literals without language tags, which is invalid RDF 1.1. Some libraries like RDF4J then can crash on those literals. If this is the case, choose this option to fix this problem by replacing rdf:langString with xsd:string datatype on such literals.
Limit number of tasks running in parallel in a group (0 for no limit)
To avoid overloading a single SPARQL endpoint, set a maximum number of tasks running in parallel for a group. Then, place all tasks targeting a single endpoint to a single group. Tasks in different groups will still run in parallel in the number of threads specified.
Query commit size (0 to commit all triples at once)
This is used to control the number of triples committed to a repository at once. Limiting the number may be necessary for large query results.
SPARQL endpoint select
open_in_browserSPARQL endpoint select

Extracts a CSV table from a SPARQL endpoint according to a SPARQL SELECT query

SPARQL endpoint selectclose

Extractor, allows the user to extract a CSV table from a SPARQL endpoint using a SELECT query.

Endpoint URL
URL of the SPARQL endpoint to be queried
File name
File name of the resulting CSV table
Default graph
IRIs of default graphs used for the SPARQL query
SPARQL SELECT query
Query for extraction of CSV from the endpoint
SPARQL endpoint select scrollable cursor
open_in_browserSPARQL endpoint select scrollable cursor

Extracts a CSV table from a SPARQL endpoint according to a SPARQL SELECT query, using the scrollable cursor technique for OpenLink Virtuoso.

SPARQL endpoint select scrollable cursorclose

Extractor, allows the user to extract a CSV table from a SPARQL endpoint using a SELECT query, using the scrollable cursor technique for OpenLink Virtuoso.

Endpoint URL
URL of the SPARQL endpoint to be queried
File name
File name of the resulting CSV table
Page size
Specifies the size of one page to be returned
Default graph
IRIs of default graphs used for the SPARQL query
Query prefixes
Prefixes to be defined for the scrollable cursor query
Outer select clause
The SELECT clause of the outer paging query. It should be a subset of the SELECT clause of the inner query
Inner select query
Ordered SELECT query to be used as a nested query in the outer paging query
Text holder
open_in_browserText holder

Inputs the given text as a file.

Text holderclose

Extractor, allows the user to input any text as a file into the pipeline.

Output file name
File name under which the file will be visible in the pipeline
File content
Text content of the file
VoID Dataset
open_in_browserVoID Dataset

Provides a form to fill in VoID Dataset metadata properties.

VoID Datasetclose

Extractor, provides a form to fill in VoID Dataset metadata properties.

Get distribution IRI from input
Gets distribution IRI from input, output of DCAT-AP Distribution is expected.
Distribution IRI
When used independently, distribution IRI can be set here.
Example resource URLs
URLs of example resources.
SPARQL Endpoint URL
URL of a SPARQL endpoint through which the data is accessible.
Copy download URLs to void:dataDump
In DCAT, the link to download a file is dcat:downloadURL. This copies it also to void:dataDump for VoID enabled applications.
CouchDB loader
open_in_browserCouchDB loader

Allows the user to load documents into Apache CouchDB

CouchDB loaderclose

Loader, allows the user to to load documents into Apache CouchDB.

CouchDB server URL
URL of the CouchDB server, e.g. http://127.0.0.1:5984
Database name
Name of the CouchDB database to write to
Clear database (delete and create)
Determines whether the CouchDB database is first deleted an recreated before loading
Maximum batch size in MB
The loader can split the load into batches of the specified size
Use authentication
Use for CouchDB requiring user authentication
User name
User name for CouchDB
Password
Password for CouchDB
DCAT-AP to CKAN
open_in_browserDCAT-AP to CKAN

Loads DCAT-AP v1.1 metadata to CKAN

DCAT-AP to CKANclose

Loader, loads DCAT-AP v1.1 metadata to a CKAN catalog.

CKAN Action API URL
URL of the CKAN API of the target catalog, e.g. https://linked.opendata.cz/api/3/action
CKAN API Key
The API key allowing write access to the CKAN catalog. This can be found on the user detail page in CKAN.
CKAN dataset ID
The CKAN dataset ID to be used for the current dataset
Load language RDF language tag
Since DCAT-AP supports multilinguality and CKAN does not, this specifies the RDF language tag of the language, in which the metadata should be loaded.
Profile
CKAN can be extended with additional metadata fields for datasets (packages) and distributions (resources). If the target CKAN is not extended, choose Pure CKAN.
Generate CKAN resources from VoID example resources
If the input metadata contains void:exampleResource, create a CKAN resource from it.
Generate CKAN resource from VoID SPARQL endpoint
If the input metadata contains void:sparqlEndpoint, create a CKAN resource from it.
Files to local
open_in_browserFiles to local

Allows files from the pipeline to be stored on a local file system.

Files to localclose

Loader, allows files from the pipeline to be stored on a local file system.

Target path
Copies a file or directory to a local file system from the pipeline
Permissions to set to files
Linux file system permissions in text form to be set to output files, e.g. rwxrwxr--
Permissions to set to directories
Linux file system permissions in text form to be set to output directoriess, e.g. rwxrwxr-x
Files to SCP
open_in_browserFiles to SCP

Loads files via the SCP protocol.

Files to SCPclose

Loader, allows the user to transfer a file to a server via the SCP protocol.

Host address
Address of the target server for the file upload. E.g. obeu.vse.cz
Port number
SSH port to use. Typically 22
User name
User name for the server
Password
Password for the server. Note that this is plain text and not secure in any way.
Target directory
Directory on the server where the file should be uploaded. E.g.~/upload/
Create the target directory
When checked, tries to create the target directory, when it is not present on the server. Fails otherwise.
Clear target directory
When checked, empties the target directory before loading input files.
Load to FTP
open_in_browserLoad to FTP

Loads files via the FTP protocol.

Load to FTPclose

Loader, allows the user to transfer a file to a server via the FTP protocol.

Host address
Address of the target server for the file upload. E.g. myftp.com
Port number
FTP port to use. Typically 21.
User name
User name for the server
Password
Password for the server. Note that this is plain text and not secure in any way.
Target directory
Directory on the server where the file should be uploaded. E.g.~/upload/
Retry count
Number of retries in cases of connection failures.
Graph store protocol
open_in_browserGraph store protocol

Allows the user to store RDF triples in a SPARQL endpoint using the SPARQL Graph Store HTTP Protocol.

Graph store protocolclose

Loader, allows the user to store RDF triples in a SPARQL endpoint using the SPARQL 1.1 Graph Store HTTP Protocol.

Repository
Selection of the target repository type. This is due to different behavior among implementations
Graph Store protocol endpoint
URL of the graph store protocol endpoint
Target graph IRI
IRI of the RDF graph to which the input data will be loaded
Clear target graph before loading
When checked, replaces the target graph by the loaded data
Log graph size
When checked, the target graph size before and after the load is logged
User name
User name to be used to log in to the SPARQL endpoint
Password
Password to be used to log in to the SPARQL endpoint
DCAT-AP to datahub.io
open_in_browserDCAT-AP to datahub.io

Loads DCAT-AP v1.1 metadata to the datahub.io CKAN, from which the LOD cloud diagram is generated.

DCAT-AP to datahub.ioclose

Loader, loads DCAT-AP v1.1 metadata to the datahub.io CKAN, from which the LOD cloud diagram is generated. Helps you with all the additional requried metadata.

Datahub.io API Key
The API key allowing write access to the datahub.io CKAN catalog. This can be found on the user detail page.
CKAN dataset ID
The CKAN dataset ID to be used for the current dataset
CKAN owner organization ID
ID of your datahub.io CKAN organization, e.g. 9046f134-ea81-462f-aae3-69854d34fc96
CKAN license ID
ID of the CKAN license, which is different from the license URL specified in DCAT-AP metadata
Additional metadata
Follows the additional requried metadata document
Solr loader
open_in_browserSolr loader

Allows the user to load documents into Apache Solr

Solr loaderclose

Loader, allows the user to to load documents into Apache Solr.

URL of Solr instance
URL of the Solr server, e.g. http://localhost:8983/solr
Name of Solr Core
Name of the Solr Core to write to
Clear target core
Determines whether the Solr core is cleared before loading
Use authentication
Use for Solr requiring user authentication
User name
User name for Solr
Password
Password for Solr
SPARQL endpoint
open_in_browserSPARQL endpoint

Allows the user to store RDF triples in a SPARQL endpoint using SPARQL Update.

SPARQL endpointclose

Loader, allows the user to store RDF triples in a SPARQL endpoint using SPARQL 1.1 Update.

SPARQL Update endpoint
URL of the SPARQL endpoint accepting SPARQL 1.1 Update queries
Target graph IRI
IRI of the RDF graph to which the input data will be loaded
Clear target graph before loading
When checked, executes SPARQL CLEAR GRAPH on the target graph before loading
Commit size
Number of triples sent to triple store at once using the RDF4J RepositoryConnection
User name
User name to be used to log in to the SPARQL endpoint
Password
Password to be used to log in to the SPARQL endpoint
SPARQL endpoint chunked
open_in_browserSPARQL endpoint chunked

Allows the user to store chunked RDF data in a SPARQL endpoint using SPARQL Update.

SPARQL endpoint chunkedclose

Loader, allows the user to store chunked RDF data in a SPARQL endpoint using SPARQL 1.1 Update.

SPARQL Update endpoint
URL of the SPARQL endpoint accepting SPARQL 1.1 Update queries
Target graph IRI
IRI of the RDF graph to which the input data will be loaded
Clear target graph before loading
When checked, executes SPARQL CLEAR GRAPH on the target graph before loading
User name
User name to be used to log in to the SPARQL endpoint
Password
Password to be used to log in to the SPARQL endpoint
Load to Wikibase
open_in_browserLoad to Wikibase

Loads RDF data to Wikibase.

Load to Wikibaseclose

Loader, allows the user to load RDF data using the Wikibase RDF Dump Format to a Wikibase instance. There is a whole tutorial on Loading data to Wikibase available.

Wikibase API Endpoint URL (api.php)
This is the URL of the api.php in the target Wikibase instance. For example https://www.wikidata.org/w/api.php
Wikibase Query Service SPARQL Endpoint URL
This is the URL of the SPARQL Endpoint of the Wikibase Query Service containing data from the target Wikibase instance. For example https://query.wikidata.org/sparql
Any existing property from the target Wikibase instance
e.g. P1921 for Wikidata. This is necessary, so that the used library is able to determine data types of properties used in the Wikibase.
Wikibase ontology IRI prefix
This is the common part of the IRI prefixes used by the target Wikibase instance. For example http://www.wikidata.org/. This can be determined by looking at an RDF representation of the Q1 entity in Wikidata, specifically the wd: and similar prefixes.
User name
User name for the Wikibase instance.
Password
Password for the Wikibase instance.
Average time per edit in ms
This is used by the wrapped Wikidata Toolkit to pace the API calls.
Use strict matching (string-based) for value matching
Wikibase distinguishes among various string representations of the same number, e.g. 1 and 01, whereas in RDF, all those representations are considered equivalent and interchangable. When enabled, the textual representations are considered different, which may lead to unnecessary duplicates.
Skip on error
When a Wikibase API call fails and this is enabled, the component continues its execution. Errors are logged in the Report output.
Retry count
Number of retries in case of IOException, i.e. not for all errors.
Retry pause
Time between individual requests in case of failure.
Create item message
MediaWiki edit summary message when creating items.
Update item message
MediaWiki edit summary message when updating items.
SPARQL ask
open_in_browserSPARQL ask

Checks data with a SPARQL ask query and stops pipeline execution on success or fail.

SPARQL askclose

Quality assurance component, Checks data with a ASK query and stops pipeline execution on success or fail.

Fail on ASK success
When checked, the component will stop pipeline execution, when the ASK query returns true, when unchecked, the component will stop pipeline execution, when the ASK query returns false
SPARQL ASK query
Query for the data check
Bing translator
open_in_browserBing translator

Translates literals using Bing machine translation.

Bing translatorclose

Transformer, allows the user to automatically translate literals using the Bing machine translation.

Subscription key
This is the subscription key needed to access the Bing translator API
Default source language
The component takes input language from the input literal language tags. The default source language will be used for literals without language tags.
Target languages
Languages to which will the input literals be translated
Indicate machine translation in language tag
When switched on, the machine translation will be indicated in the target literal language tag using BCP47 extension T like this: "english string"@en-t-cs-t0-bing instead of simple "english string"@en
Chunked merger
open_in_browserChunked merger

Transforms the RDF Chunked data unit into an RDF single graph data unit.

Chunked mergerclose

Transformer, allows the user to transform the RDF Chunked data unit into an RDF single graph data unit.

Excel to CSV
open_in_browserExcel to CSV

Transforms Excel files to CSV files.

Excel to CSVclose

Transformer, allows the user to transform Microsoft Excel files to CSV files.

Output file name template
File name template under which the output CSV file will be visible in the pipeline, including extension. You can use {FILE} and {SHEET} placeholders to generate multiple CSV files from one input file
Sheet filter
Java regular expression matching Excel sheet names to process. If empty, all sheets are processed
Row start
Number of the first row to be transformed
Row end
Number of the last row to be transformed
Column start
Number of the first column to be transformed
Column end
Number of the last column to be transformed
Virtual columns with header
Adds sheet_name column header as this is currently the only supported virtual column
Determine type (date) for numeric cells
Excel stores dates and integers as double values. When checked, the transformer tries to parse these as dates and integers
Skip empty rows
When checked, empty rows are not included in the output CSV. Otherwise, they are full of NULL values
Add sheet name as column
When checked, a sheet_name column is added, with source Excel sheet name as a value
Evaluate formulas
When checked, formulas in Excel are evaluated
File hasher
open_in_browserFile hasher

Computes SHA-1 hash of the input files

File hasherclose

Transformer, which computes SHA-1 hash for each input file. The hash can be verified e.g. by the online calculator.

Decode base64
open_in_browserDecode base64

Decodes the input files using the base64 algorithm.

Decode base64close

Transformer, decodes the input files using the base64 algorithm.

Files filter
open_in_browserFiles filter

Filters input files by file name.

Files filterclose

Transformer, allows the user to filter files by file name using Java regular expressions.

File name filter pattern
Regular expression to filter file names by
Files renamer
open_in_browserFiles renamer

Renames files using a regular expression.

Files renamerclose

Transformer, allows the user to rename files by using Java regular expressions.

Full file name RegExp to match
Regular expression to match file names
Replace RegExp with with
Regular expression to replace the file names
Files to RDF
open_in_browserFiles to RDF

Converts RDF files to RDF data.

Files to RDFclose

Transformer, allows the user to convert RDF files (Files Data Unit) to RDF data (RDF data unit).

Commit size
Number of triples processed before commiting to database
Format
RDF serialization of the input file. Can be determined by file extension.
Files to RDF chunked
open_in_browserFiles to RDF chunked

Converts RDF files to RDF data in a chunked fashion, to be further processed in smaller pieces.

Files to RDF chunkedclose

Transformer, allows the user to convert RDF files (Files Data Unit) to RDF data in chunks, to be processed in smaller pieces.

Format
RDF serialization of the input file. Can be determined by file extension.
Files per chunk
Determines, how many input files should be stored in one output chunk
Files to RDF single graph
open_in_browserFiles to RDF single graph

Converts RDF files to RDF data in a single graph.

Files to RDF single graphclose

Transformer, allows the user to convert RDF files (Files Data Unit) to RDF data in single graph.

Commit size
Number of triples processed before commiting to database
Format
RDF serialization of the input file. Can be determined by file extension.
Skip file on failure
When a file cannot be loaded, skip it instead of failing
Files to statements
open_in_browserFiles to statements

Inserts contents of input files to RDF triple objects.

Files to statementsclose

Transformer, inserts contents of input files to RDF triple objects.

Predicate
The RDF triple predicate used in the triples containing the file contents as objects. The subjects are RDF blank nodes
GeoTools
open_in_browserGeoTools

Provides geographical projection transformation features.

GeoToolsclose

Transformer, provides geographical projection transformation features using the GeoTools library.

Input resource type
IRI of the type of entities containing the geocoordinates, e.g. http://www.opengis.net/ont/gml#Point
Predicate for coordinates
IRI of the predicate connecting the entity of the input resource type to the literal containing the geocoordinates, e.g. the http://www.opengis.net/ont/gml#pos property. The geocoordinates format is "X Y", i.e. two xsd:double numbers separated by a space.
Predicate for coordinate reference system
IRI of the predicate connecting the entity containing geocoordinates to a literal containing the source coordinates reference system, e.g. http://www.opengis.net/ont/gml#srsName
Default coordinate reference system
Coordinates reference system to be used with entities, which do not specify the coordinates reference system themselves, e.g. EPSG:5514
Predicate for linking to newly created points
The new entity containing the transformed geocoordinates will be connected to the original entity using this predicate
Output coordinate reference system
Target coordinates type, e.g. EPSG:4326
Fail on error
Stops execution on error when enabled, continues execution when disabled.
Graph merger
open_in_browserGraph merger

Transforms Data Unit containing multiple graphs to a Data Unit containing single graph.

Graph mergerclose

Transformer. There are two types of RDF Data Units in LinkedPipes ETL right now. There is a data unit, that can contain multiple RDF named graphs, which is currently produced only by the Files to RDF, which may transform files containing RDF quads. When the user then wants to work with such data using components that only support triples, Graph merger is needed to do the conversion.

HDT to RDF
open_in_browserHDT to RDF

Converts an HDT file (binary serialization) to RDF data.

HDT to RDFclose

Transformer, allows the user to convert an HDT file (Files Data Unit) to RDF data (RDF data unit).

Commit size
Number of triples processed before commiting to database
HTML CSS ODCS
open_in_browserHTML CSS ODCS

Parses HTML pages using CSS like selectors.

HTML CSS ODCSclose

Transformer, enables users to parse a HTML pages using jsoup CSS like selectors.

Class of root subject
IRI of the root output resource type
Default has predicate
Default predicate to be used to connect nested objects
Generate source info
Adds information about the source file into the output data
JSON-LD operations
open_in_browserJSON-LD operations

Performs JSON-LD operations on JSON-LD files.

JSON-LD operationsclose

Transformer, performs JSON-LD operations on files formats using a JSON-LD processor.

Operation
Operation to be performed on input JSON-LD files
JSON-LD context (where applicable)
For Flatten and Compact operations, a new JSON-LD context can be specified
JSON to JSON-LD
open_in_browserJSON to JSON-LD

Adds a JSON-LD context to JSON files.

JSON to JSON-LDclose

Transformer, adds a specified JSON-LD context and additional provenance data to input JSON files, making them JSON-LD.

Encoding
Encoding of the input files
Data predicate
IRI of the predicate connecting the root entity to the object representing the original JSON data interpreted as JSON-LD.
Root entity type
IRI of the type of the newly created root entity.
Add file reference
If checked, input file name is added to the output JSON-LD file using the File name predicate.
File name predicate
IRI of the predicate used to attach the input file name to the newly created root entity in the output JSON-LD file.
Context object
JSON object specifying the JSON-LD context.
Modify date
open_in_browserModify date

Adds or subtracts days from xsd:date literals.

Modify dateclose

Transformer, allows the user to add or subtract days from a date represented as xsd:date literal.

Input predicate
The predicate used in triples, where the object is the date to be shifted.
Shift date by
Number of days to shift the date by. Can be a positive and negative integer.
Output predicate
The predicate to be used in output triples to store the shifted date.
Mustache
open_in_browserMustache

Generates text files using the {{ mustache }} template.

Mustacheclose

Transformer, generates text files using the {{ mustache }} template.

Entity class IRI
IRI of the RDF class of the objects, to which the {{ mustache }} template is applied.
Add the mustache:first predicate to first items of input lists
First items in input lists will have the mustache:first predicate added. This is useful for instance for generating JSON arrays, where we do not want to generate a comma after the last item of the array. This predicate can be used for this (we do not generate the comma before the first item).
Mustache template
The {{ mustache }} template
Chunked Mustache
open_in_browserChunked Mustache

Generates text files using the {{ mustache }} template. Works with chunked RDF data.

Chunked Mustacheclose

Transformer, generates text files using the {{ mustache }} template and works with chunked RDF data.

Create zip archive
open_in_browserCreate zip archive

Zips input files into an archive.

Create zip archiveclose

Transformer, allows the user to zip input RDF files (Files Data Unit) to an output zip archive.

Output zip file name
File name under which the file will be visible in the pipeline, including extension.
RDF to File
open_in_browserRDF to File

Converts RDF data to an RDF file.

RDF to Fileclose

Transformer, allows the user to convert RDF data (RDF data unit) to an RDF file (Files Data Unit).

Output file name
File name under which the file will be visible in the pipeline, including extension.
Output format
RDF serialization of the output file.
IRI of the output graph
IRI of the only named graph that will be created from the input triples. Valid only for the RDF Quad formats.
Chunked RDF to File
open_in_browserChunked RDF to File

Converts chunked RDF data to an RDF file.

Chunked RDF to Fileclose

Transformer, allows the user to convert chunked RDF data to an RDF file.

Output file name
File name under which the file will be visible in the pipeline, including extension.
Output format
RDF serialization of the output file.
IRI of the output graph
IRI of the only named graph that will be created from the input triples. Valid only for the RDF Quad formats.
Used prefixes as Turtle
Allows the user to insert prefixes declaration in Turtle format to be used in Turtle file output.
RDF to HDT
open_in_browserRDF to HDT

Converts RDF data to an HDT file (binary serialization).

RDF to HDTclose

Transformer, allows the user to convert RDF data (RDF data unit) to an HDT file (Files Data Unit).

Output file name
File name under which the file will be visible in the pipeline, including extension.
Output base IRI
Base IRI for relative IRIs in the HDT file.
SHACL
open_in_browserSHACL

Validates input data using SHACL shapes

SHACLclose

Transformer, allows the user to validate input RDF data using SHACL shapes.

Fail on rule violation
When on and a shape validation fails, the whole component, and therefore the pipeline, fails. When off, the failures are in the output report.
Shapes to apply (in Turtle)
An alternative way of providing the shapes graph. This is merged into the Shapes input.
Union
open_in_browserUnion

Represents a union of RDF data. Useful to make nicer pipelines.

Unionclose

Transformer, represents a union of RDF data. Useful to make nicer pipelines where multiple edges would have to be duplicated otherwise.

SPARQL construct
open_in_browserSPARQL construct

Transforms RDF data using a SPARQL CONSTRUCT query.

SPARQL constructclose

Transformer, allows the user to transform RDF data using a CONSTRUCT query. This means that based on the input data, output data will be generated using this query.

SPARQL CONSTRUCT query
Query for transformation of triples
SPARQL construct chunked
open_in_browserSPARQL construct chunked

Transforms chunked RDF data using a SPARQL CONSTRUCT query.

SPARQL construct chunkedclose

Transformer, allows the user to transform chunked RDF data using a CONSTRUCT query. This means that based on the input data, output data will be generated using this query, preserving the data chunking.

SPARQL CONSTRUCT query
Query for transformation of triples
SPARQL construct to file list
open_in_browserSPARQL construct to file list

Creates multiple RDF files, possibly containing multiple named graphs, using a set of SPARQL queries.

SPARQL construct to file listclose

Transformer, allows the user to create multiple RDF files including multiple RDF named graphs using a set of SPARQL CONSTRUCT queries.

Use deduplication
Individual SPARQL query results may contain duplicate triples or quads. This is a result of the way the SPARQL queries are evaluated. For smaller files this causes no issues as the duplicates are correctly resolved when the resulting file is further processed. For larger files, it is possible to turn on deduplication, which processes the results when they are serialized. This causes a slight performance penalty.
SPARQL linker chunked
open_in_browserSPARQL linker chunked

Transforms chunked RDF data combined with a smaller reference dataset using a SPARQL CONSTRUCT query.

SPARQL linker chunkedclose

Transformer, allows the user to transform chunked RDF data in combination with a smaller reference dataset using a CONSTRUCT query. This means that based on the input data and the reference data, output data will be generated using this query, preserving the data chunking.

SPARQL CONSTRUCT query
Query for transformation of triples
SPARQL select
open_in_browserSPARQL select

Transforms RDF data into a CSV file using a SPARQL SELECT query.

SPARQL selectclose

Transformer, allows the user to transform RDF data to a CSV file using a SELECT query.

SPARQL SELECT query
Query for transformation of triples
SPARQL select multiple
open_in_browserSPARQL select multiple

Transforms RDF data into multiple CSV files using multiple SPARQL SELECT queries.

SPARQL select multipleclose

Transformer, allows the user to transform RDF data into multiple CSV files using multiple SPARQL SELECT queries.

SPARQL update
open_in_browserSPARQL update

Transforms RDF data into using SPARQL UPDATE queries.

SPARQL updateclose

Transformer, allows the user to transform RDF data using SPARQL UPDATE queries. This means that the input data will be changed - added to and/or deleted from.

SPARQL Update query
Query for transformation of triples
SPARQL update chunked
open_in_browserSPARQL update chunked

Transforms chunked RDF data into using SPARQL UPDATE queries.

SPARQL update chunkedclose

Transformer, allows the user to transform chunked RDF data using SPARQL UPDATE queries. This means that the input data will be changed - added to and/or deleted from, preserving the data chunking.

SPARQL Update query
Query for transformation of triples
Stream compression
open_in_browserStream compression

Compresses input files using gzip or bzip2.

Stream compressionclose

Transformer, compresses input files using gzip or bzip2.

Format
Output compression format. Can be gzip or bzip2.
Tabular
open_in_browserTabular

Transforms CSV files to RDF triples.

Tabularclose

Transformer. Used to transform CSV files to RDF data according to the Generating RDF from Tabular Data on the Web W3C Recommendation. The first set of parameters is there simply do deal with CSV files not compliant with RFC 4180, which specifies a comma as a delimiter, the double-quote as a quote character, and UTF-8 as the encoding.

Tabular chunked
open_in_browserTabular chunked

Transforms CSV files to RDF data chunks.

Tabular chunkedclose

Transformer. Used to transform CSV files to RDF data chunks according to the Generating RDF from Tabular Data on the Web W3C Recommendation. This version of the Tabular component produces chunked RDF data for processing of a larger number of independent entities with lower memory consumption.

Tabular ODCS
open_in_browserTabular ODCS

Transforms CSV files to RDF triples (deprecated).

Tabular ODCSclose

Transformer. Used to transform CSV files to RDF data. It is present only to allow smoother transition from UnifiedViews using the uv2etl tool. In all other cases, the CSV on the Web compatible Tabular component should be used instead.

Templated XLS to CSV
open_in_browserTemplated XLS to CSV

Transforms Excel files to CSV files using a template, which helps with print formatted files.

Templated XLS to CSVclose

Transformer, allows the user to transform Microsoft Excel XLS (not XLSX) files to CSV files using a template, which helps with print formatted files.

Template file prefix
Prefix used by template files. The part of the template file name after the prefix should match the original file name.
Decompress archive
open_in_browserDecompress archive

Unpacks a zip or bzip2 archive.

Decompress archiveclose

Transformer, allows the user to unpack input zip, bzip2, gzip and 7z files.

Unpack each file into separated directory
When checked, a directory will be created for each input file. Otherwise, all files will be decompressed into a single directory
Format
Input archive format. Can be zip, bzip2, gzip, 7z or auto, which means the format will be determined by the file extension
Skip file on error
When an error occurs with a file when processing multiple files, continue execution
Decompress zip archive
open_in_browserDecompress zip archive

Unzips a zip archive.

Decompress zip archiveclose

Transformer, allows the user to unzip input zip files.

Unpack each zip into separated directory
When checked, a directory will be created for each input zip file. Otherwise, all files will be decompressed into a single directory
Value parser
open_in_browserValue parser

Parses literals containing lists and other structured texts.

Value parserclose

Transformer, allows the user to parse lists and other structured texts from a literal, creating multiple literals.

Predicate for input values
IRI of the predicate linking to the literal to be processed
Regular expression with groups
Java regular expression with named groups to match the items of the structured input text, e.g. (?<g>[^\[\],]+)
Preserve language tag/datatype
Sets the language tag or datatype of the input literal on the output literals
Group name
Name of the regular expression group to store as a new literal
Group predicate
IRI of the predicate linking to the output literals created from this regular expression group
XSLT transformer
open_in_browserXSLT transformer

Transforms XML files using XSLT transformations.

XSLT transformerclose

Transformer, allows the user to transform XML files using XSLT.

New file extension
XSLT allows to create arbitrary text files from XML. Expecially for RDF files it is important that the new file has a correct extension according to its format
Number of threads used for transformation
The XSLT processor is given this number of threads to use for each transformation
XSLT Template
The XSLT template used to transform the input XML files
Delete directory
open_in_browserDelete directory

Deletes a local file system directory including all its contents.

Delete directoryclose

Special purpose component, allows the user to delete a directory on a local file system.

Directory to delete
Directory to delete
Graph store purger
open_in_browserGraph store purger

Uses the SPARQL 1.1 Graph Store HTTP Protocol to remove a list of named graphs from an endpoint.

Graph store purgerclose

Special purpose component, allows the user to remove a list of named graphs from a SPARQL 1.1 Graph Store HTTP Protocol endpoint.

Graph store protocol endpoint
URL of the SPARQL 1.1 Graph Store HTTP Protocol endpoint, e.g. http://localhost:8890/sparql-graph-crud-auth
Use authentication
Use for endpoints requiring user authentication
User name
User name for the endpoint
Password
Password for the endpoint
HTTP Request
open_in_browserHTTP Request

Issues a set of arbitrary HTTP requests according to configuration.

HTTP Requestclose

Special purpose component, issues a set of arbitrary HTTP requests according to configuration.

Number of threads used for execution
Number of threads used for execution of the individual tasks. Network operations tend to be slow, so it is advantageous to process them in parallel.
Threads per group
For cases where there are multiple tasks aimed e.g. at the same host, this sets the number of threads to be used for a single group (e.g. a single host) as specified in the input tasks.
Skip on error
When an HTTP or another error is encountered in one of the tasks, this switch controls whether the execution of the whole pipeline should fail, or whether this is expected and therefore the execution should continue.
Follow redirects
Controls whether the component should follow HTTP 30x redirects.
Encode input URL
Percent encodes input URLs before using them as URIs in HTTP. Useful for cases where the target web server does not handle IRIs well of the input URLs may contain invalid characters.
Pipeline executor
open_in_browserPipeline executor

Executes LinkedPipes ETL pipelines

Pipeline executorclose

Special purpose component, allows the user to execute LP-ETL pipelines.

URL of LinkedPipes instance
The instance on which a pipeline should be executed
Pipeline IRI
IRI of the pipeline to be executed
Save debug data
Determines, whether outputs and inputs of all components are saved or not. Choosing not to save them might increase execution performance, but in case of failures, debug data will not be available.
Delete working data, including saved debug data, after successful execution
On successful execution of a pipeline, debug data and logs can be deleted, to save space.
Execution log level
Determines the log level of the pipeline execution.
Virtuoso loader
open_in_browserVirtuoso loader

Instructs Virtuoso to load a file.

Virtuoso loaderclose

Special purpose component, allows the user to instruct OpenLink Virtuoso triplestore to load a file from the remote file system. Uses Virtuoso Bulk Loader functionality.

Virtuoso JDBC connection string
JDBC database connection string to connect to the Virtuoso iSQL port, e.g. jdbc:virtuoso://localhost:1111/charset=UTF-8/
Virtuoso user name
User name to Virtuoso (default in Virtuoso is dba)
Virtuoso password
Password to Virtuoso. Note that this is plain text and in no way secured. (default in Virtuoso is dba)
Remote directory with source files
Directory on the remote file system (local to Virtuoso) where the files to be loaded are located
Filename pattern to load
A filename pattern of files to load. This can be either a single file like data.ttl or a pattern such as data*.ttl
Target graph IRI
Target RDF graph for input triple files and default graph in quad files
Clear target graph before loading
When checked, executes SPARQL CLEAR GRAPH on the target graph before loading
Execute checkpoint
When checked, executes the checkpoint; command after loading. Otherwise, checkpoint will be executed as configured in the Virtuoso instance. If that instance crashes before checkpoint; is executed, the loaded data will be lost.
Number of loaders to use
Virtuoso can load multiple files in parallel. This specifies how many and it should correspond to the available number of cores on the server.
Status update interval in seconds
Interval for polling Virtuoso for load status update
Virtuoso extractor
open_in_browserVirtuoso extractor

Instructs Virtuoso to dump a graph.

Virtuoso extractorclose

Special purpose component, allows the user to instruct OpenLink Virtuoso triplestore to dump a named graph to the remote file system. Uses Virtuoso Bulk Loader functionality.

Virtuoso JDBC connection string
JDBC database connection string to connect to the Virtuoso iSQL port (typically 1111)
User name
User name to Virtuoso
Password
Password to Virtuoso. Note that this is plain text and in no way secured
Source graph IRI
IRI of the RDF graph to be dumped
Output path
Directory on the remote file system (local to Virtuoso) where the files will be dumped