How to process large RDF data

You want to process large RDF data.

Problem

When you transform large RDF data in LinkedPipes ETL (LP-ETL), your SPARQL queries and update operations can take long or get stuck, or you may run out of memory.

Solution

LP-ETL allows you to divide and conquer large RDF data by splitting it into smaller chunks. Each chunk can be transformed separately, which requires less compute resources in total than processing the same RDF in bulk.

Processing data in chunks implies that it need not be loaded into memory as a whole, thus reducing the memory footprint of their transformation. Moreover, a sequence of chunks can be processed in parallel, leveraging machines with multiple CPU cores that can work with several chunks at a time. By efficient utilization of compute resources the parallel processing can speed up a pipeline's execution.

Chunking comes with a major caveat. It can be used only when the processed data can be partitioned into chunks such that each chunk contains all data required by the applied transformations. For example, joins across the complete dataset are unfeasible when the dataset is split into chunks. Consequently, chunking is suitable for data processing tasks that are embarrassingly parallel. Fortunately, there are still many tasks in ETL of RDF data that fit this description.

LP-ETL provides special versions of many of its components that produce or operate on chunked data. If you want to use chunking, you can substitute the components in your pipeline with their variants that support chunks. There are several components that output chunks:

Component	Chunk size determined by
Files to RDF chunked	number of files
Tabular chunked	number of rows from the input CSV
SPARQL Endpoint chunked	number of resources included in the component's query

Moreover, chunks can be produced in any case you have collections of RDF files, such as those obtained from the XSLT transformer or the HTTP get list components. Once you have chunks, you can consume them with components that can transform the chunks, such as the SPARQL construct chunked or the SPARQL update chunked components. They operate the same as their non-chunked counterparts, except that the SPARQL query or update operation from their configuration is executed on each chunk individually.

Discussion

When you are done with transforming your chunks or you need to pass the data to a component without support for chunks, you can merge the chunks back to a single RDF database via the Chunked merger component.

Note that chunking RDF data and merging it back incurs an overhead. The decision to use chunking should therefore be informed by the trade off between the savings gained by chunked execution compared with the costs of the overhead. Simply put, you want to use chunking only if your data reaches a larger volume.

Problem

Solution

Discussion

See also