How to process large RDF data
You want to process large RDF data.
Problem
When you transform large RDF data in LinkedPipes ETL (LP-ETL), your SPARQL queries and update operations can take long or get stuck, or you may run out of memory.
Solution
LP-ETL allows you to divide and conquer large RDF data by splitting it into smaller chunks. Each chunk can be transformed separately, which requires less compute resources in total than processing the same RDF in bulk.
Processing data in chunks implies that it need not be loaded into memory as a whole, thus reducing the memory footprint of their transformation. Moreover, a sequence of chunks can be processed in parallel, leveraging machines with multiple CPU cores that can work with several chunks at a time. By efficient utilization of compute resources the parallel processing can speed up a pipeline's execution.
Chunking comes with a major caveat. It can be used only when the processed data can be partitioned into chunks such that each chunk contains all data required by the applied transformations. For example, joins across the complete dataset are unfeasible when the dataset is split into chunks. Consequently, chunking is suitable for data processing tasks that are embarrassingly parallel. Fortunately, there are still many tasks in ETL of RDF data that fit this description.
LP-ETL provides special versions of many of its components that produce or operate on chunked data. If you want to use chunking, you can substitute the components in your pipeline with their variants that support chunks. There are several components that output chunks:
Component | Chunk size determined by |
---|---|
Files to RDF chunked | number of files |
Tabular chunked | number of rows from the input CSV |
SPARQL Endpoint chunked | number of resources included in the component's query |
Moreover, chunks can be produced in any case you have collections of RDF files, such as those obtained from the XSLT transformer or the HTTP get list components. Once you have chunks, you can consume them with components that can transform the chunks, such as the SPARQL construct chunked or the SPARQL update chunked components. They operate the same as their non-chunked counterparts, except that the SPARQL query or update operation from their configuration is executed on each chunk individually.
Discussion
When you are done with transforming your chunks or you need to pass the data to a component without support for chunks, you can merge the chunks back to a single RDF database via the Chunked merger component.
Note that chunking RDF data and merging it back incurs an overhead. The decision to use chunking should therefore be informed by the trade off between the savings gained by chunked execution compared with the costs of the overhead. Simply put, you want to use chunking only if your data reaches a larger volume.
See also
You can find more ways to optimize LP-ETL pipelines described in a part of the tutorial on converting tabular data to RDF. In case the options that LP-ETL offers do not cut it for the volume of data you need to handle, you can look at other methods for distributed processing of large-scale RDF data, such as the Apache Jena Elephas library for working with RDF on a Hadoop back-end.