How to compress and decompress data

Your data takes too long to transfer or occupies too much disk space.

Problem

RDF syntaxes, especially the line-oriented ones, such as N-Triples or N-Quads, tend to be verbose. Unless namespace prefixes are used diligently, namespace substrings are frequently repeated in IRIs. Redundancies may appear on structural level too. For example, size of data increases when it is denormalized. That is often the case of data in CSV. Consequently, transferring uncompressed data between servers can take long and the data files can occupy too much disk space.

Solution

LinkedPipes ETL (LP-ETL) offers several ways to compress data. It supports GZip, BZip2, Zip, and 7-zip; albeit only reading is supported for 7-zip. There are LP-ETL components that provide compression and decompression via these methods.

The Stream compression component compresses to GZip or BZip2, with GZip being the default. Since these formats do not support creating archives containing multiple files, the component produces a single output file for each input file. You can switch the compression methods by the Format option. The component outputs a file named as the input file name with the suffix indicating the compression method appended, i.e. .gz or .bz2.

The Create zip archive component compresses its input files to a Zip archive. You can specify the output's name by the Output zip file name option. If you need to combine multiple files into a single archive file, use this component.

Conversely, there is a component for decompression. The Decompress archive can read the GZip, BZip2, Zip, and 7-zip formats. The input's format can be either detected automatically from its file extension or selected by the Format option. You can also opt in to Unpack each file into separated directory or switch the Skip file on error option on if your pipeline may encounter an invalid file. Besides this component there is also the Decompress zip archive component, which supports only Zip archives. Since this format is also available in the Decompress archive component, the Zip-specific component exists mostly due to legacy reasons.

Discussion

We can illustrate the data compression ratios of the supported compression methods by using an example. The following table shows relative files sizes of a 653 MB file in the Turtle syntax when compressed by the respective methods or reserialized to other RDF syntaxes.

Relative file sizes of an example RDF file per syntax and compression
RDF syntax Uncompressed GZip BZip2 Zip 7-zip
Turtle 1 0.18 0.13 0.18 0.11
N-Triples 2.6 0.23 0.18 0.23 0.15
RDF/XML 1.57 0.18 0.13 0.19 0.12

While the results may vary for different kinds of data, they suggest that Turtle and 7-zip offer the most compact compression. However, LP-ETL can only read 7-zip, not write it. The second best results are achieved with BZip2. Nevertheless, note that the more efficient compression methods take more time, so that there is a trade off to be made between the resulting file size and the time it takes to produce it. When deciding on a method of compression to use, consider how often it will be read. When data is transferred or read many times, use a more compact compression method. Otherwise, when data is read only once, use a faster compression method.

See also

Benefits of compression can be also attained by using binary serialization formats. LP-ETL allows to export RDF to BinaryRDF, a custom binary format for RDF supported by the RDF4J. Serialization to this format is provided by the RDF to File component. Outside of LP-ETL, RDF HDT is an established binary RDF format that also maintains limited capability to query the data. For illustration purposes, the compression ratio of RDF HDT for the example used above is 0.24, which makes it on par with the evaluted compression methods.