Use of XML is widespread across many domains. For example, it is likely the dominant format of geographic data, as evidenced by standards of the Open Geospatial Consortium or the INSPIRE specifications for spatial data infrastructure in the EU. Since a lot of RDF data refers to places in the real world, combining it with geographic datasets often becomes useful. If that data is in XML, it must be turned into RDF first to be able to combine the data together.
Assume you have geospatial XML data about the Žižkov Television Tower, allegedly the second ugliest building in the world. It is available as open data via the web services maintained by the Czech State Administration of Land Surveying and Cadastre. You can get the data live via a web service. Since it is quite verbose, here is a subset of it:
It is described using the INSPIRE specification for data about buildings and includes snippets in the Geography Markup Language (GML) defining geospatial geometries. In fact, the initial version of GML had a profile serializing data to RDF, and its object-property structure retained some likeness to RDF to this day. Constructing RDF out of the example data is thus relatively straightforward.
Solution
XSL Transformations serve for transformations of XML data into other XML data. While this standard can produce any textual format, it was designed to create XML output in particular. You can leverage this support by producing RDF/XML, an XML-based RDF syntax. Unfortunately, RDF/XML has perhaps done more bad than good to RDF. Superficial flaws of this syntax were mistaken to be fundamental flaws of RDF. Using this syntax as a bridge between XML and RDF is likely one of the few uses it has.
LinkedPipes ETL (LP-ETL) provides the XSLT transformer component to run XSL transformations. This component is based on the open-source version of the Saxon XSLT processor.
A few options are at your disposal to configure this component. The Transformed file extension will be appended to the input file name to form the output file name. If you produce RDF, use the extension rdf that indicates RDF/XML and allows the output to be subsequently parsed without needing to specify its format. The Number of threads used for transformation enables you to parallelize the transformations of multiple files. If processing many XML files, you would typically set this option to be equal to the number of CPU cores available on the machine running LP-ETL. The Skip on error switch instructs LP-ETL to skip files that cause errors during their transformation. In general, it is recommended to keep this switch turned off and handle the expected errors in the XSL stylesheet. Finally, the most important parameter of the component is the actual XSL stylesheet that defines the transformation. For instance, to transform the data about the Žižkov Television Tower we can use the following simplified stylesheet, which takes only few source elements into account:
Applying this stylesheet to the input data produces the following result; here shown reserialized in the more readable Turtle RDF syntax:
The data in RDF describes the Žižkov Television Tower as an instance of dbo:Building from the DBpedia ontology. Its identifier is preserved via the schema:identifier property from Schema.org, as well as a part of the building's IRI. The building links the shape of its ground view captured as an instance of schema:GeoShape that is described by a series of coordinates in the WGS 84 system. These coordinates allow you to plot the building on a map.
The pipeline transforming the discussed XML data to RDF is available here.
Discussion
If you have multiple input XML files, you can treat each output file as a chunk to be processed separately in parallel with other chunks. To do so, you can use the Files to RDF chunked component instead of the usual Files to RDF single graph component. Splitting RDF into smaller chunks is an optimization that reduces memory use and improves the speed of pipeline execution, especially when processing larger datasets. Building pipelines operating on RDF chunks is explained in the tutorial about processing large RDF data.
See also
Thanks to XML and RDF libraries available in many programming languages you can transform XML to RDF by implementing the conversion directly on top of these libraries. Nevertheless, in doing so you would lose much of the benefits of the declarative specification that XSL stylesheets give you, such as its high-level formulation.
An alternative approach for converting XML to RDF is Tripliser. Similarly to XSLT, it is based on a declarative mapping written in XML, but it offers more and it may give you better leverage for processing large or volatile data.