HTML CSS ODCS

Transformer, enables users to parse a HTML pages using jsoup CSS like selectors.

Class of root subject
IRI of the root output resource type
Default has predicate
Default predicate to be used to connect nested objects
Generate source info
Adds information about the source file into the output data

Characteristics

ID
t-htmlcssuv
Type
transformer
Inputs
Files
Outputs
RDF single graph
Look in pipeline
HTML CSS ODCS
Sample pipeline
available

The main parsing configuration happens in the Actions list. The actions are performed on top of each input HTML document sequentially, starting with the action named webPage, producing tree-like RDF data connected using the Default has predicate and starting in an entity of type Class of root subject, representing the processed web page. The inputs and outputs are named groups of either HTML elements, lists of HTML elements or texts. Note that there is a difference between a list of HTML elements as a single object, and a list of objects, which are HTML elements. To pass from the first one to the second, one needs to use the Unlist action.

Actions

Each action has the following attributes:

Name
Name of the input for the action. The first action needs to have the name webPage
Type
Type of the action to be performed. The list of possible action types is below
Data
Configuration parameter, depends on the selected action type
Output
Name of the output from the current action. Is to be used as input name for other actions

Action types

Query
Applies the CSS selector specified in Data on the input specified by Name. Result can be either a single HTML element, or a list of HTML elements.
Unlist
Creates a list of individual HTML elements from a list of HTML elements as a single object
Text
Accesses the text content of HTML elements, produces Text output
HTML
Accesses the HTML element and outputs it as Text
Attribute
Accesses the attribute of the input HTML element(s) and outputs it as Text(s)
Subject
For each item on the input creates an RDF resource and connects it to the current parent RDF resource
Class
Generates a type triple for the current RDF resource, with the IRI in Data as the object
Output
Outputs the input texts as literals connected to the current RDF resource using the predicate IRI specified in Data