Transformer, enables users to parse a HTML pages using jsoup CSS like selectors.
- Class of root subject
- IRI of the root output resource type
- Default has predicate
- Default predicate to be used to connect nested objects
- Generate source info
- Adds information about the source file into the output data
- RDF single graph
- Look in pipeline
- Sample pipeline
The main parsing configuration happens in the Actions list.
The actions are performed on top of each input HTML document sequentially, starting with the action named
webPage, producing tree-like RDF data connected using the
Default has predicate and starting in an entity of type
Class of root subject, representing the processed web page.
The inputs and outputs are named groups of either HTML elements, lists of HTML elements or texts.
Note that there is a difference between a list of HTML elements as a single object, and a list of objects, which are HTML elements.
To pass from the first one to the second, one needs to use the
Each action has the following attributes:
- Name of the input for the action. The first action needs to have the name
- Type of the action to be performed. The list of possible action types is below
- Configuration parameter, depends on the selected action type
- Name of the output from the current action. Is to be used as input name for other actions
- Applies the CSS selector specified in
Dataon the input specified by
Name. Result can be either a single HTML element, or a list of HTML elements.
- Creates a list of individual HTML elements from a list of HTML elements as a single object
- Accesses the text content of HTML elements, produces Text output
- Accesses the HTML element and outputs it as Text
- Accesses the attribute of the input HTML element(s) and outputs it as Text(s)
- For each item on the input creates an RDF resource and connects it to the current parent RDF resource
- Generates a type triple for the current RDF resource, with the IRI in
Dataas the object
- Outputs the input texts as literals connected to the current RDF resource using the predicate IRI specified in