HTML CSS ODCS
Transformer, enables users to parse a HTML pages using jsoup CSS like selectors.
- Class of root subject
- IRI of the root output resource type
- Default has predicate
- Default predicate to be used to connect nested objects
- Generate source info
- Adds information about the source file into the output data
Characteristics
- ID
- t-htmlcssuv
- Type
- transformer
- Inputs
- Files
- Outputs
- RDF single graph
- Look in pipeline
- Sample pipeline
- available
The main parsing configuration happens in the Actions list.
The actions are performed on top of each input HTML document sequentially, starting with the action named webPage
, producing tree-like RDF data connected using the Default has predicate
and starting in an entity of type Class of root subject
, representing the processed web page.
The inputs and outputs are named groups of either HTML elements, lists of HTML elements or texts.
Note that there is a difference between a list of HTML elements as a single object, and a list of objects, which are HTML elements.
To pass from the first one to the second, one needs to use the Unlist
action.
Actions
Each action has the following attributes:
- Name
- Name of the input for the action. The first action needs to have the name
webPage
- Type
- Type of the action to be performed. The list of possible action types is below
- Data
- Configuration parameter, depends on the selected action type
- Output
- Name of the output from the current action. Is to be used as input name for other actions
Action types
- Query
- Applies the CSS selector specified in
Data
on the input specified byName
. Result can be either a single HTML element, or a list of HTML elements. - Unlist
- Creates a list of individual HTML elements from a list of HTML elements as a single object
- Text
- Accesses the text content of HTML elements, produces Text output
- HTML
- Accesses the HTML element and outputs it as Text
- Attribute
- Accesses the attribute of the input HTML element(s) and outputs it as Text(s)
- Subject
- For each item on the input creates an RDF resource and connects it to the current parent RDF resource
- Class
- Generates a type triple for the current RDF resource, with the IRI in
Data
as the object - Output
- Outputs the input texts as literals connected to the current RDF resource using the predicate IRI specified in
Data