Inscriptis module documentation

Parse HTML content and converts it into a text representation.

Inscriptis provides support for

  • nested HTML tables

  • basic Cascade Style Sheets

  • annotations

The following example provides the text representation of https://www.fhgr.ch.

import urllib.request
from inscriptis import get_text

url = 'https://www.fhgr.ch'
html = urllib.request.urlopen(url).read().decode('utf-8')

text = get_text(html)

print(text)

Use the method get_annotated_text() to obtain text and annotations. The method requires annotation rules as described in annotations.

 import urllib.request
 from inscriptis import get_annotated_text

 url = "https://www.fhgr.ch"
 html = urllib.request.urlopen(url).read().decode('utf-8')

 # annotation rules specify the HTML elements and attributes to annotate.
 rules = {'h1': ['heading'],
          'h2': ['heading'],
          '#class=FactBox': ['fact-box'],
          'i': ['emphasis']}

output = get_annotated_text(html, ParserConfig(annotation_rules=rules)
print("Text:", output['text'])
print("Annotations:", output['label'])

The method returns a dictionary with two keys:

  1. text which contains the page’s plain text and

  2. label with the annotations in JSONL format that is used by annotators

    such as doccano.

Annotations in the label field are returned as a list of triples with

start index, end index and label as indicated below:

{"text": "Chur\n\nChur is the capital and largest town of the Swiss canton
          of the Grisons and lies in the Grisonian Rhine Valley.",
 "label": [[0, 4, "heading"], [6, 10, "emphasis"]]}
inscriptis.get_annotated_text(html_content: str, config: ParserConfig = None) Dict[str, Any][source]

Return a dictionary of the extracted text and annotations.

Notes

  • the text is stored under the key ‘text’.

  • annotations are provided under the key ‘label’ which contains a list of :class:`Annotation`s.

Examples

{“text”: “EU rejects German call to boycott British lamb.”, “

label”: [ [0, 2, “strong”], … ]}

{“text”: “Peter Blackburn”,

“label”: [ [0, 15, “heading”] ]}

Returns:

‘text’) and annotations (key: ‘label’)

Return type:

A dictionary of text (key

inscriptis.get_text(html_content: str, config: ParserConfig = None) str[source]

Provide a text representation of the given HTML content.

Parameters:
  • html_content (str) – The HTML content to convert.

  • config – An optional ParserConfig object.

Returns:

The text representation of the HTML content.

Inscriptis model

Inscriptis HTML engine

The HTML Engine is responsible for converting HTML to text.

class inscriptis.html_engine.Inscriptis(html_tree: HtmlElement, config: ParserConfig = None)[source]

Translate an lxml HTML tree to the corresponding text representation.

Parameters:
  • html_tree – the lxml HTML tree to convert.

  • config – an optional ParserConfig configuration object.

Example:

from lxml.html import fromstring
from inscriptis.html_engine import Inscriptis

html_content = "<html><body><h1>Test</h1></body></html>"

# create an HTML tree from the HTML content.
html_tree = fromstring(html_content)

# transform the HTML tree to text.
parser = Inscriptis(html_tree)
text = parser.get_text()
get_annotations() List[Annotation][source]

Return the annotations extracted from the HTML page.

get_text() str[source]

Return the text extracted from the HTML page.

Inscriptis HTML properties

Provide properties used for rendering HTML pages.

Supported attributes::
  1. Display properties.

  2. WhiteSpace properties.

  3. HorizontalAlignment properties.

  4. VerticalAlignment properties.

class inscriptis.html_properties.Display(value, names=None, *values, module=None, qualname=None, type=None, start=1, boundary=None)[source]

Specify whether content will be rendered as inline, block or none.

Note

A display attribute on none indicates, that the content should not be rendered at all.

class inscriptis.html_properties.HorizontalAlignment(value, names=None, *values, module=None, qualname=None, type=None, start=1, boundary=None)[source]

Specify the content’s horizontal alignment.

center = '^'

Center the block’s content.

left = '<'

Left alignment of the block’s content.

right = '>'

Right alignment of the block’s content.

class inscriptis.html_properties.VerticalAlignment(value, names=None, *values, module=None, qualname=None, type=None, start=1, boundary=None)[source]

Specify the content’s vertical alignment.

bottom = 3

Align all content at the bottom.

middle = 2

Align all content in the middle.

top = 1

Align all content at the top.

class inscriptis.html_properties.WhiteSpace(value, names=None, *values, module=None, qualname=None, type=None, start=1, boundary=None)[source]

Specify the HTML element’s whitespace handling.

Inscriptis supports the following handling strategies outlined in the Cascading Style Sheets specification.

normal = 1

Collapse multiple whitespaces into a single one.

pre = 3

Preserve sequences of whitespaces.

Inscriptis CSS model

Implement basic CSS support for inscriptis.

  • The HtmlElement class encapsulates all CSS properties of a single HTML element.

  • CssParse parses CSS specifications and translates them into the corresponding HtmlElements used by Inscriptis for rendering HTML pages.

class inscriptis.model.css.CssParse[source]

Parse CSS specifications and applies them to HtmlElements.

The attribute display: none, for instance, is translated to HtmlElement.display=Display.none.

static attr_display(value: str, html_element: HtmlElement)[source]

Apply the given display value.

static attr_horizontal_align(value: str, html_element: HtmlElement)[source]

Apply the provided horizontal alignment.

static attr_margin_after(value: str, html_element: HtmlElement)

Apply the provided bottom margin.

static attr_margin_before(value: str, html_element: HtmlElement)

Apply the given top margin.

static attr_margin_bottom(value: str, html_element: HtmlElement)[source]

Apply the provided bottom margin.

static attr_margin_top(value: str, html_element: HtmlElement)[source]

Apply the given top margin.

static attr_padding_left(value: str, html_element: HtmlElement)[source]

Apply the given left padding_inline.

static attr_padding_start(value: str, html_element: HtmlElement)

Apply the given left padding_inline.

static attr_style(style_attribute: str, html_element: HtmlElement)[source]

Apply the provided style attributes to the given HtmlElement.

Parameters:
  • style_attribute – The attribute value of the given style sheet. Example: display: none

  • html_element – The HtmlElement to which the given style is applied.

static attr_vertical_align(value: str, html_element: HtmlElement)[source]

Apply the given vertical alignment.

static attr_white_space(value: str, html_element: HtmlElement)[source]

Apply the given white-space value.

Inscriptis canvas model

Classes used for rendering (parts) of the canvas.

Every parsed HtmlElement writes its textual content to the canvas which is managed by the following three classes:

  • Canvas provides the drawing board on which the HTML page is serialized and annotations are recorded.

  • Block contains the current line to which text is written.

  • Prefix handles indentation and bullets that prefix a line.

class inscriptis.model.canvas.Canvas[source]

The text Canvas on which Inscriptis writes the HTML page.

margin

the current margin to the previous block (this is required to ensure that the margin_after and margin_before constraints of HTML block elements are met).

current_block

A Block which merges the input text into a block (i.e., line).

blocks

a list of strings containing the completed blocks (i.e., text lines). Each block spawns at least one line.

annotations

the list of recorded Annotations.

_open_annotations

a map of open tags that contain annotations.

close_block(tag: HtmlElement) None[source]

Close the given HtmlElement by writing its bottom margin.

Parameters:

tag – the HTML Block element to close

close_tag(tag: HtmlElement) None[source]

Register that the given tag tag is closed.

Parameters:

tag – the tag to close.

flush_inline() bool[source]

Attempt to flush the content in self.current_block into a new block.

Notes

  • If self.current_block does not contain any content (or only whitespaces) no changes are made.

  • Otherwise the content of current_block is added to blocks and a new current_block is initialized.

Returns:

True if the attempt was successful, False otherwise.

get_text() str[source]

Provide a text representation of the Canvas.

property left_margin: int

Return the length of the current line’s left margin.

open_block(tag: HtmlElement) None[source]

Open an HTML block element.

open_tag(tag: HtmlElement) None[source]

Register that a tag is opened.

Parameters:

tag – the tag to open.

write(tag: HtmlElement, text: str, whitespace: WhiteSpace = None) None[source]

Write the given text to the current block.

write_unconsumed_bullet() None[source]

Write unconsumed bullets to the blocks list.

Representation of a text block within the HTML canvas.

class inscriptis.model.canvas.block.Block(idx: int, prefix: Prefix)[source]

The current block of text.

A block usually refers to one line of output text.

Note

If pre-formatted content is merged with a block, it may also contain multiple lines.

Parameters:
  • idx – the current block’s start index.

  • prefix – prefix used within the current block.

merge(text: str, whitespace: WhiteSpace) None[source]

Merge the given text with the current block.

Parameters:
  • text – the text to merge.

  • whitespace – whitespace handling.

merge_normal_text(text: str) None[source]

Merge the given text with the current block.

Parameters:

text – the text to merge

Note

If the previous text ended with a whitespace and text starts with one, both

will automatically collapse into a single whitespace.

merge_pre_text(text: str) None[source]

Merge the given pre-formatted text with the current block.

Parameters:

text – the text to merge

new_block() Block[source]

Return a new Block based on the current one.

Manage the horizontal prefix (left-indentation, bullets) of canvas lines.

class inscriptis.model.canvas.prefix.Prefix[source]

Class Prefix manages paddings and bullets that prefix an HTML block.

current_padding

the number of characters used for the current left-indentation.

paddings

the list of paddings for the current and all previous tags.

bullets

the list of bullets in the current and all previous tags.

consumed

whether the current bullet has already been consumed.

property first: str

Return the prefix used at the beginning of a tag.

Note::

A new block needs to be prefixed by the current padding and bullet. Once this has happened (i.e., consumed is set to True) no further prefixes should be used for a line.

pop_next_bullet() str[source]

Pop the next bullet to use, if any bullet is available.

register_prefix(padding_inline: int, bullet: str) None[source]

Register the given prefix.

Parameters:
  • padding_inline – the number of characters used for padding_inline

  • bullet – an optional bullet.

remove_last_prefix() None[source]

Remove the last prefix from the list.

property rest: str

Return the prefix used for new lines within a block.

This prefix is used for pre-text that contains newlines. The lines need to be prefixed with the right padding to preserver the indentation.

property unconsumed_bullet: str

Yield any yet unconsumed bullet.

Note::

This function yields the previous element’s bullets, if they have not been consumed yet.

Inscriptis table model

Classes used for representing Tables, TableRows and TableCells.

class inscriptis.model.table.Table(left_margin_len: int, cell_separator: str)[source]

An HTML table.

rows

the table’s rows.

left_margin_len

length of the left margin before the table.

cell_separator

string used for separating cells from each other.

add_cell(table_cell: TableCell)[source]

Add a new TableCell to the table’s last row.

Note

If no row exists yet, a new row is created.

add_row()[source]

Add an empty TableRow to the table.

get_annotations(idx: int, left_margin_len: int) List[Annotation][source]

Return all annotations in the given table.

Parameters:
  • idx – the table’s start index.

  • left_margin_len – len of the left margin (required for adapting the position of annotations).

Returns:

A list of all Annotations present in the table.

get_text() str[source]

Return and render the text of the given table.

class inscriptis.model.table.TableCell(align: HorizontalAlignment, valign: VerticalAlignment)[source]

A table cell.

line_width

the original line widths per line (required to adjust annotations after a reformatting)

vertical_padding

vertical padding that has been introduced due to vertical formatting rules.

get_annotations(idx: int, row_width: int) List[Annotation][source]

Return a list of all annotations within the TableCell.

Returns:

A list of annotations that have been adjusted to the cell’s position.

property height: int

Compute the table cell’s height.

Returns:

The cell’s current height.

normalize_blocks() int[source]

Split multi-line blocks into multiple one-line blocks.

Returns:

The height of the normalized cell.

property width: int

Compute the table cell’s width.

Returns:

The cell’s current width.

class inscriptis.model.table.TableRow(cell_separator: str)[source]

A single row within a table.

columns

the table row’s columns.

cell_separator

string used for separating columns from each other.

get_text() str[source]

Return a text representation of the TableRow.

property width: int

Compute and return the width of the current row.

Inscriptis annotations

The model used for saving annotations.

class inscriptis.annotation.Annotation(start: int, end: int, metadata: str)[source]

An Inscriptis annotation which provides metadata on the extracted text.

The start and end indices indicate the span of the text to which the metadata refers, and the attribute metadata contains the tuple of tags describing this span.

Example:

Annotation(0, 10, ('heading', ))

The annotation above indicates that the text span between the 1st (index 0) and 11th (index 10) character of the extracted text contains a heading.

end: int

the annotation’s end index within the text output.

metadata: str

the tag to be attached to the annotation.

start: int

the annotation’s start index within the text output.

inscriptis.annotation.horizontal_shift(annotations: List[Annotation], content_width: int, line_width: int, align: HorizontalAlignment, shift: int = 0) List[Annotation][source]

Shift annotations based on the given line’s formatting.

Adjusts the start and end indices of annotations based on the line’s formatting and width.

Parameters:
  • annotations – a list of Annotations.

  • content_width – the width of the actual content

  • line_width – the width of the line in which the content is placed.

  • align – the horizontal alignment (left, right, center) to assume for the adjustment

  • shift – an optional additional shift

Returns:

A list of Annotations with the adjusted start and end positions.

Annotation processors

AnnotationProcessors transform annotations to an output format.

All AnnotationProcessor’s implement the AnnotationProcessor interface by overwrite the class’s AnnotationProcessor.__call__() method.

Note

  1. The AnnotationExtractor class must be put into a package with the extractor’s name (e.g., inscriptis.annotation.output.*package*) and be named *PackageExtractor* (see the examples below).

  2. The overwritten __call__() method may either extend the original dictionary which contains the extracted text and annotations (e.g., SurfaceExtractor) or may replace it with an custom output (e.g., HtmlExtractor and XmlExtractor.

Currently, Inscriptis supports the following built-in AnnotationProcessors:

  1. HtmlExtractor provides an annotated HTML output format.

  2. XmlExtractor yields an output which marks annotations with XML tags.

  3. SurfaceExtractor adds the key surface to the result dictionary which contains the surface forms of the extracted annotations.

class inscriptis.annotation.output.AnnotationProcessor[source]

An AnnotationProcessor is called for formatting annotations.