# biome.text.tokenizer Module

# Tokenizer Class


class Tokenizer (config: TokenizerConfiguration)

Pre-processes and tokenizes the input text

Transforms inputs (e.g., a text, a list of texts, etc.) into structures containing allennlp.data.Token objects.

Use its arguments to configure the first stage of the pipeline (i.e., pre-processing a given set of text inputs.)

Use methods for tokenization depending on the shape of the inputs (e.g., records with multiple fields, sentences lists).

Parameters

config
A TokenizerConfiguration object

# tokenize_text Method


def tokenize_text (
  self,
  text: str,
)  -> List[List[allennlp.data.tokenizers.token.Token]]

Tokenizes a text string applying sentence segmentation, if enabled

Parameters

text : str
The input text

Returns

A list of list of Token.

If no sentence segmentation is enabled, or just one sentence is found in text
 

the first level list will contain just one element: the tokenized text.

# tokenize_document Method


def tokenize_document (
  self,
  document: List[str],
)  -> List[List[allennlp.data.tokenizers.token.Token]]

Tokenizes a document-like structure containing lists of text inputs

Use this to account for hierarchical text structures (e.g., a paragraph is a list of sentences)

Parameters

document : List[str]
A List with text inputs, e.g., sentences

Returns

tokens : List[List[Token]]
 

# tokenize_record Method


def tokenize_record (
  self,
  record: Dict[str, Any],
  exclude_record_keys: bool,
)  -> List[List[allennlp.data.tokenizers.token.Token]]

Tokenizes a record-like structure containing text inputs

Use this to keep information about the record-like data structure as input features to the model.

Parameters

record : Dict[str, Any]
A Dict with arbitrary "fields" containing text.
exclude_record_keys : bool
If enabled, exclude tokens related to record key text

Returns

tokens : List[List[Token]]
A list of tokenized fields as token list
Maintained by