# biome.text.tokenizer Module

# Tokenizer Class

class Tokenizer (config: TokenizerConfiguration)

Pre-processes and tokenizes the input text

Transforms inputs (e.g., a text, a list of texts, etc.) into structures containing allennlp.data.Token objects.

Use its arguments to configure the first stage of the pipeline (i.e., pre-processing a given set of text inputs.)

Use methods for tokenization depending on the shape of the inputs (e.g., records with multiple fields, sentences lists).


A TokenizerConfiguration object

# tokenize_text Method

def tokenize_text (
  text: str,
)  -> List[List[allennlp.data.tokenizers.token.Token]]

Tokenizes a text string applying sentence segmentation, if enabled


text : str
The input text


A list of list of Token.

If no sentence segmentation is enabled, or just one sentence is found in text

the first level list will contain just one element: the tokenized text.

# tokenize_document Method

def tokenize_document (
  document: List[str],
)  -> List[List[allennlp.data.tokenizers.token.Token]]

Tokenizes a document-like structure containing lists of text inputs

Use this to account for hierarchical text structures (e.g., a paragraph is a list of sentences)


document : List[str]
A List with text inputs, e.g., sentences


tokens : List[List[Token]]

# tokenize_record Method

def tokenize_record (
  record: Dict[str, Any],
  exclude_record_keys: bool,
)  -> List[List[allennlp.data.tokenizers.token.Token]]

Tokenizes a record-like structure containing text inputs

Use this to keep information about the record-like data structure as input features to the model.


record : Dict[str, Any]
A Dict with arbitrary "fields" containing text.
exclude_record_keys : bool
If enabled, exclude tokens related to record key text


tokens : List[List[Token]]
A list of tokenized fields as token list
Maintained by