# biome.text.configuration Module

# FeaturesConfiguration Class

class FeaturesConfiguration (
    word: Union[WordFeatures, NoneType] = None,
    char: Union[CharFeatures, NoneType] = None,
    transformers: Union[TransformersFeatures, NoneType] = None,

Configures the input features of the Pipeline

Use this for defining the features to be used by the model, namely word and character embeddings.

:::tip If you do not pass in either of the parameters (word or char), your pipeline will be setup with a default word feature (embedding_dim=50). :::


word = WordFeatures(embedding_dim=100)
char = CharFeatures(embedding_dim=16, encoder={'type': 'gru'})
config = FeaturesConfiguration(word, char)


The word feature configurations, see WordFeatures
The character feature configurations, see CharFeatures
The transformers feature configuration, see TransformersFeatures A word-level representation of the transformer models using AllenNLP's

# Ancestors

  • allennlp.common.from_params.FromParams

# from_params Static method

def from_params (
  params: allennlp.common.params.Params,
)  -> FeaturesConfiguration

This is the automatic implementation of from_params. Any class that subclasses FromParams (or Registrable, which itself subclasses FromParams) gets this implementation for free. If you want your class to be instantiated from params in the "obvious" way – pop off parameters and hand them to your constructor with the same names – this provides that functionality.

If you need more complex logic in your from from_params method, you'll have to implement your own method that overrides this one.

The constructor_to_call and constructor_to_inspect arguments deal with a bit of redirection that we do. We allow you to register particular @classmethods on a class as the constructor to use for a registered name. This lets you, e.g., have a single Vocabulary class that can be constructed in two different ways, with different names registered to each constructor. In order to handle this, we need to know not just the class we're trying to construct (cls), but also what method we should inspect to find its arguments (constructor_to_inspect), and what method to call when we're done constructing arguments (constructor_to_call). These two methods are the same when you've used a @classmethod as your constructor, but they are different when you use the default constructor (because you inspect __init__, but call cls()).

# Instance variables

var keys : List[str]

Gets the keys of the features

# compile_embedder Method

def compile_embedder (
  vocab: allennlp.data.vocabulary.Vocabulary,
)  -> allennlp.modules.text_field_embedders.text_field_embedder.TextFieldEmbedder

Creates the embedder based on the configured input features


The vocabulary for which to create the embedder



# compile_featurizer Method

def compile_featurizer (
  tokenizer: Tokenizer,
)  -> InputFeaturizer

Creates the featurizer based on the configured input features

:::tip If you are creating configurations programmatically use this method to check that you provided a valid configuration. :::


Tokenizer used for this featurizer


The configured InputFeaturizer

# TokenizerConfiguration Class

class TokenizerConfiguration (
    lang: str = 'en',
    max_sequence_length: int = None,
    max_nr_of_sentences: int = None,
    text_cleaning: Union[Dict[str, Any], NoneType] = None,
    segment_sentences: Union[bool, Dict[str, Any]] = False,
    start_tokens: Union[List[str], NoneType] = None,
    end_tokens: Union[List[str], NoneType] = None,

Configures the Tokenizer


The spaCy model used for tokenization is language dependent. For optimal performance, specify the language of your input data (default: "en").
Maximum length in characters for input texts truncated with [:max_sequence_length] after TextCleaning.
Maximum number of sentences to keep when using segment_sentences truncated with [:max_sequence_length].
A TextCleaning configuration with pre-processing rules for cleaning up and transforming raw input text.
Whether to segment input texts in to sentences using the default SentenceSplitter or providing a specific configuration for a SentenceSplitter.
A list of token strings to the sequence before tokenized input text.
A list of token strings to the sequence after tokenized input text.

# Ancestors

  • allennlp.common.from_params.FromParams

# PipelineConfiguration Class

class PipelineConfiguration (
    name: str,
    head: TaskHeadConfiguration,
    features: FeaturesConfiguration = None,
    tokenizer: Union[TokenizerConfiguration, NoneType] = None,
    encoder: Union[Seq2SeqEncoderConfiguration, NoneType] = None,

Creates a Pipeline configuration


The name for our pipeline
The input features to be used by the model pipeline. We define this using a FeaturesConfiguration object.
The head for the task, e.g., a LanguageModelling task, using a TaskHeadConfiguration object.
The tokenizer defined with a TokenizerConfiguration object.
The core text seq2seq encoder of our model using a Seq2SeqEncoderConfiguration

# Ancestors

  • allennlp.common.from_params.FromParams

# from_yaml Static method

def from_yaml(path: str) -> PipelineConfiguration

Creates a pipeline configuration from a config yaml file


The path to a YAML configuration file



# from_dict Static method

def from_dict(config_dict: dict) -> PipelineConfiguration

Creates a pipeline configuration from a config dictionary


A configuration dictionary



# as_dict Method

def as_dict(self) -> Dict[str, Any]

Returns the configuration as dictionary



# to_yaml Method

def to_yaml (
  path: str,

Saves the pipeline configuration to a yaml formatted file


Path to the output file

# build_tokenizer Method

def build_tokenizer(self) -> Tokenizer

Build the pipeline tokenizer

# build_featurizer Method

def build_featurizer(self) -> InputFeaturizer

Creates the pipeline featurizer

# build_embedder Method

def build_embedder (
  vocab: allennlp.data.vocabulary.Vocabulary,

Build the pipeline embedder for aiven dictionary

# TrainerConfiguration Class

class TrainerConfiguration (
    optimizer: Dict[str, Any] = <factory>,
    validation_metric: str = '-loss',
    patience: Union[int, NoneType] = 2,
    num_epochs: int = 20,
    cuda_device: int = -1,
    grad_norm: Union[float, NoneType] = None,
    grad_clipping: Union[float, NoneType] = None,
    learning_rate_scheduler: Union[Dict[str, Any], NoneType] = None,
    momentum_scheduler: Union[Dict[str, Any], NoneType] = None,
    moving_average: Union[Dict[str, Any], NoneType] = None,
    batch_size: Union[int, NoneType] = 16,
    data_bucketing: bool = False,
    no_grad: List[str] = None,

Creates a TrainerConfiguration

Doc strings mainly provided by AllenNLP


optimizer : Dict[str, Any], default {"type": "adam"}
Pytorch optimizers that can be constructed via the AllenNLP configuration framework
validation_metric : str, optional (default=-loss)
Validation metric to measure for whether to stop training using patience and whether to serialize an is_best model each epoch. The metric name must be prepended with either "+" or "-", which specifies whether the metric is an increasing or decreasing function.
patience : Optional[int], optional (default=2)
Number of epochs to be patient before early stopping: the training is stopped after patience epochs with no improvement. If given, it must be > 0. If None, early stopping is disabled.
num_epochs : int, optional (default=20)
Number of training epochs
cuda_device : int, optional (default=-1)
An integer specifying the CUDA device to use for this process. If -1, the CPU is used.
grad_norm : Optional[float], optional
If provided, gradient norms will be rescaled to have a maximum of this value.
grad_clipping : Optional[float], optional
If provided, gradients will be clipped during the backward pass to have an (absolute) maximum of this value. If you are getting NaNs in your gradients during training that are not solved by using grad_norm, you may need this.
learning_rate_scheduler : Optional[Dict[str, Any]], optional
If specified, the learning rate will be decayed with respect to this schedule at the end of each epoch (or batch, if the scheduler implements the step_batch method). If you use torch.optim.lr_scheduler.ReduceLROnPlateau, this will use the validation_metric provided to determine if learning has plateaued.
momentum_scheduler : Optional[Dict[str, Any]], optional
If specified, the momentum will be updated at the end of each batch or epoch according to the schedule.
moving_average : Optional[Dict[str, Any]], optional
If provided, we will maintain moving averages for all parameters. During training, we employ a shadow variable for each parameter, which maintains the moving average. During evaluation, we backup the original parameters and assign the moving averages to corresponding parameters. Be careful that when saving the checkpoint, we will save the moving averages of parameters. This is necessary because we want the saved model to perform as well as the validated model if we load it later.
batch_size : Optional[int], optional (default=16)
Size of the batch.
data_bucketing : bool, optional (default=False)
If enabled, try to apply data bucketing over training batches.
Freeze a list of parameters. The parameter names have to match those of the Pipeline.trainable_parameter_names.

# to_allennlp_trainer Method

def to_allennlp_trainer(self) -> Dict[str, Any]

Returns a configuration dict formatted for AllenNLP's trainer



# VocabularyConfiguration Class

class VocabularyConfiguration (
    sources: Union[List[DataSource], List[Union[allennlp.data.dataset_readers.dataset_reader.AllennlpDataset, allennlp.data.dataset_readers.dataset_reader.AllennlpLazyDataset]]],
    min_count: Dict[str, int] = None,
    max_vocab_size: Union[int, Dict[str, int]] = None,
    pretrained_files: Union[Dict[str, str], NoneType] = None,
    only_include_pretrained_words: bool = False,
    tokens_to_add: Dict[str, List[str]] = None,
    min_pretrained_embeddings: Dict[str, int] = None,

Configures a Vocabulary before it gets created from the data

Use this to configure a Vocabulary using specific arguments from allennlp.data.Vocabulary

See AllenNLP Vocabulary docs


List of DataSource or InstancesDataset objects to be used for data creation
min_count : Dict[str, int], optional (default=None)
Minimum number of appearances of a token to be included in the vocabulary. The key in the dictionary refers to the namespace of the input feature
max_vocab_size : Dict[str, int] or int, optional (default=None)
Maximum number of tokens in the vocabulary
pretrained_files : Optional[Dict[str, str]], optional
If provided, this map specifies the path to optional pretrained embedding files for each namespace. This can be used to either restrict the vocabulary to only words which appear in this file, or to ensure that any words in this file are included in the vocabulary regardless of their count, depending on the value of only_include_pretrained_words. Words which appear in the pretrained embedding file but not in the data are NOT included in the Vocabulary.
only_include_pretrained_words : bool, optional (default=False)
Only include tokens present in pretrained_files
tokens_to_add : Dict[str, int], optional
A list of tokens to add to the vocabulary, even if they are not present in the sources
min_pretrained_embeddings : Dict[str, int], optional
Minimum number of lines to keep from pretrained_files, even for tokens not appearing in the sources.

# FindLRConfiguration Class

class FindLRConfiguration (
    start_lr: float = 1e-05,
    end_lr: float = 10,
    num_batches: int = 100,
    linear_steps: bool = False,
    stopping_factor: Union[float, NoneType] = None,

A configuration for finding the learning rate via Pipeline.find_lr().

The Pipeline.find_lr() method increases the learning rate from start_lr to end_lr recording the losses.


The learning rate to start the search.
The learning rate upto which search is done.
Number of batches to run the learning rate finder.
Increase learning rate linearly if False exponentially.
Stop the search when the current loss exceeds the best loss recorded by multiple of stopping factor. If None search proceeds till the end_lr
Maintained by