# biome.text.configuration Module

# FeaturesConfiguration Class


class FeaturesConfiguration (
    word: Union[WordFeatures, NoneType] = None,
    char: Union[CharFeatures, NoneType] = None,
    transformers: Union[TransformersFeatures, NoneType] = None,
)

Configures the input features of the Pipeline

Use this for defining the features to be used by the model, namely word and character embeddings.

:::tip If you do not pass in either of the parameters (word or char), your pipeline will be setup with a default word feature (embedding_dim=50). :::

Example:

word = WordFeatures(embedding_dim=100)
char = CharFeatures(embedding_dim=16, encoder={'type': 'gru'})
config = FeaturesConfiguration(word, char)

Parameters

word
The word feature configurations, see WordFeatures
char
The character feature configurations, see CharFeatures
transformers
The transformers feature configuration, see TransformersFeatures A word-level representation of the transformer models using AllenNLP's

# Ancestors

  • allennlp.common.from_params.FromParams

# from_params Static method


def from_params (
  params: allennlp.common.params.Params,
  **extras,
)  -> FeaturesConfiguration

This is the automatic implementation of from_params. Any class that subclasses FromParams (or Registrable, which itself subclasses FromParams) gets this implementation for free. If you want your class to be instantiated from params in the "obvious" way – pop off parameters and hand them to your constructor with the same names – this provides that functionality.

If you need more complex logic in your from from_params method, you'll have to implement your own method that overrides this one.

The constructor_to_call and constructor_to_inspect arguments deal with a bit of redirection that we do. We allow you to register particular @classmethods on a class as the constructor to use for a registered name. This lets you, e.g., have a single Vocabulary class that can be constructed in two different ways, with different names registered to each constructor. In order to handle this, we need to know not just the class we're trying to construct (cls), but also what method we should inspect to find its arguments (constructor_to_inspect), and what method to call when we're done constructing arguments (constructor_to_call). These two methods are the same when you've used a @classmethod as your constructor, but they are different when you use the default constructor (because you inspect __init__, but call cls()).

# Instance variables

var keys : List[str]

Gets the keys of the features

# compile_embedder Method


def compile_embedder (
  self,
  vocab: allennlp.data.vocabulary.Vocabulary,
)  -> allennlp.modules.text_field_embedders.text_field_embedder.TextFieldEmbedder

Creates the embedder based on the configured input features

Parameters

vocab
The vocabulary for which to create the embedder

Returns

embedder
 

# compile_featurizer Method


def compile_featurizer (
  self,
  tokenizer: Tokenizer,
)  -> InputFeaturizer

Creates the featurizer based on the configured input features

:::tip If you are creating configurations programmatically use this method to check that you provided a valid configuration. :::

Parameters

tokenizer
Tokenizer used for this featurizer

Returns

featurizer
The configured InputFeaturizer

# TokenizerConfiguration Class


class TokenizerConfiguration (
    lang: str = 'en',
    max_sequence_length: int = None,
    max_nr_of_sentences: int = None,
    text_cleaning: Union[Dict[str, Any], NoneType] = None,
    segment_sentences: Union[bool, Dict[str, Any]] = False,
    start_tokens: Union[List[str], NoneType] = None,
    end_tokens: Union[List[str], NoneType] = None,
)

Configures the Tokenizer

Parameters

lang
The spaCy model used for tokenization is language dependent. For optimal performance, specify the language of your input data (default: "en").
max_sequence_length
Maximum length in characters for input texts truncated with [:max_sequence_length] after TextCleaning.
max_nr_of_sentences
Maximum number of sentences to keep when using segment_sentences truncated with [:max_sequence_length].
text_cleaning
A TextCleaning configuration with pre-processing rules for cleaning up and transforming raw input text.
segment_sentences
Whether to segment input texts in to sentences using the default SentenceSplitter or providing a specific configuration for a SentenceSplitter.
start_tokens
A list of token strings to the sequence before tokenized input text.
end_tokens
A list of token strings to the sequence after tokenized input text.

# Ancestors

  • allennlp.common.from_params.FromParams

# PipelineConfiguration Class


class PipelineConfiguration (
    name: str,
    head: TaskHeadConfiguration,
    features: FeaturesConfiguration = None,
    tokenizer: Union[TokenizerConfiguration, NoneType] = None,
    encoder: Union[Seq2SeqEncoderConfiguration, NoneType] = None,
)

Creates a Pipeline configuration

Parameters

name
The name for our pipeline
features
The input features to be used by the model pipeline. We define this using a FeaturesConfiguration object.
head
The head for the task, e.g., a LanguageModelling task, using a TaskHeadConfiguration object.
tokenizer
The tokenizer defined with a TokenizerConfiguration object.
encoder
The core text seq2seq encoder of our model using a Seq2SeqEncoderConfiguration

# Ancestors

  • allennlp.common.from_params.FromParams

# from_yaml Static method


def from_yaml(path: str) -> PipelineConfiguration

Creates a pipeline configuration from a config yaml file

Parameters

path
The path to a YAML configuration file

Returns

pipeline_configuration
 

# from_dict Static method


def from_dict(config_dict: dict) -> PipelineConfiguration

Creates a pipeline configuration from a config dictionary

Parameters

config_dict
A configuration dictionary

Returns

pipeline_configuration
 

# as_dict Method


def as_dict(self) -> Dict[str, Any]

Returns the configuration as dictionary

Returns

config
 

# to_yaml Method


def to_yaml (
  self,
  path: str,
) 

Saves the pipeline configuration to a yaml formatted file

Parameters

path
Path to the output file

# build_tokenizer Method


def build_tokenizer(self) -> Tokenizer

Build the pipeline tokenizer

# build_featurizer Method


def build_featurizer(self) -> InputFeaturizer

Creates the pipeline featurizer

# build_embedder Method


def build_embedder (
  self,
  vocab: allennlp.data.vocabulary.Vocabulary,
) 

Build the pipeline embedder for aiven dictionary

# TrainerConfiguration Class


class TrainerConfiguration (
    optimizer: Dict[str, Any] = <factory>,
    validation_metric: str = '-loss',
    patience: Union[int, NoneType] = 2,
    num_epochs: int = 20,
    cuda_device: int = -1,
    grad_norm: Union[float, NoneType] = None,
    grad_clipping: Union[float, NoneType] = None,
    learning_rate_scheduler: Union[Dict[str, Any], NoneType] = None,
    momentum_scheduler: Union[Dict[str, Any], NoneType] = None,
    moving_average: Union[Dict[str, Any], NoneType] = None,
    batch_size: Union[int, NoneType] = 16,
    data_bucketing: bool = False,
    no_grad: List[str] = None,
)

Creates a TrainerConfiguration

Doc strings mainly provided by AllenNLP

Attributes

optimizer : Dict[str, Any], default {"type": "adam"}
Pytorch optimizers that can be constructed via the AllenNLP configuration framework
validation_metric : str, optional (default=-loss)
Validation metric to measure for whether to stop training using patience and whether to serialize an is_best model each epoch. The metric name must be prepended with either "+" or "-", which specifies whether the metric is an increasing or decreasing function.
patience : Optional[int], optional (default=2)
Number of epochs to be patient before early stopping: the training is stopped after patience epochs with no improvement. If given, it must be > 0. If None, early stopping is disabled.
num_epochs : int, optional (default=20)
Number of training epochs
cuda_device : int, optional (default=-1)
An integer specifying the CUDA device to use for this process. If -1, the CPU is used.
grad_norm : Optional[float], optional
If provided, gradient norms will be rescaled to have a maximum of this value.
grad_clipping : Optional[float], optional
If provided, gradients will be clipped during the backward pass to have an (absolute) maximum of this value. If you are getting NaNs in your gradients during training that are not solved by using grad_norm, you may need this.
learning_rate_scheduler : Optional[Dict[str, Any]], optional
If specified, the learning rate will be decayed with respect to this schedule at the end of each epoch (or batch, if the scheduler implements the step_batch method). If you use torch.optim.lr_scheduler.ReduceLROnPlateau, this will use the validation_metric provided to determine if learning has plateaued.
momentum_scheduler : Optional[Dict[str, Any]], optional
If specified, the momentum will be updated at the end of each batch or epoch according to the schedule.
moving_average : Optional[Dict[str, Any]], optional
If provided, we will maintain moving averages for all parameters. During training, we employ a shadow variable for each parameter, which maintains the moving average. During evaluation, we backup the original parameters and assign the moving averages to corresponding parameters. Be careful that when saving the checkpoint, we will save the moving averages of parameters. This is necessary because we want the saved model to perform as well as the validated model if we load it later.
batch_size : Optional[int], optional (default=16)
Size of the batch.
data_bucketing : bool, optional (default=False)
If enabled, try to apply data bucketing over training batches.
no_grad
Freeze a list of parameters. The parameter names have to match those of the Pipeline.trainable_parameter_names.

# to_allennlp_trainer Method


def to_allennlp_trainer(self) -> Dict[str, Any]

Returns a configuration dict formatted for AllenNLP's trainer

Returns

allennlp_trainer_config
 

# VocabularyConfiguration Class


class VocabularyConfiguration (
    sources: Union[List[DataSource], List[Union[allennlp.data.dataset_readers.dataset_reader.AllennlpDataset, allennlp.data.dataset_readers.dataset_reader.AllennlpLazyDataset]]],
    min_count: Dict[str, int] = None,
    max_vocab_size: Union[int, Dict[str, int]] = None,
    pretrained_files: Union[Dict[str, str], NoneType] = None,
    only_include_pretrained_words: bool = False,
    tokens_to_add: Dict[str, List[str]] = None,
    min_pretrained_embeddings: Dict[str, int] = None,
)

Configures a Vocabulary before it gets created from the data

Use this to configure a Vocabulary using specific arguments from allennlp.data.Vocabulary

See AllenNLP Vocabulary docs

Parameters

sources
List of DataSource or InstancesDataset objects to be used for data creation
min_count : Dict[str, int], optional (default=None)
Minimum number of appearances of a token to be included in the vocabulary. The key in the dictionary refers to the namespace of the input feature
max_vocab_size : Dict[str, int] or int, optional (default=None)
Maximum number of tokens in the vocabulary
pretrained_files : Optional[Dict[str, str]], optional
If provided, this map specifies the path to optional pretrained embedding files for each namespace. This can be used to either restrict the vocabulary to only words which appear in this file, or to ensure that any words in this file are included in the vocabulary regardless of their count, depending on the value of only_include_pretrained_words. Words which appear in the pretrained embedding file but not in the data are NOT included in the Vocabulary.
only_include_pretrained_words : bool, optional (default=False)
Only include tokens present in pretrained_files
tokens_to_add : Dict[str, int], optional
A list of tokens to add to the vocabulary, even if they are not present in the sources
min_pretrained_embeddings : Dict[str, int], optional
Minimum number of lines to keep from pretrained_files, even for tokens not appearing in the sources.

# FindLRConfiguration Class


class FindLRConfiguration (
    start_lr: float = 1e-05,
    end_lr: float = 10,
    num_batches: int = 100,
    linear_steps: bool = False,
    stopping_factor: Union[float, NoneType] = None,
)

A configuration for finding the learning rate via Pipeline.find_lr().

The Pipeline.find_lr() method increases the learning rate from start_lr to end_lr recording the losses.

Parameters

start_lr
The learning rate to start the search.
end_lr
The learning rate upto which search is done.
num_batches
Number of batches to run the learning rate finder.
linear_steps
Increase learning rate linearly if False exponentially.
stopping_factor
Stop the search when the current loss exceeds the best loss recorded by multiple of stopping factor. If None search proceeds till the end_lr
Maintained by