# Training a sequence tagger for Slot Filling

View on recogn.ai

Run in Google Colab

View source on GitHub

In this tutorial we will train a sequence tagger for filling slots in spoken requests. The goal is to look for specific pieces of information in the request and tag the corresponding tokens accordingly. The requests will include several intents, from getting weather information to adding a song to a playlist, each requiring its own set of slots. Therefore, slot filling often goes hand in hand with intent classification. In this tutorial, however, we will only focus on the slot filling task.

Slot filling is closely related to Named-entity recognition (NER) and the model of this tutorial can also be used to train a NER system.

In this tutorial we will use the SNIPS data set adapted by Su Zhu and our simple data preparation notebook.

When running this tutorial in Google Colab, make sure to install biome.text first:

!pip install -U git+https://github.com/recognai/biome-text.git

Ignore warnings and don't forget to restart your runtime afterwards (Runtime -> Restart runtime).

# Explore the data

Let's take a look at the data before starting with the configuration of our pipeline. For this we create a DataSource instance providing a path to our data.

from biome.text.data import DataSource
train_ds = DataSource(source="https://biome-tutorials-data.s3-eu-west-1.amazonaws.com/token_classifier/train.json")
train_ds.head()
text labels intent path
0 [Find, the, schedule, for, Across, the, Line, ... [O, O, B-object_type, O, B-movie_name, I-movie... SearchScreeningEvent https://biome-tutorials-data.s3-eu-west-1.amaz...
1 [play, Party, Ben, on, Slacker] [O, B-artist, I-artist, O, B-service] PlayMusic https://biome-tutorials-data.s3-eu-west-1.amaz...
2 [play, a, 1988, soundtrack] [O, O, B-year, B-music_item] PlayMusic https://biome-tutorials-data.s3-eu-west-1.amaz...
3 [Can, you, play, The, Change, Is, Made, on, Ne... [O, O, O, B-track, I-track, I-track, I-track, ... PlayMusic https://biome-tutorials-data.s3-eu-west-1.amaz...
4 [what, is, the, forecast, for, colder, in, Ans... [O, O, O, O, O, B-condition_temperature, O, B-... GetWeather https://biome-tutorials-data.s3-eu-west-1.amaz...
5 [What's, the, weather, in, Totowa, WY, one, mi... [O, O, O, O, B-city, B-state, B-timeRange, I-t... GetWeather https://biome-tutorials-data.s3-eu-west-1.amaz...
6 [Play, a, tune, from, Space, Mandino, .] [O, O, B-music_item, O, B-artist, I-artist, O] PlayMusic https://biome-tutorials-data.s3-eu-west-1.amaz...
7 [give, five, out, of, 6, stars, to, current, e... [O, B-rating_value, O, O, B-best_rating, B-rat... RateBook https://biome-tutorials-data.s3-eu-west-1.amaz...
8 [Play, some, chanson, style, music.] [O, O, B-genre, O, O] PlayMusic https://biome-tutorials-data.s3-eu-west-1.amaz...
9 [I, would, give, French, Poets, and, Novelists... [O, O, O, B-object_name, I-object_name, I-obje... RateBook https://biome-tutorials-data.s3-eu-west-1.amaz...

As we can see we have two relevant columns for our task: text and labels. The intent column will be ignored in this tutorial. The path column is added automatically by the DataSource class to keep track of the source file.

The input already comes pre-tokenized and each token in the text column has a label/tag in the labels column, this means that both list always have the same length. The labels are given in the BIO tagging scheme, which is widely used in Slot Filling/NER systems.

When specifying the TokenClassification head (see below), the tokenization step in the pipeline is automatically disabled and the input is expected to be a list of tokens.

The DataSource class stores the data in an underlying Dask DataFrame that you can easily access. For example, let's check the size of our training data:

len(train_ds.to_dataframe())
13084

Or let's check how many different labels/tags we have:

df = train_ds.to_dataframe().compute()
labels_total = df.labels.sum()
len(set(labels_total))
72

and how they are distributed:

import pandas as pd
pd.Series(labels_total).value_counts()
O                               59610
I-object_name                    7400
I-playlist                       3230
B-object_type                    3023
B-object_name                    2778
                                ...  
I-cuisine                          28
I-facility                         14
I-object_part_of_series_type        3
I-object_select                     3
I-playlist_owner                    1
Length: 72, dtype: int64

Tip

The TaskHead of our model (the TokenClassification) will expect a text and a labels column to be present in the dataframe. Since they are already present, there is no need for a mapping in the DataSource.

# Configure your biome.text Pipeline

A typical Pipeline consists of tokenizing the input, extracting features, applying a language encoding (optionally) and executing a task-specific head in the end. After training a pipeline, you can use it to make predictions or explore the underlying model via the explore UI.

A biome.text pipeline has the following main components:

name: # a descriptive name of your pipeline

tokenizer: # how to tokenize the input

features: # input features of the model

encoder: # the language encoder

head: # your task configuration

See the Configuration section for a detailed description of how these main components can be configured.

In this tutorial we will create a PipelineConfiguration programmatically, and use it to initialize the Pipeline. You can also create your pipelines by providing a python dictionary (see the text classification tutorial), a YAML configuration file or a pretrained model.

A pipeline configuration is composed of several other configuration classes, each one corresponding to one of the main components.

# Features

Let us first configure the features of our pipeline. For our word feature we will use pretrained embeddings from fasttext, and our char feature will use the last hidden state of a GRU encoder to represent a word based on its characters. Keep in mind that the embedding_dim parameter for the word feature must be equal to the dimensions of the pretrained embeddings!

Tip

If you do not provide any feature configurations, we will choose a very basic word feature by default.

from biome.text.configuration import FeaturesConfiguration, WordFeatures, CharFeatures

word_feature = WordFeatures(
    embedding_dim=300,
    weights_file="https://dl.fbaipublicfiles.com/fasttext/vectors-english/wiki-news-300d-1M.vec.zip",
)

char_feature = CharFeatures(
    embedding_dim=32,
    encoder={
        "type": "gru",
        "bidirectional": True,
        "num_layers": 1,
        "hidden_size": 32,
    },
    dropout=0.1
)

features_config = FeaturesConfiguration(
    word=word_feature, 
    char=char_feature
)

# Encoder

Next we will configure our encoder that takes as input a sequence of embedded word vectors and returns a sequence of encoded word vectors. For this encoding we will use another larger GRU:

from biome.text.modules.configuration import Seq2SeqEncoderConfiguration

encoder_config = Seq2SeqEncoderConfiguration(
    type="gru",
    bidirectional=True,
    num_layers=1,
    hidden_size=128,
)

The final configuration belongs to our TaskHead. It reflects the task our problem belongs to and can be easily exchanged with other types of heads keeping the same features and encoder.

Tip

Exchanging the heads you can easily pretrain a model on a certain task, such as language modelling, and use its pretrained features and encoder for training the model on another task.

For our task we will use a TokenClassification head that allows us to tag each token individually:

from biome.text.modules.heads import TokenClassificationConfiguration

head_config = TokenClassificationConfiguration(
    labels=list(set(labels_total)),
    label_encoding="BIO",
    feedforward={
        "num_layers": 1,
        "hidden_dims": [128],
        "activations": ["relu"],
        "dropout": [0.1],
    },
)

# Pipeline

Now we can create a PipelineConfiguration and finally initialize our Pipeline.

from biome.text.configuration import PipelineConfiguration

pipeline_config = PipelineConfiguration(
    name="slot_filling_tutorial",
    features=features_config,
    encoder=encoder_config,
    head=head_config,
)
from biome.text import Pipeline

pl = Pipeline.from_config(pipeline_config)

# Create a vocabulary

Before we can start the training we need to create the vocabulary for our model. For this we define a VocabularyConfiguration.

from biome.text import VocabularyConfiguration

Since we use pretrained word embeddings we will also consider the validation data when creating the vocabulary.

valid_ds = DataSource(source="https://biome-tutorials-data.s3-eu-west-1.amazonaws.com/token_classifier/valid.json")

We also get rid of the rarest words by adding the min_count argument and set it to 2 for the word feature vocabulary. For a complete list of available arguments see the VocabularyConfiguration API.

vocab_config = VocabularyConfiguration(
    sources=[train_ds, valid_ds],
    min_count={"word": 2},
)

We then pass this configuration to our Pipeline to create the vocabulary. Apart from the loading bar of building the vocabulary, there will be two more loading bars corresponding to the weights_file provided in the word feature:

  • the progress of downloading the file (this file will be cached)
  • the progress loading the weights from the file
pl.create_vocabulary(vocab_config)
HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))






HBox(children=(FloatProgress(value=0.0, max=999994.0), HTML(value='')))

After creating the vovocab_configbulary we can check the size of our entire model in terms of trainable parameters:

pl.trainable_parameters
1989112

# Train your model

Now we have everything ready to start the training of our model:

  • training data set
  • vocabulary

As trainer we will use the default configuration that has sensible values and works alright for our experiment. This tutorial shows you an example of how to configure a trainer.

Tip

If you want to tune the trainer or use a cuda device, you can pass a trainer = TrainerConfiguration(cuda_device=0, ...) to the Pipeline.train() method. See the TrainerConfiguration API for a complete list of available configurations.

Apart from the validation data source to estimate the generalization error, we will also pass in a test data set in case we want to do some Hyperparameter optimization and compare different encoder architectures in the end. For this we will create another DataSource pointing to our test data.

test_ds = DataSource(source="https://biome-tutorials-data.s3-eu-west-1.amazonaws.com/token_classifier/test.json")

The training output will be saved in a folder specified by the output argument of the train method. It will contain the trained model weights and the metrics, as well as the vocabulary and a log folder for visualizing the training process with tensorboard.

When the training has finished it will automatically make a pass over the test data with the best weights to gather the test metrics.

pl.train(
    output="output",
    training=train_ds,
    validation=valid_ds,
    test=test_ds,
)

The model above achieves an overall F1 score of around 0.95, which is not bad when compared to published results of the same data set. You could continue the experiment changing the encoder to an LSTM network, try out a transformer architecture or fine tune the trainer. But for now we will go on and make our first predictions with this trained model.

# Make your first predictions

Now that we trained our model we can go on to make our first predictions. First we must load our trained model into a new Pipeline:

pl_trained = Pipeline.from_pretrained("output/model.tar.gz")

We then provide the input expected by our TaskHead of the model to the Pipeline.predict() method. In our case it is a TokenClassification head that classifies a text input. Remember that the input has to be pre-tokenized!

text = "can you play biome text by backstreet recognais on Spotify".split()
prediction = pl_trained.predict(text)
list(zip(text, prediction["tags"][0]))
[('can', 'O'),
 ('you', 'O'),
 ('play', 'O'),
 ('biome', 'B-track'),
 ('text', 'I-track'),
 ('by', 'O'),
 ('backstreet', 'B-artist'),
 ('recognais', 'I-artist'),
 ('on', 'O'),
 ('Spotify', 'B-service')]

Apart from the most likely tags, the prediction dictionary contains the logits and probs of each of the label for each of the input token.

Maintained by