# Training a short text classifier of German business names

View on recogn.ai

Run in Google Colab

View source on GitHub

In this tutorial we will train a basic short-text classifier for predicting the sector of a business based only on its business name. For this we will use a training dataset with business names and business categories in German.

When running this tutorial in Google Colab, make sure to install biome.text first:

!pip install -U git+https://github.com/recognai/biome-text.git

Ignore warnings and don't forget to restart your runtime afterwards (Runtime -> Restart runtime).

# Explore the training data

Let's take a look at the data we will use for training. For this we create a DataSource instance providing a path to our data.

from biome.text.data import DataSource
train_ds = DataSource("https://biome-tutorials-data.s3-eu-west-1.amazonaws.com/text_classifier/business.cat.train.csv")
train_ds.head(10)
label text path
0 Edv Cse Gmbh Computer Edv-service Bürobedarf https://biome-tutorials-data.s3-eu-west-1.amaz...
1 Maler Malerfachbetrieb U. Nee https://biome-tutorials-data.s3-eu-west-1.amaz...
2 Gebrauchtwagen Sippl Automobilverkäufer Hausmann https://biome-tutorials-data.s3-eu-west-1.amaz...
3 Handelsvermittler Und -vertreter Strenge Handelsagentur Werth https://biome-tutorials-data.s3-eu-west-1.amaz...
4 Gebrauchtwagen Dzengel Autohaus Gordemitz Rusch https://biome-tutorials-data.s3-eu-west-1.amaz...
5 Apotheken Schinkel-apotheke Bitzer https://biome-tutorials-data.s3-eu-west-1.amaz...
6 Tiefbau Franz Möbius Mehrings-bau-hude Und Stigge https://biome-tutorials-data.s3-eu-west-1.amaz...
7 Handelsvermittler Und -vertreter Kontze Hdl.vertr. Lau https://biome-tutorials-data.s3-eu-west-1.amaz...
8 Autowerkstätten Keßler Kfz-handel https://biome-tutorials-data.s3-eu-west-1.amaz...
9 Gebrauchtwagen Diko Lack Und Schrift Betriebsteil Der Autocen... https://biome-tutorials-data.s3-eu-west-1.amaz...

As we can see we have two relevant columns label and text. The path column is added automatically by the DataSource class to keep track of the source file.

Our classifier will be trained to predict the label given a text.

The DataSource class stores the data in an underlying Dask DataFrame that you can easily access. For example, let's check the size of our training data:

len(train_ds.to_dataframe())
8000

Or let's check the distribution of our labels:

labels = train_ds.to_dataframe().label.compute()
labels.value_counts()
Unternehmensberatungen              632
Friseure                            564
Tiefbau                             508
Dienstleistungen                    503
Gebrauchtwagen                      449
Elektriker                          430
Restaurants                         422
Architekturbüros                    417
Vereine                             384
Versicherungsvermittler             358
Maler                               330
Sanitärinstallationen               323
Edv                                 318
Werbeagenturen                      294
Apotheken                           289
Physiotherapie                      286
Vermittlungen                       277
Hotels                              274
Autowerkstätten                     263
Elektrotechnik                      261
Allgemeinärzte                      216
Handelsvermittler Und -vertreter    202
Name: label, dtype: int64

Tip

The TaskHead of our model will expect a text and a label column to be present in the dataframe. Since they are already present, there is no need for a mapping in the DataSource.

# Configure your biome.text Pipeline

A typical Pipeline consists of tokenizing the input, extracting features, applying a language encoding (optionally) and executing a task-specific head in the end.

After training a pipeline, you can use it to make predictions or explore the underlying model via the explore UI.

As a first step we must define a configuration for our pipeline. In this tutorial we will create a configuration dictionary and use the Pipeline.from_config() method to create our pipeline, but there are other ways.

A biome.text pipeline has the following main components:

name: # a descriptive name of your pipeline

tokenizer: # how to tokenize the input

features: # input features of the model

encoder: # the language encoder

head: # your task configuration

See the Configuration section for a detailed description of how these main components can be configured.

Our complete configuration for this tutorial will be following:

pipeline_dict = {
    "name": "german_business_names",
    "tokenizer": {
        "text_cleaning": {
            "rules": ["strip_spaces"]
        }
    },
    "features": {
        "word": {
            "embedding_dim": 64,
            "lowercase_tokens": True,
        },
        "char": {
            "embedding_dim": 32,
            "lowercase_characters": True,
            "encoder": {
                "type": "gru",
                "num_layers": 1,
                "hidden_size": 32,
                "bidirectional": True,
            },
            "dropout": 0.1,
        },
    },
    "head": {
        "type": "TextClassification",
        "labels": list(labels.value_counts().index),
        "pooler": {
            "type": "gru",
            "num_layers": 1,
            "hidden_size": 32,
            "bidirectional": True,
        },
        "feedforward": {
            "num_layers": 1,
            "hidden_dims": [32],
            "activations": ["relu"],
            "dropout": [0.0],
        },
    },       
}

With this dictionary we can now create a Pipeline:

from biome.text import Pipeline
pl = Pipeline.from_config(pipeline_dict)

# Create a vocabulary

Before we can start the training we need to create the vocabulary for our model. For this we define a VocabularyConfiguration.

In our business name classifier we only want to include words with a general meaning to our word feature vocabulary (like "Computer" or "Autohaus", for example), and want to exclude specific names that will not help to generally classify the kind of business. This can be achieved by including only the most frequent words in our training set via the min_count argument. For a complete list of available arguments see the VocabularyConfiguration API.

from biome.text.configuration import VocabularyConfiguration, WordFeatures
vocab_config = VocabularyConfiguration(sources=[train_ds], min_count={WordFeatures.namespace: 20})

We then pass this configuration to our Pipeline to create the vocabulary:

pl.create_vocabulary(vocab_config)
HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))

After creating the vocabulary we can check the size of our entire model in terms of trainable parameters:

pl.trainable_parameters
60566

# Configure the trainer

As a next step we have to configure the trainer.

The default trainer has sensible defaults and should work alright for most of your cases. In this tutorial, however, we want to tune a bit the learning rate and limit the training time to three epochs only. For a complete list of available arguments see the TrainerConfiguration API.

Tip

In case you have a cuda device available, you also specify it here.

from biome.text.configuration import TrainerConfiguration
trainer_config = TrainerConfiguration(
    optimizer={
        "type": "adam",
        "lr": 0.01,
    },
    num_epochs=3,
    # cuda_device=0,
)

# Train your model

Now we have everything ready to start the training of our model:

  • training data set
  • vocabulary
  • trainer

Optionally we can provide a validation data set to estimate the generalization error. For this we will create another DataSource pointing to our validation data.

valid_ds = DataSource("https://biome-tutorials-data.s3-eu-west-1.amazonaws.com/text_classifier/business.cat.valid.csv")

The training output will be saved in a folder specified by the output argument. It contains the trained model weights and the metrics, as well as the vocabulary and a log folder for visualizing the training process with tensorboard.

pl.train(
    output="output",
    training=train_ds,
    validation=valid_ds,
    trainer=trainer_config,
)

After 3 epochs we achieve a validation accuracy of about 0.91. The validation loss seems to be decreasing further, though, so we could probably train the model for a few more epochs without the risk of overfitting.

Tip

If for some reason the training gets interrupted, you can continue where you left off by setting the restore argument in the Pipeline.train() method to True. If you want to train your model for a few more epochs, you can also use the restore argument, but you have to modify the epochs argument in your TrainerConfiguration to reflect the total amount of epochs you aim for.

# Make your first predictions

Now that we trained our model we can go on to make our first predictions. First we must load our trained model into a new Pipeline:

pl_trained = Pipeline.from_pretrained("output/model.tar.gz")

We then provide the input expected by our TaskHead of the model to the Pipeline.predict() method. In our case it is a TextClassification head that classifies a text input:

pl_trained.predict(text="Autohaus biome.text")

The returned dictionary contains the logits and probabilities of all labels (classes). The label with the highest probability is stored under the label key, together with its probability under the prob key.

Tip

When configuring the pipeline in the first place, we recommend to check that it is correctly setup by using the predict method. Since the pipeline is still not trained at that moment, the predictions will be arbitrary.

# Explore the model's predictions

To check and understand the predictions of the model, we can use the biome.text explore UI. Just calling the Pipeline.predict method will open the UI in the output of our cell. We will set the explain argument to true, which automatically visualizes the attribution of each token by means of integrated gradients.

Warning

For the UI to work you need a running Elasticsearch instance. We recommend installing Elasticsearch with docker.

pl_trained.explore(valid_ds, explain=True)

Screenshot of the biome.text explore UI Screenshot of the biome.text explore UI

Exploring our model we could take advantage of the F1 scores of each label to figure out which labels to prioritize when gathering new training data. For example, although "Allgemeinärzte" is the second rarest label in our training data, it still seems relatively easy to classify for our model due to the distinctive words "Dr." and "Allgemeinmedizin".

Maintained by