# Hyperparameter optimization with Ray Tune

View on recogn.ai

Run in Google Colab

View source on GitHub

In this tutorial we will optimize the hyperparameters of the short-text classifier from this tutorial. We recommend to have a look at it first before going through the following tutorial. For the Hyper-Parameter Optimization (HPO) we rely on the awesome Ray Tune library that is not a dependency of biome.text and has to be installed additionally.

For a short introduction to HPO with Ray Tune you can have a look at this nice talk by Richard Liaw. We will follow his terminology and use the term trial to refer to a training run of one set of hyperparameters.

When running this tutorial in Google Colab, make sure to install biome.text and ray tune first:

!pip install -U git+https://github.com/recognai/biome-text.git ray[tune]

Ignore warnings and don't forget to restart your runtime afterwards (Runtime -> Restart runtime).

Note

In this tutorial we will use a GPU by default. So when running this tutorial in Google Colab, make sure that you request one (Edit -> Notebook settings).

# Download the data and create the vocabulary

As a first step we will download the training and validation data to our local machine. This will save us some time in the long run, since we will perform the hyperparameter search on our local machine and access the data frequently.

!curl -O https://biome-tutorials-data.s3-eu-west-1.amazonaws.com/text_classifier/business.cat.train.csv
!curl -O https://biome-tutorials-data.s3-eu-west-1.amazonaws.com/text_classifier/business.cat.valid.csv

We will store the absolute path of the data to use them later on when creating our DataSources.

import os
train_path = os.path.abspath("business.cat.train.csv")
valid_path = os.path.abspath("business.cat.valid.csv")

# Reuse the vocabulary

In order to be more efficient and speed things up, we will create the vocabulary beforehand and reuse it in every trial. For this we have to create a Pipeline first, create the vocabulary from our DataSource and save it to a folder.

Let's start with defining the configuration of our pipeline (for details see the base tutorial):

from biome.text.data import DataSource
labels = DataSource(train_path).to_dataframe().label.compute()
pipeline_dict = {
    "name": "german_business_names",
    "tokenizer": {
        "text_cleaning": {
            "rules": ["strip_spaces"]
        }
    },
    "features": {
        "word": {
            "embedding_dim": 64,
            "lowercase_tokens": True,
        },
        "char": {
            "embedding_dim": 32,
            "lowercase_characters": True,
            "encoder": {
                "type": "gru",
                "num_layers": 1,
                "hidden_size": 32,
                "bidirectional": True,
            },
            "dropout": 0.1,
        },
    },
    "head": {
        "type": "TextClassification",
        "labels": list(labels.value_counts().index),
        "pooler": {
            "type": "gru",
            "num_layers": 1,
            "hidden_size": 32,
            "bidirectional": True,
        },
        "feedforward": {
            "num_layers": 1,
            "hidden_dims": [32],
            "activations": ["relu"],
            "dropout": [0.0],
        },
    },       
}

We will use this configuration dictionary to create our pipeline:

from biome.text import Pipeline
pl = Pipeline.from_config(pipeline_dict)

Next, we have to define the vocabulary configuration and create our vocabulary.

Note

If you want to optimize the vocabulary configuration in the hyperparameter search (for example, the min_count argument), you have to move the vocabulary creation to the trainable function below. That is, in each trial the vocabulary will be created anew.

from biome.text.configuration import VocabularyConfiguration, WordFeatures
vocab_config = VocabularyConfiguration(sources=[DataSource(train_path)], min_count={WordFeatures.namespace: 20})
pl.create_vocabulary(vocab_config)

To be able to reuse the vocabulary in each trial, we have to save it to a folder and store its absolute path:

vocab_absolute_path = os.path.abspath("./vocabulary")
pl.save_vocabulary(vocab_absolute_path)

# Implementing the callback for early stopping

In this tutorial we will use a trial scheduler that adaptively allocates resources to promising hyperparameter configurations by terminating less promising candidates early. The early stopping mechanism requires the reporting of some metric during a trial. For this we use a BaseTrainLogger that defines a method log_epoch_metrics() which is executed after each epoch, and pass it on to the Pipeline.train() method.

Our TuneReport class simply reports some metrics back to tune, which in turn are used to define promising trials during the hyperparameter search.

from biome.text.loggers import BaseTrainLogger
from ray import tune
class TuneReport(BaseTrainLogger):
    def log_epoch_metrics(self, epoch, metrics):
        tune.report(
            validation_loss=metrics["validation_loss"], 
            validation_accuracy=metrics["validation_accuracy"]
        )
tune_report = TuneReport()

# Defining the training loop

For the HPO with biome.text we will use the function-based Trainable API of Ray Tune. Therefore, we have to define a trainable function that takes as input a configuration dictionary and executes a training run.

We will use the configuration dictionary to create a Pipeline and a TrainerConfiguration in order to optimize the parameters of our architecture and the learning rate, respectively. In the Pipeline.train() method we will add our tune_report instance to the epoch callbacks, and completely silence the output of the training by setting quiet=True. This avoids cluttering the output of the hyperparameter search and makes it easier to follow the progress.

from biome.text.configuration import TrainerConfiguration
def trainable(config):
    pl = Pipeline.from_config(config["pipeline"], vocab_absolute_path)
    trainer_config = TrainerConfiguration(**config["trainer"])
    
    train_ds = DataSource(train_path)
    valid_ds = DataSource(valid_path)
    
    pl.train(
        output="output",
        training=train_ds,
        validation=valid_ds,
        trainer=trainer_config,
        loggers=[tune_report],
        quiet=True,
    )

# Random search with a trial scheduler

To perform a random hyperparameter search (as well as a grid search) we simply have to replace the parameters we want to optimize with methods from the Random Distributions API and the Grid Search API, respectively. For a complete description of both APIs and how they interplay with each other, see the corresponding section in the Ray Tune docs.

In our case we will tune 9 parameters:

  • the output dimensions of our word and char features
  • the dropout of our char feature
  • the architecture of our pooler (GRU versus LSTM)
  • number of layers and hidden size of our pooler, as well as if it should be bidirectional
  • hidden dimension of our feed forward network
  • and the learning rate

For most of the parameters we will provide discrete values from which Tune will sample randomly, while for the dropout and learning rate we will provide a continuous linear and logarithmic range, respectively. Since we want to directly compare the outcome of the optimization with the base configuration of the underlying tutorial, we will fix the number of epochs to 3.

Not all of the parameters above are worth tuning, but we want to stress the flexibility that Ray Tune and biome.text offers you.

Tip

Keep in mind that the learning rate "is often the single most important hyper-parameter and one should always make sure that it has been tuned (up to approximately a factor of 2). ... If there is only time to optimize one hyper-parameter and one uses stochastic gradient descent, then this is the hyper-parameter that is worth tuning." (Yoshua Bengio).

configs = {
    "pipeline": {
        "name": "german_business_names",
        "tokenizer": {
            "text_cleaning": {
                "rules": ["strip_spaces"]
            }
        },
        "features": {
            "word": {
                "embedding_dim": tune.choice([32, 64]),
                "lowercase_tokens": True,
            },
            "char": {
                "embedding_dim": 32,
                "lowercase_characters": True,
                "encoder": {
                    "type": "gru",
                    "num_layers": 1,
                    "hidden_size": tune.choice([32, 64]),
                    "bidirectional": True,
                },
                "dropout": tune.uniform(0, 0.5),
            },
        },
        "head": {
            "type": "TextClassification",
            "labels": list(labels.value_counts().index),
            "pooler": {
                "type": tune.choice(["gru", "lstm"]),
                "num_layers": tune.choice([1, 2]),
                "hidden_size": tune.choice([32,64]),
                "bidirectional": tune.choice([True, False]),
            },
            "feedforward": {
                "num_layers": 1,
                "hidden_dims": [tune.choice([32, 64])],
                "activations": ["relu"],
                "dropout": [0.0],
            },
        },       
    }, 
    "trainer": {
        "optimizer": {
            "type": "adam",
            "lr": tune.loguniform(0.001, 0.01),
        },
        "num_epochs": 3,
        "cuda_device": 0,
    },
}

Note

By default we will use a GPU. If you do not have one available, just comment out the line "cuda_device": 0 in the trainer section of the dictionary above.

In this tutorial we will perform a random search together with the Asynchronous Successive Halving Algorithm (ASHA) to schedule our trials. The Ray Tune developers advocate this scheduler as a good starting point for its aggressive termination of low-performing trials.

To create an instance of the ASHAScheduler we have to specify the decisive metric for terminating low-performing trials and the mode of this metric (is the objective to minimize the metric, min, or to maximize it, max). For a complete description of the configurations, see the ASHAScheduler docs.

from ray.tune.schedulers import ASHAScheduler
asha = ASHAScheduler(metric="validation_loss", mode="min")

# Following the progress with tensorboard (optional)

Ray Tune automatically logs its results with TensorBoard. We can take advantage of this and launch a TensorBoard instance before starting the hyperparameter search to follow its progress.

%load_ext tensorboard
%tensorboard --logdir ./tune/trainable

Screenshot of TensorBoard with Ray Tune Screenshot of TensorBoard

Now we have everything ready to start our hyperparameter search with the tune.run() method.

The number of trials our search will go through depends on the num_samples parameter. In our case, a random search, it equals the number of trials, whereas in the case of a grid search the total number of trials is num_samples times the grid configurations (see the Tune docs for illustrative examples).

The number of parallel running trials depends on your resources_per_trial configuration and your local resources. The default value is {"cpu": 1, "gpu": 0} and results, for example, in 8 parallel running trials on a machine with 8 CPUs. You can also use fractional values. To share a GPU between 2 trials, for example, pass on {"gpu": 0.5}.

The local_dir parameter defines the output directory of the HPO results and will also contain the training results of each trial (that is the model weights and metrics).

Note

Keep in mind: to run your HPO on GPUs, you have to specify them in the TrainerConfiguration in the trainable function, as well as in the resources_per_trial parameter when calling tune.run(). If you do not want to use a GPU, just set the value to 0 {"cpu": 1, "gpu": 0}.

analysis = tune.run(
    trainable, 
    config=configs, 
    scheduler=asha, 
    num_samples=50, 
    resources_per_trial={"cpu": 1, "gpu": 0.5},
    local_dir="./tune", 
)

# Checking the results

The analysis object returned by tune.run() can be accessed through a pandas DataFrame.

analysis.dataframe().sort_values("validation_loss")

Screenshot of the analysis dataframe Screenshot of the analysis dataframe

Event though with 50 trials we visit just a small space of our possible configurations, we should have achieved an accuracy of ~0.94, an increase of roughly 3 points compared to the original configuration of the base tutorial.

In a real-life example, though, you probably should increase the number of epochs, since the validation loss in general seems to be decreasing further.

A next step could be to fix some of the tuned parameters to the preferred value, and tune other parameters further or limit their value space.

Tip

To obtain insights about the importance and tendencies of each hyperparameter for the model, we recommend using TensorBoard's HPARAM section and follow Richard Liaw's suggestions at the end of his talk.

# Evaluate the best performing model

The analysis object also provides some convenient methods to obtain the best performing configuration, as well as the logdir where the results of the trial are saved.

best_config = analysis.get_best_config(metric="validation_loss", mode="min")
best_logdir = analysis.get_best_logdir(metric="validation_loss", mode="min")

We can use the best_logdir to create a pipeline with the best performing model and start making predictions.

best_model = os.path.join(best_logdir, "output", "model.tar.gz")
pl_trained = Pipeline.from_pretrained(best_model)
pl_trained.predict(text="Autohaus Recognai")

Or we can use biome.text's explore UI to evaluate the performance of our model in more detail.

Warning

For the UI to work you need a running Elasticsearch instance. We recommend installing Elasticsearch with docker.

pl_trained.explore(DataSource(valid_path), explain=True)

Screenshot of the biome.text explore UI Screenshot of the biome.text explore UI

Note

For an unbiased evaluation of the model you should use a test dataset that was not used during the HPO!

Maintained by