Documentation

This documentation page should give you an overview of how to get started with CODES.
The technical API documentation can be found here.

Setup
Run the benchmark
Configuring the benchmark
Add your own dataset
Add your own model
Code Documentation

Setup

First, clone the GitHub Repository with git clone ssh://git@github.com/robin-janssen/CODES-Benchmark.

Optionally, you can set up a virtual environment (recommended).

Then, install the required packages with pip install -r requirements.txt.

The installation is now complete. To be able to run and evaluate the benchmark, you need to first set up a configuration YAML file. There is one provided, but it should be configured. For more information, check the configuration page. There, we also offer an interactive Config-Generator tool with some explanations to help you set up your benchmark.

You can also add your own datasets and models to the benchmark to evaluate them against each other or some of our baseline models. For more information on how to do this, please refer to the documentation.

Run the benchmark

The first step in running the benchmark is to train all the different models specified in the configuration. As this step usually takes a lot longer than the actual benchmarking, it is executed as a separate step.

To start the training, run the run_training.py file. to pass in a config file that has a filename different from the default config.yaml, use the --config argument when executing from the command line like this: /path/to/python3 run_training.py --config MyConfig.yaml.

After the training is complete, the benchmark can be run. To start the benchmark, run the run_benchmark.py file. Remember to pass in the same config file as you used for the training.

Configuring the benchmark

The training and evaluation of different models is mainly configured from a YAML config file in the base directory of the repository. In this file, all of the tweakble run parameters can be set. This includes

A Name for a benchmark run (also used to create a path to store the results)
the surrogates to train and compare
the dataset to train and evaluate on
training parameters like number of epochs, batch sizes, GPUs
what evaluations to perform and their required parameters

If you don't feel like manually creating or editing your config, check our online config generator. You can configure everything and simply download the YAML config file.

The config file has the following structure (the order of parameters is not important as long as the nesting is correct):

Overall training parameters

training_id: str
The name of the benchmark run
surrogates: list[str]
The list of surrogates to evaluate. See our surrogates for available options and how to add your own model. The name corresponds to the name of the surrogate's class.
batch_size: int | list[int]
Specifies the batch size for the surrogates during training. Can either be a single integer if all surrogates share a batch size, or a list of batch sizes as long as the list of surrogates.
epochs: int | list[int]
The number of epochs to train the surrogates for. Can be a single integer if all surrogates share the same number of epochs, or a list of epochs as long as the list of surrogates.
dataset:

name: str
The dataset to train and evaluate on. See our datasets for available options and how to add your own dataset.
log10_transform: bool
Whether to take the logarithm of the dataset. This is recommended, unless the raw data in the data.hdf5 file is already on a log scale.
normalise: str
How to normalise the data. Options are

"minmax" - applies min-max normalization to the data to rescale it to [-1, 1].
"standardise" - applies standardization to the data to have a mean of 0 and a standard deviation of 1.
"disable" - no normalization is applied.

If you want to apply another form of normalization to the data, you may have to add your own dataset and normalise the data beforehand.

use_optimal_params: str
Whether to use previously determined optimal hyperparamters for this dataset. This only works if these hyperparameters are stored in a corresponding surrogates_config.py (refer to this section for more details).

seed: int
The random seed used to initialize the random seeds for Python, PyTorch and NumPy. In some benchmarks, the seed is altered deterministically to train for example an Ensemble (where each model requires a different seed).
losses: bool
Whether to record the losses in the output files
verbose: bool
Whether to output additional information about the current processing step to the CLI

Benchmark parameters

accuracy: bool
dynamic_accuracy: bool
timing: bool
compute: bool

Once the configuration is complete, the configuration YAML file needs to be placed into the root directory of the CODES-Benchmark repository. The default filename the training and benchmark look for is config.yaml, however, you can specify any filename with the --config argument of the run_training.py and run_benchmark.py files.

Add your own dataset

Adding your own data to the CODES repository is fairly straight forward using the create_dataset function. You can simply pass your raw (numpy) data to the function along with some additional optional data and it will create the appropriate file(s) in the datasets directory of the repository. After this, you will not need to interact with the data again, as the benchmark handles the data automatically based on the dataset name provided in the configuration.

A note on dataset availability: The benchmark can be run on local data as soon as you created the dataset with create_dataset (i.e., the data can be completely offline/local). The actual data.hdf5 file in your new dataset directory is ignored by git and should not be added to the repository. If you want to make your dataset available to others (which we highly encourage), you can upload it to Zenodo and provide the download link in datasets/data_sources.yaml. If you choose to do this (which we highly encourage), you can push the created dataset directory to the repository, as it will later be used to store visualisations of the data or a surrogate_config.py that contains the hyperparameters for the surrogate models.

You can import the create_dataset function from the codes package. It has the following signature:

create_dataset

name: str The name of the dataset and also the directory in which it will be stored, e.g. a dataset called "MyDataset" will be stored in the datasets/mydataset directory.
train_data: np.ndarray The array of training data. It should be of the shape (n_trajectories, n_timesteps, n_species).
test_data: np.ndarray | None The array of test data, optional. Should follow the same shape convention as the training data.
val_data: np.ndarray | None The array of validation data, optional. Should follow the same shape convention as the training data.
split: tuple[float, float, float] | None If test and validation data are not provided, the training data array can be split into train, test and validation based on the split tuple provided. For example, a value of split=(0.8, 0.15, 0.05) will split the data into 80% training, 15% test and 5% validation data.
timesteps: np.ndarray | None The timesteps array for the data, optional. Can be used if required in the surrogates or to set the time axis in the plots. If not provided, a [0, 1] array will be inferred from the shape of the data.
labels: list[str] | None The species labels for the evaluation plots.

Supposing you already have a dataset in the form of train, test and validation numpy arrays, you can simply call the create_dataset function like this:

import numpy as np
from data.data_utils import create_dataset

# load your data
train_data = np.load("path/to/train_data.npy")
test_data = np.load("path/to/test_data.npy")
val_data = np.load("path/to/val_data.npy")
timesteps = np.load("path/to/timesteps.npy")
labels = ["species1", "species2", "species3"]

# create the dataset
create_dataset(
	name="MyDataset",
	train_data=train_data,
	test_data=test_data,
	val_data=val_data,
	timesteps=timesteps,
	labels=labels
)

Alternatively, if you only have a single dataset array and want to split it into train, test and validation data, you can do this:

import numpy as np
from data.data_utils import create_dataset

# load your data
data = np.load("path/to/data.npy")
timesteps = np.load("path/to/timesteps.npy")

# create the dataset
create_dataset(
	name="MyDataset",
	train_data=data,
	split=(0.8, 0.15, 0.05),
	timesteps=timesteps,
	labels=["species1", "species2", "species3"]
)

After calling the create_dataset function, the dataset will be stored in the datasets/mydataset directory of the repository (the dataset name is not case sensitive, it will always be stored in lowercase). The benchmark will automatically load the data from there based on the dataset name provided in the configuration.

Add your own model

To be able to compare your own models to each other or to some of the baseline models provided by us, you need to add your own surrogate implementation to the repository. The AbstractSurrogateModel class offers a blueprint as well as some basic functionality like saving and loading models. Your own model needs to be implemented or wrapped in a class that inherits from the AbstractSurrogateModel class.

We recommend you structure your model in such a way that hyperparameters you might want to change and tune in the future be stored in a separate dataclass. This keeps the hyperparameters and the actual code logic separate and easily acessible and allows you to tune your surrogate without modigying the actual code. Check the Surrogate Configuration section and tutorial with code examples below on how to do this.

For the integration into the benchmark, you need to implement four methods for your own model class:

__init__

The initialization method. In this method you can instantiate any objects you need during training and set attributes required later. The method should also call the super classes constructor and set the model configuration.

Arguments:
- self The required self argument for instance methods
- device: str The device the model will train/evaluate on
- n_chemicals: int The dimensionality (i.e. number of chemicals) in the dataset
- n_timesteps: int The number of timesteps in the dataset
- model_config: dict The configuration dictionary that is passed to the model upon initialization. This dictionary contains all the parameters from the configuration file that are relevant for the model.
prepare_data

This method serves as a helper function which creates and returns the torch dataloaders that provide the training data in a suitable format for your model.

Arguments:
- self The required self argument for instance methods
- dataset_train: np.ndarray The raw training data as a numpy array. COMMENT ON DATA FORMAT + LINK
- dataset_test: np.ndarray | None The raw test data as a numpy array (Optional)
- dataset_val: np.ndarray | None The raw validation data as a numpy array (Optional)
- timesteps: np.ndarray The array of timesteps in the training data. If your model does not explicitly use these, you can just ignore this argument.
- batch_size: int The batch size your dataloader should have. This value is read form the configuration and shoul be directly passed to the Dataloader constructor (see example below).
- shuffle: bool The shuffle argument is set by the benchmark and should be directly passed to the constructor of the Dataloader.
Return:

The method should return a tuple of three dataloaders in the order train, test, val. If the dataset_test or dataset_val arguments are None, the respective dataloader should also be None instead.
fit

This method's purpose is to execute the training loop and train the self.model instantiated in the __init__ method. Optionally, a test prediction can be made on the test dataset to evaluate training progess.

Important: This method should save the training loss (and optionally test loss and the mean absolute error on the test set) as tensors in the in the self.train_loss, self.test_loss and self.MAE attributes. See example below on how to do that.

Arguments:
- self The required self argument for instance methods
- train_loader: torch.utils.data.DataLoader The training dataloader
- test_loader: torch.utils.data.DataLoader | None The test dataloader (Optional)
- epochs: int The number of epochs to train the model for. This value is read from the configuration and should be used to determine the number of iterations in the training loop.
- position: int Position argument used for the progress bar. See example below on how to use.
- description: str Label argument used for the progress bar. See example below on how to use.
forward

This method should simply call the forward method of the model and return the output together with the targets.

Arguments:
- self The required self argument for instance methods
- inputs: Any Whatever the dataloader outputs
Return:

Returns a tuple of predictions and targets

Surrogate Configuration

To keep hyperparameters (such as model dimensions, activation functions, learning rates, latent space dimensions etc.) of surrogates separate from the code of the actual surrogate model and to subsequently make the modification of those hyperparameters at a later point easy, we employ dataclasses as configurators for a surrogate model. Since the optimal parameters for a given surrogate will likely vary between datasets, our arcitecture enables you to define a configuration per dataset.

Each model comes with a default (or fallback) configuration which will be loaded by default.

Example Implementation

This short tutorial will go over all the requred steps to add your own Surrogate class to the benchmark and will provide some sample code. The Surrogate we will add is just a variant of a fully connected neural network and serves only to demonstrate the process of adding your own implementation.

To get started, add a folder in the surrogates/ directory of the repository, named after your model. For this example, the model we will add is called MySurrogate, so we create the directory surrogates/MySurrogate/. In this directory, we create the python file which will include the code for our surrogate called my_surrogate.py. We will also create a second file my_surrogate_config.py, where we can define the hyperparameters of our surrogate. If you plan to use several datasets with your surrogate, you can also define a set of hyperparameters per dataset, as the optimal parmeters might vary between datasets. Check the dataset section on how to do this.

For this demonstration, we will use the OSU2008 dataset. Our demonstration surrogate will simply take the initial abundances and make a prediction based on those.

Before implementing the surrogate itself, we will define its configuration dataclass. For this, open the my_surrogate_config.py file you created and add the hyperparameters you might want to change in the future. For this example, we will add the width, depth, activation function and learning rate of our neural network.

from dataclasses import dataclass

from torch.nn import ReLU, Module


@dataclass
class MySurrogateConfig:
	"""Model config for MySurrogate for the osu2008 dataset"""

	network_hidden_layers: int = 2
	network_layer_width: int = 128
	network_activation: Module = ReLU()
	learning_rate: float = 1e-3

Next, we will implement a dataset class for our surrogate. You can put this class into the my_surrogate.py file we just created, or alternatively put it in a separate file and import it to my_surrogate.py.

import torch
from torch.utils.data import Dataset

class MyDataset(Dataset):

	def __init__(self, abundances, device):
		# abundances with shape (n_samples, n_timesteps, n_species)
		self.abundances = torch.tensor(abundances).to(device)
		self.length = self.abundances.shape[0]

	def __getitem__(self, index):
		return self.abundances[index, :, :]

	def __len__(self):
		return self.length

Now we implement the surrogate itself. It is important that the custom surrogate class is derived from the AbstractSurrogateModel class and adheres to its method signatures in order to be compatible with the benchmark.

Let's begin by implementing the __init__ method. All we need to do here is initialize our neural network and call the super classes constructor, as well as initializing our model config so its parameters are available inside our surrogate class.

from surrogates.surrogates import AbstractSurrogateModel
from torch import nn

from surrogates.MySurrogate.my_surrogate_config import MySurrogateConfig


class MySurrogate(AbstractSurrogateModel):

	def __init__(
		self,
		device: str | None,
		n_chemicals: int,
		n_timesteps: int,
		model_config: dict | None,
	):
		super().__init__(device, n_chemicals, n_timesteps, model_config)

		model_config = model_config if model_config is not None else {}
		self.config = MySurrogateConfig(**model_config)

		# construct the model according to the parameters in the config
		modules = []
		modules.append(nn.Linear(n_chemicals, self.config.layer_width))
		modules.append(self.config.activation)
		for _ in range(self.config.hidden_layers):
			modules.append(nn.Linear(self.config.layer_width, self.config.layer_width))
			modules.append(self.config.activation)
		modules.append(nn.Linear(self.config.layer_width, n_chemicals*n_timesteps))

		self.model = nn.Sequential(*modules).to(device)

The next step is to implement the prepare_data method. There, we instantiate and return the dataloaders for our model using our custom defined dataset.

from torch.utils.data import DataLoader
import numpy as np


class MySurrogate(AbstractSurrogateModel):

...

	def prepare_data(
		self,
		dataset_train: np.ndarray,
		dataset_test: np.ndarray | None,
		dataset_val: np.ndarray | None,
		timesteps: np.ndarray,
		batch_size: int,
		shuffle: bool,
	) -> tuple[DataLoader, DataLoader | None, DataLoader | None]:
		
		train = MyDataset(dataset_train, self.device)
		train_loader = DataLoader(
			train, batch_size=batch_size, shuffle=shuffle
		)

		if dataset_test is not None:
			test = MyDataset(dataset_test, self.device)
			test_loader = DataLoader(
				test, batch_size=batch_size, shuffle=shuffle
			)
		else:
			test_loader = None

		if dataset_val is not None:
			val = MyDataset(dataset_val, self.device)
			val_loader = DataLoader(
				val, batch_size=batch_size, shuffle=shuffle
			)
		else:
			val_loader = None

		return train_loader, test_loader, val_loader

Finally, we implement the training loop inside the fit function and define the forward function. Note that the fit function should set the train_loss, test_loss and MAE (mean absolute error) attributes of the surrogate to ensure their availability for plotting later. To have access to training durations later on, we wrap the fit function with the time_executionfunction for the utils module.

from torch.optim import Adam

from utils import time_execution


class MySurrogate(AbstractSurrogateModel):

...

	def forward(self, inputs):
		targets = inputs
		initial_cond = inputs[..., 0, :]
		outputs = self.model(initial_cond)
		return outputs, targets

	@time_execution
	def fit(
        self,
        train_loader: DataLoader,
        test_loader: DataLoader,
        epochs: int,
        position: int,
        description: str,
    ):

        criterion = nn.MSELoss()
        optimizer = Adam(self.model.parameters(), lr=self.config.learning_rate)

        # initialize the loss tensors
        losses = torch.empty((epochs, len(train_loader)))
        test_losses = torch.empty((epochs))
        MAEs = torch.empty((epochs))

        # setup the progress bar
        progress_bar = self.setup_progress_bar(epochs, position, description)

        # training loop as usual
        for epoch in progress_bar:
            for i, x_true in enumerate(train_loader):
                optimizer.zero_grad()
                x_pred, _ = self.forward(x_true)
                loss = criterion(x_true, x_pred)
                loss.backward()
                optimizer.step()
                losses[epoch, i] = loss.item()

            # set the progress bar output
            clr = optimizer.param_groups[0]["lr"]
            print_loss = f"{losses[epoch, -1].item():.2e}"
            progress_bar.set_postfix({"loss": print_loss, "lr": f"{clr:.1e}"})

            # evaluate the model on the test set
            with torch.inference_mode():
                self.model.eval()
                preds, targets = self.predict(test_loader)
                self.model.train()
                loss = criterion(preds, targets)
                test_losses[epoch] = loss
                MAEs[epoch] = self.L1(preds, targets).item()

        progress_bar.close()

        self.train_loss = torch.mean(losses, dim=1)
        self.test_loss = test_losses
        self.MAE = MAEs

Now that your surrogate class is completely implemented, the last thing left to do is to add it to the surrogate_classes.py file in the surrogates directory of the repository to make it available for the benchmark. In our case this looks like this (other, already existing surrogates are omitted in the code example)

...
from surrogates.MySurrogate.my_surrogate import MySurrogate

surrogate_classes = [
	...
    # Add any additional surrogate classes here
    MySurrogate,
]

Now you're all set! You can now use you own surrogate model in the benchmark and compare it with any of the other surrogates present.

Documentation

Setup

Run the benchmark

Configuring the benchmark

Overall training parameters

Benchmark parameters

Add your own dataset

Add your own model

`init`

Arguments:

`prepare_data`

Arguments:

Return:

`fit`

Arguments:

`forward`

Arguments:

Return:

Surrogate Configuration

Example Implementation