Introduction
This page is a personal experiment to create a benchmark database for BayesFlow. The goals are:
- each run should be self-contained, containing all information required to run it again
- this includes all the code, so a somewhat less curated resource with code examples
- runs should contain training parameters in a fashion that is easy to browse and compare, to inform future decisions on architectures and hyperparameters
- comparing the performance of different runs for each benchmark task
The starting point and inspiration for this is the sbi-benchmark repository, which also provides the tasks for the SBI Benchmark suite in this respository. I want to thank the authors Jan-Matthis Lueckmann, Jan Boelts, David Greenberg, Pedro Goncalves and Jakob Macke for providing those resources in an easy-to-use way under an open license.
The code for building this page can be found in this repository.
Technical Details
Apart from setting up the tasks, the hard part in making useful benchmark is to store the input (parameters, data, ...) and the output (metrics, ...) of the algorithms in a comparable fashion. Doing so is mostly a problem of creating a consistent structure, which can only partly be enforced by technical means.
Backend
Luckily, organizing runs is a experiement runs is a common problem. In this project, I rely on MLflow for tracking runs, parameters and metrics. A custom MLflow backend enables basic validation of the provided configuration, as well as some utilities for logging and a very basic post-run check. The runs can then be copied to the runs repository, which stores each run using git-lfs.
Frontend
Leveraging the structure provided by the backend, we can then post-process the runs, organize them by tasks and leverage the struture in the runs to render this website. The website is a static site using:
- Eleventy as a static site generator
- plotly.ljs to create plots
- D3 for basic data wrangling
- simple-datatables for interactive tables
This makes the site quite heavy, but enables nice ways to interact with the runs.
Reproducing Runs
You can find the link to the source code of each run in the run details, so you can download the source code and run it yourself. The experiments are not deterministic (i.e., each run can produce different results), so do not expect to see identical results if your run the same experiment twice.
Preparations
Due to the chosen backend, there are a few constraints on how the examples can be reproduced. Most importantly, you need a working pyenv installation. If you encounter issues with this, installing it in a virtual machine running Debian might be a good workaround.
Next, setup a new Python virtual environment and install the mlflow-bayesflow-benchmark-plugin:
pip install git+https://codeberg.org/vpratz/mlflow-bayesflow-benchmark-plugin.git@main
Finally, you can run the experiment using:
python -c 'import mlflow; mlflow.run("<path-to-experiment>", backend="benchmark-backend")'
This will produce a run in the mlruns directory, which you can view using the mlflow server command.