Tutorials

Installation

You need a Python environment with the following dependencies:

- python 3.13.11
- pytorch 2.8.0
- pyyaml 6.0.3
- h5py 3.15.1
- pydantic 2.12.4
- trueskill 0.4.5

You can use the Dockerfile to create a Docker image.

docker compose build mrl_prod      # Create the image
docker compose run --rm mrl_prod   # Run a container from the image

If you want to use the GUI-based features, you will also need to enable communication between the Docker container and your system display. The required options are already included in the docker-compose.yaml file, but they may need to be adjusted for your specific system.

On macOS systems you will need to install and use XQuartz.

Once the container is running, you can verify that the library is available:

run_game -h
run_alpha_zero -h

Runners overview

Getting examples

You need a YAML configuration file describing the game you want to run.

You can generate the built-in examples with the following command. This will create a directory called examples.

get_examples

NOTE: The container uses the directory /mrl as an alias for the production_space directory inside the repository. You only need to run this command once. The examples will persist across sessions even if you delete the container.

If you are not using Docker, you can still retrieve the examples using the get_examples command, but you must also ensure that your Python path includes the example directory. The Docker image already adds this directory to PYTHONPATH.

get_examples <desired_destination_path>
export PYTHONPATH=$PYTHONPATH:<desired_destination_path>

Game runner

You can play a game in the terminal:

run_game examples/tic_tac_toe_manual.yaml --mode terminal

If a GUI is available, you can use it instead:

run_game examples/tic_tac_toe_manual.yaml --mode gui

You can run self-play games and collect outcome statistics:

run_game examples/tic_tac_toe_auto.yaml --mode evaluate

Alpha Zero Runner

You can train an AlphaZero model. Note that the default configuration is only intended as a smoke test. To properly train a Tic Tac Toe model you will need to increase the training parameters.

Keep in mind that larger training configurations require more time, so it is recommended to increase parameters incrementally to understand the overall training time.

run_alpha_zero examples/tic_tac_toe_alpha_zero.yaml --mode train

You can evaluate the trained AlphaZero model against a random policy:

run_alpha_zero examples/tic_tac_toe_alpha_zero.yaml --mode evaluate

You can also play against the trained AlphaZero model:

run_alpha_zero examples/tic_tac_toe_alpha_zero.yaml --mode terminal

You can evaluate the trained AlphaZero model against another policy. In this example, the deterministic version of the trained model is evaluated against its non-deterministic counterpart.

run_game examples/tic_tac_toe_alpha_zero_auto.yaml --mode evaluate

Policy customization

You can create your own hard-coded policy to experiment with.

An example policy class is available in examples/opportunity_policy.

This policy is designed for the Tic Tac Toe game. It implements a __call__ method that takes an observation and an action space as input and returns the action selected by the player.

Depending on the configuration, the policy may:

check whether a winning line is available and play the winning move;
check whether the opponent has a winning line and block it;
otherwise select an action randomly.

Variables defined in the __init__ method can be configured through the YAML configuration file. See examples/tic_tac_toe_custom_policy.yaml for an example.

In the configuration you must specify the policy name and the module path relative to a directory included in PYTHONPATH. The player and game parameters are automatically provided by the framework and should not be included in the configuration.

You can run the example with:

run_game examples/tic_tac_toe_custom_policy.yaml --mode evaluate

Game customization

You can create your own custom game to experiment with.

An example game is provided in examples/coordination.py.

This is a discrete-time game in which two players attempt to coordinate by selecting the same action over a series of attempts. The game keeps track of the history of attempts, and the final payoff is the average number of successful coordinations.

The state includes:

the history of previous attempts;
the is_final attribute, which indicates whether the current step is the last step of the game.

A perspective represents how a player observes the game. At a minimum, a valid perspective must implement two methods:

get_observation: converts the state into the observation seen by the player;
get_action_space: returns the set of actions available to the player in the current state.

This is a full-information game, so the observation is identical to the state. If you want to evaluate play statistics using the game runner, you must also implement a get_reward method that returns a numerical value representing the reward obtained upon reaching a state. The final payoff is defined as the sum of all rewards accumulated throughout the game.

The game class itself represents the game rules. It must implement:

make_initial_state to generate the initial state;
get_players to list the players in the game;
get_perspectives to define the perspective associated with each player;
update to update the game state.

The signature of update depends on whether the game is turn-based or discrete-time. In discrete-time games all players act simultaneously, so all actions are available when update is called.

The example also includes a policy for playing the game in the terminal. This policy derives from the InteractivePolicy class. Its __call__ method prompts the user for input and returns a validated action.

The policy also implements notification methods that communicate the state of the game to the user. For discrete-time games you must implement notify_actions, while turn-based games use notify_action.

You can evaluate the game using predefined policies:

run_game examples/coordination_auto.yaml --mode evaluate

You can also play the game in the terminal:

run_game examples/coordination_manual.yaml --mode terminal

MCTS Game customization

The coordination game described above is not suitable for AlphaZero.

The AlphaZero algorithm implemented here only supports turn-based, restorable games with a discrete action space and a reward-observable perspective. The perspective must also be able to encode the state and action space in an array format suitable for neural network processing.

The Centipede game provides an example of such a game.

You must define a perspective whose get_observation method returns an MCTSObservation. This object lazily collects the information required by the algorithm when needed.

The get_core method returns a vector representation of the state. In this example it is a one-dimensional vector containing the reward the player would receive by choosing the TAKE action at the current turn.

The get_action_space method returns the list of actions available in the state.

The game class must also implement a restore method in addition to the usual methods. This method reconstructs a game state compatible with a given observation. Since the example uses MCTSObservation and the game has full information, this method simply returns a copy of the state component.

You can train an AlphaZero model (again, this configuration is only a smoke test):

run_alpha_zero examples/centipede_alpha_zero.yaml --mode train

You can evaluate the trained model against a random policy:

run_alpha_zero examples/centipede_alpha_zero.yaml --mode evaluate

MCTS Oracle customization

You can create a custom oracle for training. An example implementation is provided in simple_mlp.py.

This example shows how to create a custom neural network using building blocks already available in the library.

The SaveLoadModule block extends torch.nn.Module and adds save and load methods for storing and loading parameters as expected by the AlphaZero runner. Together with OracleMixin, this produces a TrainableOracle suitable for AlphaZero training.

The OracleMixin implements the core oracle functionality for computing expected payoff values and action probabilities.

The mixin requires the following components to be defined in the constructor:

torso: a neural network that performs the initial processing of the input data. Its output is passed to the other two networks.
policy_head: a neural network that computes the logits for the action probabilities (before the softmax operation).
value_head: a neural network that computes the expected payoff for the input state.

You must also call the base class constructor, passing the input shape (as a tuple of integers) and the output size (as an integer).

You can train an AlphaZero model using the custom network:

run_alpha_zero examples/tic_tac_toe_alpha_zero_custom_oracle.yaml --mode train

NOTE: You can also implement a hard-coded oracle that does not require training. In that case you only need to implement the basic Oracle protocol. Such an oracle can be used for evaluation together with policies that require an oracle (such as MCTSPolicy), but it is not sufficient for run_alpha_zero training because it is not a TrainableOracle. See the implementation of RandomRollout for an example of a non-trainable oracle.

What to do next?

Edit tic_tac_toe_alpha_zero.yaml to allow longer training and try playing against the trained model.

For example, you can modify the following parameters:

oracle. capacity. nn_width: set to 18 (to increase the model capacity to learn complex strategies)
collector. number_of_processes: set to 4 (to speed-up data collection)
collector. mcts. number_of_simulations: set to 225 (to obtain better move evaluations)
collector. number_of_episodes: set to 100 (to simulate more games for each training epoch)
collector. max_buffer_length: set to 10000 (to use more training examples at once during training)
trainer. max_training_epochs: set to 100 (to ensure the model is trained enough for each batch of training examples)
number_of_epochs: set to 20 (to train the model for longer)

After applying the changes described above, the following results were obtained in my tests:

InMemory strategy: After 16 minutes of training, the model achieved
- an average payoff of 0.92, with a 87% win rate and a 5% loss rate against the random policy;
- an average payoff of 0.36, with a 72% draw rate and a 28% loss rate against the optimal AlphaBetaPolicy.
HDF5 strategy: After 8 minutes of training, the model achieved
- an average payoff of 0.85, with a 81% win rate and a 11% loss rate against the random policy;
- an average payoff of 0.31, with a 62% draw rate and a 38% loss rate against the optimal AlphaBetaPolicy.