###########
Tutorials
###########
**************
Installation
**************
You need a Python environment with the following dependencies:
.. code:: bash
- python 3.13.11
- pytorch 2.8.0
- pyyaml 6.0.3
- h5py 3.15.1
- pydantic 2.12.4
- trueskill 0.4.5
You can use the Dockerfile to create a Docker image.
.. code:: bash
docker compose build mrl_prod # Create the image
docker compose run --rm mrl_prod # Run a container from the image
If you want to use the GUI-based features, you will also need to enable
communication between the Docker container and your system display. The
required options are already included in the ``docker-compose.yaml``
file, but they may need to be adjusted for your specific system.
On macOS systems you will need to install and use `XQuartz
`_.
Once the container is running, you can verify that the library is
available:
.. code:: bash
run_game -h
run_alpha_zero -h
******************
Runners overview
******************
Getting examples
================
You need a YAML configuration file describing the game you want to run.
You can generate the built-in examples with the following command. This
will create a directory called ``examples``.
.. code:: bash
get_examples
NOTE: The container uses the directory ``/mrl`` as an alias for the
``production_space`` directory inside the repository. You only need to
run this command once. The examples will persist across sessions even if
you delete the container.
If you are not using Docker, you can still retrieve the examples using
the ``get_examples`` command, but you must also ensure that your Python
path includes the example directory. The Docker image already adds this
directory to ``PYTHONPATH``.
.. code:: bash
get_examples
export PYTHONPATH=$PYTHONPATH:
Game runner
===========
You can play a game in the terminal:
.. code:: bash
run_game examples/tic_tac_toe_manual.yaml --mode terminal
If a GUI is available, you can use it instead:
.. code:: bash
run_game examples/tic_tac_toe_manual.yaml --mode gui
You can run self-play games and collect outcome statistics:
.. code:: bash
run_game examples/tic_tac_toe_auto.yaml --mode evaluate
Alpha Zero Runner
=================
You can train an AlphaZero model. Note that the default configuration is
only intended as a smoke test. To properly train a Tic Tac Toe model you
will need to increase the training parameters.
Keep in mind that larger training configurations require more time, so
it is recommended to increase parameters incrementally to understand the
overall training time.
.. code:: bash
run_alpha_zero examples/tic_tac_toe_alpha_zero.yaml --mode train
You can evaluate the trained AlphaZero model against a random policy:
.. code:: bash
run_alpha_zero examples/tic_tac_toe_alpha_zero.yaml --mode evaluate
You can also play against the trained AlphaZero model:
.. code:: bash
run_alpha_zero examples/tic_tac_toe_alpha_zero.yaml --mode terminal
You can evaluate the trained AlphaZero model against another policy. In
this example, the deterministic version of the trained model is
evaluated against its non-deterministic counterpart.
.. code:: bash
run_game examples/tic_tac_toe_alpha_zero_auto.yaml --mode evaluate
**********************
Policy customization
**********************
You can create your own hard-coded policy to experiment with.
An example policy class is available in ``examples/opportunity_policy``.
This policy is designed for the Tic Tac Toe game. It implements a
``__call__`` method that takes an observation and an action space as
input and returns the action selected by the player.
Depending on the configuration, the policy may:
- check whether a winning line is available and play the winning move;
- check whether the opponent has a winning line and block it;
- otherwise select an action randomly.
Variables defined in the ``__init__`` method can be configured through
the YAML configuration file. See
``examples/tic_tac_toe_custom_policy.yaml`` for an example.
In the configuration you must specify the policy name and the module
path relative to a directory included in ``PYTHONPATH``. The ``player``
and ``game`` parameters are automatically provided by the framework and
should not be included in the configuration.
You can run the example with:
.. code:: bash
run_game examples/tic_tac_toe_custom_policy.yaml --mode evaluate
********************
Game customization
********************
You can create your own custom game to experiment with.
An example game is provided in ``examples/coordination.py``.
This is a discrete-time game in which two players attempt to coordinate
by selecting the same action over a series of attempts. The game keeps
track of the history of attempts, and the final payoff is the average
number of successful coordinations.
The state includes:
- the history of previous attempts;
- the ``is_final`` attribute, which indicates whether the current step
is the last step of the game.
A *perspective* represents how a player observes the game. At a minimum,
a valid perspective must implement two methods:
- ``get_observation``: converts the state into the observation seen by
the player;
- ``get_action_space``: returns the set of actions available to the
player in the current state.
This is a full-information game, so the observation is identical to the
state. If you want to evaluate play statistics using the game runner,
you must also implement a ``get_reward`` method that returns a numerical
value representing the reward obtained upon reaching a state. The final
payoff is defined as the sum of all rewards accumulated throughout the
game.
The game class itself represents the game rules. It must implement:
- ``make_initial_state`` to generate the initial state;
- ``get_players`` to list the players in the game;
- ``get_perspectives`` to define the perspective associated with each
player;
- ``update`` to update the game state.
The signature of ``update`` depends on whether the game is turn-based or
discrete-time. In discrete-time games all players act simultaneously, so
all actions are available when ``update`` is called.
The example also includes a policy for playing the game in the terminal.
This policy derives from the ``InteractivePolicy`` class. Its
``__call__`` method prompts the user for input and returns a validated
action.
The policy also implements notification methods that communicate the
state of the game to the user. For discrete-time games you must
implement ``notify_actions``, while turn-based games use
``notify_action``.
You can evaluate the game using predefined policies:
.. code:: bash
run_game examples/coordination_auto.yaml --mode evaluate
You can also play the game in the terminal:
.. code:: bash
run_game examples/coordination_manual.yaml --mode terminal
*************************
MCTS Game customization
*************************
The coordination game described above is not suitable for AlphaZero.
The AlphaZero algorithm implemented here only supports turn-based,
restorable games with a discrete action space and a reward-observable
perspective. The perspective must also be able to encode the state and
action space in an array format suitable for neural network processing.
The Centipede game provides an example of such a game.
You must define a perspective whose ``get_observation`` method returns
an ``MCTSObservation``. This object lazily collects the information
required by the algorithm when needed.
The ``get_core`` method returns a vector representation of the state. In
this example it is a one-dimensional vector containing the reward the
player would receive by choosing the ``TAKE`` action at the current
turn.
The ``get_action_space`` method returns the list of actions available in
the state.
The game class must also implement a ``restore`` method in addition to
the usual methods. This method reconstructs a game state compatible with
a given observation. Since the example uses ``MCTSObservation`` and the
game has full information, this method simply returns a copy of the
state component.
You can train an AlphaZero model (again, this configuration is only a
smoke test):
.. code:: bash
run_alpha_zero examples/centipede_alpha_zero.yaml --mode train
You can evaluate the trained model against a random policy:
.. code:: bash
run_alpha_zero examples/centipede_alpha_zero.yaml --mode evaluate
***************************
MCTS Oracle customization
***************************
You can create a custom oracle for training. An example implementation
is provided in ``simple_mlp.py``.
This example shows how to create a custom neural network using building
blocks already available in the library.
The ``SaveLoadModule`` block extends ``torch.nn.Module`` and adds
``save`` and ``load`` methods for storing and loading parameters as
expected by the AlphaZero runner. Together with ``OracleMixin``, this
produces a ``TrainableOracle`` suitable for AlphaZero training.
The ``OracleMixin`` implements the core oracle functionality for
computing expected payoff values and action probabilities.
The mixin requires the following components to be defined in the
constructor:
- ``torso``: a neural network that performs the initial processing of
the input data. Its output is passed to the other two networks.
- ``policy_head``: a neural network that computes the logits for the
action probabilities (before the softmax operation).
- ``value_head``: a neural network that computes the expected payoff
for the input state.
You must also call the base class constructor, passing the input shape
(as a tuple of integers) and the output size (as an integer).
You can train an AlphaZero model using the custom network:
.. code:: bash
run_alpha_zero examples/tic_tac_toe_alpha_zero_custom_oracle.yaml --mode train
NOTE: You can also implement a hard-coded oracle that does not require
training. In that case you only need to implement the basic ``Oracle``
protocol. Such an oracle can be used for evaluation together with
policies that require an oracle (such as ``MCTSPolicy``), but it is not
sufficient for ``run_alpha_zero`` training because it is not a
``TrainableOracle``. See the implementation of ``RandomRollout`` for an
example of a non-trainable oracle.
What to do next?
================
Edit ``tic_tac_toe_alpha_zero.yaml`` to allow longer training and try
playing against the trained model.
For example, you can modify the following parameters:
- ``oracle. capacity. nn_width``: set to 18 (to increase the model
capacity to learn complex strategies)
- ``collector. number_of_processes``: set to 4 (to speed-up data
collection)
- ``collector. mcts. number_of_simulations``: set to 225 (to obtain
better move evaluations)
- ``collector. number_of_episodes``: set to 100 (to simulate more games
for each training epoch)
- ``collector. max_buffer_length``: set to 10000 (to use more training
examples at once during training)
- ``trainer. max_training_epochs``: set to 100 (to ensure the model is
trained enough for each batch of training examples)
- ``number_of_epochs``: set to 20 (to train the model for longer)
After applying the changes described above, the following results were
obtained in my tests:
- InMemory strategy: After 16 minutes of training, the model achieved
- an average payoff of 0.92, with a 87% win rate and a 5% loss rate
against the random policy;
- an average payoff of 0.36, with a 72% draw rate and a 28% loss
rate against the optimal AlphaBetaPolicy.
- HDF5 strategy: After 8 minutes of training, the model achieved
- an average payoff of 0.85, with a 81% win rate and a 11% loss rate
against the random policy;
- an average payoff of 0.31, with a 62% draw rate and a 38% loss
rate against the optimal AlphaBetaPolicy.