Quick Start Guide

This quick start guide will help you play a game and run a minimal AlphaZero training loop. By the end, you should be able to run a game in the terminal, see the output, and start experimenting.

1. Installation

You can run the library using Docker (recommended) or a local Python environment.

Using Docker:

# Build the production image
docker compose build mrl_prod

# Start a container interactively
docker compose run --rm mrl_prod

Local Python environment (optional):

Install dependencies:

python -m venv venv
source venv/bin/activate
pip install torch==2.8.0 pyyaml==6.0.3 h5py==3.15.1 pydantic==2.12.4

Install MRL:

pip install .

Test that the library is available:

run_game -h

2. Run a simple game

We will start with TicTacToe and play against a random policy in the terminal.

run_game examples/tic_tac_toe_manual.yaml --mode terminal

You should see a 3x3 grid and be prompted to make moves. Press the keys corresponding to the cell you want to place your symbol in.

3. Evaluate a policy match

Run a game automatically and see statistics for policy performance:

run_game examples/tic_tac_toe_auto.yaml --mode evaluate

This will run multiple simulations and show how the players perform.

You will a report similar to this one.

Total plays as player O: N. 100
Mean Payoff: 0.665
Payoff distribution in buckets:
(-inf, 0.25): 28 (28%)
(0.25, 0.75): 11 (11%)
(0.75, inf): 61 (61%)

The report indicates that 100 games were simulated. Player O achieved a mean payoff of 0.665. In Tic-Tac-Toe, the payoff is 0 for losses, 1 for wins, and 0.5 for draws. Accordingly, the three buckets above represent losses, draws, and wins, respectively.

4. Train a minimal AlphaZero

Run a smoke test of AlphaZero training:

run_alpha_zero examples/tic_tac_toe_alpha_zero.yaml --mode train

This will perform a few self-play episodes, collect experiences using the NonDeterministicMCTSPolicy, and update a neural network oracle.

5. Play against the trained model

Play against the model you just trained in the terminal:

run_alpha_zero examples/tic_tac_toe_alpha_zero.yaml --mode terminal

6. Optional next steps

Once you have succeeded with the minimal workflow, you can explore:

  • Change games: Try StraightFour, Xiangqi, or RockPaperScissors.

  • Experiment with policies: Use MCTS, deterministic, or stochastic oracle policies.

  • Use GUI: Replace –mode terminal with –mode gui to use the built-in Tkinter GUI.

  • Modify AlphaZero parameters: Edit training YAML files to increase the number of simulations, episodes, or epochs.

7. Architecture in brief

The library is built around a few core abstractions:

  • Game: Produces states and enforces rules.

  • State: Can be any structure, must include is_final (and active_player for turn-based games).

  • Perspective: Defines what each player sees and optionally provides get_reward(state).

  • Policy: Chooses actions based on observations and action spaces.

  • Oracle: Evaluates states and provides action probabilities (used by MCTS and AlphaZero).

This separation allows new games, policies, and neural networks to be plugged in easily.

8. Troubleshooting

  • If you cannot run Docker GUI apps on Mac, make sure XQuartz is installed.

  • Use run_game -h or run_alpha_zero -h to see all command line options.