Quick Start Guide
This quick start guide will help you play a game and run a minimal AlphaZero training loop. By the end, you should be able to run a game in the terminal, see the output, and start experimenting.
1. Installation
You can run the library using Docker (recommended) or a local Python environment.
Using Docker:
# Build the production image
docker compose build mrl_prod
# Start a container interactively
docker compose run --rm mrl_prod
Local Python environment (optional):
Install dependencies:
python -m venv venv
source venv/bin/activate
pip install torch==2.8.0 pyyaml==6.0.3 h5py==3.15.1 pydantic==2.12.4
Install MRL:
pip install .
Test that the library is available:
run_game -h
2. Run a simple game
We will start with TicTacToe and play against a random policy in the terminal.
run_game examples/tic_tac_toe_manual.yaml --mode terminal
You should see a 3x3 grid and be prompted to make moves. Press the keys corresponding to the cell you want to place your symbol in.
3. Evaluate a policy match
Run a game automatically and see statistics for policy performance:
run_game examples/tic_tac_toe_auto.yaml --mode evaluate
This will run multiple simulations and show how the players perform.
You will a report similar to this one.
Total plays as player O: N. 100
Mean Payoff: 0.665
Payoff distribution in buckets:
(-inf, 0.25): 28 (28%)
(0.25, 0.75): 11 (11%)
(0.75, inf): 61 (61%)
The report indicates that 100 games were simulated. Player O achieved a mean payoff of 0.665. In Tic-Tac-Toe, the payoff is 0 for losses, 1 for wins, and 0.5 for draws. Accordingly, the three buckets above represent losses, draws, and wins, respectively.
4. Train a minimal AlphaZero
Run a smoke test of AlphaZero training:
run_alpha_zero examples/tic_tac_toe_alpha_zero.yaml --mode train
This will perform a few self-play episodes, collect experiences using the NonDeterministicMCTSPolicy, and update a neural network oracle.
5. Play against the trained model
Play against the model you just trained in the terminal:
run_alpha_zero examples/tic_tac_toe_alpha_zero.yaml --mode terminal
6. Optional next steps
Once you have succeeded with the minimal workflow, you can explore:
Change games: Try StraightFour, Xiangqi, or RockPaperScissors.
Experiment with policies: Use MCTS, deterministic, or stochastic oracle policies.
Use GUI: Replace –mode terminal with –mode gui to use the built-in Tkinter GUI.
Modify AlphaZero parameters: Edit training YAML files to increase the number of simulations, episodes, or epochs.
7. Architecture in brief
The library is built around a few core abstractions:
Game: Produces states and enforces rules.
State: Can be any structure, must include is_final (and active_player for turn-based games).
Perspective: Defines what each player sees and optionally provides get_reward(state).
Policy: Chooses actions based on observations and action spaces.
Oracle: Evaluates states and provides action probabilities (used by MCTS and AlphaZero).
This separation allows new games, policies, and neural networks to be plugged in easily.
8. Troubleshooting
If you cannot run Docker GUI apps on Mac, make sure XQuartz is installed.
Use run_game -h or run_alpha_zero -h to see all command line options.