Architecture Overview

The library is organized around a set of abstractions that separate game logic, decision making, and execution. This allows new games, policies, and learning algorithms to be implemented independently.

Core Concepts

Game

A Game encapsulates the rules of the environment. It is responsible for creating the initial state of the game, identifying the players, and defining how the state evolves when players take actions.

Games may follow two different interaction models.

Turn-based games

In turn-based games only one player acts at a time. These games implement the update method with a single action:

def update(self, state: State, action: Action) -> State

The player allowed to act is identified through the active_player attribute of the state.

Discrete-time games

Some games allow players to act simultaneously. These games implement the update method using a tuple of actions, one for each player:

def update(self, state: State, actions: tuple[Action, ...]) -> State

The order of the actions corresponds to the order returned by Game.get_players().

State

A State represents a configuration of the game at a particular moment in time.

The framework does not impose a specific structure on state objects. However, states expose two attributes:

is_final: Indicates whether the state represents a terminal configuration of the game.
active_player: Identifies the player whose turn it is in turn-based games.

Apart from these conventions, the internal representation of the state is entirely defined by the game implementation.

Perspective

Players do not interact with the full game state directly. Instead, each player interacts with the game through a Perspective.

A perspective defines:

the observation available to the player
the action space available in the current state

This abstraction allows games with partial information to be implemented, as different players may observe different aspects of the underlying state.

Policies therefore operate on observations rather than the full state.

A perspective may implement a get_reward(state) method, which is used to:

Collect statistics when comparing policies during self-play or evaluation.
Provide the value estimates needed for AlphaZero training, where the payoff of a state is computed as the sum of discounted reward and is recorded alongside the action probabilities generated by the MCTS policy.

Policy

A Policy selects actions based on observations produced by a perspective. Policies can implement a wide range of strategies, including:

rule-based heuristics
random policies
search-based methods such as Monte Carlo Tree Search
neural-network-based policies

Policies are interchangeable and can be evaluated against each other using the game runner.

Oracle

Some policies rely on oracles to evaluate game states. For example, Monte Carlo Tree Search policies may use an oracle to perform rollout estimate state values using random simulations.

Oracles are defined independently so that they can be reused across multiple policies.

MCTS Compatibility

Games that are used with Monte Carlo Tree Search must be restorable. This means that the game must be able to reconstruct a valid state from an observation.

Such games implement:

def restore(self, observation: Observation) -> State

This allows the search algorithm to rebuild states encountered during simulation using only the information available to the policy.

Because of this mechanism, state objects do not need to follow a specific structure as long as they can be reconstructed from observations.

Furthermore, this AlphaZero implementation supports imperfect-information games. The only requirement is that the reconstructed state must be compatible with the observation.

Execution Flow

A game run proceeds as follows:

The game creates an initial state.
Each player receives an observation through its perspective.
Policies select actions based on their observations.
The game applies the action (or actions) and generates the next state.
The process repeats until the state indicates that the game has reached a terminal condition.

AlphaZero Integration

The library implements an AlphaZero training pipeline.

Experience Collection

Training an AlphaZero model begins with experience collection. An Experience Collector repeatedly plays the game, generating transitions of the form:

the state observed by a player
the action probabilities used by a policy
the resulting payoff

To generate these experiences, the collector uses the NonDeterministicMCTSPolicy, which itself relies on a neural network oracle to evaluate states and guide the search.

Multiple games are played in sequence, and the resulting data is stored for training the neural network.

Neural Network Oracle

The oracle used during AlphaZero training is a neural network with two outputs:

Policy head: predicts the probability distribution over actions given the current state.
Value head: predicts the expected payoff from the current state.

During experience collection, the network is queried at each decision point to guide the MCTS search.

Training Loop

Once sufficient experiences are collected, the network is trained using these samples:

For each experience, the policy head is trained to match the probabilities returned by the MCTS search.
The value head is trained to predict the observed payoff at the end of the game.

After training, the updated network is used in subsequent iterations of self-play, and the process repeats.

Parallelization

The experience collection process can be parallelized across multiple threads or processes:

InMemory mode: all collected experiences are stored in main memory for fast access during training.
HDF5 mode: each collector stores experiences to HDF5 files. This is useful when training across multiple machines or when the dataset is too large to fit in memory.