I've been trying to train a neural network to play chess. A few weeks ago I got ~~nerd sniped~~, nay, inspired by this video titled "AI plays pokemon". It's a lovely exhibit on reinforcement learning and in particular a showcase of coding something with the goal of playing a game without actually telling it to play a game. An emergent gameplay. I love that stuff.

You can find the video here.

For what it's worth; I have a degree in AI. "Cognitive artificial intelligence", to be precise. It leans more into philosophy than an applied study but it did touch on all five corner stones of AI; philosophy "can it be done?", psychology "how do humans do it?", linguistics "how do you communicate with it?", computer science "how do you code it?", and math "... cause numbers". In the end my knowledge in that area was wide, including knowing how to roll my own and train a neural network, but not very deep. It did shape my life philosophy very well and gave me a good rucksack of general knowledge. Afterwards I was a bit frustrated with not having mastered a concrete skill. So I picked JS, dug into the spec, and ended up mastering that completely. I haven't really worked with AI in my career, aside from a random encounter with a finite domain constraint solver. So tldr; I know AI, haven't done much with it.

Chess

As you can read in my previous post, I tumbled into the world of chess recently. And being triggered by the Pokemon visualization I kind of wanted to try and achieve a similar thing. But in what? Well, why not chess.

That turned out to be a bigger challenge than I thought. For one, you need to codify the rules of chess. That's when you learn that some of these rules are quite gnarly. In particular, having to do the "does this move leave you in check" check is more cognitive work than I initially expected. Funny how that works.

The thing I really wanted to make, inspired by that Pokemon video, was this big multi-board visualization of the AI playing chess.

(Click here to load a fancy 2 MB multi-board gif)

Full disclosure; that big image is just playing random (but valid) moves. Ironically, I could have saved me a couple of hours and instead of a neural network just Math.random() brute forced my way to get this animation. But what's the fun in that :)

Engine

So while I was codifying the rules for chess I tumbled into that rabbit hole and ended up creating a small engine for it, Yacge (code on github). Not highly efficient, but sufficient to cover my needs.

After finalizing the engine I could proceed to the network training part of it.

Neural network

Some preamble on neural networks. I won't bother to explain how they work in the nitty gritty but sufficed to say "shit is hard".

Ultimately a neural network boils down to a network of layers of interconnected "nodes", meaning each node in one layer is connected to each node in the layer before and after it. Each node has a value called "activation", which is zero-to-one. Each connection has a "weight", also zero-to-one. For output nodes, this activation would be interpreted as some kind of probability. It is the result of multiplying all the activation values of all input nodes and all weights and internal nodes all the way to the output node. Neural networks generally don't give absolute answers.

With "zero-to-one" I mean a floating point number that is between zero and one, inclusive.

You will have a set of input nodes that describes your state, like a game board, image, or prompt text, or whatever. Your input states ultimately need to be converted to some kind of model of zero-to-one numbers.

The output is a set of "predictions", again nodes of zero-to-one.

When you train a network you give it an input and the expected outcome. Then the network applies changes to the weights of connections and activation values of the nodes based on what you're telling it to be expecting. The actual math on this is actually fairly straightforward; just a lot of multiplications. A lot. The "training" aspect boils down to changing the values such that the activation of the output nodes is closer to the expectation as it was before. Not an absolute match but just a bit closer.

That's not all. There's a bias, a training speed, the number of hidden layers, the size of hidden layers, learning strategies, and so forth. You can probably lose yourself for a month just trying different tactics and getting a result. All without making any changes to the actual neural network. Oof.

Modeling

Since a neural network is a mere set of floating point numbers, you can't just throw a string to it and wait for a string out, or an image into it and an integer or word out. No, you have to model your problem in terms of normalized floating point values and interpret the output values. Values between zero and one, inclusive.

For example; Images as input can be encoded as ones and zeroes (lots of them). A tic-tac-toe board can be encoded as nine inputs that each represents whether the square contains an x, an o, or nothing at all. A classifier can convert a few terms and either have a single output node for each possible term, or have a single term and divide the floating point space up to have meaning.

There are many possibilities and it should be obvious that the design of your model has a big influence on the learning speed of the network. If a neural network can't figure out patterns because your model is bad then it won't be learning anything, or worse, the wrong thing.

Chess model

I tried a few different models for my chess network.

Since Yacge has an internal representation of bitboards (that's a 64bit integer where each bit represents the on/off value of one square of the chess board) displayed below, I figured it could send the status of these bitboards as inputs.

The engine has 8 of these bitboards; one for the black pieces, white pieces, and one for each kind of piece. Together you can use bitwise AND operation to discover the entire game. And so this is what I fed to the network.

Code:


--  game  --     --player w--     --opponent--     --kings-----     --queens----     --rooks-----     --bishops---     --knights---     --pawns-----
  abcdefgh         abcdefgh         abcdefgh         abcdefgh         abcdefgh         abcdefgh         abcdefgh         abcdefgh         abcdefgh  
8 ♖♘♗♛♚♗♘♖ 8     8 00000000 8     8 11111111 8     8 00001000 8     8 00010000 8     8 10000001 8     8 00100100 8     8 01000010 8     8 00000000 8
7 ♙♙♙♙♙♙♙♙ 7     7 00000000 7     7 11111111 7     7 00000000 7     7 00000000 7     7 00000000 7     7 00000000 7     7 00000000 7     7 11111111 7
6          6     6 00000000 6     6 00000000 6     6 00000000 6     6 00000000 6     6 00000000 6     6 00000000 6     6 00000000 6     6 00000000 6
5          5     5 00000000 5     5 00000000 5     5 00000000 5     5 00000000 5     5 00000000 5     5 00000000 5     5 00000000 5     5 00000000 5
4          4     4 00000000 4     4 00000000 4     4 00000000 4     4 00000000 4     4 00000000 4     4 00000000 4     4 00000000 4     4 00000000 4
3          3     3 00000000 3     3 00000000 3     3 00000000 3     3 00000000 3     3 00000000 3     3 00000000 3     3 00000000 3     3 00000000 3
2 ♟♟♟♟♟♟♟♟ 2     2 11111111 2     2 00000000 2     2 00000000 2     2 00000000 2     2 00000000 2     2 00000000 2     2 00000000 2     2 11111111 2
1 ♔♖♕♛♚♕♖♔ 1     1 11111111 1     1 00000000 1     1 00001000 1     1 00010000 1     1 10000001 1     1 00100100 1     1 01000010 1     1 00000000 1
  abcdefgh         abcdefgh         abcdefgh         abcdefgh         abcdefgh         abcdefgh         abcdefgh         abcdefgh         abcdefgh  
------------     ------------     ------------     ------------     ------------     ------------     ------------     ------------     ------------

Another way I tried was to convert the board to 64 numbers and divide the floating point space up in 13 sections, each range representing a particular colored piece or empty.

As output I expected the network to give me a "from" and "to" coordinate. I quickly realized that wouldn't work. So instead I created two sets of 64 numbers where the first set represents the index of the "from" square and the second would be the "to" square. Given those 64 numbers I would pick the index with the highest activation and ask the engine whether that from-to couple was a legit move in the current state. If not, repeat with the next best pair until there are none left. If you can find a valid move this way then make it, otherwise consider the game over.

Consider this table of predictions for the opening move for White. At this point the activations for the White pawns and knights are significantly higher than the other squares. But the "to" squares are still a bit fuzzy, for example a2 is never a valid opening target.

Code:

-- p from --------------------------------------------------------------------------     -- p to ----------------------------------------------------------------------------
      a         b         c         d         e         f         g         h                  a         b         c         d         e         f         g         h       
8  0.000717  0.000720  0.000606  0.000640  0.000646  0.000686  0.000778  0.000822  8     8  0.000617  0.000584  0.000477  0.000654  0.000707  0.000597  0.000623  0.000688  8
7  0.000612  0.000639  0.000566  0.000448  0.000800  0.000533  0.000553  0.000991  7     7  0.000623  0.000878  0.000858  0.000980  0.000844  0.001139  0.000813  0.001283  7
6  0.001180  0.000864  0.000833  0.001359  0.000885  0.000957  0.002028  0.001113  6     6  0.000741  0.001303  0.001018  0.001427  0.003672  0.003032  0.003038  0.004788  6
5  0.002021  0.004815  0.002025  0.003439  0.004112  0.003263  0.006398  0.002641  5     5  0.001858  0.004551  0.008008  0.004045  0.004396  0.005853  0.004450  0.008081  5
4  0.007167  0.006546  0.007606  0.007078  0.013818  0.012468  0.009795  0.010014  4     4  0.006896  0.023927  0.022808  0.029172  0.024975  0.052587  0.016379  0.036477  4
3  0.017303  0.009698  0.014770  0.010137  0.008952  0.009595  0.008527  0.013713  3     3  0.032853  0.062769  0.034909  0.051695  0.036063  0.052277  0.051271  0.033579  3
2  0.088726  0.094049  0.072685  0.088926  0.233186  0.071909  0.101385  0.090959  2     2  0.071619  0.008169  0.008953  0.008444  0.022026  0.039267  0.007543  0.010122  2
1  0.027890  0.077474  0.030337  0.039759  0.043032  0.042445  0.117007  0.023538  1     1  0.009813  0.005936  0.014396  0.008808  0.012505  0.014423  0.013239  0.018353  1
      a         b         c         d         e         f         g         h                  a         b         c         d         e         f         g         h       
------------------------------------------------------------------------------------     ------------------------------------------------------------------------------------

Another output model was to create 64x64 outputs; one output for each combination of "from" and "to" square. The rest is pretty similar insofar that the pair with the highest activation value would be attempted first, and so on until a legal move was found. I would hope that this makes it easier for a neural network to bootstrap the problem since all the outputs have the same semantic, rather than "half of them mean x and half of them mean y". But it obviously takes much longer to train a network to get to a reliable set of outputs. And that's still hampered by training issues outlined below. Not to mention the visualization.

Training

In general a network is trained by giving it examples. The more examples the better the training. Beyond the model itself you have over fitting, under fitting, training bias, and a bunch of things to worry about.

You can train on an existing set of inputs and outputs, like how you can train a network on images of numbers and tell them that when you feed the cat.png image into it, you expect the outcome to be the "cat" label. Given enough input samples this should train your network to classify images.

You can also train your network by giving it some inputs, evaluating the prediction, and determining a reward for that prediction. Did the predictions lead to a win, a loss, an advantage, a desired pattern? In chess, you could give rewards for making progress (captures, pawn moves), for taking opponent pieces, for winning the game.

Training chess

At first I thought I would download a chess database and train a model using a bunch of legit games.

I found this massive database (Caissabase) with about 4M chess games at master level. So I wanted to use that as a starting point. This was before I learned how PGN and chess notation actually works.

Spoiler: you need a chess engine because the "standard algebraic notation" used in chess does not tell you where the piece moved from and so you have to disambiguate all valid moves based on the current state of the game, for which you need an entire game engine. You could pre-process the moves to get these coordinates, for which you still need an engine, but only once and not at training time.

This led me to sidestep into Yacge as a necessary cornerstone of this project.

PGN

Actually, my first thought was to just feed these pgn strings to a network and create an LLM-esque sort of network. But I wasn't convinced that I was going to get that anywhere within reasonable time.

Then I realized that the thing I find most beautiful about this project was a self-learning self-playing agent with minimal external input. So I went with that and didn't bother with the database. Maybe I should revisit that idea.

Self-playing

The approach to learning this way is called Reinforcement Learning (RL). If you want to learn more about that I'd suggest you to read this article instead.

Tldr; given a neural network, you give it an input state like the state of your game. Then based on that it gives you a prediction, the move to make. You apply this move and apply a heuristic that tells you how much you liked that move. If it won you the game you like it a lot but if it lost you the game you hate the move. You then teach the network by telling it how much you liked the move it predicted.

This update won't immediately force the outputs to match your expectations. That would erase the outputs for other inputs. Instead the model is updated ever so slightly such that the prediction becomes closer to what you were expecting, hopefully without destroying the existing other input-output mappings.

Chess

At first I tried to create a pure self-learning network. I would take the moves with the highest activation, throw them against the engine, and see what sticks. If a valid move was found I would train the network with some reward. Otherwise I'd try to teach the network not to predict that move by giving it the worst reward.

While that worked to some degree, in the end it kept predicting the same moves, ending continuously in a threefold draw. It kind of makes sense; a minimal change of the inputs would be prone to lead to a very similar output. So if two pieces go back and forth then sooner or later they'll trigger the threefold draw rule. And overall, won't end up generating a good game.

I tried to implement logic that would prevent a move if it would trigger a threefold draw but that didn't help much. I went one step further and disallowed a move that would end in the same game state, and now I was left with extremely long games that triggered the 50-move draw limit because kings kept being moved with no real goal. Greeaat.

In the end I wasn't really able to solve this reliably. The game would always want to repeat itself, or end in endless games. This was a problem because it would not hit the desired end states frequently enough to get the reward for wanting more of it. And if a network never learns to want the good stuff then it'll never favor it. Interesting paradox.

Other things I tried include adjusting the network size, number of nodes, number of hidden layers, learning speed, tweaking the model, tweaking the reward function, playing with activation functions, etc. Sadly, I wasn't able to get a meaningful result.

Now don't get me wrong, sometimes it does get a win. But these are more of a fluke than consistent outcomes.

Tensorflow.JS implementation

For my experiment I used tensorflow.js in node.js and a simple web interface together with some CLI debugging output.

I used chatGPT to help me setup some of the tensorflow code. It was surprisingly helpful in that. Although if a certain prompt doesn't give you what you want, it was hard getting it to give you a good answer anyways. All in all, most of the time it was more useful than google searches for similar questions. That's refreshing :)

Setup

The installation is straightforward

Code:

npm install @tensorflow/tfjs-node-gpu

Cuda doesn't need to work on your machine, just ignore the warnings. It didn't initially for me. To get it to work I had to install a few things

Code:

sudo apt-get install nvidia-cuda-toolkit, nvidia-cudnn

After this, nvidia-smi and nvcc would work. But on my machine, CUDA would still not be recognized by Tensorflow.js and I had to do the following to fix that:

Code:

for a in /sys/bus/pci/devices/*; do echo 0 | sudo tee -a $a/numa_node; done

This is necessary after each boot. And now CUDA would work for me and during training I could see the GPU be utilized a bit through nvidia-smi. It wasn't massive for me, dropping an iteration from 150ms to 110ms. Utilization was about 5-15%. But that's probably because my model and data set is tiny?

Code:

$ node train.js
...
I tensorflow/core/common_runtime/gpu/gpu_device.cc:1532] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 5236 MB memory:  -> device: 0, name: NVIDIA GeForce RTX 2070, pci bus id: 0000:01:00.0, compute capability: 7.5
...

$ nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.147.05   Driver Version: 525.147.05   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:01:00.0  On |                  N/A |
|  0%   51C    P2    44W / 175W |   7110MiB /  8192MiB |      8%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
...

Anyways, after that all said and done, I could import tensorflow like this

Code:

import * as tfd from '@tensorflow/tfjs-node-gpu';
const tf = tfd.default;

It's common to namespace the import into a tf variable, but somehow the star import wouldn't let me so that's why I'm reassigning it. Didn't care enough to investigate that further.

Creating a model

This code creates the model. If it was stored it can load it again. In my limited experience here, Tensorflow is able to restore my model accurately which is great.

Code:

async function createModel() {
  let model;
  let history = [];
  const existingModel = fs.existsSync('./model.json') && fs.readFileSync('./model.json', 'utf8');

  if (existingModel) {
    console.log('Using existing model');
    const state = JSON.parse(existingModel);
    const weightData = new Uint8Array(Buffer.from(state.weightData, "base64")).buffer;
    model = await tf.loadLayersModel(tf.io.fromMemory(state.modelTopology, state.weightSpecs, weightData));
    history = JSON.parse(fs.readFileSync('./history.json', 'utf8'));
  } else {
    console.log('Compiling fresh model');
    model = tf.sequential();
    model.add(tf.layers.dense({ units: 50, inputShape: [8*64], activation: 'sigmoid', kernelInitialize: tf.initializers.glorotNormal() }));
    //model.add(tf.layers.dense({ units: 50 }));
    model.add(tf.layers.dense({ units: OUTPUT_COUNT, activation: 'sigmoid' }));
  }
  const optimizer = tf.train.adam(LEARNING_RATE);
  model.compile({loss: 'meanSquaredError', optimizer, metrics: ['accuracy']});
  model.summary();

  return {model: {model, optimizer}, history};
}

Some notes on that function:

The activation "sigmoid" means the values are individual normalized floats, valued zero to one. You will also commonly see "softmax" in some examples, this means the sum of all predicted values sums to 1. There are more values like that but you'll have to search for them yourself, or ask chatGPT.

There are different loss functions and optimizer functions which determine how training is applied. I used "adam" and "meanSquaredError". I feel like I would have come further by experimenting more with these. But it also feels like I could lose myself for a month just by doing so.

I restore the history of moves etc as well but obviously that's not required to make the network load and function.

Episodes

Each game we play is called an "episode".

On a high level an episode does the following;

⚬ Create a fresh chess game
⚬ Repeat until the game ends: Feed current game state into network and pick best move from resulting predictions. Make and remember the best move that is legal and repeat. Stop if game ends or if there are no valid moves left.
⚬ After game ends look at moves made. The move should have some meta data available, whatever you need for the reward function. For each move determine the reward. This could be based on outcome of the game, evaluation of the position after the move, material advantage after the move, or as simple as wanting to move the game forward (rewarding a pawn move).
⚬ Then for each move made, add an entry to inputs of the board state before that move, and add an entry to outputs where the targeted "from" and "to" index is somehow updated with the reward value for that move.
⚬ Train the network on the resulting set.

First create a new game. Using Yacme that looks like this:

Code:

let G = parseFen(FEN_NEW_GAME);

Now repeat the logic until game ends. The game either ends with a draw or because there are no more moves. If there are no more valid moves left the game ended either in a stalemate (also a draw) or a checkmate. The no more moves case is unfortunate because it still means testing 64*64=4k moves before be able to assert that conclusion. But it's either that or over-testing for a check- or stalemate after every move. I opted for this more implied outcome (most of those 4k tests are cheap since you have 16 pieces max rather than 64 and the "is square your piece"-test is cheap, but still).

In my code I convert the game state to an input model and then send that to tensorflow to get a result back:

Code:

const normalizedBoardState = toNormalizedBoardStateInput(G);
const rawPredictions = tf.tidy(() => {
  const tensor = tf.tensor([boardState], [1, 64*8], 'int32');
  const p = model.predict(tensor);
  return p.arraySync()[0]; // Only has one
});

The .tidy() part is because these "tensors" (an atomic data model) are retained until told otherwise. The tidy callback will destroy any of these objects at the end of the callback. Otherwise you'd have to call .dispose() on them, which is also fine.

So now you have a "prediction"; in my case a set of 128 floats. In my code I interpret this as two sets of 64 floats. And each of these maps to a square on the board. The first set represents the square the move from, the second set represents the square to move to. Together they should provide me with a valid move. This looks somewhat like this:

Code:

--p from, best = b1,a1,c3,e1,h1,d1,b2,e2,g1,c1--------------------------------------    --p to, best = c1,b1,d1,g3,e3,f3,h3,f2,a2,d3----------------------------------------
a         b         c         d         e         f         g         h                 a         b         c         d         e         f         g         h       
8  0.000466  0.000474  0.000458  0.000480  0.000474  0.000484  0.000449  0.000467  8    8  0.000458  0.000457  0.000469  0.000473  0.000457  0.000484  0.000454  0.000449  8
7  0.000480  0.000464  0.000461  0.000465  0.000472  0.000450  0.000477  0.000465  7    7  0.000474  0.000476  0.000460  0.000462  0.000494  0.000505  0.000479  0.000462  7
6  0.000462  0.000464  0.000465  0.000466  0.000455  0.000462  0.000481  0.000464  6    6  0.000453  0.000458  0.000464  0.000452  0.000486  0.000456  0.000446  0.000462  6
5  0.000457  0.000535  0.000486  0.000945  0.000471  0.000453  0.000475  0.000478  5    5  0.000477  0.000492  0.000505  0.000514  0.000529  0.000458  0.000458  0.000439  5
4  0.001363  0.000537  0.000596  0.000444  0.000469  0.000468  0.000469  0.000484  4    4  0.000490  0.000485  0.000494  0.000505  0.000484  0.000498  0.000480  0.000478  4
3  0.000900  0.001112  0.011174  0.001514  0.001808  0.002027  0.001140  0.001363  3    3  0.000466  0.000521  0.000530  0.000550  0.000800  0.000768  0.000928  0.000606  3
2  0.008865  0.009094  0.008909  0.008958  0.009090  0.008887  0.008944  0.008994  2    2  0.000553  0.000448  0.000468  0.000470  0.000513  0.000557  0.000454  0.000475  2
1  0.018268  0.019505  0.009009  0.009096  0.009101  0.008827  0.009093  0.009096  1    1  0.000481  0.016113  0.018434  0.000995  0.000470  0.000459  0.000531  0.000453  1
a         b         c         d         e         f         g         h                 a         b         c         d         e         f         g         h       
------------------------------------------------------------------------------------    ------------------------------------------------------------------------------------

In this example, the agent would start by trying to move from b1 to c1, b1 to b1 (:facepalm:), b1 to d1, b1 to g3, etc until finding b1 to a3 or e3 is a valid (knight) move.

I think, ultimately, a network with 4k outputs (one for each combination of from-to square) would be less confusing to the network. But it's also much more time consuming to train that and my attempt didn't seem to lead to anything. And a bit hard to visualize in a CLI :)

Having these two sets you sort them by activation and map their index to the square, then you try to evaluate moves based on descending activation order.

For example:

Code:

[0.999638,  0.983368,  0.999746,  0.999099, ...  0.993414,  0.989944,  0.954925,  0.943932, ...]
// Breaks down into
[0.999638,  0.983368,  0.999746,  0.999099, ...]
[0.993414,  0.989944,  0.954925,  0.943932, ...]

And then we map to an [activation, index] pair, and .sort(([ a ],[ b ]) => b-a) them.

These arrays are walked in order. The idea being that the highest activation for the first array and second array should yield the most likely correct move. Where it falls apart, I think, is assuming the network can even predict this association this way. Although training has shown some positive results in this. Then again, those may just as well have been coincidence and confirmation bias on my part.

For each generated pair, I would first test whether the "from" square contains a piece of the current player. Likewise, I would test whether the "to" square does not contain a piece from the current player (because you can never move your piece to a square containing your own piece). Passing those two simple tests we apply the much more expensive and pervasive canMove check, which confirms that the move is valid (in a chess game), valid in the context of the current game, but also asserts that the king is not left in check after making this move, since that would mean the move is illegal as well.

Once it passes all those tests, the move is accepted, applied, and stored for training later.

If a certain "from" square had no valid moves at all then the iteration simply goes to the next "from" square. If none of the "from" cells have a valid move then we have exhausted the possible from-to pairs and the game must have ended with the player being unable to make a move without leaving its king in check. This means a checkmate if it starts checked or a stalemate if it was not checked at the start of the turn.

Alternatively, there are some other reasons why the game may end but that's only after making a move. For example, threefold repeat, fifty-move limit, and a insufficient material all lead to draws.

The end result of an "episode" is a list of moves and some meta data for those moves. This meta data includes piece and capture information, board state information, and also network efficiency details.

Rewards heuristic

Once a game has been generated we have to determine the rewards based on the outcome of this game.

There are so many ways to generate rewards that it's actually annoying. For example:

⚬ win / lose
⚬ capturing/losing material
⚬ pawn promotion
⚬ game progression (preventing 50-move draw)
⚬ network efficiency
⚬ game duration (move count)
⚬ invalid moves

And what are the actual values for this reward? Ultimately the goal is to reward the network for things you want to see more of. You want the game to progress then you reward pawn moves (they can't be reversed). However, certain rewards can introduce a massive bias. For example, a pawn move is only likely at the start and then less likely as the game progresses. Rewarding pawn moves means both players start with a pawn storm like there's no tomorrow. Is that going to teach a network how to play chess?

In the end I wasn't able to reliably come up with a reward heuristic that adequately taught the network to start with a pawn or knight, ignoring other pieces. And this kind of makes sense; how would a network ever learn a good move when it only flukes one every once in a while, making a majority of bad moves otherwise. The good learning step is a drop in the bad learning step ocean. So either your bad move reward has to drop significantly to make the good moves more effective, or you have to train the good moves more frequent to "engrain" this good move reward more thoroughly. Neither approach seemed to work for me.

Of course rewards aren't just a matter of coming up with the right value. It's also about the correct loss function, the correct amount of training on the result, applying a fixed/linear/gradual reward (later moves are more likely to contribute to a game outcome than opening moves), learning speed, neural network composition, etc etc. So many different parameters to twist and turn. And of course, for a pattern as complex as chess, it may take hours or even days before the desired patterns emerge. How do you know you're on the right path when your feedback loop is that long?

Well, it's frustrating. And I don't have the time for it, unfortunately.

This is roughly what my heuristic code looks like;

Code:

const rewards = history.map((step, i) => {
  if (i % 2 === 1) return; // Train only on white moves?
  let r = 0;
  //if (mated) r += 1; // Prefer a proper ending to a 50-move / threefold draw
  //if (piece === 'P') r += 1; // Promote pawn moves as they prevent certain draws
  if (capture) r += {Q:8, R:5, N:3, B:3, P:1}[capture] || 0; // Promote captures
  //if (promo) r += 1; // Promote pawn promotions
  if (winner === 'white') r = 100; // Reward wins
  else (winner !== 'black' && pointsWhite > pointsBlack) r = 100; // Consider a material advantage a win too, in case of a draw

  // For each prediction of this particular move, map the square index to its reward
  const moveReward = predictions.map((p, i) => {
    if (isFromOrTo(i, fromi, toi)) return r; // Apply reward to picked square pair
    if (i <= 63 && ownCells[ i ]) return 1; // Prevent own pieces from going to zero (even if they have no moves)
    return 0; // This could also return p, leaving the prediction alone
  });

  // Generate one set of input/output to use with training. The input and output sets must be of equal size.
  inputBoards.push(boardStateForMove);
  outputSets.push(moveReward);
});

So we magically come up with some kind of reward. I tried a few things here but ultimately I'm not convinced I did this right. I tried giving a range of values like above, but I also tried to give normalzied values (between zero and one). But I didn't see a lot of difference between either approach.

Here's an example with what a reward could look like for one opening knight move:

Code:

-- new from: N b1 ----------------------------------     -- new to: N c3 ------------------------------------
   a     b     c     d     e     f     g     h              a     b     c     d     e     f     g     h   
8  0     0     0     0     0     0     0     0     8     8  0     0     0     0     0     0     0     0     8
7  0     0     0     0     0     0     0     0     7     7  0     0     0     0     0     0     0     0     7
6  0     0     0     0     0     0     0     0     6     6  0     0     0     0     0     0     0     0     6
5  0     0     0     0     0     0     0     0     5     5  0     0     0     0     0     0     0     0     5
4  0     0     0     0     0     0     0     0     4     4  0     0     0     0     0     0     0     0     4
3  0     0     0     0     0     0     0     0     3     3  0     0     1     0     0     0     0     0     3
2  0.01  0.01  0.01  0.01  0.01  0.01  0.01  0.01  2     2  0     0     0     0     0     0     0     0     2
1  0.01  1     0.01  0.01  0.01  0.01  0.01  0.01  1     1  0     0     0     0     0     0     0     0     1
   a     b     c     d     e     f     g     h              a     b     c     d     e     f     g     h   
----------------------------------------------------     ----------------------------------------------------

Where own "from" squares move closer to 0.01, the valid "from" square goes to 1, the valid "to" square goes to 1, and the other squares go to 0.

Applying the reward

The last step of the episode is to apply the reward to the network. This is the teaching moment, yay.

Code:

// Convert data to tensor objects
const inputBoardTensor = tf.tensor2d(inputBoards);
const expectedOutput = tf.tensor2d(outputSets);
// Shuffle should improve training quality.
// Batch size doesn't matter for our problem.
// Epochs is arbitrary number of times the samples are trained.
await modelmodel.model.fit(inputBoardTensor, expectedOutput, {epochs: 20, verbose: 0, batchSize: 20000, shuffle: true});

You can also teach it one-by-one but it's just so much slower. There's probably a lot of overhead to be paid for preparing the training data. So batching the moves together and running them multiple times is just cheaper than doing it move-by-move.

CLI output

Here's what the CLI shows after each episode. It'll print the game state, prediction, and rewards of the Nth move, in this case the opening move.

It also shows which cells have been asserted to be invalid, like when there's no piece or when the engine said so. The final line shows a summary that includes how the game ended, the material points for each side, the time taken to run the engine and time taken to train, and how often the prediction was wrong before getting a valid move. This was a random sample while training it with a pawn storm reward.

Code:

------- move 1 --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
--game w u--     --player w--     --opponent--     --kings-----     --queens----     --rooks-----     --bishops---     --knights---     --pawns-----
  abcdefgh         abcdefgh         abcdefgh         abcdefgh         abcdefgh         abcdefgh         abcdefgh         abcdefgh         abcdefgh  
8 ♖♘♗♛♚♗♘♖ 8     8 00000000 8     8 11111111 8     8 00001000 8     8 00010000 8     8 10000001 8     8 00100100 8     8 01000010 8     8 00000000 8
7 ♙♙♙♙♙♙♙♙ 7     7 00000000 7     7 11111111 7     7 00000000 7     7 00000000 7     7 00000000 7     7 00000000 7     7 00000000 7     7 11111111 7
6          6     6 00000000 6     6 00000000 6     6 00000000 6     6 00000000 6     6 00000000 6     6 00000000 6     6 00000000 6     6 00000000 6
5          5     5 00000000 5     5 00000000 5     5 00000000 5     5 00000000 5     5 00000000 5     5 00000000 5     5 00000000 5     5 00000000 5
4          4     4 00000000 4     4 00000000 4     4 00000000 4     4 00000000 4     4 00000000 4     4 00000000 4     4 00000000 4     4 00000000 4
3          3     3 00000000 3     3 00000000 3     3 00000000 3     3 00000000 3     3 00000000 3     3 00000000 3     3 00000000 3     3 00000000 3
2 ♟♟♟♟♟♟♟♟ 2     2 11111111 2     2 00000000 2     2 00000000 2     2 00000000 2     2 00000000 2     2 00000000 2     2 00000000 2     2 11111111 2
1 ♔♖♕♛♚♕♖♔ 1     1 11111111 1     1 00000000 1     1 00001000 1     1 00010000 1     1 10000001 1     1 00100100 1     1 01000010 1     1 00000000 1
  abcdefgh         abcdefgh         abcdefgh         abcdefgh         abcdefgh         abcdefgh         abcdefgh         abcdefgh         abcdefgh  
------------     ------------     ------------     ------------     ------------     ------------     ------------     ------------     ------------
--p from, best = a2,d2,b2,h3,c2,e2,d3,f1,b3,g2--------------------------------------
      a         b         c         d         e         f         g         h       
8  _.__2195  _.__1443  _.__1997  _.__1731  _.__1862  _.__1614  _.__1113  _.__1429  8
7  _.__1381  _.__1736  _.__2202  _.__1505  _.__1523  _.__2002  _.__1815  _.__1794  7
6  _.__1631  _.__2232  _.__1279  _.__1614  _.__1673  _.__1571  _.__1827  _.__1952  6
5  _.__1369  _.__1592  _.__1795  _.__1014  _.__1303  _.__1655  _.__2127  _.__1591  5
4  _._11357  _.__6727  _.__1848  _._14977  _.__2664  _.__2028  _.__1814  _.__8190  4
3  _._20234  _._23869  _.__8546  _._29754  _._13037  _.__8765  _.__6992  _._37838  3
2  _.161353  _._59909  _._35909  _._76755  _._32026  _._14384  _._23037  _._21860  2
1  _._10155  _._16474  _._16182  _._18081  _._22670  _._24611  _._12363  _._22117  1
      a         b         c         d         e         f         g         h       
------------------------------------------------------------------------------------
--p to, best = b3,c3,e3,a2,d3,f3,b4,h3,e4,b2----------------------------------------
      a         b         c         d         e         f         g         h       
8  _.__2027  _.__1786  _.__1499  _.__2023  _.__1303  _.___998  _.__1396  _.__2029  8
7  _.__1200  _.__2022  _.__1542  _.__1290  _.__1431  _.__1646  _.__1465  _.__1346  7
6  _.__1351  _.__1454  _.__1818  _.__1176  _.__1810  _.__1528  _.__2420  _.__1155  6
5  _.__1970  _.__1765  _.__2400  _.__1571  _.__3485  _.__1760  _.__1707  _.__1645  5
4  _.__1411  _._19643  _._11618  _.__2896  _._18135  _.__5058  _.__2284  _.__2911  4
3  _._16914  _._80085  _._56358  _._30357  _._51646  _._30339  _._12861  _._19281  3
2  _._33478  _._17652  _.__8320  _.__8741  _.__8156  _.__9200  _.__1591  _.__1473  2
1  _.__3255  _.__2344  _.__1724  _.__2002  _.__1199  _.__1449  _.__1345  _.__1991  1
      a         b         c         d         e         f         g         h       
------------------------------------------------------------------------------------
--game w u--     --player w--     --opponent--     --kings-----     --queens----     --rooks-----     --bishops---     --knights---     --pawns-----     --action-f--     --  action-t--
  abcdefgh         abcdefgh         abcdefgh         abcdefgh         abcdefgh         abcdefgh         abcdefgh         abcdefgh         abcdefgh         abcdefgh         abcdefgh  
8 ♖♘♗♛♚♗♘♖ 8     8 00000000 8     8 11111111 8     8 00001000 8     8 00010000 8     8 10000001 8     8 00100100 8     8 01000010 8     8 00000000 8     8 DDDDDDDD 8     8          8
7 ♙♙♙♙♙♙♙♙ 7     7 00000000 7     7 11111111 7     7 00000000 7     7 00000000 7     7 00000000 7     7 00000000 7     7 00000000 7     7 11111111 7     7 DDDDDDDD 7     7          7
6          6     6 00000000 6     6 00000000 6     6 00000000 6     6 00000000 6     6 00000000 6     6 00000000 6     6 00000000 6     6 00000000 6     6 DDDDDDDD 6     6          6
5          5     5 00000000 5     5 00000000 5     5 00000000 5     5 00000000 5     5 00000000 5     5 00000000 5     5 00000000 5     5 00000000 5     5 DDDDDDDD 5     5          5
4          4     4 00000000 4     4 00000000 4     4 00000000 4     4 00000000 4     4 00000000 4     4 00000000 4     4 00000000 4     4 00000000 4     4 DDDDDDDD 4     4  c  c    4
3 ♟        3     3 10000000 3     3 00000000 3     3 00000000 3     3 00000000 3     3 00000000 3     3 00000000 3     3 00000000 3     3 10000000 3     3 DDDDDDDD 3     3 !ccccc c 3
2  ♟♟♟♟♟♟♟ 2     2 01111111 2     2 00000000 2     2 00000000 2     2 00000000 2     2 00000000 2     2 00000000 2     2 00000000 2     2 01111111 2     2 T        2     2 ssSSSSSS 2
1 ♔♖♕♛♚♕♖♔ 1     1 11111111 1     1 00000000 1     1 00001000 1     1 00010000 1     1 10000001 1     1 00100100 1     1 01000010 1     1 00000000 1     1     T    1     1 SSSSSSSS 1
  abcdefgh         abcdefgh         abcdefgh         abcdefgh         abcdefgh         abcdefgh         abcdefgh         abcdefgh         abcdefgh         abcdefgh         abcdefgh  
------------     ------------     ------------     ------------     ------------     ------------     ------------     ------------     ------------     ------------     ------------
--m1, new from, actual: P a2, trained: a2, better: neither -------------------------
      a         b         c         d         e         f         g         h       
8  0         0         0         0         0         0         0         0         8
7  0         0         0         0         0         0         0         0         7
6  0         0         0         0         0         0         0         0         6
5  0         0         0         0         0         0         0         0         5
4  0         0         0         0         0         0         0         0         4
3  0         0         0         0         0         0         0         0         3
2  _.1       _._1      _._1      _._1      _._1      _._1      _._1      _._1      2
1  _._1      _._1      _._1      _._1      _._1      _._1      _._1      _._1      1
      a         b         c         d         e         f         g         h       
------------------------------------------------------------------------------------
--m1, new to, actual: P a3, trained: P a3, -----------------------------------------
      a         b         c         d         e         f         g         h       
8  0         0         0         0         0         0         0         0         8
7  0         0         0         0         0         0         0         0         7
6  0         0         0         0         0         0         0         0         6
5  0         0         0         0         0         0         0         0         5
4  0         0         0         0         0         0         0         0         4
3  0         _.1       0         0         0         0         0         0         3
2  0         0         0         0         0         0         0         0         2
1  0         0         0         0         0         0         0         0         1
      a         b         c         d         e         f         g         h       
------------------------------------------------------------------------------------
Games:  5430 Moves:   50  Reason:    100x      0   0 .     59ms     77ms hundred,    546 canMove checks   (10.92/move):      9/(1/11)       8/(1/9)       6/(2/6)       6/(2/6)     13/(2/19)       1/(3/1)       3/(4/3)       2/(3/3)       3/(6/4)     13/(4/19) ...

It looks more fluid when the numbers update after each episode but that's gonna be a bit hard to do here :) And expensive as a gif/movie, so this'll have to do.

Gifs

The gifs in this blog were simply created through gif.js (years old, still works fine in the browser out of the box) and optimized through ezgif.com which cut them in half or more.

Here is the code, given an array of games represented as string snapshots of the board. Just drop into the html:

Code:

<script src="lib/gif.js"></script>
<script>
function generateMovieGif(boardStrings: string[][]) {
  const EW = 10;
  const EH = 5;
  const PAD = 0;

  const CEW = 80;
  const CW = EW * CEW + EW * PAD;
  const CEH = 130;
  const CH = EH * CEH + EH * PAD;

  var gif = new GIF({
    workers: 2,
    quality: 10
  });
  const canvas = document.createElement('canvas');
  canvas.style.width = (canvas.width = CW) + 'px';
  canvas.style.height = (canvas.height = CH) + 'px';
  const c = canvas.getContext('2d');

  const size = 12;
  c.font = `${size}px monospace`;

  let more = boardStrings.length > 0;
  let framei = 0;
  while (more) {
    more = false;
    c.fillStyle = 'white';
    c.fillRect(0, 0, CW, CH);
    c.fillStyle = 'black';
    for (let ei = 0; ei < boardStrings.length; ++ei) {
      const ex = ei % EW;
      const ey = Math.floor(ei / EW);
      let state = boardStrings[ei][framei];
      if (state) more = true;
      else state = boardStrings[ei][boardStrings[ei].length - 1];
      state.split('\n').forEach((line, i) => {
        const ox = CEW * ex + PAD * ex;
        const oy = CEH * ey + PAD * ey;
        c.fillText(line, ox + 5, oy + 5 + size + i * size);
      });
    }
    if (more) {
      // At least one game had a new frame to show
      gif.addFrame(canvas, {delay: 150, copy: true, quality: 20});
    }
    ++framei;
  }

  gif.on('finished', function(blob) {
    console.log('finished...')
    const img = document.createElement('img');
    img.src = URL.createObjectURL(blob);
    img.style.border = '1px solid green';
    $gifs.appendChild(img);
    //window.open(URL.createObjectURL(blob));
  });

  gif.render();
}
</script>

Conclusion

I think I've rambled on long enough here.

I got inspired by somebody teaching a neural network to play Pokemon, as an emergent behavior. I wanted to do the same to chess. Ultimately, I didn't have enough time to bring that over a meaningful finishline. I had my fun and now I need to move on from it since I never intended for anything serious to manifest here.

Kudos if you got this far :)

Neural Chess

Chess

Engine

Neural network

Modeling

Chess model

Training

Training chess

PGN

Self-playing

Chess

Tensorflow.JS implementation

Setup

Creating a model

Episodes

Rewards heuristic

Applying the reward

CLI output

Gifs

Conclusion