Who is the Master?

Jean-Marc Alliot
2017 ICGA Journal  
There has been debates for years on how to rate chess players living and playing at different periods (see Keene and Divinsky (1989) ). Some attempts were made to rank them not on the results of games played, but on the moves played in these games, evaluating these moves with computer programs. However, the previous attempts were subject to different criticisms, regarding the strengths of the programs used, the number of games evaluated, and other methodological problems. In the current study,
more » ... the current study, 26,000 games (over 2 millions of positions) played at regular time control by all world champions since Wilhelm Steinitz have been analyzed using an extremely strong program running on a cluster of 640 processors. Using this much larger database, the indicators presented in previous studies (along with some new, similar, ones) have been correlated with the outcome of the games. The results of these correlations show that the interpretation of the strength of players based on the similarity of their moves with the ones played by the computer is not as straightforward as it might seem. Then, to overcome these difficulties, a new Markovian interpretation of the game of chess is proposed, which enables to create, using the same database, Markovian matrices for each year a player was active. By using classical linear algebra methods on these matrices, the outcome of games between any players can be predicted, and this prediction is shown to be at least as good as the classical ELO prediction for players who actually played against each others. 4 J.-M. Alliot / Who is the Master? the sample analyzed is small (1397 games with 37,000 positions only). Guid and Bratko (2011) used different and better engines (such as Rybka 3, with a rating of 3073 ELO at the time). However, the search depth remained low (from 5 to 12), meaning that the real strength of the program was far from 3000 ELO, and the set of games remained small, as they only studied World Chess Championship games. Their results were aggregated (there was no evaluation per year), and not easily reproducible as the database of the evaluations was not put in the public domain. A second problem was that the metrics they used could not be analyzed as the raw results were not available. A similar effort was made by Charles Sullivan (2008). In total 18,875 games were used (which is a much larger sample), but the average ply was only 16, the program used was still Crafty, and the raw data were not made available, which makes the discussion of the metrics used (such as "Raw error and Complexity") difficult. This lack of raw data also denies the possibility to try different hypotheses (the author decided for example to evaluate only game turns 8 to 40, which is debatable; Guid and Bratko made the same kind of decisions in their original paper, such as simply excluding results when the score was above or less than 200 centipawns, which is also debatable). All these problems were discussed too by Regan (2009) and Fatta (2010) . In this article I present a database of 26,000 games (the set of all games played at regular time controls by all World Champions from Wilhelm Steinitz to Magnus Carlsen), with more than 2 million positions. All games were analyzed at an average of 2 minutes by move (26 plies on the average) by what is currently the best or almost best chess program (Stockfish), rated around 3300 ELO at the CCRL rating list. For each position, the database contains the evaluation of the two best moves and of the move actually played, and for each move the evaluation, the depth, the selective depth, the time used, the mean delta between two successive depth and the maximum delta between two successive depths. As the database is in PGN it can be used and analyzed by anyone, and all kind of metrics can be computed from it. The study was performed on the OSIRIM cluster (640 HE 6262 AMD processors) at the Toulouse Computer Science Research Institute, and required 61440 hours of CPU time. The exact methodology is described in section 2. In section 3 we present different indicators that can be used to evaluate the strength of a player. Some of them were already presented in other papers or other studies such as tactical complexity indicators (section 3.1) in Sullivan (2008), "quality of play" 1 (sections 3.2) which was mainly introduced by the seminal work of Guid and Bratko (2006) , distribution of gain (section 3.3) introduced by Ferreira (2012). Last, we introduce in section 3.4 a new indicator based on a Markovian interpretation of chess which overcomes some of the problems encountered with the other indicators 2 . These indicators are then discussed, validated and compared using our database in section 4. The results found demonstrate that the evaluation of a player's strength based on the "quality" of his moves 1 That we will call in this paper "conformance". 2 I consider here that computer programs are now strong enough (see next section) to be considered as "nearly perfect" oracles when evaluating human games. This is absolutely true when considering endgames (at least up to 6 pieces): here the evaluation function for each position can return the distance to mate, and thus gives an exact evaluation of each move. Of course, as chess has not been solved, the evaluation function in the middle game is only an approximation of this exact function, and different chess programs might return (a) different best moves ordering, and (b) different evaluation for the same position (Stockfish is for example known for returning higher/lower evaluations than its siblings). (b) does not change much to the current work: all results and curves would keep exactly the same shape, only the scales would be modified. (a) is however a more serious objection: would the results be the same if using for example Komodo instead of Stockfish? The two programs have approximately the same strength and sometimes return different move ordering for the same position. This should be the subject of a further study; there has already been work done on comparing the output of different engines (Levene and Bar-Ilan, 2005) , especially recently as a result of the Rybka controversy (Dailey, Hair, and Watkins, 2013) , which shows that programs usually agree on 50% to 75% of the moves. However, such articles concentrate mainly on how many moves are different, and not on how much moves are different.
doi:10.3233/icg-160012 fatcat:twdftggeizb6jo6muluketg4my