Online Rankings

Yup not 'that' type of ratings. But your face will stop if yours does this.


This post is going to cover a detailed dive into the Glicko ratings used by more online server choice

If you are not at all interested in ratings then I would advise ignoring!

What is the point to ratings?

The primary aim of any ratings system is actually to sustain game enjoyment. In most large games at the extreme upper tiers there is a certain degree of one upmanship (I am 10 points higher than you!) or use to invite to a senior tournament but the actual objective is to support the game moving forward. It does this by following the assumption that players enjoy games more when they


A - are not beaten heavily repeatedly and

B - have challenging games that stick in the memory - the 'I remember when this happened' and I fought back from certain defeat to grasp victory..

C - are not so easy as to be instantly forgotten (people rarely boast of destroying someone who cannot 'play' - I,e 'Ha I took on my four your old at chess last week. He didn't know the rules properly and I crushed him!'


To help with this you need to know how good/bad players are so you can try and ensure games are more enjoyable than not. This prevents players dropping because the game is too hard or even too easy - hence ratings systems.


Problems with ratings

One thing that becomes rapidly obvious with any ratings system is that some players get obsessed with their rating as opposed to playing. My own glicko2 system for L5R worked well but it became rapidly noticeable that strong players were deliberately conceding matches (in cup tournaments) to players who played decks perceived as hard to beat so that their rating would not be adversely effected.


Any player who religiously follows their rating as a crutch to their own self esteem is likely to be crushed when it drops. This can be so prevalent that Starcraft 2 (as an example) introduced a dummy league with time increasing bonuses to ensure that as long as you played it would appear as though you were 'improving' see the fascinating discussion here)


Ratings can also be a blocker to players actually playing. In one of the Starcraft 2 redesigns they adjusted the interface so that 'Play' was a rated game and 'Custom' (non-rated) was hidden away because they found lots of people did not even want to play rated games as they thought they would be too competitive and regardless as to the purpose of ratings they did not like having one which could drop. Since the point of the ratings is to make the experience more enjoyable this was a bit of an own goal for the designers hence the attempt to use GUI clues to indicate otherwise.


Finally some ratings systems can be gamed. A ratings system based on specific metrics 'head shots' or some other value can cause players who realise this (and they always find out) to play 'differently' to try and game the rating system.


Ratings Systems

Any ratings system worth anything at all needs therefore to focus on adequately expressing a players strength so that it adequately ensures they get games that are not too hard or too easy. The grandfather of most decent ratings systems was a Chess player with called Arpad Elo (hence ELO). His system assumed skill was based over the very common 'normal distribution' system (think of a bell curve). Basically the majority of results are in the middle and extreme skill and extreme incompetence are both less common.

[As an aside this appears all over then place. As an example people's height and body part measurements (i.e leg height) also occur with this distribution type. When designing furniture the designer designs for around the middle 80% of people and ignores the bottom 10% and to 10% - so basically the smallest and tallest people. Practically this is because if you design a chair (say) that is comfortable for someone 3 foot 4 to sit on then it will not be comfortable for 80% of the population and impossible for the 10% tallest people. So they design for the most common denominator]

Each player gets a rating that can move up and down. If you beat an equal player then it will go up an average amount and if you beat a stronger player then it increases more. Lose to a weaker player and it drops. The amount of the drop/increase depends on the difference between the two ratings.

ELO Hell - a phrase for when you are at the bottom of the ELO ratings and improve but only gain 1 or 2 points a victory as the system incorrectly thinks you belong there.


ELO is all well and good but suffers from the fact that a single point is not a good descriptor of 'strength'. Hence other systems appeared. The one I wrote for (and the one used on this particular Go server) is the Glicko2 variant. Glicko introduced a random deviation or 'variation' to the rating. This was a positive and negative number that expressed how confident the system was of a players rating. A number smaller than a hundred meant the system was confident that the player was in the correct ratings 'zone' whereas a number greater than that basically said that the system either had not had enough time to ascertain the correct rating or the results it had seen were not increasing confidence. Say a 1500 player beat an 1800 player and the next game lost to a 1200 player. This would increase the systems uncertainty in that player. The more confident the system becomes (confidence occurs when you beat players you are expected to beat and lose to stronger players) the lower the deviation becomes.


The rating is normally displayed as follows

PlayerName 1150 +-350

The above is actually the 'dummy' rating given to new players on the server and basically says their rating is 1150 with a deviation of plus OR minus 350. The system has a 95% confidence that the player is in a band indicated by double the deviation. So this player could be as low as 450 or as high as 1850 - quite a range. When trying to matchmake this also provides a wide range of potential opponents. 

One 'quirk' of this server is that this is not the actual rating which is the default glicko2 actual rating of

PlayerName 1500 +-350

What the developers are trying to do is to ensure that 'actual' new players do not join the server and then get repeatedly beaten by stronger players. If they started new players 'at' 1150 then an experienced new player would take a while to get to his/her actual strength. So they use this. 

Anyway  one of the the aims of any ratings system is to drop the deviation as quickly as possible as if this player is actually a 500 but gets matched with a 2000 +- 100 (so someone with a confidence band of 1800-2200) then they are likely to get stomped. Most systems try and find someone close to the stated rating (1150) and then widen the search gradually if opponents are not found (this is the Starcraft 2 matchmaking system).


As that player plays the system will increase/drop both numbers hopefully increasing accuracy as time passes and more games are played. When organising games for 'this' player they could have a wide range of acceptable opponents but as that deviation shrinks so will the prospective pool. Hopefully at the same time increasing enjoyable matches and losing the massacre type games of either player.


As an example let us assume the above player plays an opponent 1200 +-80


If he wins then his rating might increase 201 to 1351 and his RD (rating deviation) drop to +-252. His opponent may drop to 1187 and his deviation +-79 (why so little? Because the system had a wide band for the winning player so is assuming that the first player was at the strong end of the band and not the weak.


Let us give another couple of examples


Player A 1800 +-80 beats player B 1600 +-100


Since Player A is at the limit of Player B's play ability this is an expected result so ratings change to



Player A 1809 +-79 Player B 1587 +-98


RD's do not change much and the rating hardly drops/increases for either.


Player C 1700 +-80 beats Player D 1800 +-80


This has more of an impact as the system thought Player D was stronger . Result


Player C 1722 +-79 Player D 1778 +-79



How much the ratings drop depends on the relative differences between the players ratings/deviation (glicko 2 also adds in a volatility factor that can mute extreme changes but if set too low can also trap people in the really low rank bands - on this server it appears to be the default value). The system also has a RD time decay included so if you don’t play for a long time then your RD increases as the system grows less certain of you. This is only a tiny amount so has minimal impact over time.

A final point is that Glicko functions best when computing a 'group' of results. Single results tend to cause the ratings to increase/decrease in much larger numbers. Apparently the OGS server uses around 15 games before generating a more accurate rating but appears to do single ratings up dates to 'indicate' before that point.


GO Server system

The server uses Glicko2 (and a Kyu system I shall look at later). The site provides ratings for both board (9x9, 13x13, 19x19) and game type based (i.e blitz, normal etc) so each player has a different rating for each combination. This is then combined into an overall rating. I cant tell if these are just informational or whether all games funnel into a single background rating ( which makes more sense than having a different underlying rating for each game type)

Let's look at my first game.

My start rating 1150 +-350 [450 to 1850]   (and ACTUAL 1500 +- 350 [800 to 2300] ) (I am putting the band the system puts me in at 95% probability in square brackets at the end so the structure I will use moving forward is

RATING +- RATING DEVIATION [ LOWEST RATING to HIGHEST RATING]

My opponent (one 'SlingingStones') was

1046 +-83 [ 880 - 1212]

The server is much more certain SlingingStones is rated correctly. (Plus this rating is the actual ELO without any 'beginner' visual adjustment shenanigans.

Anyway he/she won with a large victory (though this should not impact a decent ratings system only the win/loss should count). What happened to the ratings?

Mine dropped drastically

New rating

742 +- 314 [114-1370]]

but the clever part is that the actual rating decreased as well

1147.48 +- 280.19 [ 587 - 1707]

Ouch. So I dived but the system is happier where I am (but not by much) as my beginner deviation dropped by a whole 36 points. The results make more practical sense when viewed via the actual ratings. Here a 1500 lost to a much lower 1046 opponent. This explains the drastic dive and curiously the RD dropped by a higher 70 points.

My opponent? He/her rating increased by 1 sole point and his/her deviation stayed exactly the same

New rating

1047 +-83 [ 881-1213 ]

I dont know why this occurred. If the win was calculated using my actual rating then regardless as to the RD this rating should have shot up. It is behaving as if the win was calculated against the fake beginner rating. Perhaps this is the case as it prevents un-natural deviations in more established players ratings when playing beginners.

All games ratings



Note how the 1150 for all the other board/games types is increasing the 0 - 0 rating (which is the prime site rating displayed against the user name). Considering this is displaying the adjusted beginner rating I don't know how useful this currently is to me.

The site provides a wealth of interesting stats over time which I will enjoy looking into. 

Kyu Rating?

One specific which I should address now appears to be a Kyu-Dan designation which drops to the alarming level of 41k!And increases to 11th dan.

The following bits are guesswork so may turn out to be incorrect

This also appears to use a deviation number and my start point was 20.6k +- 6.6 and it dropped to 25.0k +-6.8 

Flovo on the forum has stated the correlation between the rating and rank is rating = 850 * exp (0.032 * rank) where rank0 is 30k

This might be used to provide an indication of 'belt' type progression as with martial arts. GO is regarded as a martial art but as any practitioner of any of those knows belts are awarded for knowledge and not taken away for a bad result in a competition..Official GO Kyus also start at 30 kyu for a total beginner whereas this system seems to drop you to 41 Kyu..(note the OGS server never drops a player below 25 kyu but the rating rated to the rank 'can' and this is therefore apparent in the histogram)

Anyway the following lovely chart was created a year ago (May2018) by one of the forum users and shows the bell chart of games per active ranks



The green covers players with established ratings deviations (so less than 100) and the blue those with less certainty. You can see the bell shaped curve with the mass of server players between 15k and 3k.( as noted I am guessing 'Kyu')

The boxes are my addition. The yellow box shows my start rating and the orange the deviation and 95% band. The red box is my new rating and the purple its 95% confidence band.

The key thing to look at though is the percentage figures below the chart. These show the amount of players on the server you should 'expect' to beat at your level. So at my current it says I am still stronger than 3% of server players. Though the deviation means this could be as high as 10% or as little as slightly over 0%. Seems about right.

When this was posted a user stated that at any one time usually 2,000 players were online and 150-200 of those were either in games or looking for games. Therefore if I go looking for a game and allowing that the lower ranked players are represented evenly with their totals then there should be 4-6 players at, or near, my play ability who will also be looking for games and that that game should be easier than my last.

We will see. Either way it is a great example of how the ratings system is trying to ensure I get a more enjoyable match and thus play the game for longer.

Personally I expect a few more losses before I even get a sniff of winning and I can see me dropping to the 35 kyu zone.

If anyone has any more direct knowledge on some of the internal server workings feel free to message me and I will update the article accordingly. Some adjustments have been made after some helpful comments by flovo but there will still be plenty of errors all attributable to me..














Comments

Popular Posts