‘Adding to the conversation’ of ESL’s new Dota 2 rating system.
Recently ESL announced a roll-out of a new rating system for Dota 2, an expansion to their existing CS:GO system. Whilst it certainly fulfills certain key criteria (relatively simple concept, objective rules, transparent-ish) there is certain language that seems overly defensive from the outset, and this itself is telling.

Back when I started doing Dota 2 stats, I collected historic ratings from Gosugamers (GG) and joinDota (JD) and pairing this data with associated matches to see how accurate the various systems were (GG was quite bad, JD was very bad). I then tried out various methods of my own to see which simple base models performed the best, and how these compared to jD/GG. With even the slightest tuning these pretty stock models (even on a hidden evaluation set) performed way better.

It was several months later at an event when I was chatting with various people within the scene. Writers for the news sites (like GG & JD) were quite up-front that their ratings were there mostly as a method to drive traffic to their site and as a result didn’t care too much about the accuracy provided it “looked right”. What was more scary was the fact that various sponsors and organizations used such ratings as a very large indicator of which teams to pick up; and tournaments which used the ratings to determine invites, seeds and groups for their events. To me, this was a scary thing: people making important decisions based on pretty terrible data.

Now, back to ESL. They seem to be following a similar line: discussing how this provides a ‘new perspective’, or ‘adds to the debate’. What isn’t mentioned which is quite key is that this exact system is used as a seeding metric for their CS:GO events.

It seems weird to admit that a system isn’t the most accurate and is only there to add to the conversation; but then at the same time also use this inferior system in a manner which directly impacts the income of players in another esport (CS:GO, but likely Dota 2 soon). In my eyes using this instead of any top-tier statistically proven system is intentionally harming the players.
In conclusion, I’d like to identify some key issues with the model itself, which I think make it unsuitable:
- No associated statistical metrics published within the model (how accurate was it, is there an associated Brier score, how do you combat overfitting?).
- Quantifying events by size (the huge/large/medium/tiny criteria). No matter how many teams go to the LAN finals, this can be irrelevant if the tournament format is poor — winning a 32 team single elimination bracket shouldn’t mean more than winning an 18 team round-robin groups into 16 team double elimination event (like TI is).
- Quantifying events by skill. Using the skill of a team within the model to classify future events within the model is introducing an unnecessary feedback loop. This means a relatively small error can be compounded over time.
- Evaluating tournament performance by just final placement. Imagine you had a 256-team single elimination bracket where one team defeated 5 of the top teams in the world before being eliminated and another team defeated 5 random pub stacks. Despite finishing in the same position, would you say their performance is the same? If you had to start a tournament the next day, would you realistically offer the same odds on both these teams since they placed the same? Obviously not — individual matches describe the actual experiences and performances of individual teams. If there was an accurate “margin-of-victory” metric, that would also be a great addition.
- TI being weighted 50% more. After TI we have a huge break, a moderate patch, and the biggest shuffle of the year — creating the largest amount of uncertainty within any system. TI being weighted heavily just serves one purpose: the rankings look “right” after TI for 5–6 weeks just because there’s no other information. Given how the decay works, this indirectly means that TI results against completely different teams are used as a justification for up-seeding teams which did well at TI (generally big names) over potentially up-and-coming newer teams with much better recent performances (within the current set of teams, not a previous one).
Whilst it’s cool to talk about alternative ways to evaluate performance, this specific problem is one which exists in multiple other fields and large amounts of research has gone into ‘solving’ this. To reject that just for the sake of an alternative perspective seems backwards. I hope ESL addresses these issues before using this system for seeding teams, that is if they truly care about the competitive integrity of their events.
- Noxville
(ESL used “ELO” so my joke of adding an Electric Light Orchestra song continues)