Sad but true: Match-up estimations gone wild [Archive]

View Full Version : Sad but true: Match-up estimations gone wild

GoboLord

09-09-2010, 10:21 AM

When you ask 24 Merfolk-players to guess what percentage of games (including sideboarding) Merfolk wins against Belcher-combo you’ll get answers ranging from 40%-95%. Why that? Do they play different decks? Probably not. Don’t they know the match-up? I’d say most of them do. This raises the question of who’s estimates are accurate. If some players have difficulties in guessing the match-ups of “their” decks right, how is it possible for players who did never play Merfolk or Belcher-combo to make such estimates? Still it is common sense that the MU in the given example is in favor of Merfolk and most people seem to know that since the average estimate was 68%. In the following article I will explain that it takes more than simply experience to make good estimates about certain match-ups.
This article presents the results of an analysis of certain MUs of the most common legacy decks. I will structure the article as follows:

1. Method (how I collected the data)
2. Results
3. Applying the results
4. Contributions and limitations

Before you start reading: please note that I do not want to offend anyone. Neither do I claim that this article holds the “absolute truth” about legacy. I’m an ambitioned player - just like most you – who favors statistics when it comes down to estimating match-ups. I want to give the reader a more distant view on some topics that are commonly discussed among magic players.

1. Method

In January 2010 I determined 20 decks that “defined legacy and won’t disappear in the curse of the year 2010”. The thought behind that was, that I didn’t want to collect data about decks that will soon disappear. I decided to take the following:

- Dragon Stompy
- Goblins
- Merfolk
- Zoo
- Eva Green
- Aggro Loam
- The Rock
- Threshold
- Team America
- Survival
- Counter-Top Bant
- Dreadstill
- Enchantress
- Lands
- Landstill
- White Staxx
- UBx Stormcombo
- Belcher
- Dredge
- Reanimator

Note that this selection is questionable – I will return to that point in the 4th part of the article.
With the banning of Mystical Tutor Reanimator seemed to vanish, so I cut it out of my analysis. I added Painted Stone and Mono Black.

Collecting data

In the following part I will talk about “games” and “matches”. For me 1 match constist of up to 3 games. Thus, 2-1 is one match with 3 games.
What I wanted to do is measuring the percentages of games (including sidboarded games) that Deck A wins against Deck B. An example of that would be: Merfolk 40% - Goblins 60%. This means that Goblins wins 6 out of 10 games against Merfolk. I chose this design because it is commonly seen on forums and players express their estimates in this “percentage-fashion”.
Therefore I needed pairings and their actual outcomes, e.g. Merfolk 1 – Goblins 2 (means that Goblins win this match 2-1 after 3 games inclusing sideboarded games).
I collected the data through (1) personal observation of other matches I saw on tournaments, (2) recording my own results and (3) analyzing the spreadsheets of 6 Star City Games Open legacy tournaments, published by Jared Sylva. Of course the latter provided me with the largest part of information – thanks for that Jared!

Calculating results

After collecting the data I calculated the percentage as shown in the following example:

Let’s take an easy one: Goblins vs. The Rock
The recorded results: 2-0; 2-1; 2-0; 2-1 and 2-1
This makes a total of 5 matches with 13 games
Goblins wins 10 out of 13 games
The Rock wins 3 out of 13 games
Therefore it’s 10/13= 0,83 --> 83%

The MU is in Goblin’s favor: 83%-17%

By recording and calculating the results in this particular fashion we know how sideboarding changes the results. Another way to calculate the results would be:

Goblins win 5 out of 5 matches. Thus it’s 5/5 = 100%
Result: the MU is in Goblin’s favor: 100%-0%

This calculation does obviously give a wrong expression of the MU and is therefore useless. That’s why I decided to pick up the first one.

2. Results

The results are listed in a cross table (http://www.megaupload.com/?d=ELSMSFDA). The table holds only percentages of MUs with 10 or more recorded matches (thus at least 20 games) – in order to make the results more significant.

For everyone who does not want to download the table, here are the results that I found most striking:

Merfolk: 45% - Goblins: 55%
2-0 XXXXX.XX
2-1 XXXXX.XX
1-1
1-2 XXXXX.XX
0-2 XXXXX.XXXXX.XXX
This means that Merfolks win 45% of the games (including sideboarded games) against Goblins, while Goblins win most of the matches 2-0. I will report the following example in the same fashion only without explanations.

Lands: 40% - Goblins: 60%
2-0 XX
2-1 XX
1-1 XX
1-2 XX
0-2 XXXXX.

UBx Stormcombo: 49% - Goblins: 51%
2-0 XXXXX.
2-1 XXX
1-1
1-2 XXXXX.XXXXX.
0-2 XX

UBx Stormcombo: 29% - Merfolk: 71%
2-0 XXX
2-1 XXXX
1-1
1-2 XXXXX.XXXXX.X
0-2 XXXXX.XXXXX.XXXXX.XXX

Belcher: 38% - Merfolk: 62%
2-0 XXXX
2-1 XX
1-1 X
1-2 XXXXX.XXXXX.XXX
0-2 XXXXX.XXX

Lands: 49% - Zoo: 51%
2-0 XXXXX.
2-1 XXXX
1-1 XXXXX.XX
1-2 XXXXX.XXX
0-2 XXXX

Belcher: 60% - Zoo: 40%
2-0 XXXXX.XXXXX.X
2-1 XXXXX.XXXX
1-1 X
1-2 XXXXX.XXX
0-2 XXX

Mono Black: 83% - Zoo: 17%
2-0 XXXXX.XX
2-1 XX
1-1
1-2 X
0-2

UBx Stormcombo: 43% - CT Bant: 57%
2-0 XXXX
2-1 XX
1-1
1-2 XXXXX.X
0-2 XXXXX.

Belcher: 36% - TES 64%
2-0 X
2-1 XX
1-1
1-2 XXX
0-2 XXXX

Here is what I found striking:

- Goblins and Merfolk seem to be even (45%-55%)
- Goblins are better against lands than against Merfolk (60% vs. 55%)
- Goblins do rather good against UBx Stormcombo (51%-49%)
- Merfolk does worse against Belcher than against TES (62% vs. 71%)
- Lands vs. Zoo ends up 1-1 very often in comparison to others (7x!)
- Zoo does worse against Mono Black than against Belcher (17% vs. 40%)
- UBx Stormcombo does surprisingly good against CT Bant (43%)
- Belcher does bad against UBx Stormcombo

Some additional information (not listed here, look at table)
The best MU (among the ones I reported) for

- …Goblins is CT Bant (63%)
- …Merfolk is UBx Stormcombo (71%)
- …Zoo is Enchantress (70%)
- …Eva Green is Merfolk (52%)
- …Aggro Loam is Zoo (63%)
- …CT Bant is UBx Stormcombo (57%)
- …Lands is Dredge (66%)
- … UBx Stormcombo is Lands (68%)
- …Belcher is Goblins (78%)
- …Dredge is CTBant (58%)

The worst MU (among the one I reported) for

- …Goblins is Belcher (22%)
- …Merfolk is Enchantress (33%)
- … Zoo is Mono Black (17%)
- … Eva Green is Zoo (37%)
- …Aggro Loam is UBx Stormcombo (42%)
- …CT Bant is Goblins (37%)
- …Lands is UBx Stormcombo (32%)
- … UBx Stormcombo is Merfolk (29%)
- …Belcher is UBx Stormcombo (36%)
- …Dredge is Lands (34%)

3. Applying the results

Now that we have this amount of information - what comes next? Of course we might want to learn something from them and maybe even apply them to deck construction, sideboard construction, tournaments preparation, playtesting etc.
Note that this might again be stuff for criticism and again I want to point to the 4th part of this article: Limitations.
Before we start I like to introduce a friend of mine: Mr. Average. Mr. Average has played every MU recorded in my table. Therefore the statistics we are discussing are his personal statistics. Mr. Average is the average player with the average decklist, average playing style and average skill. He will be our guide in the following part.
Let’s look how to apply the results. Since Goblins is the deck I can pilot best, I will give an easy example from Goblins’ perspective that might work for other decks too.
A common opinion (when it comes to sideboard construction) is to not run any combo hate. We like to reason that “the MUs is just too bad and I just make other MU better by putting other cards in my SB”. With a short glimpse to the results the Goblin player notices that at least part of our combo MU, namely TES, is actually not as bad as expected. We now take a look at the table below: It tells us that Goblins win most of the matches 2-1. Since Goblins are likely to lose g1 it seems that percentages rise after sideboarding. A plausible conclusion would be that Goblins have rather effective cards to fight combo in g2 and g3 and that this MU is far away from being “just too bad”. As a result Goblins should consider dedicating some slots in SB to combo hate. Similar thoughts can be applied to sideboard (and MD) construction of other decks.
I am aware of the fact that MU-philosophy is not only about what cards you have in SB. As someone in this forum put it: “It’s not the decks that have good or bad MUs, it’s players that have MUs.”. Or as someone suggested in a survey of mine: “Player can change a big deal of the MUs.”. IMO both are right. sideboarding and deckconstruction have to fit someone’s skill and playing-style. When applying the results to ourselves we should note that they hold information about Mr. Average. To give another example:
The categories of decks I listed are very vast. Goblins can splash the colors W, G and B. Those splashes push the MUs in certain directions. Someone who doesn’t run G has no access to Krosan Grip and is therefore more vulnerable against Moat/Humility.
To determine your distance to Mr. Average" it would be useful to report the outcomes of the games you had on tournaments and while playtesting. This might help to determine what your MUs (not that of your deck) are. E.g. 4 months ago I found it quite hard to beat Merfolk with Goblins. My personal statistics told me that I was far away from winning 55% of the matches (like Mr. Average). So there was obviously a distance between him and me – my MU against Merfolk was bad, not that of my deck.
The results tell us what Mr. Average is and are therefore by no means applyable to your particular decklist, playingstyle and skill. They are worthless without some thoughts of interpretation. Or as a user in this forum puts it:

I suspect the real problem is that "Mr. Average" is pretty good with some decks (like goblins), but not all that good with others (like storm combo). (As anecdotal evidence, my TES vs. Goblins matchup improved dramatically after goldfishing a thousand or so times. It takes a lot of practice to make sure you don't lose to yourself, let alone hate. That's not nearly as much of an issue with goblins, where it's pretty straightforward to win if your opponent screws themself.) I suppose that's not necessarily a problem, though, as long as you understand that you're measuring matchups between all players, rather than particularly skilled players with a given deck.
And that is what we need to do when we want to apply them. Let me tell you something from a psychologists point of view at the end of this part: people tend to overestimate themselves on field they value much. This effect is known as the “above-average-effect”. To proof this effect, researchers interviewed male drivers in hospital after they had a car accident. 80% of those participants (all of them male, still injured though recovering from the crash!) rated themselves as better as the average car driver.

4. Contributions and limitations

As I said right at the beginning: This article does not hold absolute truth about legacy. Neither do I claim that all of my methods are 100% perfect. In this last part I will discuss pros and cons of my analysis.

What does the article contribute?

This article is about averages - nothing more and nothing less. It tells us what the MU of certain decks are like according to average decklists, playingstyles and skills. This article reveals some rather surprising results that are far away from actual estimations that players give. This might help to reduce wrong impressions and ratings when discussing strategies against certain decks. It is designed to push thoughts about playtesting, tournaments, deck- and sideboard construction away from subjective opinions and to give a more distant view on MUs in general.

What are limitations of the article?

As I said throughout the article, there are two questionable points that I want to discuss.
First of all: the categorization of decks. I doubt that people will be happy with such vast categories “Survival”, “Landstill” and “UBx Stormcombo”. While the concept of Belcher is very clear cut the decks named before can be very different in decklist. Nevertheless are there reasons for this categorization. It is quite hard to find enough data about decks like Survival when I was to split them up into their subtypes – still I wanted to tell something about them. I grouped them together because their function and win condition is very similar: A green-based deck that is able to create card advantage and to find flexible answers via Survival of the Fittest. The same is true for Landstill and UBx Stormcombo. All Landstill decks function in that way that they create card advantage via Standstill, have a slow win condition and are blue and control-based. All TES/ANT/DDANT are combo decks that share many mana producing cards, the storm mechanic, tutor and cantrips and are lethal in virtually turn 2-3. I know that experts on those deck will disagree with me, but this categorization does not mean that those decks are the same – the categorization is therefore functional. Experts on those decks will know how to interpret the results for their particular deck: e.g. Survival Bant is better off against combo than Survival feat. Recurring Nightmare. Once again: Results tell nothing without some thought of interpretation.
Second, the application of the results. You might say that the results are not very helpful because they contain data of the most lousy losers as well as those of tournament winners and that they therefore can’t be applied to anyone. When looking at the Merfolk (45%) – Goblin (55%) MU you might come up with explanations as “you just recorded too many lousy Goblin players, this MU must be better”. Maybe you are right, maybe you aren’t. There is always some randomness in statistics. The more games I record the less likely it is that the result is touched by randomness. To ensure at least a little significance I only reported MU from which I have 10 or more records (thus at least 20 matches). One should also take note of what I wrote at the end of part 3: Results tell nothing without some thought of interpretation.

Thank you for reading my article.
I would like to hear any helpful comment.
If you have any questions please feel free to ask, I will try to answer them.

Cthuloo

09-09-2010, 12:00 PM

Great work. There's definitely a lack of hard matchups data, and everyone who tries to fill this gap is of great help in making us all better understanding the meta. There's a point I would like to discuss, though.

[WARNING : A bit of math will follow]

You mentioned in your article Mr. Average, and all the caveat needed to deal with him. Actually, to be sure we understood what Mr. Average is telling us, we can get the help of Mr. Variance. Let's make a real example. You collected (if I counted correctly) a total of 185 Zoo vs. Merfolk games. It turns out Zoo won 111 of them, while Merfolk won 74. Mr. Average tells us that the matchup is 60-40. Let's do the simplest treatment of this statistic. In principle we can approximate the probability with a Bernoullian one calling

P = Probability that Zoo wins a game
Q = Probability that Merfolk wins a game

with P+Q = 1, P = 1-Q.

Then, the data for a number N of game should follow a binomial distribution centered on N*P. This distribution has a variance of the form

V = NPQ

In you case, the center of the distribution is 111, and the variance is

V = 185*0.6*0.4 = 44.4

Since the number of events is decently large, we can then approximate the distribution with a gaussian with a standard deviation S = Sqrt(V) = 6.66. Then we can say something about the "true" value of the matchup. Calling M the number of game won by zoo we see that:

104 < M < 117 with 68% probability
97 < M < 125 with 95% probability
91 < M < 131 with 99% prbability

We can then say with a confidence of 95% that the matchup is in zoo's favour (i.e. M/N = 97/185 > 0.5).

This also means that we still need a lot more statistic to express precise conclusions based only on data. This is particularly true in the case of Storm Combo vs Goblins, when you have a total of 53 events, and the variance is roughly

V = 53*0.5*0.5 = 13.25

Then

S = 3.64

Looking at the number of games won by Goblins (which by the measured data is 27/53), called G:

23 < G < 31 with 68% probability
19 < G < 35 with 95% probability
16 < G < 38 with 99% prbability

So we can't say with decent certainty that the matchup is not 30 - 23 instead of 26 - 27, and there is more than 5% probability that it is instead something like 34 -19 ( which means 65% in favour of the storm combo player ).

I'm not sure of you followed me, but basically the message is: we need more data, and to be careful to interpret the ones we have. Let me remark again how I appreciated your work, and that this criticism is meant to be a constructive one and by no me is directed at how you conducted your research.

GoboLord

09-09-2010, 03:33 PM

Humphrey

09-09-2010, 05:38 PM

GoboLord

09-09-2010, 05:48 PM

Awesome work!

Yeah it would be nce to get more data, maybe every event with Players >100 in one year

What id like to know whick deck(s) have the best average matchup against the field and which are the most played.

And I dont understand what the XXX after 2:0 means

It's not useful to draw such a conclusion because the field does not contain rouge-decks and other decks that I simply missed like: Faeries, New Horizont, Sneak Attack etc.

Each X stands for one game that ended 2-0 or 2-1 (depending where it stands).
Therefore

Deck A - Deck B

2-0 XXX
2-1 X
1-1
1-2 XXXX
0-2 XXXXX.XX

Deck A wins 3x 2-0, 1x 2-1
Deck B wins 4x 2-1, 7x 2-0

caiomarcos

09-09-2010, 06:46 PM

I can't download the table, it says: "The file you are trying to access is temporarily unavailable."

pippo84

09-09-2010, 07:21 PM

@ GoboLord: very interesting info! I will read the article again when I'll have more time. Thus I'll look at the info more in depth. Good job anyways! I'll probably post some more comments later on. I'll also like to see more MU analysis.

@ CThuloo: I didn't understand a thing! Prepare yourself to explain me something next time..

Cthuloo

09-10-2010, 05:00 AM

On university I'm a bit into statistics myself, therefore I understand the essence of what you wrote (and calculated). I know that, although it took 9 month to gather the data, it takes much more to makes the results significant with p < .05
Plus there are many MU for which I dont even have a single record - so there is much mor work waiting... Unfortunately Legacy remains a ever-changing format, so nobody knows what decks we hav next year. I didn't even dream of Reanimator's drop-out when i started the analysis.

Thanks for your comment.

Yes, the one you're doing is definitely a hard job! ;) But it's indeed very precious. Even when the data are not conclusive, they can be still of great help.

E.G., even if we can't really say the Storm vs Goblin matchup really is even, the data suggest that it is definitely not impossible, and you made correct and interesting deductions about Gobbo's sideboard, I will be definitely interested in seeing upgrades on the table in the next months! ;)

GoboLord

09-10-2010, 06:08 AM

Yes, the one you're doing is definitely a hard job! ;) But it's indeed very precious. Even when the data are not conclusive, they can be still of great help.

E.G., even if we can't really say the Storm vs Goblin matchup really is even, the data suggest that it is definitely not impossible, and you made correct and interesting deductions about Gobbo's sideboard, I will be definitely interested in seeing upgrades on the table in the next months! ;)

It would be helpful if you'd send me spreadsheets (like the ones posted on starcitygames by Jared Sylva) when you find any.

klaus

09-10-2010, 06:19 AM

GoboLord

09-10-2010, 06:55 AM

GoboLord, I appreciate your effort.
But as you concluded, what you initiated would have to evolve into a Source collaboration boasting ten times as much data to become meaningful. Goblins having a positive Storm Combo MU in your analysis emphasizes that in bold letters.

Well actually it's a statistical law that results don't change much if you have a sample with N > 30.

This means that if I recorded 30 games Merfolk vs. Zoo and the outcome is e.g. 40% - 60% it won't be 60% - 40% after 300 recorded games.
In the particularcase of UBx Storm combo vs. Goblins I recorded 53 games. Thus the MU percentage won't change dramatically with 530 games, it will stay around 50%-50% +/- 5% maybe.

My N is at least 20 for every MU, so it doesn't take many more data to make those numbers significant that I already reported. I rather need more data about the MUs I didn't report yet.

Mostly_Harmless

09-10-2010, 07:54 AM

Well actually it's a statistical law that results don't change much if you have a sample with N > 30.

That's a little ridiculous. For one thing, "statistical law" doesn't mean your answer will never change drastically, it means your answer will probably not change drastically (for some value of probably). It's entirely possible you accidentally picked 40 very unlucky matches. That's not actually what I think happened, though. I suspect the real problem is that "Mr. Average" is pretty good with some decks (like goblins), but not all that good with others (like storm combo). (As anecdotal evidence, my TES vs. Goblins matchup improved dramatically after goldfishing a thousand or so times. It takes a lot of practice to make sure you don't lose to yourself, let alone hate. That's not nearly as much of an issue with goblins, where it's pretty straightforward to win if your opponent screws themself.) I suppose that's not necessarily a problem, though, as long as you understand that you're measuring matchups between all players, rather than particularly skilled players with a given deck.

I'm curious to see what happens if you look at win percentages for particular players in a given matchup. I suspect you'd find several storm players with 80-20 or 90-10 records vs. goblins and a lot of players with 40-60 records.

That said, I really do appreciate the work you put into this. It's nice to see someone actually look at data rather than just make educated guesses about matchups (like I just did =).)

Cthuloo

09-10-2010, 08:15 AM

That's a little ridiculous. For one thing, "statistical law" doesn't mean your answer will never change drastically, it means your answer will probably not change drastically (for some value of probably). It's entirely possible you accidentally picked 40 very unlucky matches. That's not actually what I think happened, though. I suspect the real problem is that "Mr. Average" is pretty good with some decks (like goblins), but not all that good with others (like storm combo). (As anecdotal evidence, my TES vs. Goblins matchup improved dramatically after goldfishing a thousand or so times. It takes a lot of practice to make sure you don't lose to yourself, let alone hate. That's not nearly as much of an issue with goblins, where it's pretty straightforward to win if your opponent screws themself.) I suppose that's not necessarily a problem, though, as long as you understand that you're measuring matchups between all players, rather than particularly skilled players with a given deck.

I'm curious to see what happens if you look at win percentages for particular players in a given matchup. I suspect you'd find several storm players with 80-20 or 90-10 records vs. goblins and a lot of players with 40-60 records.

The interesting thing is that, if we really had a huge amount of data, one could in principle take account also for player's skill. The binomial distribution will have a peak around the average and then decrease with a precise law from both sides (it should look like a bell, for a high number of data). Then if you e.g. suppose to be in the top 10% of storm combo players, you can localize at which point of the curve you are, and find the expected matchup average for your skill (supposing the skill distribution of combo player is a gaussian and not something completely weird for some reason). But this will probably require an amount of data we will never have,

Well actually it's a statistical law that results don't change much if you have a sample with N > 30.

This means that if I recorded 30 games Merfolk vs. Zoo and the outcome is e.g. 40% - 60% it won't be 60% - 40% after 300 recorded games.
In the particularcase of UBx Storm combo vs. Goblins I recorded 53 games. Thus the MU percentage won't change dramatically with 530 games, it will stay around 50%-50% +/- 5% maybe.

My N is at least 20 for every MU, so it doesn't take many more data to make those numbers significant that I already reported. I rather need more data about the MUs I didn't report yet.

I agree to some extent. 53 is not a really huge number, but it still tells us that it is highly difficoult that the matchup is better than 65-35 for combo (with 95% certainty), which IMHO is something already worth to know, since the usual knowledge is that the matchup should be more like 80-20.

Psyqo

09-10-2010, 09:00 AM

Great research, thanks!

GoboLord

09-10-2010, 10:48 AM

That's a little ridiculous. For one thing, "statistical law" doesn't mean your answer will never change drastically, it means your answer will probably not change drastically (for some value of probably). It's entirely possible you accidentally picked 40 very unlucky matches. That's not actually what I think happened, though. I suspect the real problem is that "Mr. Average" is pretty good with some decks (like goblins), but not all that good with others (like storm combo). (As anecdotal evidence, my TES vs. Goblins matchup improved dramatically after goldfishing a thousand or so times. It takes a lot of practice to make sure you don't lose to yourself, let alone hate. That's not nearly as much of an issue with goblins, where it's pretty straightforward to win if your opponent screws themself.) I suppose that's not necessarily a problem, though, as long as you understand that you're measuring matchups between all players, rather than particularly skilled players with a given deck.

I'm curious to see what happens if you look at win percentages for particular players in a given matchup. I suspect you'd find several storm players with 80-20 or 90-10 records vs. goblins and a lot of players with 40-60 records.

That said, I really do appreciate the work you put into this. It's nice to see someone actually look at data rather than just make educated guesses about matchups (like I just did =).)

What you say is absolutely true. That's why I refered to "Mr. Average". If most storm combo players do lousy against Goblins (40-60) and a few are very good (90-10) then it's just not true that this MU is in favor of storm combo, because in most cases it isn't.
Note that this exactly what I said at the last passage of part 3. I agree with you that if both players are skilled with their decks combo should win with a greater probability. I repeatedly wrote that I don't compare players but averages. Still, if we take more data it will just contain the same amount of bad and good players. Therefore the results probably (95%) won't change much.

I added your post to my article, because it makes clear what I meant.

Mostly_Harmless

09-10-2010, 05:57 PM

Cthuloo

09-13-2010, 07:11 AM

Well I guess we actually agreed, then. That'll teach me to read more carefully.

@Cthuloo: I don't see any particular reason for the distribution of play skill to be normal. If anything, I'd expect a bimodal distribution for any sufficiently difficult deck (be it combo or countertop). The people who pick up a deck and play it for a tournament or two will be pretty lousy (I do this a lot with various tempo decks), while the people who pick a deck and stick with it get pretty good pretty fast. We don't get to use the Central Limit Theorem here (not that you seem to think we can; I just wanted to be clear) because each data point is the outcome of a single game/match, not the average of many.

You make a good point. What is really hard to model, however, is the eventual shape of this bimodal distribution, were the peaks are and how high is the "good players" peak with respect to the other. There could even be more peaks, and in the presence of many different peaks the final distribution may very well look like a gaussian - but it's hard to tell. I find it difficoult to imagine that we will ever have the data to reconstruct the shape of the distribution, so here's were personal experience plays a big role filling the gaps.

GoboLord

09-13-2010, 08:52 AM

You make a good point. What is really hard to model, however, is the eventual shape of this bimodal distribution, were the peaks are and how high is the "good players" peak with respect to the other. There could even be more peaks, and in the presence of many different peaks the final distribution may very well look like a gaussian - but it's hard to tell. I find it difficoult to imagine that we will ever have the data to reconstruct the shape of the distribution, so here's were personal experience plays a big role filling the gaps.

You are right with what you say, but finding distributions of playskills is

a) not possible cause skill must be measured on more than succes with a particular deck
b) not what I wanted to show with this analysis.

I now like to hear some advice before I go on collecting data.
Maybe we could discuss more application of the data? So far we have only found out what it can not be applied to and what the limitations are.

Cthuloo

09-13-2010, 09:24 AM

You are right with what you say, but finding distributions of playskills is

a) not possible cause skill must be measured on more than succes with a particular deck
b) not what I wanted to show with this analysis.

You're definitely right, I was only answering to Mostly_Harmless remark.

I now like to hear some advice before I go on collecting data.
Maybe we could discuss more application of the data? So far we have only found out what it can not be applied to and what the limitations are.

Just throwing in some ideas, probably not all of them very good:

- Agglomerate decks in Archetypes (the usual Aggro-Control-Combo for instance), and then try to see how they perform against each other (is really Combo>Aggro>Control>Combo?). Then one could repeat the process for single deck vs archetype (is Mono Black a good choice in a field full of Aggro?). We don't need to be very precise in the classification, I guess, since the huge amount of agglomerated data should be sufficient to make some considerations even if we are not very precise.

- Try to have a look at the distribution: Win(games) vs. Win(matches). In principle, if you have a probability X to win a single game, you will win a match with a probability Y=X^2+2*(1-X)*X^2 = 3*X^2 - 2*X^3 = X^2*(3-2*X). If the value is very different for a matchup, this should be an index that sideboarding plays a big role.

- Try to see the % of mathces that end up as a draw for a given deck. This parameter can play a big role when deciding to bring the deck to a big tournament or not.

This is it what I can think of for the moment.

1rakete

09-13-2010, 10:20 AM

First: kudos for your comprehensive work on this. But I actually question the usefulness. The difference of the individual builds of the certain decks is just too high to compare them in a reasonable way. Just an example: Rhoner Merfolks vs Saito Merfolks and their matchup against a CB Top Bant list Excalibur-Style vs a CB Top Bant List with NO. Most of the deck types you compare have a huge difference in possible competitive builds, only few of them are nearly always the same. In my eyes, there is no possibility to eliminate this problem. Concerning surveys, you could use reference builds. But data from tournaments is just not usable, as the variance in the builds is to high.

Then, as already mentioned, your work includes some issues concerning your statistical method. Unfortunately, I can’t do more than complaining in this point because I just have poor statistic skills.

GoboLord

09-21-2010, 07:31 PM

Here is an update of the MU-Analysis I described in my article.

To make things easier I will present the result in a diffrent fashion: I arrange the MU percentages in blocks of decks. The percentages are listed from best (top) to worst (down).
If someone is interested in the updated version of the excel-document feel free to ask. I'll then either post a link or send it via e-mail.

Goblins
CT Bant - 64%
Merfolk - 60%
Lands - 58%
Landstill - 57%
Mono Black - 54%
Dredge - 52%
Aggro Loam - 50%
Aggro Loam - 50%
Threshold - 46%
ANT/TES - 46%
Zoo - 45%
Dreadstill 39%
Survival 31%
Belcher 20%

Merfolk
ANT/TES - 67%
Dreadstill - 58%
Landstill - 58%
Belcher - 58%
CT Bant - 56%
White Staxx - 56%
Painter - 53%
Threshold - 52%
Eva Green - 49%
Aggro Loam - 48%
Dredge - 46%
Lands - 44%
Enchantress - 42%
Mono Black - 40%
Survival - 40%
Rock - 40%
Goblins 40%
Zoo - 38%

Zoo
Enchantress - 68%
Painter - 66%
Eva Green - 66%
Merfolk - 62%
Dredge - 61%
CT Bant - 55%
Goblins - 55%
Threshold - 52%
Lands - 51%
Landstill - 51%
Dreadstill - 48%
ANT/TES - 47%
Survival - 46%
Rock - 40%
Belcher - 38%
Aggro Loam - 38%
Mono Black - 23%

Eva Green
Merfolk - 51%
CT Bant - 46%
Dredge - 35%
Zoo - 34%

Aggro Loam
CT Bant - 64%
Zoo - 62%
Merfolk - 52%
Goblins - 50%
Dredge - 43%
ANT/TES - 42%

Rock
Merfolk - 60%
Zoo - 60%
Dredge - 40%

Threshold
Belcher - 82%
ANT/TES - 54%
Goblins - 54%
CT Bant - 50%
Survival - 49%
Merfolk - 48%
Zoo - 48%

Survival
Goblin - 69%
Belcher - 62%
Merfolk - 60%
CT Bant - 58%
Zoo - 54%
Threshhold - 51%
Dredge - 50%
ANT/TES - 43%
Dreadstill - 43%

CT Bant
ANT/TES - 57%
Eva Green - 54%
Landstill 52%
Threshold - 50%
Zoo - 45%
Merfolk - 44%
Dredge - 42%
Lands - 41%
Survival - 43%
Goblin - 36%
Aggro Loam - 36%

Dreadstill
Goblins - 61%
Survival - 57%
Zoo - 52%
Merfolk - 42%

Enchantress
Merfolk - 58%
Dredge - 52%
Goblins - 50%
Zoo - 32%

Lands
Dredge - 63%
CT Bant - 59%
Merfolk - 56%
Zoo - 49%
ANT/TES - 36

Landstill
Zoo - 49%
CT Bant - 48%
Goblins - 43%

ANT/TES
Belcher - 64%
Aggro Loam - 58%
Goblins - 54%
Zoo - 51%
Dredge - 50%
Merfolk - 33%

Belcher
Goblins - 80%
Zoo - 62%
Dredge - 59%
Merfolk - 42%
Survival - 38%
ANT/TES - 36%
Threshold - 18%

Dredge
Eva Green - 65%
Rock - 60%
CT Bant - 58%
Aggro Loam - 57%
Merfolk 54%
Survival - 50%
ANT - 50%
Enchantress - 48%
Goblins - 48%
Belcher - 41%
Zoo - 39%
Lands - 37%

Painter
Merfolk - 47%
Zoo - 34%

Mono Black
Zoo - 77%
Merfolk - 60%
Goblins - 46%

As always: If you have questions, feel free to ask.

stasis

03-30-2011, 07:12 AM

Would be interesting to see what match-ups that sideboarding turn the result all around

practical joke

03-30-2011, 09:03 AM

Cenarius

03-30-2011, 09:09 AM

Somehow I get this feeling that either some decklists are not optimal or the players are lacking the abbility to play the deck with maximized potential.

I mean, there's no reason why ANT would lose to goblins mainboard except:
- T3 goblin kills on the play against an opponent that kept a mediocre hand without a T2 kill (which are pretty rare)
- Ad Nauseam failed (which you should never try against goblins, because ill-gotten gains/tutor chain is a much easier kill)

But...everything I see that seems off could be extremely list-dependant.

I'm very skeptic about the results.

QFT.

I'm also skeptic about these results.

I'm not sure how many games u played with or against certain decks. I believe results become self-explaning after 75 games preboard and 100 games postboard, but then again it totally depends on the players and on the lists.

Nessaja

03-30-2011, 09:31 AM

He already explained where he got his data from.

The question of "how does goblins even win against ANT game 1" is a very relevant one though. Because the only way that does happen is when you got a turn 3 kill, that requires a very specific amount of cards on the goblin side and I can't imagine that that percentage is higher then 10 (I'll do the exact calculations later). Your sheet says it's 25% of the time and that's just entirely wrong.

Either way, that leaves us with two other games, the average goblin list also doesn't have enough sideboard cards to deal with combo and as such. Just using simple logic - there's something iffy about the results. It's not even randomness or a standard deviation it's just that something is entirely incorrect about it.

Skeggi

03-30-2011, 09:52 AM

I mean, there's no reason why ANT would lose to goblins mainboard except:
- T3 goblin kills on the play against an opponent that kept a mediocre hand without a T2 kill (which are pretty rare)
- Ad Nauseam failed (which you should never try against goblins, because ill-gotten gains/tutor chain is a much easier kill)
It happens more than you think. I think ANT isn't as popular in America as it is in Europe is because people don't know how to play it and can't look at eachother to learn it. Here in Europe people know how to play it and the knowledge spreads. At least, that's a hypothesis I have. I suspect there's a difference if you compare the American data with the European.

Fact is that the American meta differs from the European. That's why I don't really care about 'but your list is bad, the SCG list played this and that' statements.

(nameless one)

03-30-2011, 10:17 AM

AriLax

03-30-2011, 11:52 AM

ANT can also lose if it keeps a 1 lander and gets Lackey->Ported and never sees more land in a timely fashion. Realistically Goblins is probably better off than Zoo in the matchup, but Storm is around a 80% favorite.

On the subject, the Zoo number might be my favorite of these. I once played a set against Cat Sligh Zoo where they were on the play every game and got a free mulligan each game. They got 1 out of 12.

ScatmanX

03-30-2011, 12:20 PM

Somehow I get this feeling that either some decklists are not optimal or the players are lacking the abbility to play the deck with maximized potential.
But this IS what happens.
Most of the TES/AdN players we face out there, are not optimal players, and the lists are not optimal either. That's why the data is like that.
In theory, Combo should beat goblins 80+% of the time, but in real life that's not what really happens.
Of course, if we would make a sample, where you, AriLax and Bryant were the only combo players, the percentage win of combo against goblins would be way higher than the one posted here...

dahcmai

03-30-2011, 12:31 PM

Kind of hard to get concrete numbers for this due to the player skill level having such a large bearing on it. The Storm vs Lands alone should say a ton. Lands won a game somewhere in there? lol Zoo vs Lands was positive? Same for Goblins vs Lands? Goes to show there's some lousy Storm and Lands players if you're getting numbers like that. Those are practically blow outs with very little chance for one side to win.

Admiral_Arzar

03-30-2011, 01:00 PM

I think all of these considerations are explained by the fact that "the average player, with average skill level, and an average decklist" was used for the testing. Obviously, if Bryant Cook was playing TES for the test there would be very few losses to aggro. However, it's just some average player who probably doesn't have much storm experience. The same goes for Lands.dec and all the other complex lists on there which reward a lot of practice and playskill.

Doomsday

03-30-2011, 01:31 PM

ScatmanX

03-30-2011, 02:30 PM

Are "random player who doesn't know how to play his own deck properly" results worth anything at all though? I don't personally think that they are.
Yes, because that's the average player you'll encounter in tournaments.

Skeggi

03-30-2011, 03:54 PM

Ive won against AnT once with Goblins on a game 1. He Duressed me on his turn 1. I happened to draw Lightning Bolt after and when he Ad Nauseamed himself until he was only at 2, I bolted him. He didn't see it coming.

Its not 100% AnT will win against Goblins turn 1. He was a pretty competent AnT player too, it just happened that he thought he was safe after seeing my hand full of Goblins. After all, there is still that percentage of game that its still based on luck.
I understand what you're trying to do, but you're actually confirming my hypothesis.

How can he be a competent ANT player if he cast Duress on turn 1 without going off? Competent ANT players don't cast Duress on turn 1 (unless they combo off the same turn ofcourse). And more: why did he try to go off with Ad Nauseam? Against Goblins, you go off with Ill-Gotten Gains 99% of the time. Sounds like a pretty bad ANT player to me.

Doomsday

03-30-2011, 04:29 PM

Yes, because that's the average player you'll encounter in tournaments.

I completely disagree with this. The average tournament TES player is not losing to lands. The average tournament Lands player is not losing 60% of games to Goblins lol.

Skeggi

03-30-2011, 04:55 PM

Yes, because that's the average player you'll encounter in tournaments.
If you're an average player too perhaps. When you usually sit at the higher tables, this is not the case.

TossUsToLions

03-30-2011, 05:24 PM

There is nothing to disagree with. These are the cold, hard statistics. They're not opinions, they are what ACTUALLY happened. There are some bad players piloting decks that require a lot of skill/practice, and there are some really, really good players piloting these decks too. Players of all skill range MUST be taken into account when compiling this data because these are the people that you will encounter in tournaments. You will see bad players, you will see good players. You will see bad players who play the game of their life, and good players who play the worst that they can possibly play. Everyone makes misplays, and everyone can get runner-runner topdecks to pull games out that they should usually lose.

ScatmanX

03-30-2011, 05:35 PM

If you're an average player too perhaps. When you usually sit at the higher tables, this is not the case.
But round 1-2 you normally don't play the top players yet. And the data take those games in consideration too.
If we take in consideration only the Top 8's, storm probably beats Goblins every time, but it does not happen on the Swiss.

Again: I agree when you say "When you usually sit at the higher tables, this is not the case", but the statistics gathered here took everyone into account, even bad players with poor decks.

Do you only play against Professional players every tournament you attend to?

The average tournament Lands player is not losing 60% of games to Goblins lol.
Do you have any data (like Gobbolord have) to support this?

Gui

03-30-2011, 05:38 PM

What's up with the huge thread necro?

Anyways, just to point something relevant, one gotta consider that % of wins isn't exactly synonymous of huge amount of data, nor of consistant data.

I can agree with the ones advocating that high level players would lose less than what's stated, but also, you can use these statistics to prove what you already assume, i.e., I assume Belcher beats Gobbos, and hell yeah, the data proves it.

Noone should be lazy and use only these statistics anyways...

Zork

03-30-2011, 07:52 PM

If you want a measure of how well an average pilot performs against an average pilot, means/percentages are fine to look at. If you want to look at the performance of a better pilot, you want to look at the shape of confidence intervals and check the upper bounds. Some basic methods for this would be bootstrap/jackknife resampling of the data followed by an examination of the 95% confidence interval.

Now, if you want to start predicting the winner of a given matchup, a general classification of each deck is probably not going to be sufficient for any model. Decklists, die rolls, current standings (maybe), etc would need to be rolled in there. It would be possible, but really hard to scrape together.

Skeggi

03-31-2011, 04:10 AM

There is nothing to disagree with. These are the cold, hard statistics. They're not opinions, they are what ACTUALLY happened.
Except that the discussion is not about if the facts are good or not, it´s about if they are relevant. American meta facts are alot less relevant in Europe. Data about bad players is less relevant at high tables.

But round 1-2 you normally don't play the top players yet. And the data take those games in consideration too.
If we take in consideration only the Top 8's, storm probably beats Goblins every time, but it does not happen on the Swiss.

Again: I agree when you say "When you usually sit at the higher tables, this is not the case", but the statistics gathered here took everyone into account, even bad players with poor decks.

Do you only play against Professional players every tournament you attend to?

So I meet the 'average Joes' on round 1 and 2. Chances are I can outplay them even if the match-up is bad. So the list is less useful here. Later, when I´m at the higher tables, I have to play better players, but the list also takes bad players in consideration, so still, the list is less useful here. Or, I completely flunk and go 0-2 drop, also in that case, the list is less useful.

practical joke

03-31-2011, 04:55 AM

If you talk about match-up percentage you desire the ultimate list (mono-red goblins and B/r goblins are different decks for example and ANT with or without Chant are so as well)

With the ultimate list, you desire results with competent top players that know how to play the deck without any misstakes. (sure, in hindsight everyone makes misstakes or takes a bad gamble, but competent players can differ this from each other and know when to do)

at that point you get stonecold match-up results.

-bad or mediocre players influent the results (which makes the test/results invalid)
-Good players with no affinity for the deck have influence to the rest (which make less optimal plays, this will make the result slightly different, but enough to make it invalid)

You want to have the top-players play the deck.
For Example I know that Cenarius has been playing Tempo ***** decks for years, getting his results as well.
You want a player like that to play against Bryant Cook (TES) and see what happens after 10 games preboard (at that point you only have a match-up percentage preboard)

If you desire match-overal you should take 4 preboard matches and 6 postboard matches.

dahcmai

03-31-2011, 10:45 AM

ScatmanX

03-31-2011, 10:57 AM

Data about bad players is less relevant at high tables.
Now with this I agree.
A scope of just top 8 data would me more useful to us.

Admiral_Arzar

03-31-2011, 11:43 AM

I just was more curious how some Tendrils player failed bad enough to lose to lands. lol Maybe this time, fizzle, maybe this time, fizzle, maybe this time, fizzle, maybe this time, damn targeted myself with the tendrils. That's too funny.

I once lost to lands playing Pact SI by fizzling on draw-4's and losing a lot of life. Around fourth or fifth turn, before I could rebuild to go off again, he cast Scapeshift for Valakut and a bunch of other lands with Prismatic Omen in play.

practical joke

03-31-2011, 12:23 PM

I once lost to lands playing Pact SI by fizzling on draw-4's and losing a lot of life. Around fourth or fifth turn, before I could rebuild to go off again, he cast Scapeshift for Valakut and a bunch of other lands with Prismatic Omen in play.

With a few good hands, lands can win on T4-T5 with scapeshift ( T5 is very possible)
Meanwhile wasting your manabase etc.

I almost lost once due to bad draws. (drawing into ill-gotten gains + tendrils eventually gave me the win, but it wasn't funny though)

bracer028

03-31-2011, 05:42 PM

how is monoblack stronger against zoo than eva green? it basically uses the exact same cards.

dahcmai

03-31-2011, 06:17 PM

It's probably talking more about that mono-B control deck. That one is pretty strong against Zoo.

I had forgotten about that Scapeshift version of Lands. That one is actually fairly fast. I can see having a bad enough draw against that one to accidentally get nailed. That and mull to oblivion. I've done that a couple of times.

GoboLord

04-21-2011, 05:30 AM

Hey there,

I'm very glad that you guys took your time to read my article ( I didn't even know that it still existed on these board because it didn't get much feedback).

It's very interesting to follow your discussion, so please go on =). What you should keep in mind is that the data were collected in 2010 (and may therefore be somewhat outdated.
My attempt in this article was to give players a more distant (and objective) view on game-outcomes. Also this article is very "hard and cold" as one of you stated. That's in part because it's difficulöt to interpret the results properly when so many factors (you listed all of them) are not taken into consideration (like skill with the deck, skill as a player, experience with the format, physical and mental condition).

The results tell us what Mr. Average is and are therefore by no means applyable to your particular decklist, playingstyle and skill. They are worthless without some thoughts of interpretation. Or as a user in this forum puts it:

I suspect the real problem is that "Mr. Average" is pretty good with some decks (like goblins), but not all that good with others (like storm combo). (As anecdotal evidence, my TES vs. Goblins matchup improved dramatically after goldfishing a thousand or so times. It takes a lot of practice to make sure you don't lose to yourself, let alone hate. That's not nearly as much of an issue with goblins, where it's pretty straightforward to win if your opponent screws themself.) I suppose that's not necessarily a problem, though, as long as you understand that you're measuring matchups between all players, rather than particularly skilled players with a given deck.
And that is what we need to do when we want to apply them. Let me tell you something from a psychologists point of view at the end of this part: people tend to overestimate themselves on field they value much. This effect is known as the “above-average-effect”. To proof this effect, researchers interviewed male drivers in hospital after they had a car accident. 80% of those participants (all of them male, still injured though recovering from the crash!) rated themselves as better as the average car driver.

4. Contributions and limitations

As I said right at the beginning: This article does not hold absolute truth about legacy. Neither do I claim that all of my methods are 100% perfect. In this last part I will discuss pros and cons of my analysis.

What does the article contribute?

This article is about averages - nothing more and nothing less. It tells us what the MU of certain decks are like according to average decklists, playingstyles and skills. This article reveals some rather surprising results that are far away from actual estimations that players give. This might help to reduce wrong impressions and ratings when discussing strategies against certain decks. It is designed to push thoughts about playtesting, tournaments, deck- and sideboard construction away from subjective opinions and to give a more distant view on MUs in general.