Anyone interested in assembling a giant matchup grid as a Source project? [Archive]

lordofthepit

05-12-2010, 12:32 PM

There are a lot of great decks in Legacy, and there are a lot of great innovators on the Source who have taken the time to share their tech and to write great primers. However, one thing that I've noticed which tends to be consistently a problem are the analysis of various matchups. Inevitably, I think people who write primers tend to exaggerate the performance of the deck they are playing, sometimes intentionally, but often unconsciously.

Some of them read like this: "Our matchups are all overwhelmingly in our favor, except for ANT, which is 40/60 in game 1, because we don't run any countermagic or disruption, but that becomes 80/20 in our favor in games 2 or 3 after we side in 4 Mindbreak Traps. Expect this to dominate the format when it catches on."

I was wondering whether it would be a good idea to create a matrix/grid of the 15-20 most played decks in the format, plus sub-variants of each (so, for instance, NO-Pro CounterTop, Supreme Blue, Nassif CounterTop, etc.; Mono-R Goblins, R/b Goblins, R/g Goblins) to indicate whether a particular matchup was very favorable, slightly favorable, even, slightly unfavorable, or very unfavorable. If there is interest in this, we could implement it either with a Wiki, or maybe with a giant thread specifically for posters to comment on matchups. Would you guys be up for it?

Vacrix

05-12-2010, 01:05 PM

This is a great idea. It would significantly help out players looking for 'what to play' in their meta game. If we do this, it should probably be a source exclusive thread.

As far as the objectivity is concerned, how is that to be done? The authorities of these decks get way better results than random people who pick up the deck so naturally they look exaggerated (not that some aren't). Whats the best way to go about fixing these problems?

stuckpixel

05-12-2010, 01:23 PM

The problem in doing this, is that you'd need to get players who are experts with their respective decks to dedicate to playing a large number of games to get some statistical information.

If you just pick two or three random players - they won't get all the nuance of a number of decks, so the stats won't really be valid.

RogueMTG

05-12-2010, 01:33 PM

To have something like this be truly valuable (as in, to not have arguments flare up because "your deck is considered bad") I think we would need to start on the data collection side.

I'm not sure the best way to go about it. You'd have to collate all of the deck archetypes, match-ups, and win-losses and then analyze them over a long period of time, that's an incredible amount of work.

<magicalchristmasland>
...I'm thinking automate it with a web application. Sort of like deck-check except you file decks under pre-existing archetypes (or create a new one if one doesn't exist), and you would have to enter all of the results of each tournament instead of just the top8. Someone would need to collect all of the deck lists for each event, identify & categorize them, and then follow & enter the results through-out the tournament.

Then we'd be able to automagically calculate the percentages of archetypes versus other archetypes based on real tournament data that comes from more than one person.
</magicalchristmasland>

Julian23

05-12-2010, 01:37 PM

Ok, I'll just start with UW Tempo which seems to be 80/20 against the entire field.

No, seriously. I really doubt such a project would be of any value unless you put more time and work into it that anyone, even the pro players, has ever done with regards to Magic. My adivce is to just have three categories for such a project:

> has a real favorable matchup
= is largely on par
< has a real unfavorable matchup

I'm speaking in categories of what other people like to describe as "70/30" (maybe eben 80/20) but surely not "60/40". The later clearly indicates "is largely on par" to me.

Mark Sun

05-12-2010, 01:41 PM

To have something like this be truly valuable (as in, to not have arguments flare up because "your deck is considered bad") I think we would need to start on the data collection side.

I'm not sure the best way to go about it. You'd have to collate all of the deck archetypes, match-ups, and win-losses and then analyze them over a long period of time, that's an incredible amount of work.

<magicalchristmasland>
...I'm thinking automate it with a web application. Sort of like deck-check except you file decks under pre-existing archetypes (or create a new one if one doesn't exist), and you would have to enter all of the results of each tournament instead of just the top8. Someone would need to collect all of the deck lists for each event, identify & categorize them, and then follow & enter the results through-out the tournament.

Then we'd be able to automagically calculate the percentages of archetypes versus other archetypes based on real tournament data that comes from more than one person.
</magicalchristmasland>

Haha, "automagically." That's wonderful.

I think this is a great idea, but it can be difficult, as there's always the likelihood of error, for example forgetting to enter data or missing events completely. So I guess the question we are presented with is how "complete" this matchup grid can be versus how efficiently relevant data can be gathered and entered.

It will be a time consuming project; I am by no means an expert in any of the decks that I play, so it will depend on who wants to volunteer their time to sit down and do the testing.

Tacosnape

05-12-2010, 01:50 PM

Some of them read like this: "Our matchups are all overwhelmingly in our favor, except for ANT, which is 40/60 in game 1, because we don't run any countermagic or disruption, but that becomes 80/20 in our favor in games 2 or 3 after we side in 4 Mindbreak Traps. Expect this to dominate the format when it catches on."

I was wondering whether it would be a good idea to create a matrix/grid of the 15-20 most played decks in the format, plus sub-variants of each (so, for instance, NO-Pro CounterTop, Supreme Blue, Nassif CounterTop, etc.; Mono-R Goblins, R/b Goblins, R/g Goblins) to indicate whether a particular matchup was very favorable, slightly favorable, even, slightly unfavorable, or very unfavorable. If there is interest in this, we could implement it either with a Wiki, or maybe with a giant thread specifically for posters to comment on matchups. Would you guys be up for it?

You've already described why this won't work. Nobody can agree what the actual matchup percentage of a deck is. And there isn't enough data to support it accurately because you can't take into account player skill and the difficulty of a deck to pilot. If ANT goes 2-8 against Merfolk, does that mean Merfolk's 80-20 against ANT? Or did ANT lose 4-5 of those matches due to the fact that half the people who play the deck can't pilot it for shit?

Plus, you have decks like Dragon Stompy and Stax, which have an absolutely insane variation in matchup percentages based on whether or not they win the die roll, so you have to factor the die roll into the match and have two separate categories for it.

lordofthepit

05-12-2010, 01:52 PM

I wasn't planning on having players run 100 samples or conducting a meta-analysis or anything. I think for the most part, we can reach a consensus on whether A is "highly favorable" against B, "somewhat favorable", or "roughly even". It's just that sometimes you'll see conflicting opinions (i.e. ANT vs. Merfolk), which is certainly both player and variant dependent. Maybe if we broke it down by subtype, we'd get more meaningful results (i.e. U/B Saito ANT has a "roughly even" matchup against Merfolk, whereas UBw ANT has a "somewhat unfavorable" matchup, although I'm not sure this is actually the case).

You've already described why this won't work. Nobody can agree what the actual matchup percentage of a deck is. And there isn't enough data to support it accurately because you can't take into account player skill and the difficulty of a deck to pilot. If ANT goes 2-8 against Merfolk, does that mean Merfolk's 80-20 against ANT? Or did ANT lose 4-5 of those matches due to the fact that half the people who play the deck can't pilot it for shit?

I don't intend to be extremely rigorous with gathering statistical data, just relying on the experience of most Source players, who I assume are competent Magic players but not necessarily amazing technical players like Kai Budde. Obviously, "relying on experience" is prone to error, but at least by making this a community project, if someone says "X shits all over Y", this would be a visible statement for Y players to refute that statement, whereas that would not be the case if it were posted only on the X thread.

There will eventually be some disagreement, but at some point, someone who bears responsibility will say "enough is enough, based on my experience and the consensus of most of the Sources, I'm going to say that X has a roughly even matchup with Y". It might sound like I'm volunteering myself by putting this suggestion out there, but I actually don't have enough experience playing in real tournaments, so I wouldn't feel comfortable with that responsibility. If anyone is willing to do this though, I think it would be great, and I don't think it will take a whole lot of time (because it wouldn't need to involve a lot of statistics)--certainly less time than it would take to organize one of the Source tournaments.

kicks_422

05-12-2010, 06:06 PM

Instead of 60/40 and 80/20, just do favorable, even, unfavorable. Maybe even highly favorable or highly unfavorable, if that can be done without bias as well.

Of course the first step would be to take a look at the deck primers to get some information there on match-ups.

Cthuloo

05-13-2010, 04:42 AM

It's not really clear to me if you want to collect actual data or just general opinions/experience. In the first case, I'll be very excited, even though the task looks very hard. A possibility to have a decent average data set is to only collect tournament data, having people involved in the project posting their results (and possibly even if they won the die roll or not).

Jak

05-13-2010, 10:24 AM

Another big problem is variance in lists. I just see a lot of bitching about lists which won't do much good.

I still like the idea though. If it could be recorded too that would be great. A ton of work to collect all of this from different people though.

pi4meterftw

05-13-2010, 10:32 AM

It would be a matchup grid, and people would know about statistical fluctuations. Just compile some tournament data, and if you're worried about the statistically illiterate, just mention that it's not fail-safe.

Forbiddian

05-13-2010, 02:50 PM

Another big problem is variance in lists. I just see a lot of bitching about lists which won't do much good.

I still like the idea though. If it could be recorded too that would be great. A ton of work to collect all of this from different people though.

Correct: An optimized list does as much as 5-10% against the field better than a standard list (or 20% worse). And extremely good pilots can add an extra 10% to that.

The bottom line is what I've learned: No matter what you do, nobody believes your data unless it fits into their own paradigm. Even if you don't have an ulterior motive, e.g. Stephen Menendian's analyses (widely supported by the Vintage community as defining the metagame) are criticized without much data support for the counterclaims here on the Legacy boards. Even stuff for which there's overwhelming data is still sidelined. Sticking with Menendian's analysis:

Dredge: Stick a fucking fork in it. ANT: Overrated. Zoo: Playable.

But then people look at the conclusions, realize the conclusions don't mesh with their perception of reality, and immediately think that their own anecdotal experience is much more accurate than a shitton of empirical data. Instead of changing their perception of reality to reflect the data, they change the data to reflect their perception of reality.

"Oh, you think Dredge did poorly? Yeah, well, your data doesn't account for player skill. Clearly everyone playing dredge is worse than the average player and that accounts for its poor performance better than the ridiculous idea that Dredge is ACTUALLY BAD. You didn't account for the possibility that Dredge players are systematically worse, therefore data is useless.

ANT has no bad matchups. This is fact. So what if it consistently splits or loses to most of the format? It's probably because ANT players are bad and not because the deck has any vulnerabilities other than bad play. You didn't account for the possibility that ANT players are systematically worse, therefore data is useless.

Zoo is terrible. It autoscoops to ANT. So what if data shows that it has actually been splitting/only slightly losing to ANT? It's definitely because ANT players made mistakes. Oh, and if you think that the ANT players will continue to make the same mistakes, thus making Zoo a good bet, think again: ANT players will begin playing perfectly really soon. You didn't account for the fact that ANT players couldawouldashoulda played better, therefore data is useless."

To *SOME* extent, the counterarguments hold water, but the hidden claim is, "You have a lot of data, but I have some feelings that I got because I played this deck in a tournament once or twice (or maybe I didn't even do that). There's a huge discrepancy between the two, but I choose to believe my anecdotal evidence."

It's only when the evidence meshes with their reality that people buy into it. There are just too many autopilot red herrings out there for people to see data as more useful than individual testing in someone's basement. Clearly infallible because there aren't any confounding variables associated with one or two playtests against your circle of friends.

So no, I don't think the matchup grid is going to help many people or even any people, since the open-minded data miners like Jak. Menendian, and others do the mining themselves, and the vast majority of other people wouldn't believe it anyway.

I mean, look at all the arguments after the Attacking Is Miserable article, and that was basically entirely over whether the ANT/Zoo matchup should be written up as 80/20 or 60/40.

"Attacking is miserable. It's 80/20.
Data says 60/40.
I tested it. 80/20.
In tournaments, ANT players get tired. 60/40.
In tournaments, ANT players suck. 80/20.
Boobs ibtl"

Repeat for every single matchup.

Nidd

05-13-2010, 03:24 PM

I don't know what people gain from reading 80/20 instead of 60/40. Maybe I'm not enough of a matchuppercentagewhore (excuse my harsh words, but some people here hold to their matchup-% like to some sort of holy book and act as if it was insulted), but I'm confident with knowing what my opponent packs and whether I'm unfavored, favored or the matchup is even. i don't care whether the MU is 60/40 or even 90/10, I'm favored and bad draws can still fuck this up.
Now, say, you play ANT against Lands, you're like 99/1 G1 but your deck just shits on you and you draw nothing relevant and can't find anything, so Lands just smahes your head in. What did your 99% do that match? Right, nothing. They were about as relevant as knowing whether you're favored or not.

And the time some people waste to argue over the matchup%es could be used much better.
Like, for sleeping, working or even jerking off.

Damn, I just had to get this off my chest.

Cthuloo

05-13-2010, 04:48 PM

Well, it should be quite obvious that almost every percentage thrown in matchup descriptions is made up only form sensation or anedoctal evidence. Nobody has enough data to make a decent statistic. Even supposing no variance on standard deck lists, there aren't definitely enough recorded results to have a precise conclusion.

People tend to think to know extensively a matchup they have played 30 times. Hell, say even 50. Let's say you played 50 games against a deck and end with 30 wins and 20 losses. Is the matchup 60/40? Maybe. But it's within one standard deviation to be unfavorable. To be 99% sure that it's not unfavorable you should play around 1000 games.* Scientific discoveries are usually claimed at 5 standard deviations which means you have to play something like 5000 games. And this is assuming constant deck lists and player skill.

Nevertheless, it think some data taking and analysis could be really interesting.

*It's probably a decent assumption to take the distribution to be gaussian, if the number of games played is big enough

pi4meterftw

05-14-2010, 07:35 AM

I don't know what people gain from reading 80/20 instead of 60/40. Maybe I'm not enough of a matchuppercentagewhore (excuse my harsh words, but some people here hold to their matchup-% like to some sort of holy book and act as if it was insulted), but I'm confident with knowing what my opponent packs and whether I'm unfavored, favored or the matchup is even. i don't care whether the MU is 60/40 or even 90/10, I'm favored and bad draws can still fuck this up.
Now, say, you play ANT against Lands, you're like 99/1 G1 but your deck just shits on you and you draw nothing relevant and can't find anything, so Lands just smahes your head in. What did your 99% do that match? Right, nothing. They were about as relevant as knowing whether you're favored or not.

And the time some people waste to argue over the matchup%es could be used much better.
Like, for sleeping, working or even jerking off.

Damn, I just had to get this off my chest.

So you don't care about the win % just as soon as it's >50%? What's so special about 50.000000000001%?

And yeah, you lose 1% of the time in 99/1. So then you look back and go: I lost, therefore it couldn't be 99/1 lolololol. The way statistics works is unless the probability is 0, it might happen. Your whole post sounds so uneducated it doesn't even make it seem like you tried to pretend you knew what you were talking about.

Concerning standard deviations, Gaussians and stuff:

I don't get why everybody's so anxious to have an excuse to ignore data. It's not perfect, the standard deviation is a bit high. (But this isn't even always the case, as with the testing I've done, as well as perhaps that of Menendian.) The only possible reason I can conceive of that people are so eager to not listen to data is the one provided by Matt.

Like hearing claims about stdevs, etc. is getting to the point of sounding like obsessive nitpicking. Why hasn't anybody mentioned 2nd order probabilities? (Probabilities that your probabilities are correct) or 3rd order probabilities, etc.?

Why hasn't someone suggested that we take an l2 sum of all these nth order probabilities from n=1 to infinity?

Yes, the data can't be trusted with 100% confidence. You get back to me when you have perfect data, but until then I want more than 0 information about the decks in the metagame.

Rico Suave

05-14-2010, 07:40 AM

Nobody is even discussing the actual match-up percentages yet, and already people are jumping down each other's throats. Just imagine what would happen if *gasp* someone disagreed with your match-up grid?

This idea is noble but it's not practical. There are already match-up grids that exist from actual tournament data, and people have a hard enough time looking at those let alone ones that come from sources which aren't nearly as trustworthy.

Nidd

05-14-2010, 08:04 AM

So you don't care about the win % just as soon as it's >50%? What's so special about 50.000000000001%?

And yeah, you lose 1% of the time in 99/1. So then you look back and go: I lost, therefore it couldn't be 99/1 lolololol. The way statistics works is unless the probability is 0, it might happen. Your whole post sounds so uneducated it doesn't even make it seem like you tried to pretend you knew what you were talking about.

Concerning standard deviations, Gaussians and stuff:

I don't get why everybody's so anxious to have an excuse to ignore data. It's not perfect, the standard deviation is a bit high. (But this isn't even always the case, as with the testing I've done, as well as perhaps that of Menendian.) The only possible reason I can conceive of that people are so eager to not listen to data is the one provided by Matt.

Like hearing claims about stdevs, etc. is getting to the point of sounding like obsessive nitpicking. Why hasn't anybody mentioned 2nd order probabilities? (Probabilities that your probabilities are correct) or 3rd order probabilities, etc.?

Why hasn't someone suggested that we take an l2 sum of all these nth order probabilities from n=1 to infinity?

Yes, the data can't be trusted with 100% confidence. You get back to me when you have perfect data, but until then I want more than 0 information about the decks in the metagame.
You say the data won't be complete.
You say the data can't be trusted 100%.
There are more things that influence the MU-% than just the archetypes paired against each other.
Why should we start with that grid, then? It's a waste of time in my eyes. But well, go ahead if you want, can't hold you back I guess.

And yes, my post may have siounded uneducated. Guess what, neither did I study Math nor is it relevant to my interests. So yes, I know shit about Gaussians and stuff - worsened by the fact that english isn't my mothertongue and I don't know what all that stuff we learnt in school means in english.

DrJones

05-14-2010, 08:14 AM

Moreover, people do change their decklists before tourneys to have better chances against certain decks they expect to encounter.

pi4meterftw

05-14-2010, 03:46 PM

Yes, as I said you are acting as if the data is now worthless, but all this means is the data is not a pristine reflection of ultimate knowledge.

I'd rather know something than know nothing. If after 50 games, you choose to pretend you know nothing then that's too bad. I will note that in the way of getting good data, the differential you get for playing the next game in accuracy is less as the number of played games increases. So at 10 games, I'm pretty interested in seeing the next game, to get a better statistical read. At 100, much much less so.