Brainstorm
Force of Will
Lion's Eye Diamond
Counterbalance
Sensei's Divining Top
Tarmogoyf
Phyrexian Dreadnaught
Goblin Lackey
Standstill
Natural Order
TBH I don't really care about MTG anymore to even bother.
The fact that it needs the community to collect and process the data instead of WotC is not helping.
As physicist I just (should) know more about statistics and it bugs me when people use them wrong.
Back on topic, it's striking that even though monke decks are the most played and the meta is tailored for that at least UR is still at >=50% WR.
?
I still think that is an bad assumption.
There's no reason to assume that any matchup besides the mirror is 50%.
Why even assume anything in the first place?
You just present the matchup data as it is, done.
If you are a really a physicist you should know the definition of null. You always assume no difference and try to prove otherwise. From thermodynamics you should also now the principle of informational entropy; the null has to be the same as assigning labels at random. Otherwise you would conclude snow basics d&t having a 25-24 head to head winrate vs non-snow D&T is telling you something.
H_0 is in the vast majority of case p=1/# of options, (in this case 2, W or L)
Fully agreeing with Zoid here.
You present the data, transformation of said data is useful only if it is more informative.
Here I do not see it:
600-400 is more meaningful than 60-40, itself better than 6-4.
I do not see what statistical treatment you could do that would make it more easily understandable by the audience, magic players that do understand what a MU is, both in term of result and reliability of said result.
The field of statistics cannot answer that. There are formula to tell you whether from some data, you can fix a given range of probability with a given confidence.
So here you could say that there is >95% probability that deck 1 has a win rate comprised between 55 and 65% over deck 2.
It would still be reduced data, ie less information than the actual numbers, eg 600-400 (numbers not corresponding to above statement).
It is very useful to do statistical treatment, but only if it gives you a faster, better understanding. I do not think that it is the case here.
600-400 is obviously more meaningful than 6-4. But let's look at less extreme cases.
If HomeBrew.dec goes 6-4 vs Delver, does that mean your homebrew is favored against Delver or was that just a lucky streak? Maybe the matchup is about even? Maybe it's unfavored? (Those are the most common matchup classifications players use)
Maybe players would intuitively know that's too few games and they need to test more (although some take a single League 5-0 as proof a deck is good, so you never know). But what if that result was 12-8? 18-12? 24-16? 60-40? That's more than 6-4, but is it enough? At what point is it enough games to be reasonably sure HomeBrew is favored against Delver? That's not easy to intuitively know from looking at the raw results. And that's where a statistical treatment adds value. If you do a 1-tailed test with null 50%, it basically tells you whether you had enough games to conclude the matchup is favorable (technically you're rejecting that the matchup is even or unfavorable, but close enough).
It shouldn't come at the cost of presenting the real data. Sometimes people report only a p-value without presenting any of the actual data, but that isn't the only way to present it. You can show both.
2 Examples:
1) 60%*
(N=60)
2) 36-24*
That's clean and simple and still has no information loss from the original data. Both contain enough to tell you that 60 matches were played, 60% were wins, 40% were losses, overall result of 36-24, AND that the matchup was favorable at some standard level of statistical significance you can mention outside the table (e.g. alpha=5%, alpha=10%). The statistical treatment adds value to the result. It tells a player that was "enough" data to classify that as favorable, while 6-4 isn't enough.
Or you could color-code the cells
Green = Favorable (statistically significant at X% confidence)
Yellow = About even (not statistically different from 50-50 at X% confidence)
Orange = Unfavorable (statistically significant at X% confidence)
That should be easy to digest and does tell you more than just the matchup data without any further treatment.
I wrote "has", not "had"?
But my question stands : how is decreasing the information, and giving for a given MU a range + confidence, giving a clearer picture than the actual data, i.e. Win-loss ?
Edit: FTW answered meanwhile. In the example above, I see option 2) as an easier readout than option 1). I still do think that W-L is a better representation, cleaner and simpler, than adding some arbitrarily chosen confidence interval. It is just discretizing the confidence, rather than keeping a continuum.
Because you can't record all data so range + confidence also performs validation on how good the data even is. It also provides insight on the greater population while the recorded results only provide information about the sample they are a part of.
This is what statistics is. It's about understanding the greater population of matches given a sample.
Recording all data or not has no influence here.
You have recorded data, in the form of W-L.
You do not get more or better data by performing whatever treatment you want on it, you are only modifying the representation of said data.
That you would settle on a given probability threshold, likely 90% or 95% is simply discretizing the confidence, which is dependent of the sample size, which you see from W-L in a pseudo continuous fashion.
I presented both because I think players may have different opinions on this when the numbers get messier. For 60-40, it's simple to do the mental math for win% and total number of matches, so the cleaner presentation is sufficient. If it was 47-36, the mental math to get win % is more of a burden on the reader, especially if there are 100+ cells in the table.
The 2nd is cleaner, but the 1st gives easier access to different information. It depends which are of more interest. But I agree it should be done in a way without information loss.
It's also establishing a consistent benchmark for all cells, based on a fixed probability threshold instead of the difference between W and L. Otherwise this is not intuitive looking at W-L with different numbers of matches played in each cell.
I still don't know why you're so stuck up on hypothesis testing.
There is no reason to assume anything.
You just present the data and that's it.
What I was initially was suggesting was how to give an uncertainty to the win rates.
Here we either take the frequentist approach or use Bayesian statistics where we need a prior.
That's where you can start to assume things which need to be well motivated and it depends on what you want to show.
There's no reason not to present the data and also test it, showing both. If some don't trust the testing, they can ignore that part, but for those who do they are given more rather than less.
Whether you use hypothesis testing or Bayesian methods, both make similar assumptions (prior or null). 50% is reasonable because players tend to classify matchups as:
Favorable
Even
Unfavorable
A null of 50% allows you to do that. A null of 35% could tell you your deck has >35% win rate against Delver, but that's not how most players want to think about their matchup info, at least not before knowing if it's favorable or not, so the result of that test has less practical value. You can always test different nulls afterwards. 50% makes sense as a starting point.
This is a 2 player 0-sum game with a lot of chance. If neither player has an edge from the deck construction, you expect 50-50 odds by default. If your data contradict that, it tells you one deck is favored over the other.
(Player skill is a more relevant factor if you include LGS weeklies with a lot of new players, but if this is ripped from top tournament results then most players are good at their deck)
Seriously, this is pretty dumb.
On one hand, replacing W & L numbers by the best mle estimator of the winrate %age (pmle=W/(W+L)) + a second value to represent uncertainty (like width of the 95% confidence interval for pmle, or the quasi-std sqrt(pmle*(1-pmle)/n) (*)) doesn't reduce the available information, as from those two values, you can reconstruct both W & L.
On the other hand, you don't need to assume anything to establish those. There is no hypothesis to make or test against. It's simple mle.
(*) I'm saying quasi-std as this is improper ; the only actual std is sqrt(p*(1-p)/n) where p is the actual value of the parameter. But :
- this doesn't change the fact that this allows the reconstruction of original W & L numbers if one so desires,
- this still does quite adequately match expectations / will properly represent what the standard deviation of the process is, a) given that real matchups never go outside 0.2-0.8 for p, and b) as long as you don't go out of your way to use it wrong, ie if you have like only 5 matches.
There are currently 150 users browsing this thread. (0 members and 150 guests)