Brainstorm
Force of Will
Lion's Eye Diamond
Counterbalance
Sensei's Divining Top
Tarmogoyf
Phyrexian Dreadnaught
Goblin Lackey
Standstill
Natural Order
I don't understand how delver's matchup against bant is part of the same experiment as delver's matchup against death and taxes. As long as you make sure you aren't double counting, each matchup should be independant. You are testing the null that each deck is 50% against each other deck; and if you compare 15 decks against 15 decks that is 14*15/2=105 separate hypothesis. So your alpha of 5% turns into a FWER of 99.5%
We have the deck overall win rate graphs elsewhere.
Exactly and that has a possibility of happening by chance 1/20 times with normal Alpha levels/confidence intervals. If you do 105 independent versions of that now you have a 199/200 possiblity that at least one happened by chance.
It has to be the null, otherwise you could say the fact that the mirror wins 50% of its games tells you something significant. Which is absurd.
I was thinking last night that the chart I posted might be a little more interesting if each cell had the actual match-up wins and losses, so one could easily see the notional power at a glance.
On the assumption of a 50% win rate, I am no statistician, but this seems OK, since can't we start with the "ideal" notional set-up where each player has a 50% chance to win (there being only two players and we ideally pretend that each deck is 100% the same). (Here, of course, I am thinking of the way physics might ask us to assume a penguin is a cylinder.) Then, what we are considering are the notional interventions (perhaps in the line of this sort of causal model) where card choices are "interventions" that work on the outcome of a given match. Of course, one major issue is that the "player skill" issue might overdetermine outcomes in some cases (along with just the "random" nature of what cards show up in a given game), but we aren't really doing an actual experiment anyway.
But, with this amount of data about players, it might be possible to creature a normalized "skill level" (as a sort of baseline win-percentage) and then compare how deck/card choices seem to intervene on that. But, we cannot have a control there, but I guess we can compare it to our hypothetical 50% "ideal" win rate.
Of course, consider that I am not a statistician at all, so maybe I have no idea what I am talking about actually. (Maybe there is no maybe there either.)
"The Ancients teach us that if we can but last, we shall prevail."
—Kaysa, Elder Druid of the Juniper Order
I still don't know where you're trying to go with this.
If you have enough samples, outliers are expected.
In fact it would be weird if you don't have them.
Filling the diagonal in this type of chart doesn't make sense in general because it's 50% per definition.
You can also skip one of the 2 triangle sections since they have to add to 100%.
I agree with Reeplcheep. 50% makes sense as a baseline. What you really want to know is if Bant is favored against Sagavan, unfavored, or the matchup is about even. Using anything else as a null wouldn't tell you that.
I think he's getting at this:
https://xkcd.com/882/
In order to avoid nonsense like that, I think he adjusted his alphas to account for doing 105 tests instead of 1 test. But then alpha is so small that basically nothing is significant, so it doesn't tell you anything useful either.
Ideally you want more data. Or maybe weed down to fewer decks based on number of matches.
Yah, that is why it is 15x(15-1)/2=105 cells, not 15x15=225 cells.
Thats what I didn't like about the chart. I weeded out all matchups that have a sample size lower than 30 like TES vs Reanimator. Of the 105 cells, 80 should be blank and it should only show the 25 most common ones. But that doesn't look as nice.
Indeed, although if the actually win/loss was in the cell, rather than just a percentage, you could see the the notional power somewhat immediately. So, instead of 75% being in a cell, put 3-1, so one would know what the record is, but also that it is a low-data case. I guess it doesn't look as slick like that though.
"The Ancients teach us that if we can but last, we shall prevail."
—Kaysa, Elder Druid of the Juniper Order
I still think the table presentation is too busy. It could be simplified without losing important info.
1) Block out the diagonal. Mirror match is always 50%. Not useful.
2) Round and reformat cells (e.g. 33,33 -> 33%). There's not enough matches to have certainty to 2 decimal places. It just clogs up the table with twice as many numbers, without those numbers conveying much. Rounding to the nearest % is good enough for most practical use anyway.
3) Label whether it's row beating column or column beating row. You can figure it out based on known MUs, but we shouldn't have to guess.
4) Maybe have the number of matches played in brackets (below each %)
Example:
33% 25%
(9) (8)
Then you can still see the %s and the colored heat map but can also see the number of matches played and figure out the win-loss.
@Reeplcheep: Looking at only those 25 matches, does that improve the FWER? What results were significant?
What if you just do a 1-tailed test on the winner for each pair? (Is this > 50%, instead of is it <> 50%?)
What are the results of EW? Can we finally get over it and ban something?
Charts are from the Legacy Data Collection Project, you can find the decklists at the URL under each image.
Wasteland (Friday):
Lists @ MTGGoldfish
Bayou (Saturday):
Lists @ MTGGoldfish
Sylvan Library (Sunday):
Lists @ MTGGoldfish
"The Ancients teach us that if we can but last, we shall prevail."
—Kaysa, Elder Druid of the Juniper Order
Modern Horizons were a mistake.
"The Ancients teach us that if we can but last, we shall prevail."
—Kaysa, Elder Druid of the Juniper Order
I think play patterns are pretty bad. But the meta seems healthy as the delver does have many consistent predators (lands/elves/D&T/GW depths). A true tier 0 deck would have even or positive matchups across the board. Why delver always is top dog is that it’s bad matchups are even worse vs doomsday.
It’s a RPS metagame but when it is 70-30 URx/DDay, 80-20 DDay-Fair non-blue, 60-40 Fair non-blue/URx it’s easy to see why most spikes choose delver. The top16 conversion rate wasn’t that insane
Conversion rate is a bad metric because only so many can make top whatever and an infinite number of people can run.
However, if it's just UR, it's predators and the decks that pray on those, it's pretty boring and the real challenge is to guess the fractions as the wheel turns.
Here is the data from the legacy data project. Hard non-mirror winrates collected by hand. This was a ton of work so if you like stuff like this please consider helping out with data collection or the Patreon.
There are currently 1886 users browsing this thread. (0 members and 1886 guests)