Monday, January 16, 2017

There are (Almost) No Obvious Development Mistakes and Complaints are (Almost) Always Hindsight Bias: Issuing the PT AER Metagame Prediction Challenge

(Click here if you just want to go straight where you can win a Japanese Elspeth vs Kiora for predicting the PT AER metagame. Keep reading if you want to know why I think you - the collective you - aren't any good at identifying broken MtG cards.)

Complaints about WoTC RD’s card balancing/metagame prediction abilities basically all fall into something like this pattern:

Step 1) New cards are spoiled. Something like 10-20 strong standout cards with high competitive potential are identified by the community as cards to watch.

Step 2) Competitive metagame forms over the course of many weeks of high-level competitive play and deck refinement from a playerbase of millions. Eventually a handful of the true standout cards are discovered from among the group identified in Step 1. In some cases these cards might even be broken.

Step 3) “RD is terrible at their jobs, how did they miss this obviously broken card, I saw that this card was broken the second it was spoiled. Clearly the FFL is no better than a bunch of drunk monkeys.”

The key bias that allows people to falsely believe step 3 is that they did identify the development mistakes when they were spoiled. It’s just that they identified them as one of a bunch of possibly good cards. Some of these good cards didn’t quite make it, some of them were actually good, and some of them were broken. Then, we forget our misses and zero in with hindsight bias on the hits, and wonder why RD isn’t as good as we are at identifying the broken cards.

I’m not exaggerating about the level of contempt that is sometimes expressed for RD, by the way:
I hope we all can see the hindsight bias at play here. Particularly telling is how this poster’s ability to identify development “mistakes” seems to take a nosedive as we approach recent sets. Grim Flayer’s deck just got knocked out of the meta. And Mardu Vehicles is certainly a top contender, but Depala is hardly a card people point to as an OP development mistake. It’s not impossible that Yehenni’s Expertise will turn out to be a development mistake, but at this point it’s just one of many cards that could possibly be OP. Even Sylvan Advocate no longer seems like such a big mistake now that it’s completely fallen out of the metagame.

It’s instructive to go back and read some of the set reviews of sets that contained obvious development mistakes. Yes, we thought Dromoka’s Command and Collected Company might be good. We also thought Sidisi, Undead Vizer and Narset and Thunderbreak Regent and Secure the Wastes might be good. Some of those cards were good, some weren’t, and some were perhaps too good.

It’s Impossible to Evaluate Individual Cards without knowing the Metagame Context (And the Metagame is Impossible for the FFL to Accurately Predict)

As the fate of “obviously broken” cards fading in and out of the meta shows us, the difference between OP, good, fringe, and not-quite-good-enough cards isn’t the cards themselves. Their actual strength is always an emergent property of the metagame in which they exist. Take one of development’s biggest mistakes of recent memory:
There’s no denying that Collected Company was a pretty big miss on RD’s part, and the eventual Dragons/Origins-BFZ-Shadows standard it dominated was truly quite stale. But even this card’s strength in standard was highly context-dependent! On release CoCo was seen mainly as a great addition to modern, spawning a new but not broken archetype. Meanwhile in standard it was middling, forming part of a good-not-great Green/White aggro deck. It wasn’t until two sets later that enough solid 3-drops were released to create an environment for CoCo to become the oppressively format-warping card that we grew to hate.

CoCo started out fine and became strong, so in hindsight we consider it broken. It’s also instructive to consider a card that had something of the reverse dynamic:
Half a year ago Duskwatch Recruiter was labelled a “development mistake” just as often as Collected Company was. Green getting a recurring card draw effect that’s also an extremely efficient beater that’s also a ramp spell? How ridiculous is that? Of course this is broken, what are the morons in RD doing? But wait, the meta changed, and these days even when there is a green deck in the format, Recruiter doesn’t make the cut.

There’s been a lot of whining about FNM promos lately, so let’s look at one of the FNM whiffs of last year:
Flaying Tendrils has always been a bulk card. Which intern in RD do they have picking these Promos anyway?

But wait, what happened the last time they printed a similar effect?
You may not remember, but while in standard this was a $2+ uncommon. Sounds like something that would have been a solid FNM card!

Of course the difference is that Flaying Tendrils is in a standard environment where a mass -2/-2 is pretty useless, and Drown in Sorrow was in a standard environment where a mass -2/-2 was amazing.

So: individual cards are impossible to evaluate absent foreknowledge about how the entire metagame will shape out, and the metagame is the emergent result of the crowdsourced efforts of a horde of highly motivated and intelligent players (and even then it takes us a few months to really shake it out). Given this, I feel confident asserting that there are almost no obvious development mistakes, and if you think you identified some, you’re likely operating under hindsight bias.


You Totally Predicted the CoCo was OP, Though, and can Prove it


Any such assertions made after-the-fact (and any arguments that rely on such post-hoc assertions) are indistinguishable from hindsight bias and should be discounted. The only way to rigorously test whether you are as good at identifying OP cards as you say you are is to pre-commit, before the tournament results come in. Since we’ve just finished the Aether Revolt prerelease, that means… now. In the vein of the PT EMN Fantasy Draft, I am happy to unveil...


The Pro Tour Aether Revolt Metagame Prediction Challenge - Win a Japanese Elspeth vs Kiora!

Contest is here.

All entries will be individually scored. You receive 1 point every time one of your cards appears in a top-performing standard deck (7 wins/21 points or better). For each sideboard-only appearance, you will receive 0.25 points. To capture the impact of cards that may be format-defining despite not being 4-offs in their decks (such as Emrakul), you will receive the full point value even when your selected card is a 1-off, 2-off-, 3-off in its decks. The top individual entry will win a new Japanese Elspeth vs Kiora.

The challenge here is to prove me wrong. If I’m wrong, and there is an obvious development mistake, the community’s picks should concentrate into a few (<3) cards, and those cards should turn out to be OP. If the community’s picks are spread out among a lot of cards, and some of them do turn out to be OP - then I’m right, and there were no obvious development mistakes. If the community’s picks are concentrated into a few cards, and those cards do not turn out to be OP, then I’m still right, and there were still no obvious development mistakes, because the ones we thought were “obviously too good” turned out not to be.
I’m pretty sure I’m right, but hey—maybe you guys will prove me wrong.


Appendix - In Which I Concede a Situation Where I Look Pretty Wrong, but Not Really because Reasons


Funny thing about the terms of the metagame prediction challenge - had I run this challenge for Pro Tour Kaladesh, I would probably have lost. Why? Well, you may recall that this card was recently banned:
And honestly, had I run the prediction challenge pre PT-KLD, I expect a lot of people would have picked Copter. Now, I could submit the small quibble that the Smuggler’s Copter is colorless. For a prediction game that’s scored simply on the number of decks in which your picked cards appear, the strategic choice is to load up on powerful colorless cards that can go in many archetypes, rather than powerful archetype-specific cards. Even if you thought, say, Chandra, Torch of Defiance would end up being stronger than Smuggler’s Copter, your incentive would still to pick Copter. My point is not that Chandra is stronger than Copter, as we’ve learned that she’s not, just that someone *believing* at that time that Chandra is stronger would still have picked Copter, thus misrepresenting the wisdom of the playerbase versus development. It’s for this reason that I’ve segregated the colorless cards from the pick pool in the PT AER prediction challenge.

But that’s just a quibble, and  to be completely honest - a lot of people pegged Copter as the defining card of the set shortly after release, head and shoulders above the rest of the set. I don’t believe many people predicted a banworthy-level of brokenness from Copter, but it was definitely a case where FFL missed something that was reasonably obvious.

That said, a single exception does not disprove a general rule, and there is an “almost” in my assertion for a reason: “complaints are (almost) always hindsight bias.” So even granting that complaints about FFL missing on Smugglers Copter are more reasonable than most MtG balance whining, I still believe overall that obvious development mistakes are extremely rare.

No comments:

Post a Comment