(Forthcoming in Psychology Today) THE TL;DR KEY POINTS
PRELIMINARY CLARIFICATION: WHAT IS CALIBRATION AND CALIBRATIONISM? As I mentioned elsewhere, I recently published a paper arguing for calibrationism, the idea that judgments of probability are trustworthy only if there’s evidence they are produced in ways that are calibrated—that is, only if there is evidence that the things one assigns probabilities of, say, 90% to happen approximately 90% of the time. Below is an example of such evidence; it is a graph which depicts the calibration of a forecaster from the Good Judgment project—user 3559: The graph shows how often the things they assign probabilities to turn out to be true. For example, the top right dot represents all the unique events which they assigned a probability of around 97.5% to before they did or didn’t occur: that Mozambique would experience an onset of insurgency between October 2013 and March 2014, that France would deliver a Mistral-class ship to a particular country before January 1st, 2015 and so on for 17 other events. Now, out of all of these 19 events which they assigned a probability of about 97%, it turns out that about 95% of those events occurred. Likewise, if you look at all the events this person assigned a probability of approximately 0%, it turns out that about 0% of those events occurred. However, not all people are like this, below is a particular individual, user 4566, who assigned probabilities of around 97% to things which were true merely 21% of time, such as Chad experiencing insurgency by March 2014 and so on. This graph shows that this person is poorly calibrated. Studies have shown people can be more or less calibrated for a range of domains, including medical diagnoses and prognoses, general knowledge about past or current affairs, geopolitical topic—virtually anything. Calibrationism then says that judgments of probability are trustworthy only if we have evidence that they are produced by ways of thinking that are well calibrated—that is, by ways of thinking that look more like the first forecaster’s calibration graph than like the second forecaster’s calibration graph. A number of studies suggest humans can be miscalibrated to some extent. For example, Philip Tetlock found that some particular experts assigned 0% probabilities to events that actually did happen 19% of the time. Another study found that some students assigned probabilities of about 95% to things that we true only 73% to 87% of the time (depending on which university these particular students were from). A variety of other studies provide evidence of miscalibration in other contexts, as discussed in my book Human Judgment. HOW TO MEASURE ACCURACY AND CALIBRATION So, the evidence suggests not only that miscalibration is fairly common; what’s more, various studies also suggest that we can be unaware of just how miscalibrated or inaccurate we are. Because of this, it is worth gathering evidence about how calibrated the judgments are that we rely on, and calibrationism says this is necessary before we can fully trust them. Gathering this is often entirely possible, though, and the purpose of this post is to share some thoughts about how this is so. For a start, we can plug some values into a spreadsheet. Below is an adaptation from a spreadsheet which I’ve used to track my calibration for almost a year now and which can be downloaded here: It might seem odd to do this, but I don’t think it’s irrational. I find myself often making probability judgments about the world, and I found it both fun and insightful to take a moment to make this tool to systematically track the accuracy of the judgments which I make regardless. (Plus, as a cognitive scientist who specializes in judgment accuracy, it makes sense to do something like this since studying this kind of thing is literally my job!) In the spreadsheet above, I have anonymized the content so it does not concern other people, but the topics include a broad variety of things: examples include past or current medical diagnoses and prognoses, interpersonal relationships with friends or others, career developments, events in international affairs and many other things. In that sense, it’s a truly “domain general” set of judgments, to use a term from my book on Human Judgment. Some aspects of the spreadsheet are self-explanatory (e.g. the “Topic”, “Description” and “Rationale” columns) while others are explained elsewhere (e.g. “Brier” and “Resolution” scores, both of which are discussed in the second chapter of my book here). What matters, though, is that the spreadsheet automatically produces a table and graph to depict the calibration of a set of judgments as follows (note errors like “#DIV/0” show up in the categories where I haven’t made any judgments*): Like the other calibration graphs, the x-axis shows the probabilities I assign to things and the y-axis shows the frequency with which those things turn out to be true. So far, the judgments are pretty well calibrated for some categories: if you take the things I have been around 97.5% confident in, for example, about 92% of those things are true (although I will say more on this shortly), while the things I have assigned 70% probabilities to are true about 71% of the time. Other categories may appear less well calibrated: for example, out of the things I assign 80% to, all of those things are true, thus possibly indicating a degree of underconfidence. THINGS TO PAY ATTENTION TO WHEN MEASURING CALIBRATION However, the statistics for some of the categories may be misleading, and so there are a few things to pay attention to when measuring calibration—there are several desiderata or ideal features, we might say.
Desideratum #1: Multiplicity One of which is what I call multiplicity—which concerns the number of judgments we have. When measuring calibration, we often need multiple judgments that can be assessed for accuracy. For instance, even a person who is perfectly calibrated will assign probabilities of 90% to things which are false 10% of the time, and if we only have one of their 90%-judgments for an outcome which happens to be false, then they might look terribly miscalibrated when they are actually not. For example, in the spreadsheet above, I assigned a 95% probability to moving into a particular apartment ahead of time; then the day to move in came, I was on my way to get the keys, and then the apartment-owner said an issue came up and I couldn’t move in. That was the one thing I assigned a high probability to in that category which turned out false, and if we focus on just that instance which (I would think) is objectively improbable, I might look much more miscalibrated than I actually am. More generally, the so-called “law of large numbers” implies that if, say, the things we are 90% confident in would be true 90% of the time, then our actual judgments will likely converge to this proportion—but only with a sufficiently large number of judgments. It’s like how the probability of a coin landing heads is 50%, but the actual numbers reflect this only with multiple coin flips: three coin flips might land heads 100% of the time, or 66% of the time, but many more flips will eventually reveal the true probability of 50%. Consequently, the number of judgments in a given category is depicted in the calibration graph. There are 13 judgments in the 97.5% category and only 2 in the 60% category, meaning that the observed calibration is much more likely to reflect the true calibration for the former category than for the latter. Desideratum #2: Independence However, observed calibration is more likely to reflect one’s real underlying accuracy when another desideratum is met: namely, that the things one makes judgments about are probabilistically independent of each other. Suppose, for example, that the probability the Republican candidate will win the next US presidential election is 70%. Then, if they have a dataset containing only 100 of the same judgments that are each 70% confident in his success, then either all of the predictions are true or none of them are, since they all concern the exact same thing. More generally, then, a set of judgments could look unduly inaccurate merely because the judgments are dependent on each other, either because they are about the same topic (e.g. repeating the same forecast in a forecasting tournament), or because they are about closely related topics which affect each other (e.g. the number of COVID deaths and the number of COVID infections in a country). One way to guarantee the right kind of independence, then, is to make sure that the judgments each concern an independent topic: you might have one judgment about the US presidential election, another about the culture of Mexico, another about a question of geography or whatever else one might care about. Desideratum #3: Resolvability But beyond that, however, we would ideally assess accuracy using questions that are "resolvable", meaning we can know with virtual certainty what the correct answer is. Sometimes we can do this, like when predicting a future outcome which will eventually be knowable or when using things like DNA evidence to figure out what happened in the past. But sometimes the questions we are concerned with are not always resolvable. For example, one study assessed the accuracy of judgments from experts vs. non-experts by comparing the two groups’ judgments to estimates from other experts. But if the “other experts” have inaccurate judgments, as experts sometimes do, then this would give a skewed and possibly biased picture of judgment accuracy. In any case, the upshot is that it’s very feasible to measure the calibration of multiple, independent and resolvable judgments using the free spreadsheet tool above. This could potentially help us determine both the trustworthiness of those judgments and—by extrapolation--those in other contexts where the true outcomes might not be knowable. *Footnote: In my calibration graph above, I've "flipped" probabilities like 10% in false propositions to become 90% in true propositions so I have a larger sample of judgments for the categories above 50%.
0 Comments
Leave a Reply. |
AuthorJohn Wilcox Archives
August 2024
Categories
All
|