(Forthcoming in Psychology Today) THE TL;DR KEY POINTS
THE IMPORTANCE OF TRUSTWORTHY JUDGMENTS We all make judgments of probability and depend on them for our decision-making. However, it is not always obvious which judgments to trust, especially since a range of studies suggest these judgments can sometimes be more inaccurate than we might hope or expect. For example, scholars have argued at least 4% of death sentence convictions in the US are false convictions, that tens or even hundreds of thousands of Americans die of misdiagnoses each year and that sometimes experts can be 100% sure of predictions which turn out to be false 19% of the time. So we want trustworthy judgments, or else bad outcomes can occur. How do, then, can we determine which judgments to trust—either from ourselves or others? In a paper recently published here and freely available here, I argue for an answer called “inclusive calibrationism”—or just “calibrationism” for short. Calibrationism says trustworthiness requires two ingredients—calibration and inclusivity. THE FIRST INGREDIENT OF TRUSTWORTHINESS: CALIBRATION The calibrationist part of “inclusive calibrationism” says that judgments of probability are trustworthy only if they are produced by ways of thinking that have evidence of their calibration. Here, “calibration” is a technical term that refers to whether the probabilities we assign to things correspond to the frequency with which those things are true. Let us consider this with an example. Below is a graph which depicts the calibration of a forecaster from the Good Judgment project—user 3559: The graph shows how often the things they assign probabilities to turn out to be true. For example, the top right dot represents all the unique events which they assigned a probability of around 97.5% to before they did or didn’t occur: that Mozambique would experience an onset of insurgency between October 2013 and March 2014, that France would deliver a Mistral-class ship to a particular country before January 1st, 2015 and so on for 17 other events. Now, out of all of these 19 events which they assigned a probability of about 97%, it turns out that about 95% of those events occurred. Likewise, if you look at all the events this person assigned a probability of approximately 0%, it turns out that about 0% of those events occurred. In this case, this person has a relatively good track record of calibration because they think in ways which mean that the probabilities that they assign to things correspond (roughly) to the frequency with which those things are true. And this is the case even when those things concern “unique” events which we might have thought we couldn’t assign probabilities too. After all, a particular insurgency either would or would not occur, it’s a unique event, and so it’s not obvious we can assign it a numerically precise number, like 67%, that—as it turns out—reflects the objective odds of its occurrence. But it turns out that we can—at least if we think in the right ways. Yet not all of us are so well calibrated. Below is a particular individual, user 4566, who assigned probabilities of around 97% to things which were true merely 21% of time, such as Chad experiencing insurgency by March 2014 and so on. Studies have shown people can be more or less calibrated for a range of domains, including medical diagnoses and prognoses, general knowledge about past or current affairs, geopolitical topics—virtually anything. Calibrationism then says that judgments of probability are trustworthy only if we have evidence that they are produced by ways of thinking that are well calibrated—that is, by ways of thinking that look more like the first forecaster’s calibration graph than like the second forecaster’s calibration graph. Of course, for many kinds of judgments, we lack strong evidence of calibration, and we then have little grounds for unquestioning trust in those judgements of probability. This is true in some parts of medicine or law, for instance, where we have some evidence of inaccurate judgments which can lead to misdiagnoses or false convictions at least some of the time, even if we have perfectly good judgments the rest of the time. But in other contexts, we simply lack evidence either way: our judgments could be as calibrated as the first forecaster or as miscalibrated as the second, and we have no good reasons to tell firmly either way. The calibrationist implication is that in some domains, then, we need to measure and possibly improve our calibration before we can fully trust the judgments of probability which we have in that domain. The good news, however, is that this is possible, as I discuss in my book elsewhere. For example, some evidence (e.g. here and here) suggests we can improve our calibration by drawing on information about statistics or the frequency with which things have happened in the past. For instance, we can better predict a recession in the future if we look at the proportion of the time that recessions have occurred in similar situations in the past. We can similarly use statistics like these to determine the probabilities of election outcomes, wars, a patient having a disease and even more mundane outcomes like whether someone has a crush on you. The other good news is that some people are well calibrated, meaning that they can provide us with trustworthy judgments about the world. For example, take user 5265 from the Good Judgment’s forecasting tournament. In the first year of the tournament, their judgments were well calibrated, as the below graph depicts: If we were in the second year of the Good Judgment tournament, and we were about to ask this person another series of questions, we could then have similarly inferred the calibration and trustworthiness of their judgments about the future. And in fact, that is exactly what we see when we look at their track record of calibration for the second year of the tournament below: More generally, the evidence demonstrates track records of accuracy are the biggest indicator of someone’s accuracy in other contexts—and better indicators than education level, age, experience or anything else that has been scientifically tested. So track records of calibration are one important ingredient of trustworthiness, but it’s not the only one. THE SECOND INGREDIENT OF TRUSTWORTHINESS: INCLUSIVITY Another important ingredient is inclusivity: that is, the extent to which we consider all the evidence which is relevant. After all, calibration isn’t everything we care about, since someone could be perfectly well calibrated merely by assigning 50% probabilities to a series of “yes/no” questions. Additionally, some evidence suggests people form more accurate judgments when they include more evidence than others—and it’s obvious how this can improve accuracy when, for example, including DNA evidence which vindicates otherwise convicted defendants in law. What we also care about, then, is whether judgments of probability are informative in the sense that they tell us whether something is true or not in a particular case. This in turn is largely a matter of including evidence. For example, one could be well calibrated by assigning 50% probabilities for the “yes/no” questions, but they would likely be omitting relevant evidence and not saying anything particularly informative. Calibrationism then says that in order for judgments to be trustworthy, they must also include all the evidence which we regard as relevant. GETTING PRACTICAL: HOW TO IMPLEMENT CALIBRATIONISM So that’s calibrationism in a rough nutshell: our judgments are trustworthy to the extent we have evidence that A) they are produced in ways that are well calibrated and B) they are inclusive of all the relevant evidence. What, then, are the implications of this? Four come to mind:
Practically, I provide more ideas about how we can do these things elsewhere. For example, one can measure calibration by plugging some judgments into a spreadsheet template, as I discuss here. Calibration can also be improved with recommendations which I discuss here. Lastly, if we want to assess the inclusivity and trustworthiness of someone’s thinking, we can list the evidence which we think is relevant, ask them questions about their reactions to each item, and if their responses seem to reflect calibrated engagement with all of the evidence, then we can trust their judgments. Put simply, to determine which judgments to trust, we might want to see more calibration graphs and evidence checklists to determine calibration and inclusivity—at least when the stakes are high. This might help make a world with fewer fatal misdiagnoses, false criminal convictions and other expressions of inaccuracy that compromise the functioning and well-being of our societies.
0 Comments
Leave a Reply. |
AuthorJohn Wilcox Archives
September 2024
Categories
All
|