<![CDATA[JOHN WILCOX - John\'s Blog]]>

<![CDATA[JOHN WILCOX - John\'s Blog]]>Wed, 26 Jun 2024 19:43:26 -0700Weebly<![CDATA[New book, "Human Judgment", is now published]]>Mon, 02 Jan 2023 11:22:26 GMThttps://www.johnwilcox.org/johns-blog/new-book-human-judgment-is-now-publishedTHE TL;DR KEY POINTS

We all make judgments, and our important life decisions depend on them--but how good are those judgments?
My new book "Human Judgment: How Accurate is It, and How Can it Get Better?" investigates this topic (it's now available to purchase here)
It has two somewhat newsworthy items: one bad, the other good
The bad news is that science suggests humans are often much more inaccurate than we might hope or expect--for example, thousands die each year because of misdiagnoses and false death sentence convictions
The good news is that science suggests a number of concrete ways to measure and improve the accuracy of our judgments
The book then outlines and summarizes those recommendations

We all make countless judgments, and our important life decisions depend on them.

My new book, “Human Judgment”, investigates these judgments, and it is now available to purchase online here.

The book concerns two topics to do with human judgment, as implied by the subtitle: How accurate is it, and how can it get better?

It has two somewhat newsworthy items, one bad and the other good.

The bad news is that the science suggests that human judgment is often much more inaccurate than we might hope or expect. For example, some researchers estimated as many as 40,000 to 80,000 US citizens will die because of preventable misdiagnoses—and that’s each year. If they are right, that’s a yearly death toll at least 13 times higher than the September 11th terrorist attacks. Unfortunately, medicine is not unique too: judgmental inaccuracy can afflict a number of other areas in society as well. As another example, some researchers estimate at least 4.1% of death sentence convictions in the US are actually false convictions; this implies that some people are trialed, convicted and executed for horrific crimes that they never actually committed. So that is a few of numerous studies painting a less than ideal picture of human judgment: we make inaccurate judgments about medical diagnoses, about criminal convictions and about a number of other areas.

Of course, we are not always so inaccurate. We often judge with perfect accuracy where we live or whether humans need oxygen to survive, to take a couple of many mundane examples. To that extent, the book espouses a context dependent model of human accuracy: how accurate we are simply depends on the context, thus prohibiting unqualified generalizations.

Regardless, we are substantially inaccurate in an important remainder of other contexts, and then there are many more contexts where we simply do not know how accurate humans are.

The good news, however, is that science also suggests some ways to measure and improve accuracy. The hope, then, is that the book can motivate society to capitalize on those ways where possible, and it gives some recommendations to that extent.

Most of the recommendations are informed by pioneering research funded by the US intelligence community. US intelligence has funded some of the most cutting-edge research in improving the accuracy of human judgment. The book then describes a lot of that research. However, often the research concerns specifically geopolitical topics--topics like the outcomes of wars, elections and the like. But that said, the research also supports a set of generalizable recommendations for improving human judgment—or so I argue in the book.

The ultimate long-term aim of the book is that society will use these recommendations to improve our judgments, our decision-making and ultimately our lives—thereby reducing misdiagnoses, false convictions and many other expressions of judgmental inaccuracy that severely compromise human well-being. Fingers crossed it eventually achieves its aims!

]]><![CDATA[How to calculate probabilities: The Bayesian calculator]]>Thu, 11 Nov 2021 00:03:25 GMThttps://www.johnwilcox.org/johns-blog/how-to-calculate-probabilities-the-bayesian-calculatorthe tl;dr key points

This post describes and provides a Bayesian calculator to supplement a philosophy of science course taught at Stanford University
The calculator is potentially useful for a variety of purposes, including calculating the probability of propositions in philosophical, scientific and mundane contexts
The calculator also features some examples of Bayesian calculations, just to help others get an intuition for how to use the calculator

THE BAYESIAN CALCULATOR: WHY YOU SHOULD CARE ABOUT IT

Tomorrow, I'll be giving my last lecture on Bayesianism for the course "Phil 60: Introduction to Philosophy of Science" at Stanford University.

There, I'll be talking about a Bayesian solution to the problem of underdetermination, associated with Pierre Duhem and Willard van Orman Quine.

The problem essentially concerns the limited ability of evidence to support or rule out isolated hypotheses. For example, if you run an experiment to test whether a putative piece of iron melts at 1538 degrees Celsius, and the piece doesn't melt at that temperature, then you have at least two possible responses: you could rule out the hypothesis that iron melts at 1538 degrees Celsius, or you could instead rule out the hypothesis that the piece of metal was actually iron as opposed to another substance. As Duhem put it, the experiment itself does not tell you which specific hypothesis is false:

“In sum, the physicist can never subject an isolated hypothesis to experimental test, but only a whole group of hypotheses; when the experiment is in disagreement with his predictions, what he learns is that at least one of the hypotheses constituting this group is unacceptable and ought to be modified; but the experiment does not designate which one should be changed.” (Duhem, [1914] 1954, p. 187)

Quine then generalized the point to concern not just physics, but all of our beliefs and theories in general. He states that "our statements about the external world face the tribunal of sense experience not individually, but only as a corporate body” (Quine, 1953, p. 41). Elsewhere, he makes the radical claim that “any statement can be held true come what may, if we make drastic enough adjustments elsewhere in the system” (Quine, 1953, p. 43). How should one respond to these claims? That is the topic of what is now called the Duhem-Quine problem.

Not all philosophers have been entirely comfortable with Quine's claims, a notable example being Larry Laudan. And that's fair; after all, Quine's claims threaten to undermine the very notions of progress, reliability and rationality that we justifiably associate with science.

In any case, the purpose of the lecture tomorrow will not be to discuss the problem at length and various possible responses to it. I already did that yesterday, and it was pretty fun. Rather, the purpose for tomorrow will be to explore a Bayesian approach.

A BAYESIAN SOLUTION TO THE DUHEM-QUINE PROBLEM

That is where the Bayesian calculator comes in. You can access it here.

It can help us calculate probabilities for a variety of hypotheses, not just in scientific contexts, but also in other contexts that might be of interest. (For instance, to try stimulate students' interest in the subject, I provide an example of how to calculate the probability that someone has a crush on you given various pieces of evidence. From experience lecturing on this topic before, that tends to hold people's attention more than, say, whether iron melts at a particular temperature.)

Regardless, the calculator can also shed light on contexts involving underdetermination. There, the key insight is that the question of how to distribute blame given a false prediction depends crucially on the prior probabilities of the relevant hypotheses. Consider our earlier example about the iron. According to the Bayesian calculator, if you are 95% confident that the metal is iron, but only 90% confident in the hypothesis that all iron melts at 1538 degrees Celsius, then the outcome will make you much less confident in the hypothesis about iron's melting point (which has a posterior probability of 19%) than in the substance being iron (which has a posterior probability of 79%).

However, a disclaimer: I am not the first to advocate this solution to the Duhem-Quine problem. From what I can tell, it was first articulated by the philosopher Jon Dorling in 1979. His paper, in my opinion, is a very underrated piece of Bayesian genius. You can find the references for this and the other works below.

SOME REFERENCES

Dorling, J., 1979, "Bayesian personalism, the methodology of scientific research programmes, and Duhem's problem", Studies in History and Philosophy of Science Part A, 10(3): 77-187.

Duhem, P., [1914] 1954, The Aim and Structure of Physical Theory, trans. from 2nd ed. by P. W. Wiener; originally published as La Théorie Physique: Son Objet et sa Structure (Paris: Marcel Riviera & Cie.), Princeton, NJ: Princeton University Press.

Quine, W. V. O., 1951, “Two Dogmas of Empiricism”, Reprinted in From a Logical Point of View, 2nd Ed., Cambridge, MA: Harvard University Press, pp. 20–46.

]]><![CDATA[What makes for good judgment? A re-analysis]]>Thu, 06 May 2021 07:00:00 GMThttps://www.johnwilcox.org/johns-blog/whats-makes-for-good-judgment-a-re-analysisThe TL;DR key points

We all make judgments about what’s true or false, but we often don’t know how good our judgments really are
Thankfully, the Good Judgment project can tell us something about what makes judgments good
Below, I describe a re-analysis of their data which vindicates their main findings
But the re-analysis also made some potential methodological improvements:

1. Estimates of accuracy could consider only final forecasts for a question
2. When we do this, people are better forecasters than it initially appeared
3. And we are able to explain and predict accuracy better than it initially appeared

Good Judgment: Why you should care about it

We all make judgments every day. We all depend on them to make decisions and to live our lives. You might think someone is a good partner for you, and so you might marry them. Or you might think you will be happy in a particular career, and so you might spend countless hours of your life studying and working your way towards it.

But what happens if your judgments are wrong—if the person you married or the career you chose weren't good options?

We all know that this kind of thing happens: people make bad judgments and regret their decisions all the time. That is old news—and bad news, at that. What’s more, if we take a passing glance at the scientific study of reasoning, we’ll see that we are often biased in our judgments and we may not even realize it (check out Kahneman's fantastic book, for instance).

But there is good news: we can improve our judgments!

The Good Judgment Project

The Good Judgment project tells us something about how to do that. It was a 4-year program that was launched with funding from a US intelligence organization.

Its aim? Well, to figure out what makes for good judgments. They focused on judgments about largely political questions—the outcomes of elections, wars and the like. But their insights are arguably generalizable to other contexts too (and I will write another blog post on this later). To this end, they studied 743 people who made over 150,000 forecasts from 2011 to 2013.

Each forecast was a probabilistic prediction about what would happen in the future, and the forecasters were then scored based on the probability that they attached to the actual outcome. For example, if you attached a 90% probability to Putin winning the 2012 Russian presidential election (which obviously happened), then you’d get a better accuracy score than if you assigned it, say, only a 60% probability.

Insights from the Good Judgment Project

Their project uncovered a few insights.

1. Not everyone is equally good at predicting the future

This by itself isn't all that surprising, but it's good to get a sense of how people vary in their abilities. This graph helps us do that:

It tells us how accurate the participants were, using a measure called the Brier score. A Brier score isn't always the easiest thing to understand immediately--so don't worry about it too much. But note that the lower the Brier score, the better. Perhaps think about it as a measure of your inaccuracy—how far your probability forecasts are from the truth. In that sense, you'd want it to be low.

Regardless, the graph tells us that most people had an average Brier score of around 0.32—the kind of thing we could expect if, for example, someone assigned a 60% probability to the true outcome on average.

Some did much better than this: about 2% of participants had an average Brier score around 0.15—the kind of thing we could expect if, for example, they assigned a 73% probability to the true outcome on average. These people were called superforecasters, and they are the subject of Philip Tetlock’s best-selling book Superforecasting: The Art and Science of Prediction.

Unfortunately, other people did much worse: some even had a Brier score of around 0.65, as though they assigned only a 42% probability to the true outcome on average. In other words, they're wrong most of the time, and it'd be better to flip a coin than to accept their opinion on a question!

2. Accuracy isn't just due to chance

But of course, if enough people make enough predictions about anything, some are bound to get it right—and just by chance. How can we know whether this is what's going on here?

Well, there's different ways to do this, but one way is to look at whether people's accuracy changes over time. If some people were accurate at the beginning—and this was just by chance—then you'd eventually see their luck wear off: after a while, they'd start to look about as accurate as the average person. You'd also expect a similar kind of thing for those who were the least accurate: their bad luck would wear off and they'd look relatively better over time. This would be an example of regression to the mean—as they say.

But this isn't what the Good Judgment project found. Instead, the best people after the first 25 questions stayed relatively good. Meanwhile, the not-so-good people also stayed not so good. We can see this in the below graph (but remember that higher Brier scores are bad, and so the "Worst" forecasters are on top):

3. We can explain and predict accuracy—well, somewhat

So what made the good forecasters so good?

To answer this question, the Good Judgment project examined information about all 743 people and they asked them questions about their knowledge, their ways of thinking and other topics.

They found that three categories of features correlated with accuracy:

1. Your dispositions - What you know and how you think

One category is about your psychological dispositions—features of your mind and the way you think. This includes things like: how likely you are to override your quick gut reactions with more prolonged and careful thinking; how much general knowledge you have that might be relevant to your forecasting topics; your probabilistic and statistical reasoning abilities; how open you are to admitting the fallibility of your own judgment, and your willingness to unbiasedly evaluate evidence, including evidence against your own viewpoints. The more you have these features, the more accurate you are likely to be.

   2. Your situation - Teaming and training

Another category includes features about your situation: in particular, whether you were teamed up with others when making predictions and/or had some training about biases and statistical reasoning.

   3. Your behavior - What you do

Another category includes behavioral features: in particular, the more time you spent thinking about a question and updating your opinions about it, the more likely you were to be accurate.

   How much does this explain?

These features can explain a lot of what makes a good forecaster, but not all of it. Other factors also play a role, including some amount of good luck.

How do we tell, then, just how much could be explained by the three categories of features?

To do this, data scientists often recruit the help of regression models. A regression model is a mathematical equation which makes predictions given various features. In our context, it would tell you that if you have such-and-such levels of open-mindedness, political knowledge etc., then your accuracy will be so-and-so.

Of course, these models never predict people’s accuracy with perfect precision, but they can do a pretty good job of it. The extent to which they do a good job can be measured in various ways. One way is with a multiple R measure. (We could also supplement this with some particular inferential statistics, such as p-values, frequentist confidence intervals or Bayesian credible intervals, but I won't bore you with the sometimes controversial details of these in a simple blogpost like this.) To calculate the multiple R, you look at the predictions of accuracy according to the regression model and then see how well that correlates with the actual accuracy of people. We can then measure that correlation with a number between 1 and -1.

If that number is 0, then there's no correlation between the predictions on the one hand and the actual accuracy of people on the other. Then, the we'd have a graph that could look like this:

A Multiple R of 0
Generated from R Psychologist

Above, each point represents a person. The horizontal x-axis represents how accurate they are according to the predictions of the model: the further to the right you are, the better your predicted accuracy. The vertical y-axis represents how accurate they actually are: the higher up you are, the better your actual accuracy. And in the above graph, there is no relationship between the two: knowing something about your predicted accuracy tells you nothing about your actual accuracy. For example, the model might predict you'd have a Brier score of 0.65 when it might actually be 0.15.

Fortunately, though, the Good Judgment project's data does not look like that! They don't provide a graph, but they do say that their R value is .64. This is would be the kind of thing you'd expect if the graph looked more like this:

A Multiple R of .64
Generated from R Psychologist

This looks a lot better: the predictions would not tell you precisely what your accuracy is, but you're much more likely to have high actual accuracy if your predicted accuracy is similarly high. For example, the model might predict you'd have a Brier score of 0.37 and it turns out your accuracy is around 0.32.

But remember what the model uses to make these predictions: it takes into account just the three categories of features mentioned earlier—the kind of things you could describe in a survey. It then uses just that information to do a decent job of predicting your accuracy.

For that reason, statistical evidence like this suggests what might explain accuracy, although scientists often also rely on theory or experimental evidence to inform their explanatory conclusions. In our case, there's a variety of reasons to think these features actually do explain accuracy, and that's good--because then we have a better sense of how to improve accuracy.

Reproducibility Crisis

If their analysis is correct, then it's good news: we know a fair bit about what can improve the accuracy of our judgments.

But is their analysis really correct?

As you might've heard, many scientists—if not most—believe that science is in the grips of a reproducibility crisis. Fiona Fidler and I have written about this in this Stanford Encyclopedia of Philosophy entry. In some cases, half of the attempts to reproduce past findings have failed, leading many to conclude that what was once regarded as true was actually false. This has primarily been a concern when it comes to reproducing data collection procedures, like experiments or surveys. But it is also a concern when it comes to analyzing past data: problems sometimes arise from how the data was analyzed—not just with how it was collected.

So, then, can we reproduce the findings of the Good Judgment program using their data?

A Re-Analysis: A Success!

To answer this question, I took a look at their data on Harvard's dataverse website. I then tried to reproduce several key statistics (and feel free to contact me if you want my analysis scripts in Python and R).

   Some Difficulties

There were some difficulties along the way. One is about their exclusion criteria. They wanted to study the accuracy of people, and to do that, they needed to collect a lot of forecasts to get a stable estimate of their accuracy over time. For that reason, they only studied those who made at least 30 forecasts across both years of the program, and they excluded everyone else from their analysis. 743 people met these exclusion criteria, according to their paper. I tried to apply the same criteria and variants of it, but the closest I could get was 799 people. But perhaps this isn't such a big deal.

A bigger issue is that some important features were missing from the dataset. In particular, their regression models took into account how much time people spent looking at questions—and they report that this correlated with accuracy. But I couldn't find that feature, so I had to run my analyses without it.

      Some Successes

But regardless, there were successes!

I tried to re-run their analyses in the most straight-forward ways I could think of, and when I did, I got similar results.

1. Not everyone is equally good at predicting the future

Again, people vary in their forecasting abilities. To show this, I placed my analysis in red over their original graph:

2. Accuracy isn't just due to chance

And again, accuracy isn't just a matter of luck. To show this, I again placed my analysis in red over the original graph below:

I also found that the same three categories of features predict forecasting accuracy (although I'll say more about that shortly).

Of course, these statistics differ over the exact details, but they agree in spirit—and that is what we are really concerned about here. For that reason, I consider the re-analysis attempt a success!

But!—there's an important caveat...

Improving the Analysis

When I looked at the data, it seemed to me like we could do some things to analyze it even better. Ultimately, we want to measure how good a forecaster is, and we want to know what makes them good. We want to measure, explain and predict accuracy—or something to that effect.

There's at least two ways we could do this: I call them the 'all forecast' approach and the 'final forecast' approach.

The All Forecast Approach

The 'all forecast' approach measures accuracy by averaging all of the forecasts which a forecaster makes—not just a special subset of those forecasts.

This is probably the approach that that the Good Judgment project originally took—and for a few reasons.

First, it makes intuitive sense, at least at first: if we want to know how good someone is overall, what better way to do that than to take into account all of their forecasts? Omitting some of their forecasts is like wasting useful information, or so one might think.

Second, when they discuss how they measured accuracy, they simply say that they took the “average raw Brier score” of their participants (p. 95). The most obvious interpretation of this is that they simply averaged the scores of all Brier scores of their participants. They didn't mention, for instance, that they took the average for one or another special subset of the forecasts.

Third, analyzing the Brier scores in this way produces a distribution of accuracy that is very similar to the one they reported. Recall the graph above? Here it is again:

The distributions have a pretty similar shape, huh?

Anyway, these three reasons lead me to suspect they adopted this approach (although I'll also mention a reason to doubt that suspicion later).

In any case, there's a problem with this approach.

The Problem with the All Forecast Approach

To see the problem, imagine the following scenario. Suppose it's 2020, and you and I are both thinking about an election this coming November. I make a forecast now since I am at work and am very bored: I think there's a 50% chance that [insert your favorite candidate here] will win, and I register this probability assignment on the Good Judgment Open website. You also think that [your favorite candidate] has a 50% chance of winning, but you're on vacation, and so you neither declare it nor register it on the same website. Despite this, however, you and I have exactly the same judgments; the only difference is about whether you declare yours.

Anyway, suppose we get closer to the time of the election, and we then get more evidence: there is a political scandal, and it becomes overwhelmingly probable that [insert your favorite candidate here] will win instead of [insert your least favorite candidates here]. We both assign a 90% chance to [insert your favorite candidate here] winning—and this time, we both declare our judgments and register it on the Good Judgment Open website.

Now suppose the candidate wins. We both had the same judgments, and we both got it right in the end. But do we get the same score?

In this case, you would get a better accuracy score than me on the website. You only get scored based on your final judgment which you declared, and it was a good one: 90% to the true outcome! But I get scored based on my final judgment and my earlier judgment which assigned a lower probability to the true outcome: 90% and 50%. So my judgment looks worse than your judgment. But this ain't right: by stipulation, we both have equally good judgments. Why should you have better judgment than me just because you went on vacation? The only difference is about whether we declared our judgments at times when the evidence was more revealing of the true outcome.

The moral of the story is this:

When we average someone's accuracy over time, we cannot tell whether they appear better than someone else merely because they declared their judgments at times when more evidence was available

The Final Forecast Approach

For this reason, an alternative measure is to take into account only a person's final forecast for a question. That way, you and I are more likely to be compared on a level playing field—that is, at times when we could have had similar evidence to inform our judgments. Call this the 'final forecast approach'.

Of course, what really matters is that judgments are measured and compared at similar times. The measures need to be time-relativized in the sense, and the final forecast approach is simply one way to do this.

I suspect that the Good Judgment project didn't do this—and for the three reasons I gave earlier. But this suspicion might be wrong. After all, if we take this final forecast approach, we get an average accuracy score of 0.302, and they report an average accuracy score of 0.30 in their paper. We also get some correlation values that better resemble theirs.

But regardless of what they did, what happens when we take the final forecast approach?

Well, we improve the results.

Improving the Results

1. People are Better Forecasters

For a start, people look more accurate than they initially appeared. To show this, I again juxtapose my analysis in red over their results—but this time with the final forecast approach.

Notice the differences. For example, in the original graph, around 1% of people were in the top category. But in the new graph, 11% of people were in the top category! That’s a big difference.

But that's not all.

2. We can Better Predict and Explain Accuracy

It looks like our three categories of features do a better job of explaining accuracy when we use the final forecast approach.

When I re-analyzed their data with the 'all forecast' approach, we get a multiple R value of .39; that's significantly weaker than the .64 value reported in their original study.

But when I re-analyzed their data with the 'final forecast' approach, the multiple R looked much better: it was .63.

Of course, you might look at that and think it's evidence that they really adopted the final forecast approach all along. But I wouldn't be so quick, and here's why: the re-analysis obtained a value of .63 despite having less information than the original analysis.

In particular, remember how the dataset didn't include the amount of time spent looking at a question? Yeah, well, that made a difference. Even with the final forecast approach, my re-analysis using just the behavioral features is weaker than theirs. They reported a value of .54; I found a value of .45. And presumably this was because they have the behavioral feature of time spent looking at a question, and this correlated with forecasting accuracy. But I did not have this feature.

Despite this handicap, though, the analysis still managed to get good results. Why? Well, perhaps the answer is the final forecast approach.

Ultimately, though, I don't know whether they used the final forecast approach or not. But I am nevertheless fairly confident that this approach is often better for explaining and predicting accuracy.

Time for a final methodological result.

Confirming the Results: Machine Learning

Of course, I wouldn't be a true Stanford student if I didn't smugly mention something to do with machine learning—no matter how irrelevant and gratuitous. So here I go...

When using a model to analyze the data—including any regression model—there's a chance of overfitting. Overfitting happens when you make a model which does a good job of predicting actual accuracy among the data you have analyzed, but the model then does poorly when trying to predict new data. For example, we could make a model which predicts exactly how well everyone did in our dataset. But if we did that, I guarantee you that the model would be too complicated to successfully predict the accuracy of another 100 new people in a similar context.

Below is a graph which I stole from another website—another website that in turn stole it from Andrew Ng's course on another, another website:

The idea is that the model on the right does a better job of predicting the 5 data points; it even intersects with a couple of them. But we can see that it would probably do a poor job of predicting any further data that we collect. The model on the left, however, is simpler and would probably do a better job. And this is true even though it predicts the 5 data points a little less well.

Anyway, the point of looking at these sophisticated lines is to better understand the risk of overfitting.

We want to make sure our analysis of the Good Judgment data avoids that risk. How do we do that?

Well—and here it comes—with machine learning! That is, we can use a machine learning technique to detect and avoid any overfitting. More particularly, we can split the data we have into two sets: the 'training' set that we use to build the model and the 'test' set that we use to predict how well the model does. More concretely in our case, this looks like assigning 80% of people to the training set and building a model to explain and predict their accuracy. Using this model, we then make predictions about the other 20% and see how well the predictions correlate with their actual accuracy. We can then summarize the results with a statistic you're already familiar with—the multiple R. If the multiple R for the test set isn't so good, then we're probably overfitting; but if it looks okay, then we have no problem—at least not in that department. (And again, we could also discuss different inferential statistics like p-values, confidence intervals and credible intervals, but as I mentioned earlier, I won't bore you with the controversial details in a simple blogpost like this.)

In addition to this, I also tested out some different methods of regression analysis: Ridge, Lasso and Partial Least Squares. And I re-ran this process with 100 "splits" of the data—that is, 100 different ways of assigning people to the training and test sets. I then took the average of the multiple Rs across the test sets. This is a way of making sure I'm also not over-fitting the so-called "hyper-parameters" of the data: the alpha values, n components and other things you don't need to understand or worry about. This whole process is called cross-validation.

And so, what are the results?

Well!... they're basically the same. The best value I got was an average multiple R of .628 with Partial Least Squares regression and n components set to 2. That's about .005 units of difference to the .632 value I got without cross-validation.

A big win for machine learning... not really. #GoStanford

But regardless, we can at least be confident that over-fitting is not a problem here. And that's positive.

Where to from here?

So we know something about what makes people more accurate. But we still do not know how far we can push the limits of our predictive abilities.

As good as our best forecasters are now, how much better could they be?

The question, I think, is especially interesting for those who study reasoning. Philosophers and psychologists have presented and defended many approaches to reasoning, some more formal than others. But often these are untested: there are theoretical grounds for thinking these are good approaches, but we do not know whether they actually improve accuracy in practice.

So could these or other approaches make us even more accurate in our judgments?

That is the topic of my research, and I will hopefully have part of an answer to it—my PhD thesis—about a year from now.

]]><![CDATA[How do we make changemakers? Models of metachangemaking]]>Mon, 20 Jul 2020 14:06:15 GMThttps://www.johnwilcox.org/johns-blog/how-do-we-make-changemakers-models-of-metachangemakingThe TL;DR key points

This post describes a research project about "models of metachangemaking"
A "model of metachangemaking" is a set of ideas about how to make changemakers
Such models are explicit or implicit in the work of organizations like the Weaving Lab, FUNDAEC, Effective Altruism, those who try to develop altruistic leaders and others
Such models vary along specific dimensions: how they motivate participants, for example
The project is comprised of three components:

1. Building a database of different models
2. Assembling an annotated bibliography about metachangemaking
3. The development of models of metachangemaking

The project made seem daunting, but it shouldn't: small contributions help, many hands make light work, and some progress has already been made in this area

Models of metachangemaking

How do we make “changemakers”? Put differently, how do we empower people with the motivation and efficacy to make a positive impact, to contribute to the wellbeing of humanity?

I’ll assume you’re already interested in this question, perhaps for reasons which I discuss in this other post here. Kuhan Jeyapragasan and I were talking about this, and an idea came up in our discussions: we can explore this question by doing research into so-called “models of metachangemaking”.

What is a “model of metachangemaking”? Well, let’s back up and look at a few concepts here.

As mentioned, a “changemaker” is just someone with the motivation and efficacy to make a positive impact on the world. Defined broadly as such, it encompasses terms that refer to more specific kinds of changemakers: examples include “weavers”, "altruistic leaders" or “effective altruists”, for instance.

“Metachangemaking”, then, is the process of making changemakers. It is a process reflected in the work of educational programs: for example, those by the Weaving Lab, by the Arete fellowship at Stanford University, by FUNDAEC and by others.

A “model” of metachangemaking is then a set of ideas or assumptions about metachangemaking—about how to make changemakers, as such. Every attempt to make changemakers explicitly or implicitly involves such models. For example, many such programs face the challenge of motivating people to make a change. To do this, some programs motivate participants by fostering empathy and concern for the suffering of others. Other programs reflect assumptions about human nature or about a motivating sense of purpose that humans should have to contribute to the advancement of society.

These different approaches to metachangemaking reflect different models. On some models, empathy and the awareness of suffering are important motivators; on other models, particular views about human nature and purpose are important motivators.

If we look at the landscape of metachangemaking, we’ll then see a diversity of models of metachangemaking.

Model research

This post is about how we might do research into these models—what I call “model research”, for short. Ultimately, this is the task of collecting, evaluating and developing of models of metachangemaking.

The ideal outcome of this is threefold.

First, to have a database of different models throughout the world, and this would ideally be as comprehensive as possible.

Second, to have an annotated literature of important work that bears on models of metachangemaking.

Third, to develop a comprehensive, evidence-based model of metachangemaking.

This last point may need some clarification. By an “evidence-based” model, I mean a model where careful attention is paid to the evidence about whether the model actually works. For example, as mentioned, some models motivate changemaking through empathy, and others motivate through particular understandings of human nature. For these models to be “evidence-based”, there would need to be evidence about whether, for instance, motivation is actually enhanced through such empathy or through such understandings. And of course, they would need to be "evidence-based" in the sense that the content of the program is putatively factual and supported by evidence: it ideally wouldn't reflect implausible assumptions about human nature, for instance.

But aside from being evidence-based, the model would also ideally be “comprehensive”. By that, I mean that the model would ideally feature a complete description of all of the main parameters of a metachangemaking process. What do I mean by “parameters” of a model? I’ll explore this in the next section.

Parameters of models

Speaking somewhat metaphorically, parameters are just the dimensions along which models can vary: two models may both consider how to motivate participants, but one might motivate participants in one way and the other might motivate them in a different way. In this case, they differ along the dimension of motivation insofar as they each have different ways of motivating participants.

So motivation is one parameter of models of metachangemaking. What are others?

Well, I’ll give you what I think is a list of salient parameters soon, but I’d first like you to consider the question by yourself: if you were to create a model of metachangemaking, what are the parameters you’d need to consider? And I ask this question because if I don’t bias you with my opinions first, perhaps you’ll see things I don’t, and if so, I’d love to hear from you as to how you answered the question differently.

Anyways, now I’ll share what I take the main parameters to be.

Demographic parameters:

One set of parameters concern who is involved in the program and where it occurs. Here, there are different approaches: for instance, some programs work with adolescents since they are seen as more open-minded to exploring changemaking, while others focus on young adults since they strike a right balance of open-mindedness and capability.

  Motivational parameters:

Then there are motivational parameters, as discussed earlier, and we saw some examples of how programs can vary in this respect.

  Competence parameters:

Then there are parameters about how to develop the competence or efficacy to be a changemaker. And again, there are differences here: some programs aim to develop competence by encouraging participants to think in terms of specific forces in society, while others even teach such things as Bayes’ theorem.

  Practical parameters:

Then there are also practical parameters, and by that, I mean parameters concerning how changemaking is then to be carried out in practice during or after an educational program. For example, some programs have coordinators whose job it is to accompany the graduates of their programs while they carry out acts of service to society.

  Metaparameters:

Then, there are also what we might call “metaparameters”. These are about features that are external to the content of the program, so to speak. They’re not features of the program, but they are still features that are about it, in some sense.

One such parameter concerns how people go about implementing the program: do they approach governmental organizations, for instance, or do they disseminate their research via online or other outlets in the hope that their models will be implemented? This parameter, I take it, also involves the problem of how education research and innovation is implemented into various programs—that is, the problem of how research is translated into impact.

Another parameter concerns how the programs are evaluated: for example, is a program evaluated based on the number of people who voluntarily choose to participate, or is it measured based on metrics of impact after a 15 year period?

  Summary:

That then, is an overview of what I take to be the main parameters of models of metachangemaking.

To summarize, these are:

Demographic parameters about who is in the program and where it occurs
Motivational parameters about how participants are motivated to be changemakers
Competence parameters about the program aims to make effective or competent changemakers
Practical parameters about how the program is connected to action—to the actual practice of making a change
Implementation parameters about how people go about implementing or disseminating models metachangemaking
Evaluation parameters about how programs are evaluated

A research program: components and impact

Model research is then research about any or all of these parameters of models of metachangemaking. Kuhan and I are eager to do this as our time permits, along with the help of others according to their interests and availability.

How, then, could we implement this research program?

Well, as a first suggestion, it could include at least the follow three components:

The first is a Google sheet of models of metachangemaking. Each row corresponds to a particular model of changemaking, and then some columns could refer to the parameters—with each cell being a brief summary of that parameter for the model (e.g. how it motivates participants). The idea is to gradually populate this sheet, developing a more and more comprehensive list of models. To develop this, we’d need to collaborate with various organizations, such as the Weaving Lab and the Center for Effective Altruism.

The second component of the research is an annotated bibliography in the form of a Google doc. It would ideally contain subject headings and key literature that is relevant to each of the parameters.

The third component is a collection of materials from different programs on a Google folder. For example, some programs would have syllabi or other participant materials that could be included in the folder. Of course, for privacy and other reasons, these materials wouldn’t be publicly accessible. They would be accessible only to those working on the research program, and they would be obtained with explicit consent from the relevant organizations.

These three components would then comprise the bulk of the program.

Of course, all this could require considerable work, but it might not be that daunting--for several reasons. First, many hands make light work: if many people are working on it over a long period, it could be easier to carry these things out. Second, even small contributions help, and so people can make a contribution to it without feeling burdened by further expectations or commitments. Third, progress has already been made in this area. For example, the Arete fellowship already has something of an annotated bibliography that is relevant to this area.

Those, then, are the components of the research program. Once they have been carried out, though, how would the program make a difference?

Well, it might do so in different ways. A report could be generated which summarizes the main findings of the program, and this could be distributed to various organizations, many of whom would be interested in it. Further, if people are interested, it could lead to spin-off projects that develop, work on or implement specific models of changemaking. The annotated literature and sheets could also be distributed and made available for organizations and individuals that are concerned with metachangemaking. And of course, other avenues might be attractive in the future too.

Just some ideas—but always open to more!

]]><![CDATA[A project for Stanford affiliates: Education and impact]]>Tue, 14 Jul 2020 16:27:07 GMThttps://www.johnwilcox.org/johns-blog/a-project-for-stanford-students-education-and-impactThe TL;DR key points

Presumably we'd all like to live lives that make a positive impact on the world
Education is a uniquely powerful tool for making an impact: it can make changemakers
On some toy models, the impact of education drastically outperforms the impact of other approaches to making a positive impact
We do not know how realistic those models are, but it is worth exploring the potential of education
That is the purpose of this project, and everyone is welcome to participate how they see fit

Making an impact: Why you should care about it

Imagine that you’re 80 years old, looking back on your life and all the things you have accomplished. What would you like to see? Perhaps you’d like to see that you had a happy life, a family and success in some line of work.

But I suspect you’d also like to see that you had made a positive impact in some way: that somehow others benefited from your existence. After all, when we step back to consider the question, what beauty is there in a life lived only in pursuit of one’s selfish ends? Presumably, then, this has implications for now: presumably we’d like to make a positive impact while we still can.

And given that the world faces a diversity of challenges, such an impact is desperately needed. But where do we start?

We all know about two high-profile challenges: the coronavirus and racial injustice. So maybe we could start there.

Or how about healthcare? Shane Boyle, for example, died after he couldn’t raise just another $50 to fund his diabetes medication. He is not alone. One study estimated that as many as 44,789 Americans died because they could not afford medication in 2005. Meanwhile, drug companies have been rising prices to maximize profit, the most notable example being an increase of one important medication from $13.50 to $750 a pill.

So healthcare is one challenge, but what about poverty in general? One study claimed that in 2000, approximately 133,000 deaths were attributable to poverty afflicting individuals in the United States.

Or what about mental health? Over 45 million Americans are experiencing mental illness, about 20% of the population. Over 10.3 million have suicidal thoughts, while over 48,000 people died by suicide in 2018.

Or what about the myriad other extremely important challenges facing the world? Climate change? The pervasive sexual harassment and mistreatment of women? International conflicts costing millions of people their lives? Environmental pollution? The extinction of species at accelerating rates? Corruption? Homophobia? The myriad specific diseases out there?

Believe it or not: the point is not to depress or overwhelm you. Instead, the point is that if we consider all these problems, one thing is clear: you cannot directly address them all, but they are all extremely important—and their importance cannot be overstated.

Being a changemaker and making changemakers

But how does one go about making change in world of so many challenges?

One option is to be a changemaker: to focus on one or a handful of issues, hopefully making some progress on them. Of course, this is both necessary and extremely praiseworthy.

But another option is make changemakers: to empower growing numbers of people with the motivation and competence to make a positive impact.

The difference is between solving problems and creating problem solvers. It’s like giving a man a fish and teaching a man to fish—focusing on long-term solutions with potentially greater payoffs and impact.

We could visualize this more concretely with a illustrative model. Suppose there’s 5 people considering how they can make an impact over a 15-year period. They have two options: they can either be changemakers or they can make changemakers.

Let us make some assumptions about the impact which they can have with each option. Of course, these assumptions are over-simplifications, but they potentially illustrate an important truth which I'll touch on later. Here are the assumptions:

It takes 5 changemakers 5 years to solve one problem—such as environmental pollution in San Mateo
5 people can create an educational program that:
1. Effectively creates 100 changemakers after its first 5 years
2. Expands its capacity by a factor of 10 every 5 years after that (so it makes 100 after 5 years, 1000 after 10 years, 10,000 after 15 years)

If we make these assumptions, then we can plot how much impact those 5 people would have when they are changemakers versus when they make changemakers. Here's a 15 year projection:

We can see how both approaches differ in their impacts.

The "make changemakers" approach doesn't see a big impact quickly. It takes 5 years to effectively educate 100 changemakers, and then it takes those changemakers another 5 years to solve one problem in groups.

However, it makes an attractive impact after 15 years. 20 problems are solved when making changemakers whereas only 3 problems are solved when being changemakers.

But really, the most pronounced difference is what happens after 25 years:

If the two approaches conform to the above assumptions, then the "making changemakers" approach has solved 2,720 problems after 25 years, while the "being changemakers" approach has solved 5.

Of course, the natural reaction to this graph is to question how realistic the assumptions are, especially since they're undoubtedly over-simplifications. (Actually, you can play with the assumptions in this spreadsheet yourself if you like.)

The purpose of this graph, however, is not to have precisely accurate assumptions—something which I suspect is neither possible nor desirable at this stage. Instead, it's to make two points.

First, there may very well be some kind of relationship like this, where education generates vastly greater payoffs over the long-term. As my collaborators Will Roderick and Jake McKinnon point out, even making just one other effective changemaker increases your impact in a big way.

Second, we do not know what the realistic assumptions are. Perhaps it is harder to educate 100 changemakers in 5 years, since it might take time to develop materials. Or perhaps it is easier to educate 100 changemakers, since we could build on existing programs and work with them. Likewise, for each of the other assumptions, there are arguments in both directions about what the more realistic replacement is.

The point is that there is much that we do not know. But the potential payoffs of this line inquiry could be tremendous—incomparable even.

While we don't want to jump to conclusions about what is possible, we also wouldn't want to jump to conclusions about what is impossible. Who could have anticipated the impact of the internet or of mobile phones for example? Could an effective education program likewise change the world in hard to fathom ways? Who knows. But it seems to me to be worth exploring--and with an open mind.

The project: Some questions for exploration

So the aim is to explore the potential of education, guided by the idea that it might play an incomparably important role in social transformation. What questions need to be asked in exploring this topic, then? Well, perhaps we could group them under several categories.

There's questions about existing programs and possibilities:

What educational programs are already trying to make changemakers, and what are their strengths and weaknesses?
Can we contribute to their programs, or should we design other programs?
What societal constraints are there in pursuing such a program?

Then there's questions about the design of programs:

What does it take to motivate people to be changemakers?
Assuming that people are motivated to make a change, how can they competently do that?
What follow up protocols and institutions can accompany changemakers in their efforts?

Then there's also questions about how the project runs:

What qualities does one need to be an effective researcher?
What qualities does a team need in order to cooperate and work effectively?

And we also have more logistical and practical questions about the project's immediate operations:

Who is potentially interested in being involved?
Does our project require funding, and if so, how can we get it?

To some extent, the current collaborators have explored these questions already, some of us in papers we've written and others through our experience in organizations. But the purpose of the project is to further explore these questions to probe the potential of education.

What does participation look like?

Assuming one is interested in the project, what could participation involve? Here's my initial response: anything ranging from "not much" to "lots"—it depends on the interests and availability of the person.

At the "not much" side of the spectrum, even coming to a Zoom chat occasionally to listen to ideas and share opinions can be valuable. Often, everyone has something valuable to contribute, and we wouldn't want to preclude their participation, especially if they find the ideas interesting.

At the "lots" side of the spectrum, someone may wish to take this on as a full-time job, applying for grants and pursuing this research as it interests them.

Then, there are various mediums between these extremes. For example, someone might like to:

Write something for the project, like a blog post or a paper addressing one of the questions

Help make applications for funding, or find others who are interested
Collaborate on bigger projects in the future, such as designing or refining curricula

Really, it depends on the individual, but the important point is that anything helps and is welcome—but without further expectations.

Who could participate in the project?

Of course, the project touches on many questions from various fields: education, management, psychology and philosophy—to name a few. But not everyone has a background in these things. So who could participate?

Here's my suggestion: everyone who is willing and able—but in a sense which I will soon describe.

The rationale for this is twofold.

First, a growing body of research suggests that background expertise, while useful, is not necessary to be proficient in particular domains. The most striking example of this comes from four decades of research into geopolitical forecasting. There, thousands of people make predictions about future events—like the outcomes of elections, wars, pandemic and economic events.

One of the most startling findings of the program was that typical indicators of competence actually didn't correlate with success: it made no difference whether one had a PhD, many years of experience or even a full-time profession giving their opinions to the media or important organizations (Tetlock, 2005). In fact, many "experts" on political matters turned out to not be that good.

However, others were remarkably good at predicting the future—in fact, they even outperformed intelligence analysts with classified information and indeed any other prediction mechanism that we know of (Tetlock & Gardner, 2015). These people—the "superforecasters", as they're called—often came to questions with no strong background knowledge: they might have to make predictions about the "UNHCR" without even knowing what that is at first. But the point is they even if they don't know a lot, they would learn a lot--and they're impressively good at what they do.

Furthermore, the research shows various things can make people good (Chang et al., 2017; Haran et al., 2013; Mellers et al., 2014, 2015; Tetlock & Gardner, 2015). These things include:

Working in teams
Being accountable for the outcomes of one's judgments, as well as the processes by which they produce them
Strong motivation in doing well at what they do
Focusing their efforts on topics which they can make progress on
Being an "actively open-minded" thinker"—that is, going out of one's way to get more information, to find evidence against one's opinions and to take into account the opinions of others
Acquiring subject-specific knowledge
Reflecting on their past efforts—especially the strengths, weaknesses and lessons to be learned

So when I say "willing and able", I mean everyone who is seriously willing to develop these qualities. The research suggests all these things make one a better forecaster. But upon reflection, it seems clear that they could make one better in many domains which they apply themselves to.

My point is that what's most important is not so much the expertise which people have, but rather whether they have the qualities that could predict future success across many domains.

My second reason for thinking that everyone can get involved is that others have successfully worked in areas that were not always their own. This is true in education: FUNDAEC has successfully implemented education programs across Latin America, for instance, despite being championed primarily by a physicist and a mathematician. But this is also true in areas outside of education as well: Jeff Bezos was just a smart Princeton graduate in electrical engineering and computer science, but he did remarkably well in turning a book company into the world's most profitable business.

What, then, could a group of Stanford graduates accomplish?

If you're interested in the project, don't hesitate to contact John at wilcoxje@stanford.edu.

References

Chang, W., Atanasov, P., Patil, S., Mellers, B. A., & Tetlock, P. E. (2017). Accountability and adaptive performance under uncertainty: A long-term view. In Judgment and Decision Making (Vol. 12, Issue 6).
Haran, U., Ritov, I., & Mellers, B. (2013). The Role of Actively Open-Minded Thinking in Information Acquisition, Accuracy, and Calibration. Judgment and Decision Making. https://repository.upenn.edu/marketing_papers/413
Mellers, B., Stone, E., Atanasov, P., Rohrbaugh, N., Metz, S. E., Ungar, L., Bishop, M. M., Horowitz, M., Merkle, E., & Tetlock, P. (2015). The psychology of intelligence analysis: Drivers of prediction accuracy in world politics. Journal of Experimental Psychology: Applied, 21(1), 1–14. https://doi.org/10.1037/xap0000040
Mellers, B., Ungar, L., Baron, J., Ramos, J., Gurcay, B., Fincher, K., Scott, S. E., Moore, D., Atanasov, P., Swift, S. A., Murray, T., Stone, E., & Tetlock, P. E. (2014). Psychological Strategies for Winning a Geopolitical Forecasting Tournament. Psychological Science, 25(5), 1106–1115. https://doi.org/10.1177/0956797614524255
Tetlock, P. (2005). Expert political judgment: how good is it? How can we know? Princeton University Press.
Tetlock, P., & Gardner, D. (2015). Superforecasting: the art and science of prediction. Broadway Books. https://doi.org/doi:10.1201/b15410-25

]]><![CDATA[Estimating risk: Why we're bad at it, and how to get better]]>Sun, 12 Jul 2020 07:00:00 GMThttps://www.johnwilcox.org/johns-blog/estimating-risk-why-were-bad-at-it-and-how-to-get-betterThe TL;DR key points

We all estimate risks, especially in the age of COVID-19
But research shows we are often bad at this, and we don't even realize it
This is true for people with PhDs, people across cultures and even doctors and political experts
Fortunately, though, research shows we can get much better if we:

1. Think in terms of probability
2. Know our biases, such as overconfidence and availability biases
3. Use statistics, even simple ones

If we do this, we'd be less worried--at least about some risks--but ideally still conscientious

Estimating risk: Why you should care about it

Nowadays, we’re especially worried about risks—about the risk of getting COVID if we hop on a plane or go to an in-person class, or about the risk of dying if we get COVID. And some risks are worth taking, but others aren't; it depends partly on how we estimate the risks.

So, then, how good are we at estimating risk? And how should we estimate risks?

Below, I’ll share some ideas on these questions. These aren’t just my personal thoughts, though. Instead, they’re based on decades of psychological and scientific research, including cutting-edge research programs funded by the US intelligence community.

Now I don’t specialize in "risk" per se. I specialize in probability and psychology. But we might think that a "risk" simply is a kind of probability: it’s the probability of something bad happening. Of course, we could argue endlessly over how to interpret the word “probability”. But this isn’t really the issue we care about here.

Instead, what we want to know is how confident we should be that various bad things might happen. This is what we can call "risk estimation".

So I’ll discuss two things: what makes us bad at estimating risk, and what’ll make us much better at it. The hope is that we can use this to improve our lives—to help us feel more confident about what decisions to make and to feel no more or less worried than we need to be.

I'll group my ideas under three pointers. Here's the first:

1. Think in terms of probability

When it comes to estimating risk, we want to know how confident we should be that something bad might happen—how confident you can be that you’ll get COVID if you take a flight, for instance.

We can represent this confidence with a number—a percentage between 0% and 100%. We could say this number is a kind of "probability", like so-called Bayesians do.

At first, this might seem counter-intuitive: how could I assign a probability to a unique event, like the risk of me getting COVID if I take this specific flight on this specific day? And this is a reasonable question: if we think about it enough, almost any event is unique in a myriad ways, and it might seem strange that we could assign probabilities to these unique events.

However, some pioneering research from the US intelligence community offers some surprising insight (Mellers et al., 2015). One of their most remarkable findings is that some people can assign probabilities like this and, when they do, they have good "calibration", as it's called.

Calibration

“Calibration” here is a technical term. We would say you have good calibration just in case anything you assign a probability of 80% to happens about 80% of the time, anything you assign a probability of 70% to happens about 70% of the time, and so forth. For instance, if you're well calibrated, it'd rain 80% of the time when you say there's an 80% probability of rain.

It turns out that some people are well calibrated like this, even when it comes to unique events. For example, in one research program funded by the US intelligence community, one person made 1,056 probability estimates about events over the course of one year. These were estimates about events like international conflicts, pandemics, elections and the like.

I used data from the program to quickly produce a rough graph which shows how calibrated this person was:

What does this graph mean? Well, the horizontal x-axis represents the probabilities which they attached to events, each grouped into categories close to 10%, 20%, 30% and so forth. The vertical y-axis represents the number of times that the events in the category happened. The straight line represents what someone would look like if they were perfectly calibrated. It shows that as someone would assign higher probabilities to events, the more frequently those events occur.

And the dots here represent how calibrated this particular person was. You can see the dots are pretty close to the line: when this person assigned a probability of 80% to an event, that event happened about 83% of the time. And when they assigned a probability of 90% to an event, that event happened about 91% of the time. And this isn't an accident. This person improved their calibration over time, as is often the case.

The remarkable thing about this is that, in a sense, all of these events were unique. They concerned things such as, for example, the probability that the "Libyan government forces regain control of the city of Bani Walid before 6 February 2012". Now that's pretty unique: we're talking about a specific government, with specific forces, regaining control of a specific city by a specific time. Despite the fact that this is clearly a unique event, this individual was well calibrated when making estimates about this and countless other unique events.

The moral of the story is that it’s possible for us to assign probabilities to even unique events, and we can be well calibrated in doing so.

And this is handy because we often can’t know whether something bad will happen: we can only be more or less confident, and representing this confidence with a number is useful to get a sense of how worried we should be or which risks are worth taking and which aren’t.

However, as we'll see, we’re often pretty bad at assigning these numbers—but we can get better! And here's how.

2. Know your biases

In general, many people aren’t that good at estimating probabilities, and unless someone's done special training or research on the topic, they could be one of them. Of course, the good news is that people can improve! And I’ll share some thoughts on this below.

But for now, let us delve into the bad news in more detail, in part because we’re better thinkers when we understand and expand our limits than when we don't.

Overconfidence

Let’s take as our cue the studies of overconfidence. “Overconfidence” here is another technical term. It happens when you assign a higher probability to something than the frequency with which that thing is true. For example, if someone assigns a probability of 80% something, and it turns out that that thing is true only 60% of the time, then they’re overconfident—and by 20% in this case.

Unfortunately, overconfidence is almost everywhere. For example, in one study, they asked 100 students questions that tested their general knowledge about such things as whether a leopard is faster than a gazelle (Lechuga & Wiebe, 2011). The results found that the students were overconfident: out of all the things they assigned a probability of 95% or more to, only 73% of those things were true. (BTW, fun fact: it turns out that it's false that a leopard is faster than a gazelle... the average gazelle is nearly twice as fast... so much fun.)

And unfortunately, overconfidence is present even where we might least expect it.

One study examined the calibration of 189 experts whose job it was to give opinions about political affairs—the outcomes of wars, elections and the like (Tetlock, 2005). The study looked at the calibration of their long-term predictions which they were certain about—that is, the things they assigned a probability of 100% to. How many of those things happened? Well, only 81% of them. Put differently, when they were completely certain something wouldn't happen, it actually did happen 19% of the time. That's pretty bad, especially since their job is to give opinions—and the media and other organizations often rely on those opinions! Having a PhD or having many years of experience didn't make a difference either.

The situation is also bad when we look at medicine. One study found that 118 doctors were generally overconfident (Meyer et al., 2013). They were asked to give diagnoses about easier cases where 55.3% of the diagnoses were right. And they were also asked to give diagnoses about harder cases where only 5.8% of the diagnoses were right. That might not be so worrying. What is more worrying, though, is that the doctors were similarly confident about their diagnoses in both kinds of cases: they were about as confident in themselves when they were likely to be wrong as when they were likely to be right.

This isn't trivial; it impacts real people's lives and is sometimes a matter of life and death. As Professor of Medicine Robert Watcher says, "Today in America, hundreds of patients will be falsely reassured and panicked, and many of them will be medicated, scanned and even cut open because of the wrong diagnosis" (TriMed Staff, 2006).

One might think—or at least hope—that I am cherry picking studies here: perhaps these are just a few unfavorable studies and people aren't really that bad when it comes to probabilities. I wish that were the case, but the fact of the matter is that there is so much evidence of overconfidence that I could have a whole paragraph just of references to it and similar phenomena. In fact, here it is:

(Barnsley et al., 2004; Berner & Graber, 2008; Callender et al., 2016; Ehrlinger et al., 2008; Gilovich et al., 2002; Hall et al., 2016; Kruger & Dunning, 1999; Lechuga & Wiebe, 2011; Lundeberg et al., 2000; Meyer et al., 2013; Miller Tyler M. & Geraci, 2011; Naguib et al., 2019; Neyse et al., 2016; Perel et al., 2016; Podbregar et al., 2001; Tetlock, 2005; Tirso et al., 2019; Whitcomb et al., 1995; Wright et al., 1978; Yates et al., 1989, 1996, 1997, 1998)

... that's with the abbreviations included. And even then, there's more evidence that I can't be bothered citing. You get the gist.

Findings like this led Mellers et al. (2015) to say, “Across a wide range of tasks, research has shown that people make poor probabilistic predictions of future events ” (266).

But if that's the case, we shouldn't expect to be so good at predicting risks either.

Heuristics: Availability

Why, then, might we be so bad at estimating risk?

To answer this question, we can look at how we often arrive at our judgments in the first place.

That’s where the so-called heuristics come in. “Heuristics” are simply processes by which we arrive at conclusions—conclusions about risks or other things. They’re often quick and efficient, but they can also produce biased judgments, especially when it comes to estimating risk.

Let’s take the availability heuristic, for example. We use the availability heuristic when we estimate the risk of something based on the ease with which that thing comes to our mind. We equate the probability of something with the "mental availability" of that thing, as it were.

For example, you might think there’s a high risk of getting COVID if you take a flight, merely because you can easily recall stories of people getting COVID when they took a flight. Or you might think there’s a high risk that you’ll die if you get COVID, merely because you can easily recall a story of someone like you who died of COVID.

The problem with the availability heuristic, though, is that the availability of something often isn’t the same as the probability of that thing. For instance, one study found people thought that the risk of a random person dying from a flood was higher than the risk of them dying from asthma (Lichtenstein et al., 1978). At the time, though, the reverse was true: 9 times as many people died from asthma as they did from floods.

People likewise overestimated the risk of other “sensational” events like homicides, tornadoes, car accidents, cancer and the like.

Some scholars blame the media for this (Reber, 2017). They claim the news excessively covers dramatic stories, and this makes the risks appear greater than they really are. The result is a lot of overconfidence about risks and a lot of unnecessary worry.

If we’re to accurately estimate risks, then, it seems that we need to ask ourselves how we’re estimating risk and whether we’re biased by such things as availability. And of course, other things can bias our judgments and lead to inaccuracy, but I don't want to bore you with details more than I already have.

In any case, often our biases can be reduced with another little tip:

3. Use statistics!

So, what do you do if you want to know the risk of something?

Well, we can look at some statistics! We’re often afraid of numbers, but we don’t need to be. Even simple statistics can be informative, and our ability to use them well is a litmus test for the accuracy of our estimates.

So how do we use these statistics?

Well, if you want to know the risk of something, maybe try looking at the relevant proportion of the time that that thing happens.

The Probability of Getting COVID on a plane

For example, you want to know the risk of getting COVID if you take a flight? Then maybe look at how many people on flights get COVID. Statistics like this are often available, and it sometimes takes just a little searching to find them.

Here, it turns out we have those statistics for one place at least--New Zealand! There, everyone who arrives in the country is placed in managed isolation and is tested. This includes people from countries around the world: India, Australia and the United States, for example. Since March 26, it turns out that 344 people tested positive. Now, these people were spread across 113 flights. Probably at least one person on each of those flights had to have COVID to begin with. So we can plausibly conclude that out of those 344 people, at most only 231 people on those flights got infected from others on board. (Even then, the true number is probably much lower than that.)

So what does this estimate of 231 mean? Well, nothing until we consider the total number of people who arrived in the country in the first place. Since March 26, over 26,400 people have flown into New Zealand.

So our toy estimate is that 231 out of 26,400 people got COVID from flights—not even 1% of them.

We could also look at the data from another angle, focusing on more recent arrivals. As of July 6th, about 6,000 people were in managed quarantine, virtually all of whom arrived into New Zealand within the 14-days previous to that. During that time, only 23 people tested positive, and they were spread across 8 flights. We could then use that to form a toy estimate that at most 15 out of 6,000 got COVID on their flight, about 0.25%.

What's more, if we look at the data over time, we'll see there's always been a low number of infected people on these flights.

So what inference should we draw from this? Here’s my suggestion: that the risk is very low.

And how exactly do we draw this inference? Well, there’s different ways to do it, some of which are more complicated than others.

The simple way is just to draw an intuitive inference in accordance with what is known as the straight-rule (Reichenbach, 1949). To do this, we just take our estimate that 0.9% of people got infected, and we conclude that the probability you'll get COVID from a flight is about 0.9%. Put differently: you can be more than 99% confident that you won't get COVID if you take a flight—at least if you're as precautious as the average Joe.

Simple intuitive inferences like this can sometimes get us into trouble if we don’t consider how big our sample of people is, among other things. But in this specific case, the available evidence suggests this way of assigning probabilities is no less accurate than any other. However, some statisticians would still hate it if you did that, and they’d hate me for saying you could do that. Instead, they'd recommend you use more complicated statistical methods for drawing inferences, and I've included a very boring appendix* on how you might do that.

But no matter what way you look at it, the risk is very small. And it is doubtful that most other flights are that different either—for now at least.

Conclusions: Making Estimates, and Taking Risks

If we think statistically in this way, then our risk estimates will be more accurate.

But "more accurate" estimates don't mean the "most accurate" estimates. We can often refine our estimates in various ways: by considering statistics from other sources, by analyzing the data in different ways, by looking at more of the specific features of the thing we're estimating, and so on and so forth. We also need to be aware of changes that should affect our estimates, such as outbreaks in areas which we might be flying from. But in some cases, crude calculations like this are sufficiently accurate, and hopefully this is at least a simple illustration of the type of statistical thinking that is indispensable to accurate risk estimation.

You might ask, though: how do we know that statistical thinking like this makes us more accurate? Well, because we know that this is the kind of thing that the best thinkers do—those people who are in the minority of well calibrated reasoners (Chang et al., 2017; Tetlock & Gardner, 2015).

But I suspect that if we thought this way, we often wouldn't worry as much as we do, especially people who worry about what'll happen if they get COVID when they're not in "at risk" categories. Of course, some of us are "at risk" and should worry about COVID more than we do. But I think we'd all agree that it's important to strike a balance. Accurate risk estimation helps us to do that.

However, correctly "estimating" a risk is not the same as correctly "taking" a risk. Nothing in this post tells you about what risks you should take. Theories in ethics and rational decision theory tell us about that—like utilitarianism.

So this doesn't mean we should be reckless. We might still think we have ethical obligations to be precautious out of respect for vulnerable populations, for instance, even if we don't feel the need to be precautious for ourselves.

But the moral of the story is that risks are everywhere, and we're often bad at estimating them—but we can get better! And when we do, we might find ourselves less worried about life—but hopefully in a conscientious way :)

I'll discuss more about what makes for good judgment in future posts. But if you want to learn more about probabilistic prediction now, you might like to check out Good Judgment’s training. There, you’ll find cutting-edge training about how to make predictions about future events. These are often events about political topics—like wars and elections. However, the basic principles of probabilistic reasoning are arguably generalizable to other contexts—or so I’d argue if I had more space.

UPDATE 7/15/2020:

My friend Jonathan Ammaral has just brought to my attention an interesting article here. It reports a study of over 1,500 adults. They found that younger adults overestimate the risk of dying from COVID-19 by as much as 10 times.

The 18-34 group in our data reports a median believed conditional fatality rate of 2%. However, available epidemiological estimates for the US or other developed countries using different samples and methods indicate a conditional death rate for this age group of at most about 0.2% (Ferguson et al. 2020, Covid CDC Response Team 2020, Modi et al. 2020, Russell et al. 2020). The young appear to assess their mortality risk to be more than 10 times the available data.

Using similar logic to the above, lots of young people can be even more confident than 99.8% that they won't die if they get COVID. There are several reasons for this:1) fatality rates are lower than this in several other countries which they might be in, 2) benign COVID cases are often undercounted and undetected, thereby inflating the reported fatality rate, and 3) out of the very small minority of young people who pass away, many have health conditions which predispose them to complications--such as diabetes and the like--and many young people can be confident that they don't have these conditions. In saying that, some young people do have these health conditions and are quite rightly concerned for their wellbeing, so I think we should all do our part for their sake.

*A very boring appendix

Above, I mentioned that you might use the straight-rule to be more than 99% confident that you won't get COVID if you take a flight.

However, some statisticians would hate me for saying you could do that. Instead, they’d recommend more complicated methods of drawing inferences. Which one they’d recommend depends on whether they’re a “Bayesian”, or a “frequentist”, or something else.

Unfortunately, though, there’s bitter debates between these camps of statisticians, and I don’t want to go to war by siding with one camp against the other. So I’ll just point out what happens if we use different methods, but you don’t need to worry the details here.

If you’re a Bayesian, then you could do inference in different ways. Technically, the simple "straight-rule" inference above is compatible with a Bayesian approach. But some Bayesians might insist that you should do some kind of “population parameter estimation”, where the parameter in this case is something like the "true proportion" of people who get infected on flights, or the "objective propensity" of such infection. To do this estimation, you could do “Bayesian conditionalization” using a “uniform prior probability distribution" over the "parameter space” and a “binomial distribution” for the “likelihoods of the sample data”. This would yield a “posterior probability distribution” over the “parameter space” which you could use to do “posterior predictive inference”. If you did all of that, the probability you’d get COVID on a flight is still around 1%—basically a pompous version of the simple inference above.

But if you’re a frequentist, then you’d probably do “confidence interval estimation” instead. One way of doing this would generate a “99% confidence interval” of [0.0073, 0.0103]. This means that the method of generating this interval would generate intervals which contain the true "parameter value" 99% of the time when that method is applied to a series of "hypothetical data sets". When applied to the sample estimate, that same method then yields an interval of [0.0073, 0.0103]. So how confident should you then be that you’ll get COVID if you take a flight? To be honest, I don’t know the answer, even though people frequently misinterpret confidence intervals as telling you the answer (Hoekstra et al., 2014). Frequentism doesn’t explicitly tell you anything about how confident you should be. It just tells you about specific kinds of “probabilities” that are interpreted as “frequencies” instead of degrees of confidence. But I'm guessing any frequentist would say you should be quite confident you wouldn’t get COVID on board anyways, even if frequentism doesn’t tell you this.

Either way, it looks to me like statisticians in all camps would say the risk is low.

End of boring appendix.

Some light bedtime reading for you...

Barnsley, L., Lyon, P. M., Ralston, S. J., Hibbert, E. J., Cunningham, I., Gordon, F. C., & Field, M. J. (2004). Clinical skills in junior medical officers: A comparison of self-reported confidence and observed competence. Medical Education, 38(4), 358–367. https://doi.org/10.1046/j.1365-2923.2004.01773.x
Berner, E. S., & Graber, M. L. (2008). Overconfidence as a Cause of Diagnostic Error in Medicine. American Journal of Medicine, 121(5 SUPPL.), S2–S23. https://doi.org/10.1016/j.amjmed.2008.01.001
Callender, A. A., Franco-Watkins, A. M., & Roberts, A. S. (2016). Improving metacognition in the classroom through instruction, training, and feedback. Metacognition and Learning, 11(2), 215–235. https://doi.org/10.1007/s11409-015-9142-6
Carnap, R. (1951). Logical Foundations of Probability (T. S. Gendler & J. Hawthorne (eds.); Vol. 3). Routledge and K. Paul.
Chang, W., Atanasov, P., Patil, S., Mellers, B. A., & Tetlock, P. E. (2017). Accountability and adaptive performance under uncertainty: A long-term view. In Judgment and Decision Making (Vol. 12, Issue 6).
Ehrlinger, J., Johnson, K., Banner, M., Dunning, D., & Kruger, J. (2008). Why the unskilled are unaware: Further explorations of (absent) self-insight among the incompetent. Organizational Behavior and Human Decision Processes, 105(1), 98–121. https://doi.org/10.1016/j.obhdp.2007.05.002
Gilovich, T., Griffin, D. W., & Kahneman, D. (2002). Heuristics and biases : the psychology of intuitive judgment. Cambridge University Press. https://searchworks.stanford.edu/view/4815978
Hall, S. R., Stephens, J. R., Seaby, E. G., Andrade, M. G., Lowry, A. F., Parton, W. J. C., Smith, C. F., & Border, S. (2016). Can medical students accurately predict their learning? A study comparing perceived and actual performance in neuroanatomy. Anatomical Sciences Education, 9(5), 488–495. https://doi.org/10.1002/ase.1601
Hoekstra, R., Morey, R. D., Rouder, J. N., & Wagenmakers, E.-J. (2014). Robust misinterpretation of confidence intervals. Psychological Bulletin Review, 21, 1157–1164.
Kruger, J., & Dunning, D. (1999). Unskilled and unaware of it: How difficulties in recognizing one’s own incompetence lead to inflated self-assessments. Journal of Personality and Social Psychology, 77(6), 1121–1134. https://doi.org/10.1037/0022-3514.77.6.1121
Kyburg, H., & Teng, C. M. (2001). Uncertain inference. Cambridge University Press.
Lam, J., & Feller, E. (2020). Are We Right When We’re Certain? Overconfidence in Medicine. Rhode Island Medical Journal, 103(2), 11–12.
Lechuga, J., & Wiebe, J. S. (2011). Culture and Probability Judgment Accuracy: The Influence of Holistic Reasoning. Journal of Cross-Cultural Psychology, 42(6), 1054–1065. https://doi.org/10.1177/0022022111407914
Lichtenstein, S., Slovic, P., Fischhoff, B., Layman, M., & Combs, B. (1978). Judged frequency of lethal events. Journal of Experimental Psychology: Human Learning and Memory, 4(6), 551–578. https://doi.org/10.1037/0278-7393.4.6.551
Lundeberg, M. A., Fox, P. W., Brown, A. C., & Elbedour, S. (2000). Cultural influences on confidence: Country and gender. Journal of Educational Psychology, 92(1), 152–159. https://doi.org/10.1037/0022-0663.92.1.152
Mellers, B., Stone, E., Murray, T., Minster, A., Rohrbaugh, N., Bishop, M., Chen, E., Baker, J., Hou, Y., Horowitz, M., Ungar, L., & Tetlock, P. (2015). Identifying and Cultivating Superforecasters as a Method of Improving Probabilistic Predictions. Perspectives on Psychological Science, 10(3), 267–281. https://doi.org/10.1177/1745691615577794
Meyer, A. N. D., Payne, V. L., Meeks, D. W., Rao, R., & Singh, H. (2013). Physicians’ diagnostic accuracy, confidence, and resource requests: A vignette study. JAMA Internal Medicine, 173(21), 1952–1961. https://doi.org/10.1001/jamainternmed.2013.10081
Miller, T. M., & Geraci, L. (2011). Unskilled but aware: Reinterpreting overconfidence in low-performing students. Journal of Experimental Psychology: Learning Memory and Cognition, 37(2), 502–506. https://doi.org/10.1037/a0021802
Naguib, M., Brull, S. J., Hunter, J. M., Kopman, A. F., Fülesdi, B., Johnson, K. B., & Arkes, H. R. (2019). Anesthesiologists’ Overconfidence in Their Perceived Knowledge of Neuromuscular Monitoring and Its Relevance to All Aspects of Medical Practice. Anesthesia & Analgesia, 128(6), 1118–1126. https://doi.org/10.1213/ANE.0000000000003714
Neyse, L., Bosworth, S., Ring, P., & Schmidt, U. (2016). Overconfidence, Incentives and Digit Ratio. Scientific Reports, 6(1), 1–8. https://doi.org/10.1038/srep23294
Perel, A., Saugel, B., Teboul, J. L., Malbrain, M. L. N. G., Belda, F. J., Fernández-Mondéjar, E., Kirov, M., Wendon, J., Lussmann, R., & Maggiorini, M. (2016). The effects of advanced monitoring on hemodynamic management in critically ill patients: a pre and post questionnaire study. Journal of Clinical Monitoring and Computing, 30(5), 511–518. https://doi.org/10.1007/s10877-015-9811-7
Podbregar, M., Voga, G., Krivec, B., Skale, R., Parežnik, R., & Gabršček, L. (2001). Should we confirm our clinical diagnostic certainty by autopsies? Intensive Care Medicine, 27(11), 1750–1755. https://doi.org/10.1007/s00134-001-1129-x
Pollock, J. (1990). Nomic Probability and the Foundations of Induction. Oxford University Press.
Reber, R. (2017). Availability. In R. F. Pohl (Ed.), Cognitive illlustions: Intriguing phenomena in thinking, judgment and memory (Second Edi, pp. 185–203). Routledge, Taylor & Francis Group.
Reichenbach, H. (1949). The theory of probability : an inquiry into the logical and mathematical foundations of the calculus of probability / (2d ed.). University of California Press,.
Tetlock, P. (2005). Expert political judgment: how good is it? How can we know? Princeton University Press.
Tetlock, P., & Gardner, D. (2015). Superforecasting: the art and science of prediction. Broadway Books. https://doi.org/doi:10.1201/b15410-25
Tirso, R., Geraci, L., & Saenz, G. D. (2019). Examining Underconfidence Among High-Performing Students: A Test of the False Consensus Hypothesis. Journal of Applied Research in Memory and Cognition, 8(2), 154–165. https://doi.org/10.1016/j.jarmac.2019.04.003
TriMed Staff. (2006). Study: 1 in 6 Americans has suffered from a medical misdiagnosis. Health IT. https://www.healthimaging.com/topics/health-it/study-1-6-americans-has-suffered-medical-misdiagnosis
Westberg, J., & Jason, H. (1994). Fostering learners’ reflection and self-assessment. Family Medicine, 26(5), 278–282.
Whitcomb, K. M., Önkal, D., Curley, S. P., & George Benson, P. (1995). Probability judgment accuracy for general knowledge. Cross-national differences and assessment methods. Journal of Behavioral Decision Making, 8(1), 51–67. https://doi.org/10.1002/bdm.3960080105
Wright, G. N., Phillips, L. D., Whalley, P. C., Choo, G. T., Ng, K.-O., Tan, I., & Wisudha, A. (1978). Cultural Differences in Probabilistic Thinking. Journal of Cross-Cultural Psychology, 9(3), 285–299. https://doi.org/10.1177/002202217893002
Yates, J. F., Lee, J. W., & Bush, J. G. G. (1997). General Knowledge Overconfidence: Cross-National Variations, Response Style, and “Reality.” Organizational Behavior and Human Decision Processes, 70(2), 87–94. https://doi.org/10.1006/obhd.1997.2696
Yates, J. F., Lee, J. W., & Shinotsuka, H. (1996). Beliefs about overconfidence, including its cross-national variation. Organizational Behavior and Human Decision Processes, 65(2), 138–147. https://doi.org/10.1006/obhd.1996.0012
Yates, J. F., Lee, J. W., Shinotsuka, H., Patalano, A. L., & Sieck, W. R. (1998). Cross-Cultural Variations in Probability Judgment Accuracy: Beyond General Knowledge Overconfidence? Organizational Behavior and Human Decision Processes, 74(2), 89–117. https://doi.org/10.1006/obhd.1998.2771
Yates, J. F., Zhu, Y., Ronis, D. L., Wang, D. F., Shinotsuka, H., & Toda, M. (1989). Probability judgment accuracy: China, Japan, and the United States. Organizational Behavior and Human Decision Processes, 43(2), 145–171. https://doi.org/10.1016/0749-5978(89)90048-4

]]>

Critical feedback survey