Coronavirus Modeling

The Numbers Behind Social Distancing

Here in California, we’re finishing our first week of shelter in place. Governments around the world have implemented a variety of similar policies, from complete quarantines to simple travel advisories. As I wrote Tuesday, I believe swift quarantine measures are the only way to stay the virus’s ascent. This isn’t just opinion. The math says isolation is the fastest way to make this pandemic disappear. Of course, like every optimization problem, there are constraints. People need to shop for food. Families live under the same roof. Society depends on interactions, however minimal. If that’s the case, how dangerous is it for us to cross paths?

More directly: in a group of people, what are the chances that no one has Covid-19? Answering this allows us to understand how safe we are when we’re in groups.

Some Quick Math

First, let’s answer an easier question: in the course of your daily interactions, what are the chances that any one person you interact with has Covid-19?

If you divide the number of active cases in your location by the population in your location you obtain a rough estimate. This is ‘rough’ because most people stay home if they’re sick. However, let’s assume there’s an equally sized group of individuals that are infectious but asymptomatic roaming out there. At the beginning stages of the infection this a reasonable simplification. The beauty of modeling is that you can decide to adjust this number based on your own beliefs if you disagree.

Let I be the number of local active cases and N be the local population:


To answer our original question, we take the complement (i.e. what’s the probability someone is not infected?):


If you want to know the probability that multiple (independent) events are true at the same time, you have to multiply their individual probabilities. Therefore the probability that no one in a group of k has Covid-19 is:


Coast to Coast

Let’s look at San Francisco and New York City as examples. As of March 25th, they had 178 and 20,011 cases and populations of 884k and 8.6M respectively. Taking the equations above, the probability of no one being positive in a group of k follows these curves:

Said differently, if you were in a room of 250 people, the chances that everyone is negative is only 56% in NYC and 95% in San Francisco. Although only two in every 1000 people have coronavirus in NYC, probability works in such a way that your chances of encountering at least one person in 250 are staggering. This is the math of why groups are so dangerous, the chances compound as you add people even though individual probabilities are low.

You can also interpret these numbers in a different way. For purposes of illustration, say the average person in NYC has about 250 people in their personal network. This implies that 44% of people know at least one person with the virus today.

If you flip the equation around and solve for group k, you can ask how many cases there has to be in NYC for people to have a 90% chance of knowing one person in their network who has it. The answer is 78,800 cases. At the current rate, NYC will cross this threshold in the next few days.

How many people can I safely see in a day?

Safe and legal are two different things. First, there are laws – read yours. Second, these equations are helpful, but they shouldn’t be interpreted too precisely, so stay safe.

No matter your situation, it’s not safe to see lots of people right now. Although you have a small chance of interacting with someone positive, there are many people taking that chance every day. Some of those people will be unlucky. Those unlucky few infections compound quickly through this process.

Working the Numbers

Let’s assume you want to have some ‘margin of comfort’ (probability) of knowing you won’t run into anyone with coronavirus in a group of k people. What’s the largest group you could be in?

This is best answered through an example using real data. The following graph shows the maximum safe group size given a margin of comfort using the equations above and real data from NYC cases:

If you wanted to be 90% sure you didn’t run into anyone with the virus you’d stay below the red line. If you were conservative and wanted to be 99% sure, you’d be below the blue line. On March 1st, you could walk freely. Just two weeks later the picture was far more dire.

You can see that as cases grow, the safe group size falls precipitously – no matter your margin of comfort. This is why it’s important to act quickly with social distancing: the safe group went from 100,000s to 100s in a few days. In fact, the graph compresses so quickly, it’s easiest to see in log scale:

An interesting overlay is what New York University (NYU) decided to do with their classes. NYU has a population of about 51,000 students. If you ran the university and your margin of comfort was 90% of not having any students with COVID-19, you moved classes online at precisely the right time.

Now let’s say you run Starbucks, and you have to decide when to close your stores in NYC. The average Starbucks sees roughly 500 customers per day, so in a busy city let’s double that and assume 1000 customers per day. Using the same equations, you can then answer: “What are the chances no customers enter with coronavirus today?”

While it’s unlikely customer count would stay the same for this period, the graph is instructive. In the days following the Starbucks announcement, the chance of a store having all customers be coronavirus-free went from nearly 100% down to 10%. It’s interesting to me how many of these organizations intuitively made the right call at just the right time.

These are just a handful of examples but they support the same point: any individual is unlikely to be infectious, but as you add them to groups, the chances skyrocket that there’s at least one covid19 carrier in the group. This is why social distancing and limiting groups is so critical to stopping the spread.

Side note: this math underscores how heroic it is for any person to step into their job to keep us fed, healthy and safe. A huge thank you to everyone in those roles; doctors, police officers, chefs, and more.


The US just crossed a dangerous threshold

Today, Dr. Deborah Birx, the White House’s coronavirus response coordinator said something you should hear:

The only data that we all have [is that] the two areas that have moved through their curve [are] China and South Korea. […] Those were 8-10 week curves. Each state and each hot spot in the US will be its own curve because the seeds came in at different times.

So Washington State is on their curve, they’re about two weeks ahead of New York, and so each of these have to be done in a granular way to really understand where we are. It’s the charge of the President […] to really define those issues about where the virus is, where is it going, and what predictions we can make about where we are in that bell-shaped curve.

Dr. Deborah Birx, White House Press Conference, March 23, 2020

I think this is one of the most important dynamics to understand. There are two parts:

  1. Every infected city/area develops independently, tracing its own curve along the way
  2. Each of these curves started with a ‘seed’ individual at a specific time.

These two facts are important, because they unlock the ability to compare infections that happen at different places at different times. To do this, you need to define the beginning of a local infection and you need to record total cases over time.

Let’s call the beginning the day a region (country, state or city) reaches 100 cases. Now that you have “day zero”, you can plot cases on a common timeline: days since 100 infections. When you do, some startling (and terrifying) things become clear:

At Instagram, we called these ‘snail charts’. We’d use them to compare the growth of a product over time from day of launch (often we’d launch a product in one country before another)

State of the United States

The first conclusion to take in: the United States now has the fastest growing mature infection on record. What does this mean? Since infections start at different times, it’s hard to say which infection is ‘worst’. Presumably, an infection that goes from zero to 100,000 cases faster than another one is both qualitatively and quantitatively more troubling (mortality rates notwithstanding).

By the way, these lines don’t bend easily. When they do bend (China and Korea), it has taken draconian quarantines, mass surveillance and mass testing – none of which exist in the US. Even once these measures are in place, cases have taken over a week to flatten.

Looking at the chart and knowing the US has relatively mild measures, it’s not hard to conclude that cases will soar past China’s and end far higher. Only an act of god (or a more reasonable national lockdown of all transit and non-essential health, food and government business banned) will give the US a fighting chance. Any talk of reopening the economy soon will ensure this line stays straight, up, and to the right.

Flatten that Curve. Now.

Now with just four countries – but I’ve added dotted lines showing a hypothetical country that doubles every 1, 2, 3 and 7 days for reference.

I mentioned how hard it is to bend these curves. This is the second key lesson. There are only two ways these curves bend: reducing contacts between people and thus reducing transmission, or running out of people to infect. The key is to choose the former before being dealt the latter.

Take China: on January 23rd, they suspended travel and kept anyone from leaving Wuhan. Not everything went well. An estimated 300 thousand people left before the lockdown took place. In response, the government quarantined all of Hubei province’s cities by the 27th. By that point 56 million residents were under quarantine and every non-essential business was closed. The stunning fact about this all? At the moment of quarantine, there were a paltry 830 cases.

Let’s turn to Italy. After a series of cases linked to China, Italy’s cases grew quickly. On March 8th, when Italy had 7,375 cases, Prime Minister Giuseppe Conte quarantined all of Lombardy and 14 other northern provinces. Italy had nearly ten times the number of cases when China took similar action. And stories from the time show that there was little enforcement:

“There was no immediate disruption to air travel, either, with scheduled flights still departing and landing in Milan. A sign at Milan’s Linate airport assured passengers that regular service was continuing. Italy’s national carrier, Alitalia, said it would reduce the number of flights in and out of Milan.”

“Italy’s Coronavirus Lockdown Met with Confusion, Questions About Enfrorcement”
Margherita Stancati,

Realizing they were behind, Italy expanded this quarantine to the entire nation the following day, stopping all commercial activity, and finally closing all non-essential businesses on the 21st when the latest case count showed a towering 53,578 cases. Now, two weeks later, Italy’s cases have ballooned to 64,000. Sometimes early isn’t early enough. The good news is that for the second straight day the number of new cases in Italy has dropped. The prognosis is complicated, and I plan to write about that separately. But the lesson, if any, from Wuhan is that the most effective action is to lock down when infections are low.

Some US states have had a succession of increasingly restrictive lockdowns. First, San Francisco and surrounding counties mandated residents to shelter in place on March 16th – there were 472 cases. At the same time, New York Gov. Cuomo dismissed the idea of a shelter in place order:

“As a matter of fact, I’m going so far that I don’t even think you can do a state-wide policy.”

Gov. Andrew Cuomo, New York State Governor, CNN

Days later, with cases just topping 1000, California ordered residents statewide to shelter in place. Only then on the 20th, Gov. Cuomo announced a stay at home order – a euphemism for shelter in place. By then, cases in New York had reached 8,402. Today, not three days later, they passed 20,875.

I suspect history will show that the early action in California saved countless lives. At the same time, I worry the hesitation–if only for a few days–in New York might be one of the largest public policy mistakes of our generation.

US doesn’t have an infection, States do

The last conclusion, and one that I will revisit again in upcoming posts, is that it’s a mistake to analyze a country as a whole. After all, California has 40 million residents – Italy has 60M. The line between states and countries starts to blur. You can aggregate regions any way you want, but you will always get a clearer picture by analyzing the component parts. In this case, we have states – each of which has a very different trajectory.

The same chart, but now with states. This exercise can be repeated for regions of a country or even cities themselves.

Once you look at this chart, you can’t unsee New York’s line. Not only is it just as mature as Washington State (the state with the first infection, which arguably garnered most of the media attention for the last couple of weeks), but it has an order of magnitude more cases in the same time. New York is currently hugging the ‘doubles every two days’ line – which for a state of of nearly 20 million people should give you pause.

But don’t let the largest states get all your attention. The chart above shows that Michigan (1,328), New Jersey (2,860) and Illinois (1,285) have grown far more quickly in a shorter number of days. At the age each of those reached 1,000, New York was sitting in the hundreds.

Of course, this might be because of increased testing and therefore cases. It’s possible New York missed cases and is now catching up. Regardless, you should watch these states over the next week. They are all bigger and growing faster than New York at the same age and that doesn’t bode well.

When forecasting, you don’t always need a complicated model. Sometimes all you need to do is find similar situations and observe how they evolved. If they evolve in predictable ways, ask yourself: why is this time is any different? If you don’t have a good answer, you can surely expect more of the same.

Coronavirus Modeling

Predicting Coronavirus Cases

Of all the first posts I’d write, this was the least likely. But then again, we’ve all been caught off guard by the last two weeks. That’s the thing about probability; rare still means possible. So here we are.

Instagram, in its own way, was a black swan. In 2010 we launched and 25,000 people signed up the first day. One million were using it three months in. By the time I left a year ago, over a billion users used it every month. Watching this growth, I became interested in the science of how things grow. Why was this thing growing so quickly? How quickly would it grow? When might it stop growing? As an investor over the past year, I started asking myself the same question of other companies as well. Through my research, I found a model called the SIR model that applies to infectious diseases. Some said it applied to company growth, too.

The other week, I thought about whether this model could be applied to coronavirus. On the one hand, some pundits argued cases would whimper out within a few weeks. Alarmists, however, insisted exponential curves they drew fit the data nightmarishly well. Which one was it?

I wanted to try two things. First, I’d study the SIR model and see if I could fit it to the real world data. I also wanted to figure out a way to explain how uncertain I was about that model being right. Since I’m not an epidemiologist, it’s even more important that I explain the model’s uncertainty through ranges of values called credible intervals. No one knows what the future of COVID-19 holds, but a model can provide a picture of probable outcomes.

My goal is to share my process so that you can judge the foundation for my conclusions. I also hope that by seeing how bad this can get, we might collectively avoid the worst predicted outcomes by acting quickly and decisively.

Most importantly: this is a work in progress. I don’t have all the answers, nor do I claim to know the future for certain. That being said, the model has tracked the last week very closely, and I’d personally rather have a model than no model at all. I’m open to feedback and hope that smarter people out there will both build off of this work as well as help improve it.

SIR Model for Viral Growth

Imagine an app with 100 users. At launch, each of those people tell friends about it, each of whom have some probability of becoming an active user. The process then repeats itself with those new users. They tell their friends, and with any luck they’d stick around for the process to continue with their friends. Successful companies find ways of continuing this process so that every person that joins brings at least one other person onto the service.

A virus isn’t any different. It might start with a sneeze. Invisible droplets, replete with virus, float towards susceptible people. Some people inhale the virus, and with some probability, become infected themselves. The process then repeats itself, but now your friends are the ones sniffling.

The SIR model attempts to explain both of these situations. Assume every individual is in one of three states: susceptible, infected, or resistant. In a simple world, patient zero is infected, everyone else is susceptible and nobody is resistant.

With every step in time, some susceptible people become infected, some infected people recover to be resistant, and resistant people stay resistant. Note that for simplicity, I’ve assumed a constant population size and that in the terrible case that someone dies, they are counted in the resistant population as they cannot spread the virus.

As the infected group grows, you’re more likely to run into someone sick and catch it yourself. Cases grow exponentially. At some point though, enough people have recovered that the chances of a susceptible and infected person meeting disappears. New cases slow, infected people recover, and you end up with most people being resistant. The groups evolve like this:

If you’ve watched the news, the metric they track is ‘total cases’. To calculate this, add the infected and resistant groups (less how many were resistant to start with, if any). New cases per day is the slope of the ‘total cases’ line:

Warning: Lots of math ahead. I try my best to explain what it all means.

Sure enough, the characteristic S-curve emerges. Total cases start slow, ramp exponentially, then linearly and finally taper off at some limit. The SIR model defines equations that produce these graphs. The equations govern the change in each group per unit of time:


    \[\frac{dS}{dt}=-\beta \frac{I}{N} S\]

    \[\frac{dI}{dt}=\beta \frac{I}{N} S-\gamma I\]

    \[\frac{dR}{dt}=\gamma I\]

S, I, and R are the totals of each group. N is the total of all three. The other variables are:

  • 𝜸 (gamma) – the rate of recovery. This is a fancy way of saying: what portion of infected people become healthy and resistant again per unit of time? This also happens to be the inverse of the duration of sickness.
  • 𝛽 (beta) – the transmission rate. This is the number of people an infected person infects per unit of time. While 𝛽 is unknown, it can be estimated given data (more on that soon).

There’s an issue with 𝛽, however. It’s static. In the real world, 𝛽 should decrease over time as people become more aware of the virus and people avoid gatherings, work, etc. I have to assume 𝛽 shrinks over time because people are smart and distance themselves as they learn about the virus. The traditional model doesn’t include this effect, but there’s no reason why we can’t add if we assume (and witness) that’s the way the world works. To take this effect into account, I added an additional equation governing the rate of decline of 𝛽 over time. I assume beta shrinks by a factor of δ (delta) at each step.

    \[\frac{d\beta}{dt}=-\beta \delta\]

Below, I’ve run the same model, but this time with various levels of δ. Think of δ as how quickly people distance themselves from others. Remember that 𝛽 is the number of people an infected person infects over time.

If people don’t ‘hunker down’ 𝛿 is zero and the virus infects the full population. With increasing 𝛿 (more social distancing), we reduce total cases and the rate of new infection. When you hear ‘flatten the curve’, this is what they’re talking about.

Imagine we only have ten ICU beds for our population of 100 and all infected people require critical hospitalization. If 𝛿=0.1, infections peak on day 13 but stay under the critical limit of 10 beds. If we don’t act at all, infections peak at 45 so we’re short by 35 beds on day 15. Models like this help us understand if and how our medical system can be overwhelmed depending on specific policy actions that influence 𝛿.

Besides working well theoretically, this modified version of SIR describes what we’re seeing in the real world, too. To show you, we need to choose 𝜸, 𝛽, and 𝛿 so that the model fits the data we see in the real world.

Fitting the model to data

The last section described four knowns: S, I, R, N. We also have three unknowns: 𝜸 (gamma, rate of recovery), 𝛽 (beta, rate of transmission) and 𝛿 (delta, social distancing factor) that we need to estimate.

For estimating 𝜸 (gamma, rate of recovery), we need the inverse of the length of infectiousness. If you are infectious for 5 days, 𝜸 is 1/5 because 1/5 of infected people recover every day. In the case of COVID-19, it seems that the incubation period lasts about 5 days. Researchers said they could not grow the virus from specimens taken 8 days after the onset of symptoms. I assume, then, that the average infectious period is about 9 days (5 plus half of 8). Therefore, I assume 𝜸 = 1/9. This may not be perfect, and if we were very concerned we could try different values.

𝛽 and 𝛿 are harder, and likely dependent on the specific population. The easiest way I know to choose 𝛽 and 𝛿 is to run a least-squares regression on the data from a given country. However, this produces a single ‘best guess’ value for 𝛽 and 𝛿. Since we are uncertain about the future, we’d like to know how uncertain we are about the best values of 𝛽 and 𝛿. It’d be nice to have distributions of the two parameters. For this, I decided to use pymc3, a library for probabilistic programming.

Pymc3 allows you to set up a model with knowns and unknowns. After observing real data, it returns distributions for the unknowns. (If you are interested in a more technical explanation can read more on the pymc3 site.)

After running US data through the model, it returned distributions for our parameters 𝛽 and 𝛿:

The model believes 𝛽 (transmission rate) is likely around 0.535 people per day and 𝛿 (transmission rate decay) is close to 0.01 as of March 18th. As discussed, we assume 𝜸 = 1/9 (9 days of infectiousness).

The model’s forecast

The parameters we get back from the model are distributions. This means there are many possible versions of the model. Some are more likely than others, like the red dotted average case below. At the same time, there are some cases that might happen but are less likely. The gray section in the chart shows the range where 90% of models exist.

In the short run, the model is confident. But one month out, the credible interval expands dramatically – we cannot be that confident in the future This does not mean we cannot draw conclusions, though.

For instance, the model claims there is a 95% chance we will have more than 15.4 million infections in the United States. The best estimates of IFR (infection fatality ratio), are around 2%. Recent tallies imply a 4% IFR, though this is disputed because mild cases go undiagnosed.

Either way, a conservative 1% IFR implies a 95% chance of 154,000 deaths or greater in the US alone. The average scenario of the model implies 1.5 million dead in the US – bested by the now widely cited Imperial College study at 2.2 million deaths. In a typical year, the flu kills just 40,000.

It’s important to remember that this is a model that shows what happens if we stay the course. It’s not a prediction of the future because our behavior may change. If we lockdown cities, cancel events, etc. 𝛽 will reduce far more quickly than the model expects. So far, it’s unclear how much we’ll change though. In New York politicians are resisting these measures, while San Francisco implemented them quickly.

I’m hopeful that these predictions push policymakers, local governments and individuals to take extraordinary precautions to reduce transmission rates. Over the coming days and weeks, I’ll provide updates to the model and inferences based on its output. The estimates will change and credible intervals will tighten with new data and we’ll get a clearer picture of what the future holds.

I will analyze specific countries, like Italy, where people are looking to see the effects of a national lockdown. I’ll take a look at the trajectory of various countries and make inferences about how things will go. I’ll try to answer questions like: how many people is it ‘safe’ to socialize with now and in one week?

In almost every case, the conclusions are more dire than people currently believe. Statistician George Box once said that all models are wrong but some are useful. In this case, I sincerely hope he’s right.