Coronavirus Modeling

The Numbers Behind Social Distancing

Here in California, we’re finishing our first week of shelter in place. Governments around the world have implemented a variety of similar policies, from complete quarantines to simple travel advisories. As I wrote Tuesday, I believe swift quarantine measures are the only way to stay the virus’s ascent. This isn’t just opinion. The math says isolation is the fastest way to make this pandemic disappear. Of course, like every optimization problem, there are constraints. People need to shop for food. Families live under the same roof. Society depends on interactions, however minimal. If that’s the case, how dangerous is it for us to cross paths?

More directly: in a group of people, what are the chances that no one has Covid-19? Answering this allows us to understand how safe we are when we’re in groups.

Some Quick Math

First, let’s answer an easier question: in the course of your daily interactions, what are the chances that any one person you interact with has Covid-19?

If you divide the number of active cases in your location by the population in your location you obtain a rough estimate. This is ‘rough’ because most people stay home if they’re sick. However, let’s assume there’s an equally sized group of individuals that are infectious but asymptomatic roaming out there. At the beginning stages of the infection this a reasonable simplification. The beauty of modeling is that you can decide to adjust this number based on your own beliefs if you disagree.

Let I be the number of local active cases and N be the local population:


To answer our original question, we take the complement (i.e. what’s the probability someone is not infected?):


If you want to know the probability that multiple (independent) events are true at the same time, you have to multiply their individual probabilities. Therefore the probability that no one in a group of k has Covid-19 is:


Coast to Coast

Let’s look at San Francisco and New York City as examples. As of March 25th, they had 178 and 20,011 cases and populations of 884k and 8.6M respectively. Taking the equations above, the probability of no one being positive in a group of k follows these curves:

Said differently, if you were in a room of 250 people, the chances that everyone is negative is only 56% in NYC and 95% in San Francisco. Although only two in every 1000 people have coronavirus in NYC, probability works in such a way that your chances of encountering at least one person in 250 are staggering. This is the math of why groups are so dangerous, the chances compound as you add people even though individual probabilities are low.

You can also interpret these numbers in a different way. For purposes of illustration, say the average person in NYC has about 250 people in their personal network. This implies that 44% of people know at least one person with the virus today.

If you flip the equation around and solve for group k, you can ask how many cases there has to be in NYC for people to have a 90% chance of knowing one person in their network who has it. The answer is 78,800 cases. At the current rate, NYC will cross this threshold in the next few days.

How many people can I safely see in a day?

Safe and legal are two different things. First, there are laws – read yours. Second, these equations are helpful, but they shouldn’t be interpreted too precisely, so stay safe.

No matter your situation, it’s not safe to see lots of people right now. Although you have a small chance of interacting with someone positive, there are many people taking that chance every day. Some of those people will be unlucky. Those unlucky few infections compound quickly through this process.

Working the Numbers

Let’s assume you want to have some ‘margin of comfort’ (probability) of knowing you won’t run into anyone with coronavirus in a group of k people. What’s the largest group you could be in?

This is best answered through an example using real data. The following graph shows the maximum safe group size given a margin of comfort using the equations above and real data from NYC cases:

If you wanted to be 90% sure you didn’t run into anyone with the virus you’d stay below the red line. If you were conservative and wanted to be 99% sure, you’d be below the blue line. On March 1st, you could walk freely. Just two weeks later the picture was far more dire.

You can see that as cases grow, the safe group size falls precipitously – no matter your margin of comfort. This is why it’s important to act quickly with social distancing: the safe group went from 100,000s to 100s in a few days. In fact, the graph compresses so quickly, it’s easiest to see in log scale:

An interesting overlay is what New York University (NYU) decided to do with their classes. NYU has a population of about 51,000 students. If you ran the university and your margin of comfort was 90% of not having any students with COVID-19, you moved classes online at precisely the right time.

Now let’s say you run Starbucks, and you have to decide when to close your stores in NYC. The average Starbucks sees roughly 500 customers per day, so in a busy city let’s double that and assume 1000 customers per day. Using the same equations, you can then answer: “What are the chances no customers enter with coronavirus today?”

While it’s unlikely customer count would stay the same for this period, the graph is instructive. In the days following the Starbucks announcement, the chance of a store having all customers be coronavirus-free went from nearly 100% down to 10%. It’s interesting to me how many of these organizations intuitively made the right call at just the right time.

These are just a handful of examples but they support the same point: any individual is unlikely to be infectious, but as you add them to groups, the chances skyrocket that there’s at least one covid19 carrier in the group. This is why social distancing and limiting groups is so critical to stopping the spread.

Side note: this math underscores how heroic it is for any person to step into their job to keep us fed, healthy and safe. A huge thank you to everyone in those roles; doctors, police officers, chefs, and more.

Coronavirus Modeling

Predicting Coronavirus Cases

Of all the first posts I’d write, this was the least likely. But then again, we’ve all been caught off guard by the last two weeks. That’s the thing about probability; rare still means possible. So here we are.

Instagram, in its own way, was a black swan. In 2010 we launched and 25,000 people signed up the first day. One million were using it three months in. By the time I left a year ago, over a billion users used it every month. Watching this growth, I became interested in the science of how things grow. Why was this thing growing so quickly? How quickly would it grow? When might it stop growing? As an investor over the past year, I started asking myself the same question of other companies as well. Through my research, I found a model called the SIR model that applies to infectious diseases. Some said it applied to company growth, too.

The other week, I thought about whether this model could be applied to coronavirus. On the one hand, some pundits argued cases would whimper out within a few weeks. Alarmists, however, insisted exponential curves they drew fit the data nightmarishly well. Which one was it?

I wanted to try two things. First, I’d study the SIR model and see if I could fit it to the real world data. I also wanted to figure out a way to explain how uncertain I was about that model being right. Since I’m not an epidemiologist, it’s even more important that I explain the model’s uncertainty through ranges of values called credible intervals. No one knows what the future of COVID-19 holds, but a model can provide a picture of probable outcomes.

My goal is to share my process so that you can judge the foundation for my conclusions. I also hope that by seeing how bad this can get, we might collectively avoid the worst predicted outcomes by acting quickly and decisively.

Most importantly: this is a work in progress. I don’t have all the answers, nor do I claim to know the future for certain. That being said, the model has tracked the last week very closely, and I’d personally rather have a model than no model at all. I’m open to feedback and hope that smarter people out there will both build off of this work as well as help improve it.

SIR Model for Viral Growth

Imagine an app with 100 users. At launch, each of those people tell friends about it, each of whom have some probability of becoming an active user. The process then repeats itself with those new users. They tell their friends, and with any luck they’d stick around for the process to continue with their friends. Successful companies find ways of continuing this process so that every person that joins brings at least one other person onto the service.

A virus isn’t any different. It might start with a sneeze. Invisible droplets, replete with virus, float towards susceptible people. Some people inhale the virus, and with some probability, become infected themselves. The process then repeats itself, but now your friends are the ones sniffling.

The SIR model attempts to explain both of these situations. Assume every individual is in one of three states: susceptible, infected, or resistant. In a simple world, patient zero is infected, everyone else is susceptible and nobody is resistant.

With every step in time, some susceptible people become infected, some infected people recover to be resistant, and resistant people stay resistant. Note that for simplicity, I’ve assumed a constant population size and that in the terrible case that someone dies, they are counted in the resistant population as they cannot spread the virus.

As the infected group grows, you’re more likely to run into someone sick and catch it yourself. Cases grow exponentially. At some point though, enough people have recovered that the chances of a susceptible and infected person meeting disappears. New cases slow, infected people recover, and you end up with most people being resistant. The groups evolve like this:

If you’ve watched the news, the metric they track is ‘total cases’. To calculate this, add the infected and resistant groups (less how many were resistant to start with, if any). New cases per day is the slope of the ‘total cases’ line:

Warning: Lots of math ahead. I try my best to explain what it all means.

Sure enough, the characteristic S-curve emerges. Total cases start slow, ramp exponentially, then linearly and finally taper off at some limit. The SIR model defines equations that produce these graphs. The equations govern the change in each group per unit of time:


    \[\frac{dS}{dt}=-\beta \frac{I}{N} S\]

    \[\frac{dI}{dt}=\beta \frac{I}{N} S-\gamma I\]

    \[\frac{dR}{dt}=\gamma I\]

S, I, and R are the totals of each group. N is the total of all three. The other variables are:

  • 𝜸 (gamma) – the rate of recovery. This is a fancy way of saying: what portion of infected people become healthy and resistant again per unit of time? This also happens to be the inverse of the duration of sickness.
  • 𝛽 (beta) – the transmission rate. This is the number of people an infected person infects per unit of time. While 𝛽 is unknown, it can be estimated given data (more on that soon).

There’s an issue with 𝛽, however. It’s static. In the real world, 𝛽 should decrease over time as people become more aware of the virus and people avoid gatherings, work, etc. I have to assume 𝛽 shrinks over time because people are smart and distance themselves as they learn about the virus. The traditional model doesn’t include this effect, but there’s no reason why we can’t add if we assume (and witness) that’s the way the world works. To take this effect into account, I added an additional equation governing the rate of decline of 𝛽 over time. I assume beta shrinks by a factor of Ξ΄ (delta) at each step.

    \[\frac{d\beta}{dt}=-\beta \delta\]

Below, I’ve run the same model, but this time with various levels of Ξ΄. Think of Ξ΄ as how quickly people distance themselves from others. Remember that 𝛽 is the number of people an infected person infects over time.

If people don’t ‘hunker down’ 𝛿 is zero and the virus infects the full population. With increasing 𝛿 (more social distancing), we reduce total cases and the rate of new infection. When you hear ‘flatten the curve’, this is what they’re talking about.

Imagine we only have ten ICU beds for our population of 100 and all infected people require critical hospitalization. If 𝛿=0.1, infections peak on day 13 but stay under the critical limit of 10 beds. If we don’t act at all, infections peak at 45 so we’re short by 35 beds on day 15. Models like this help us understand if and how our medical system can be overwhelmed depending on specific policy actions that influence 𝛿.

Besides working well theoretically, this modified version of SIR describes what we’re seeing in the real world, too. To show you, we need to choose 𝜸, 𝛽, and 𝛿 so that the model fits the data we see in the real world.

Fitting the model to data

The last section described four knowns: S, I, R, N. We also have three unknowns: 𝜸 (gamma, rate of recovery), 𝛽 (beta, rate of transmission) and 𝛿 (delta, social distancing factor) that we need to estimate.

For estimating 𝜸 (gamma, rate of recovery), we need the inverse of the length of infectiousness. If you are infectious for 5 days, 𝜸 is 1/5 because 1/5 of infected people recover every day. In the case of COVID-19, it seems that the incubation period lasts about 5 days. Researchers said they could not grow the virus from specimens taken 8 days after the onset of symptoms. I assume, then, that the average infectious period is about 9 days (5 plus half of 8). Therefore, I assume 𝜸 = 1/9. This may not be perfect, and if we were very concerned we could try different values.

𝛽 and 𝛿 are harder, and likely dependent on the specific population. The easiest way I know to choose 𝛽 and 𝛿 is to run a least-squares regression on the data from a given country. However, this produces a single β€˜best guess’ value for 𝛽 and 𝛿. Since we are uncertain about the future, we’d like to know how uncertain we are about the best values of 𝛽 and 𝛿. It’d be nice to have distributions of the two parameters. For this, I decided to use pymc3, a library for probabilistic programming.

Pymc3 allows you to set up a model with knowns and unknowns. After observing real data, it returns distributions for the unknowns. (If you are interested in a more technical explanation can read more on the pymc3 site.)

After running US data through the model, it returned distributions for our parameters 𝛽 and 𝛿:

The model believes 𝛽 (transmission rate) is likely around 0.535 people per day and 𝛿 (transmission rate decay) is close to 0.01 as of March 18th. As discussed, we assume 𝜸 = 1/9 (9 days of infectiousness).

The model’s forecast

The parameters we get back from the model are distributions. This means there are many possible versions of the model. Some are more likely than others, like the red dotted average case below. At the same time, there are some cases that might happen but are less likely. The gray section in the chart shows the range where 90% of models exist.

In the short run, the model is confident. But one month out, the credible interval expands dramatically – we cannot be that confident in the future This does not mean we cannot draw conclusions, though.

For instance, the model claims there is a 95% chance we will have more than 15.4 million infections in the United States. The best estimates of IFR (infection fatality ratio), are around 2%. Recent tallies imply a 4% IFR, though this is disputed because mild cases go undiagnosed.

Either way, a conservative 1% IFR implies a 95% chance of 154,000 deaths or greater in the US alone. The average scenario of the model implies 1.5 million dead in the US – bested by the now widely cited Imperial College study at 2.2 million deaths. In a typical year, the flu kills just 40,000.

It’s important to remember that this is a model that shows what happens if we stay the course. It’s not a prediction of the future because our behavior may change. If we lockdown cities, cancel events, etc. 𝛽 will reduce far more quickly than the model expects. So far, it’s unclear how much we’ll change though. In New York politicians are resisting these measures, while San Francisco implemented them quickly.

I’m hopeful that these predictions push policymakers, local governments and individuals to take extraordinary precautions to reduce transmission rates. Over the coming days and weeks, I’ll provide updates to the model and inferences based on its output. The estimates will change and credible intervals will tighten with new data and we’ll get a clearer picture of what the future holds.

I will analyze specific countries, like Italy, where people are looking to see the effects of a national lockdown. I’ll take a look at the trajectory of various countries and make inferences about how things will go. I’ll try to answer questions like: how many people is it ‘safe’ to socialize with now and in one week?

In almost every case, the conclusions are more dire than people currently believe. Statistician George Box once said that all models are wrong but some are useful. In this case, I sincerely hope he’s right.