Coronavirus Modeling

Predicting Coronavirus Cases

Of all the first posts I’d write, this was the least likely. But then again, we’ve all been caught off guard by the last two weeks. That’s the thing about probability; rare still means possible. So here we are.

Instagram, in its own way, was a black swan. In 2010 we launched and 25,000 people signed up the first day. One million were using it three months in. By the time I left a year ago, over a billion users used it every month. Watching this growth, I became interested in the science of how things grow. Why was this thing growing so quickly? How quickly would it grow? When might it stop growing? As an investor over the past year, I started asking myself the same question of other companies as well. Through my research, I found a model called the SIR model that applies to infectious diseases. Some said it applied to company growth, too.

The other week, I thought about whether this model could be applied to coronavirus. On the one hand, some pundits argued cases would whimper out within a few weeks. Alarmists, however, insisted exponential curves they drew fit the data nightmarishly well. Which one was it?

I wanted to try two things. First, I’d study the SIR model and see if I could fit it to the real world data. I also wanted to figure out a way to explain how uncertain I was about that model being right. Since I’m not an epidemiologist, it’s even more important that I explain the model’s uncertainty through ranges of values called credible intervals. No one knows what the future of COVID-19 holds, but a model can provide a picture of probable outcomes.

My goal is to share my process so that you can judge the foundation for my conclusions. I also hope that by seeing how bad this can get, we might collectively avoid the worst predicted outcomes by acting quickly and decisively.

Most importantly: this is a work in progress. I don’t have all the answers, nor do I claim to know the future for certain. That being said, the model has tracked the last week very closely, and I’d personally rather have a model than no model at all. I’m open to feedback and hope that smarter people out there will both build off of this work as well as help improve it.

SIR Model for Viral Growth

Imagine an app with 100 users. At launch, each of those people tell friends about it, each of whom have some probability of becoming an active user. The process then repeats itself with those new users. They tell their friends, and with any luck they’d stick around for the process to continue with their friends. Successful companies find ways of continuing this process so that every person that joins brings at least one other person onto the service.

A virus isn’t any different. It might start with a sneeze. Invisible droplets, replete with virus, float towards susceptible people. Some people inhale the virus, and with some probability, become infected themselves. The process then repeats itself, but now your friends are the ones sniffling.

The SIR model attempts to explain both of these situations. Assume every individual is in one of three states: susceptible, infected, or resistant. In a simple world, patient zero is infected, everyone else is susceptible and nobody is resistant.

With every step in time, some susceptible people become infected, some infected people recover to be resistant, and resistant people stay resistant. Note that for simplicity, I’ve assumed a constant population size and that in the terrible case that someone dies, they are counted in the resistant population as they cannot spread the virus.

As the infected group grows, you’re more likely to run into someone sick and catch it yourself. Cases grow exponentially. At some point though, enough people have recovered that the chances of a susceptible and infected person meeting disappears. New cases slow, infected people recover, and you end up with most people being resistant. The groups evolve like this:

If you’ve watched the news, the metric they track is ‘total cases’. To calculate this, add the infected and resistant groups (less how many were resistant to start with, if any). New cases per day is the slope of the ‘total cases’ line:

Warning: Lots of math ahead. I try my best to explain what it all means.

Sure enough, the characteristic S-curve emerges. Total cases start slow, ramp exponentially, then linearly and finally taper off at some limit. The SIR model defines equations that produce these graphs. The equations govern the change in each group per unit of time:


    \[\frac{dS}{dt}=-\beta \frac{I}{N} S\]

    \[\frac{dI}{dt}=\beta \frac{I}{N} S-\gamma I\]

    \[\frac{dR}{dt}=\gamma I\]

S, I, and R are the totals of each group. N is the total of all three. The other variables are:

  • 𝜸 (gamma) – the rate of recovery. This is a fancy way of saying: what portion of infected people become healthy and resistant again per unit of time? This also happens to be the inverse of the duration of sickness.
  • 𝛽 (beta) – the transmission rate. This is the number of people an infected person infects per unit of time. While 𝛽 is unknown, it can be estimated given data (more on that soon).

There’s an issue with 𝛽, however. It’s static. In the real world, 𝛽 should decrease over time as people become more aware of the virus and people avoid gatherings, work, etc. I have to assume 𝛽 shrinks over time because people are smart and distance themselves as they learn about the virus. The traditional model doesn’t include this effect, but there’s no reason why we can’t add if we assume (and witness) that’s the way the world works. To take this effect into account, I added an additional equation governing the rate of decline of 𝛽 over time. I assume beta shrinks by a factor of Ξ΄ (delta) at each step.

    \[\frac{d\beta}{dt}=-\beta \delta\]

Below, I’ve run the same model, but this time with various levels of Ξ΄. Think of Ξ΄ as how quickly people distance themselves from others. Remember that 𝛽 is the number of people an infected person infects over time.

If people don’t ‘hunker down’ 𝛿 is zero and the virus infects the full population. With increasing 𝛿 (more social distancing), we reduce total cases and the rate of new infection. When you hear ‘flatten the curve’, this is what they’re talking about.

Imagine we only have ten ICU beds for our population of 100 and all infected people require critical hospitalization. If 𝛿=0.1, infections peak on day 13 but stay under the critical limit of 10 beds. If we don’t act at all, infections peak at 45 so we’re short by 35 beds on day 15. Models like this help us understand if and how our medical system can be overwhelmed depending on specific policy actions that influence 𝛿.

Besides working well theoretically, this modified version of SIR describes what we’re seeing in the real world, too. To show you, we need to choose 𝜸, 𝛽, and 𝛿 so that the model fits the data we see in the real world.

Fitting the model to data

The last section described four knowns: S, I, R, N. We also have three unknowns: 𝜸 (gamma, rate of recovery), 𝛽 (beta, rate of transmission) and 𝛿 (delta, social distancing factor) that we need to estimate.

For estimating 𝜸 (gamma, rate of recovery), we need the inverse of the length of infectiousness. If you are infectious for 5 days, 𝜸 is 1/5 because 1/5 of infected people recover every day. In the case of COVID-19, it seems that the incubation period lasts about 5 days. Researchers said they could not grow the virus from specimens taken 8 days after the onset of symptoms. I assume, then, that the average infectious period is about 9 days (5 plus half of 8). Therefore, I assume 𝜸 = 1/9. This may not be perfect, and if we were very concerned we could try different values.

𝛽 and 𝛿 are harder, and likely dependent on the specific population. The easiest way I know to choose 𝛽 and 𝛿 is to run a least-squares regression on the data from a given country. However, this produces a single β€˜best guess’ value for 𝛽 and 𝛿. Since we are uncertain about the future, we’d like to know how uncertain we are about the best values of 𝛽 and 𝛿. It’d be nice to have distributions of the two parameters. For this, I decided to use pymc3, a library for probabilistic programming.

Pymc3 allows you to set up a model with knowns and unknowns. After observing real data, it returns distributions for the unknowns. (If you are interested in a more technical explanation can read more on the pymc3 site.)

After running US data through the model, it returned distributions for our parameters 𝛽 and 𝛿:

The model believes 𝛽 (transmission rate) is likely around 0.535 people per day and 𝛿 (transmission rate decay) is close to 0.01 as of March 18th. As discussed, we assume 𝜸 = 1/9 (9 days of infectiousness).

The model’s forecast

The parameters we get back from the model are distributions. This means there are many possible versions of the model. Some are more likely than others, like the red dotted average case below. At the same time, there are some cases that might happen but are less likely. The gray section in the chart shows the range where 90% of models exist.

In the short run, the model is confident. But one month out, the credible interval expands dramatically – we cannot be that confident in the future This does not mean we cannot draw conclusions, though.

For instance, the model claims there is a 95% chance we will have more than 15.4 million infections in the United States. The best estimates of IFR (infection fatality ratio), are around 2%. Recent tallies imply a 4% IFR, though this is disputed because mild cases go undiagnosed.

Either way, a conservative 1% IFR implies a 95% chance of 154,000 deaths or greater in the US alone. The average scenario of the model implies 1.5 million dead in the US – bested by the now widely cited Imperial College study at 2.2 million deaths. In a typical year, the flu kills just 40,000.

It’s important to remember that this is a model that shows what happens if we stay the course. It’s not a prediction of the future because our behavior may change. If we lockdown cities, cancel events, etc. 𝛽 will reduce far more quickly than the model expects. So far, it’s unclear how much we’ll change though. In New York politicians are resisting these measures, while San Francisco implemented them quickly.

I’m hopeful that these predictions push policymakers, local governments and individuals to take extraordinary precautions to reduce transmission rates. Over the coming days and weeks, I’ll provide updates to the model and inferences based on its output. The estimates will change and credible intervals will tighten with new data and we’ll get a clearer picture of what the future holds.

I will analyze specific countries, like Italy, where people are looking to see the effects of a national lockdown. I’ll take a look at the trajectory of various countries and make inferences about how things will go. I’ll try to answer questions like: how many people is it ‘safe’ to socialize with now and in one week?

In almost every case, the conclusions are more dire than people currently believe. Statistician George Box once said that all models are wrong but some are useful. In this case, I sincerely hope he’s right.