Coronavirus Modeling

The Metric We Need to Manage COVID-19

It’s Easter. A couple weeks ago, this was the target for returning to normal. Hospitals are now full of patients, cities bloom as new hotspots, and politicians wrestle with the balance of human and economic costs. We’re left to wonder if we are equipped with the right metrics to guide our path forward. Add to the confusion that metrics are based on noisy data that changes daily. There’s one metric, however, that has the most promise. It’s called Rt – the effective reproduction number. We can estimate it, and it’s the key to getting us through the next few months.

Most people are more familiar with R0. R0 is the basic reproduction number of an epidemic. It’s defined as the number of secondary infections produced by a single infection. If R0 is greater than one, the epidemic spreads quickly **. If R0 is less than one, the epidemic spreads, but limps along and disappears before everyone becomes infected. The flu has an R0 between one and two while measles sits in the high teens. While R0 is a useful measure, it is flawed in an important way: it’s static.

We’ve all witnessed that humans are adaptable. Our behavior changes, whether mandated or self-prescribed, and that changes the effective R value at any point in time. As we socially distance and isolate, R plummets. Because the value changes so rapidly, Epidemiologists have argued that the only true way to combat COVID19 is to understand and manage by Rt.

I agree, and I’d go further: we not only need to know Rt, we need to know local Rt. New York’s epidemic is vastly different than California’s and using a single number to describe them both is not useful. Knowing the local Rt allows us to manage the pandemic effectively.

States have had a variety of lockdown strategies, but there’s very little understanding of which have worked and which need to go further. Some states like California have been locked down for weeks, while others like Iowa and Nebraska continue to balk at taking action as cases rise. Being able to compare local Rts between different areas and/or watch how Rt changes in one place can help us measure how effective local policies are at slowing the spread of the virus.

Tracking Rt also lets us know when we might loosen restrictions. Any suggestion that we loosen restrictions when Rt > 1.0 is an explicit decision to let the virus proliferate. At the same time, if we are able to reduce Rt to below 1.0, and we can reduce the number of cases overall, the virus becomes manageable. Life can begin to return to ‘normal.’ But without knowing Rt we are simply flying blind.

How We Can Calculate Rₜ

It’s impossible to measure Rt directly, so we have to estimate it. Fortunately, there are many ways to this. One particular method Bettencourt & Ribeiro described in their 2008 paper, “Real Time Bayesian Estimation of the Epidemic Potential of Emerging Infectious Diseases.” This solution caught my attention because it focuses on the same principles from my first post, Predicting Coronavirus Cases. It uses Bayesian statistics to estimate the most likely value of Rt and also return a credible interval for the true value of Rt.

What follows is an application of Bettencourt & Ribeiro’s process (with an important modification) to US State COVID19 data. Note that while this post focuses on the high level concepts, those who want to dig in further can find the details in this Jupyter notebook.

Bettencourt & Ribeiro’s original algorithm to estimate Rt is a function of how many new cases appear each day. The relationship between the number of cases yesterday and the number of cases today give us a hint of what Rt might be. However, we can’t rely on any one day too much in trying to guess Rt, as daily case counts are imperfect due to changing testing capacity, lags in data reporting, and random chance. However, using Bayes’ Theorem, we can take the new information we get from each day’s case count to adjust our expectation of what Rt is, getting closer to the true value as more daily data becomes available.

I applied this algorithm to the data to produce a model for each state’s Rt, and how it changes over time. But I noticed something strange. Over time, all states trended asymptotically to Rt = 1.0, refusing to descend below that value. Somehow, the algorithm wasn’t reflecting the reality that Rt could be < 1.0 as well.

To fix this, I made one significant change to their algorithm that maintains the integrity of Bettencourt & Ribeiro’s original work while allowing us to see the real-time picture clearly. The change was simple. Instead of considering every previous day of data we have to estimate Rt, I only use the last seven days. Doing so is mathematically sound and produces more accurate results when the model is compared to actual data, but I admit is not reviewed by anyone. While I invite feedback, I’m sharing these results with that disclaimer well in advance.

The Results

The algorithm produces a most-likely value for Rt over time for each locale. Below you can see select results for Michigan, Washington and New York.

The graphs above show two important things. First, the ‘most likely value’ of Rt for each day, represented by the dots (the more red a dot is, the higher and more dangerous the value of Rt is). Second, the interval of values that Rt might actually be is represented by the gray bands (if there’s a ‘most likely value’, then there are other less likely but still possible values as well).

Because we’re using Bayesian methods, the model produces something called a highest-density interval (HDI). In this case, the model claims that 95% of the possible true values lie in the gray band, that is, in the HDI. This is helpful when looking at states where conclusions might not be as certain. Consider Maine.

Comparing Maine to the previous states is instructive. Maine looks to have reduced Rt below 1.0, but the gray band shows us that we cannot safely conclude we are below the safety threshold, since it’s possible that Rt is still above 1.0. In this case, restrictions should continue. The reason Maine’s HDI is wide has to do with the lack of information. Maine has fewer cases than the previous examples, so the margins are wider. The more data we get and the more consistent that data is, the more confident we can be in our conclusions. The explicit HDI helps make sure we’re not misled by the data and prematurely conclude something that might be a grave error. This is useful when it comes to policy decisions: wait until the high end of gray band is well below 1.0 before resuming normal activity.

You can review each state in the main graphic at the top, but a denser view might be helpful, too.

How do States Compare?

In this image, I’ve taken the most recent values for Rt and plotted them along with the highest-density interval (HDI) bands (as of the evening of 4/11). Doing this allows us to see all sorts of interesting things.

Notice how many of the states without a shelter-in-place order—Iowa, Nebraska and South Dakota— have some of the highest Rt values. These states should be aware that their high Rt will lead to exponential growth in cases. As I argued in my previous post, when this happens, we need to lock down, so why not lock down now before cases grow? Doing so will save countless lives.

Now, you might point out that North Dakota looks well below 1.0 and does not have a lockdown. But here it’s important to consider North Dakota’s HDI. North Dakota’s interval is so wide that it includes many possible values above 1.0—therefore we can’t safely conclude that Rt is truly below 1.0. If the actual value lies in the highest portion of the HDI, North Dakota might have one of the highest Rt values. In fact, if we plot the same data, but sort by the high end of the HDI (worst-case scenario) we see this:

Now it’s clear that all non-lockdown states cluster on the right-hand side. While we cannot be sure the true Rt value is that high, this graph should give any policymaker in a non-lockdown state pause.

Instead of worst-case scenarios, let’s look at which states are almost certainly below the 1.0 threshold. These are the states that are clearly doing well.

Here, I’ve plotted states where the high end of their HDI is less than 1.1. This means that even in the worst case scenario, the state likely has the epidemic “under control.” You can see that many of the states in the headlines for the past few weeks are in this group. Surprisingly, Louisiana seems to have turned the corner decisively, even with one of the worst outbreaks. Note, however, that “under control” does not mean good. Louisiana still has far more new cases than Washington per day, but Louisiana’s outbreak will not be growing as nearly quickly as Washington and will soon see cases decline even more rapidly. Whatever they’re doing is working. But even in these states—the most “under control” in the nation— Rt is still far from the comfortable value of ~0.5. Even Idaho has a wide HDI, so it’s too early to tell.

Which states have the epidemic least “under control?” To answer this, I plotted states where the best case (eg. low end of HDI) is above 1.0, indicating the true value of Rt is almost certainly above 1.0. Surprisingly, Rhode Island, Maryland and Massachusetts sit at the top. Part of this may be that Rhode Island is earlier in its infection curve, but seeing large states like Massachusetts and Texas above 1.0 is worrisome —especially because none of these states have hit the headlines as being trouble spots.

While the measure of Rt can be abstract, it shows up in the case data. Compare Massachusetts’s new cases per day to Louisiana’s. Louisiana’s cases per day peaked a week ago while Massachusetts’s continue to ascend. Remember that we’re looking at new cases per day, so in both, total cases continue to climb but Louisiana’s rate of increase is declining rapidly. Rt is a measure of direction, not absolute size.

Looking Forward

Without the use of a clear metric on our ability to contain the coronavirus pandemic, it’s difficult to imagine that we’ll manage a return to normalcy anytime soon. I hope by sharing this work with you, you will consider Rt as the metric that can guide our analysis and decision-making. I’ve kept most of the math and theory in the notebook, but I’d highly suggest diving in if you’re so inclined. And as always, you can get in touch by email or reaching out on Twitter or Instagram @kevin

** Note: In a previous version of this post I wrote that R0 > 1 meant that the virus spread to everyone in the population. This is not true. While R0 values in the range of 3-4 as shown above mean it will spread to nearly all people in the population, a value of 1.5 might only spread to ~60% (depending on your parameter estimates). In any case, our goal should be to reduce R0 to as small a value as possible. Thank you to Jim Fisher for pointing this out.

Coronavirus Modeling

The Numbers Behind Social Distancing

Here in California, we’re finishing our first week of shelter in place. Governments around the world have implemented a variety of similar policies, from complete quarantines to simple travel advisories. As I wrote Tuesday, I believe swift quarantine measures are the only way to stay the virus’s ascent. This isn’t just opinion. The math says isolation is the fastest way to make this pandemic disappear. Of course, like every optimization problem, there are constraints. People need to shop for food. Families live under the same roof. Society depends on interactions, however minimal. If that’s the case, how dangerous is it for us to cross paths?

More directly: in a group of people, what are the chances that no one has Covid-19? Answering this allows us to understand how safe we are when we’re in groups.

Some Quick Math

First, let’s answer an easier question: in the course of your daily interactions, what are the chances that any one person you interact with has Covid-19?

If you divide the number of active cases in your location by the population in your location you obtain a rough estimate. This is ‘rough’ because most people stay home if they’re sick. However, let’s assume there’s an equally sized group of individuals that are infectious but asymptomatic roaming out there. At the beginning stages of the infection this a reasonable simplification. The beauty of modeling is that you can decide to adjust this number based on your own beliefs if you disagree.

Let I be the number of local active cases and N be the local population:


To answer our original question, we take the complement (i.e. what’s the probability someone is not infected?):


If you want to know the probability that multiple (independent) events are true at the same time, you have to multiply their individual probabilities. Therefore the probability that no one in a group of k has Covid-19 is:


Coast to Coast

Let’s look at San Francisco and New York City as examples. As of March 25th, they had 178 and 20,011 cases and populations of 884k and 8.6M respectively. Taking the equations above, the probability of no one being positive in a group of k follows these curves:

Said differently, if you were in a room of 250 people, the chances that everyone is negative is only 56% in NYC and 95% in San Francisco. Although only two in every 1000 people have coronavirus in NYC, probability works in such a way that your chances of encountering at least one person in 250 are staggering. This is the math of why groups are so dangerous, the chances compound as you add people even though individual probabilities are low.

You can also interpret these numbers in a different way. For purposes of illustration, say the average person in NYC has about 250 people in their personal network. This implies that 44% of people know at least one person with the virus today.

If you flip the equation around and solve for group k, you can ask how many cases there has to be in NYC for people to have a 90% chance of knowing one person in their network who has it. The answer is 78,800 cases. At the current rate, NYC will cross this threshold in the next few days.

How many people can I safely see in a day?

Safe and legal are two different things. First, there are laws – read yours. Second, these equations are helpful, but they shouldn’t be interpreted too precisely, so stay safe.

No matter your situation, it’s not safe to see lots of people right now. Although you have a small chance of interacting with someone positive, there are many people taking that chance every day. Some of those people will be unlucky. Those unlucky few infections compound quickly through this process.

Working the Numbers

Let’s assume you want to have some ‘margin of comfort’ (probability) of knowing you won’t run into anyone with coronavirus in a group of k people. What’s the largest group you could be in?

This is best answered through an example using real data. The following graph shows the maximum safe group size given a margin of comfort using the equations above and real data from NYC cases:

If you wanted to be 90% sure you didn’t run into anyone with the virus you’d stay below the red line. If you were conservative and wanted to be 99% sure, you’d be below the blue line. On March 1st, you could walk freely. Just two weeks later the picture was far more dire.

You can see that as cases grow, the safe group size falls precipitously – no matter your margin of comfort. This is why it’s important to act quickly with social distancing: the safe group went from 100,000s to 100s in a few days. In fact, the graph compresses so quickly, it’s easiest to see in log scale:

An interesting overlay is what New York University (NYU) decided to do with their classes. NYU has a population of about 51,000 students. If you ran the university and your margin of comfort was 90% of not having any students with COVID-19, you moved classes online at precisely the right time.

Now let’s say you run Starbucks, and you have to decide when to close your stores in NYC. The average Starbucks sees roughly 500 customers per day, so in a busy city let’s double that and assume 1000 customers per day. Using the same equations, you can then answer: “What are the chances no customers enter with coronavirus today?”

While it’s unlikely customer count would stay the same for this period, the graph is instructive. In the days following the Starbucks announcement, the chance of a store having all customers be coronavirus-free went from nearly 100% down to 10%. It’s interesting to me how many of these organizations intuitively made the right call at just the right time.

These are just a handful of examples but they support the same point: any individual is unlikely to be infectious, but as you add them to groups, the chances skyrocket that there’s at least one covid19 carrier in the group. This is why social distancing and limiting groups is so critical to stopping the spread.

Side note: this math underscores how heroic it is for any person to step into their job to keep us fed, healthy and safe. A huge thank you to everyone in those roles; doctors, police officers, chefs, and more.

Coronavirus Modeling

Predicting Coronavirus Cases

Of all the first posts I’d write, this was the least likely. But then again, we’ve all been caught off guard by the last two weeks. That’s the thing about probability; rare still means possible. So here we are.

Instagram, in its own way, was a black swan. In 2010 we launched and 25,000 people signed up the first day. One million were using it three months in. By the time I left a year ago, over a billion users used it every month. Watching this growth, I became interested in the science of how things grow. Why was this thing growing so quickly? How quickly would it grow? When might it stop growing? As an investor over the past year, I started asking myself the same question of other companies as well. Through my research, I found a model called the SIR model that applies to infectious diseases. Some said it applied to company growth, too.

The other week, I thought about whether this model could be applied to coronavirus. On the one hand, some pundits argued cases would whimper out within a few weeks. Alarmists, however, insisted exponential curves they drew fit the data nightmarishly well. Which one was it?

I wanted to try two things. First, I’d study the SIR model and see if I could fit it to the real world data. I also wanted to figure out a way to explain how uncertain I was about that model being right. Since I’m not an epidemiologist, it’s even more important that I explain the model’s uncertainty through ranges of values called credible intervals. No one knows what the future of COVID-19 holds, but a model can provide a picture of probable outcomes.

My goal is to share my process so that you can judge the foundation for my conclusions. I also hope that by seeing how bad this can get, we might collectively avoid the worst predicted outcomes by acting quickly and decisively.

Most importantly: this is a work in progress. I don’t have all the answers, nor do I claim to know the future for certain. That being said, the model has tracked the last week very closely, and I’d personally rather have a model than no model at all. I’m open to feedback and hope that smarter people out there will both build off of this work as well as help improve it.

SIR Model for Viral Growth

Imagine an app with 100 users. At launch, each of those people tell friends about it, each of whom have some probability of becoming an active user. The process then repeats itself with those new users. They tell their friends, and with any luck they’d stick around for the process to continue with their friends. Successful companies find ways of continuing this process so that every person that joins brings at least one other person onto the service.

A virus isn’t any different. It might start with a sneeze. Invisible droplets, replete with virus, float towards susceptible people. Some people inhale the virus, and with some probability, become infected themselves. The process then repeats itself, but now your friends are the ones sniffling.

The SIR model attempts to explain both of these situations. Assume every individual is in one of three states: susceptible, infected, or resistant. In a simple world, patient zero is infected, everyone else is susceptible and nobody is resistant.

With every step in time, some susceptible people become infected, some infected people recover to be resistant, and resistant people stay resistant. Note that for simplicity, I’ve assumed a constant population size and that in the terrible case that someone dies, they are counted in the resistant population as they cannot spread the virus.

As the infected group grows, you’re more likely to run into someone sick and catch it yourself. Cases grow exponentially. At some point though, enough people have recovered that the chances of a susceptible and infected person meeting disappears. New cases slow, infected people recover, and you end up with most people being resistant. The groups evolve like this:

If you’ve watched the news, the metric they track is ‘total cases’. To calculate this, add the infected and resistant groups (less how many were resistant to start with, if any). New cases per day is the slope of the ‘total cases’ line:

Warning: Lots of math ahead. I try my best to explain what it all means.

Sure enough, the characteristic S-curve emerges. Total cases start slow, ramp exponentially, then linearly and finally taper off at some limit. The SIR model defines equations that produce these graphs. The equations govern the change in each group per unit of time:


    \[\frac{dS}{dt}=-\beta \frac{I}{N} S\]

    \[\frac{dI}{dt}=\beta \frac{I}{N} S-\gamma I\]

    \[\frac{dR}{dt}=\gamma I\]

S, I, and R are the totals of each group. N is the total of all three. The other variables are:

  • 𝜸 (gamma) – the rate of recovery. This is a fancy way of saying: what portion of infected people become healthy and resistant again per unit of time? This also happens to be the inverse of the duration of sickness.
  • 𝛽 (beta) – the transmission rate. This is the number of people an infected person infects per unit of time. While 𝛽 is unknown, it can be estimated given data (more on that soon).

There’s an issue with 𝛽, however. It’s static. In the real world, 𝛽 should decrease over time as people become more aware of the virus and people avoid gatherings, work, etc. I have to assume 𝛽 shrinks over time because people are smart and distance themselves as they learn about the virus. The traditional model doesn’t include this effect, but there’s no reason why we can’t add if we assume (and witness) that’s the way the world works. To take this effect into account, I added an additional equation governing the rate of decline of 𝛽 over time. I assume beta shrinks by a factor of δ (delta) at each step.

    \[\frac{d\beta}{dt}=-\beta \delta\]

Below, I’ve run the same model, but this time with various levels of δ. Think of δ as how quickly people distance themselves from others. Remember that 𝛽 is the number of people an infected person infects over time.

If people don’t ‘hunker down’ 𝛿 is zero and the virus infects the full population. With increasing 𝛿 (more social distancing), we reduce total cases and the rate of new infection. When you hear ‘flatten the curve’, this is what they’re talking about.

Imagine we only have ten ICU beds for our population of 100 and all infected people require critical hospitalization. If 𝛿=0.1, infections peak on day 13 but stay under the critical limit of 10 beds. If we don’t act at all, infections peak at 45 so we’re short by 35 beds on day 15. Models like this help us understand if and how our medical system can be overwhelmed depending on specific policy actions that influence 𝛿.

Besides working well theoretically, this modified version of SIR describes what we’re seeing in the real world, too. To show you, we need to choose 𝜸, 𝛽, and 𝛿 so that the model fits the data we see in the real world.

Fitting the model to data

The last section described four knowns: S, I, R, N. We also have three unknowns: 𝜸 (gamma, rate of recovery), 𝛽 (beta, rate of transmission) and 𝛿 (delta, social distancing factor) that we need to estimate.

For estimating 𝜸 (gamma, rate of recovery), we need the inverse of the length of infectiousness. If you are infectious for 5 days, 𝜸 is 1/5 because 1/5 of infected people recover every day. In the case of COVID-19, it seems that the incubation period lasts about 5 days. Researchers said they could not grow the virus from specimens taken 8 days after the onset of symptoms. I assume, then, that the average infectious period is about 9 days (5 plus half of 8). Therefore, I assume 𝜸 = 1/9. This may not be perfect, and if we were very concerned we could try different values.

𝛽 and 𝛿 are harder, and likely dependent on the specific population. The easiest way I know to choose 𝛽 and 𝛿 is to run a least-squares regression on the data from a given country. However, this produces a single ‘best guess’ value for 𝛽 and 𝛿. Since we are uncertain about the future, we’d like to know how uncertain we are about the best values of 𝛽 and 𝛿. It’d be nice to have distributions of the two parameters. For this, I decided to use pymc3, a library for probabilistic programming.

Pymc3 allows you to set up a model with knowns and unknowns. After observing real data, it returns distributions for the unknowns. (If you are interested in a more technical explanation can read more on the pymc3 site.)

After running US data through the model, it returned distributions for our parameters 𝛽 and 𝛿:

The model believes 𝛽 (transmission rate) is likely around 0.535 people per day and 𝛿 (transmission rate decay) is close to 0.01 as of March 18th. As discussed, we assume 𝜸 = 1/9 (9 days of infectiousness).

The model’s forecast

The parameters we get back from the model are distributions. This means there are many possible versions of the model. Some are more likely than others, like the red dotted average case below. At the same time, there are some cases that might happen but are less likely. The gray section in the chart shows the range where 90% of models exist.

In the short run, the model is confident. But one month out, the credible interval expands dramatically – we cannot be that confident in the future This does not mean we cannot draw conclusions, though.

For instance, the model claims there is a 95% chance we will have more than 15.4 million infections in the United States. The best estimates of IFR (infection fatality ratio), are around 2%. Recent tallies imply a 4% IFR, though this is disputed because mild cases go undiagnosed.

Either way, a conservative 1% IFR implies a 95% chance of 154,000 deaths or greater in the US alone. The average scenario of the model implies 1.5 million dead in the US – bested by the now widely cited Imperial College study at 2.2 million deaths. In a typical year, the flu kills just 40,000.

It’s important to remember that this is a model that shows what happens if we stay the course. It’s not a prediction of the future because our behavior may change. If we lockdown cities, cancel events, etc. 𝛽 will reduce far more quickly than the model expects. So far, it’s unclear how much we’ll change though. In New York politicians are resisting these measures, while San Francisco implemented them quickly.

I’m hopeful that these predictions push policymakers, local governments and individuals to take extraordinary precautions to reduce transmission rates. Over the coming days and weeks, I’ll provide updates to the model and inferences based on its output. The estimates will change and credible intervals will tighten with new data and we’ll get a clearer picture of what the future holds.

I will analyze specific countries, like Italy, where people are looking to see the effects of a national lockdown. I’ll take a look at the trajectory of various countries and make inferences about how things will go. I’ll try to answer questions like: how many people is it ‘safe’ to socialize with now and in one week?

In almost every case, the conclusions are more dire than people currently believe. Statistician George Box once said that all models are wrong but some are useful. In this case, I sincerely hope he’s right.