Volume 17, December 2016, Pages 19–26

Open Access

Data-driven outbreak forecasting with a simple nonlinear growth model

  • University of Arizona, Tucson, AZ, USA


We present EpiGro, a simple data-driven method to forecast the scope of an ongoing outbreak.

We provide general hypotheses for expected model validity and also discuss model limitations.

We propose an automated parameter estimation method that can be used for forecasting.

We test our approach on 9 different outbreaks and show robustness over multiple systems and over noisy data sets.

In the absence of other information or in conjunction with other models, EpiGro may be useful to public health responders.


Recent events have thrown the spotlight on infectious disease outbreak response. We developed a data-driven method, EpiGro, which can be applied to cumulative case reports to estimate the order of magnitude of the duration, peak and ultimate size of an ongoing outbreak. It is based on a surprisingly simple mathematical property of many epidemiological data sets, does not require knowledge or estimation of disease transmission parameters, is robust to noise and to small data sets, and runs quickly due to its mathematical simplicity. Using data from historic and ongoing epidemics, we present the model. We also provide modeling considerations that justify this approach and discuss its limitations. In the absence of other information or in conjunction with other models, EpiGro may be useful to public health responders.


  • Infectious disease outbreaks;
  • Mathematical model;
  • Surge capacity;
  • Chikungunya virus infection

1. Introduction

As infectious diseases are identified for the first time or emerge in new populations, researchers increasingly use mathematical models to describe observed patterns and to plan and evaluate public health responses (Anderson and May, 1992, Grassly and Fraser, 2008, Keeling and Danon, 2009 and Anderson et al., 2015). These models vary in complexity and scale, from simple compartmental models (Hethcote, 2000) to complex stochastic agent-based and metapopulation approaches that include external information like transportation networks (Rvachev and Longini, 1985, Hufnagel et al., 2004, Eubank et al., 2004, Ferguson et al., 2006, Balcan et al., 2010, Ajelli et al., 2010 and Van den Broeck et al., 2011). The latter have been shown to efficiently capture the real-time spread of epidemics (Tizzoni et al., 2012), but often require large amounts of information. Key parameters need to be estimated from epidemiological data, which may be accomplished by maximum likelihood estimation (Ionides et al., 2006, Bretó et al., 2009 and King et al., 2015) or data assimilation (Rhodes and Hollingsworth, 2009 and Shaman and Karspeck, 2012). However, for newly emerging infections or when estimating the impact of bioterrorism events (Walden and Kaplan, 2004 and Rotz and Hughes, 2004), such information may not always be available. Sometimes, the community is able to quickly compile and share epidemiological parameters, as was for instance the case for the devastating 2014/2015 Ebola outbreak (Van Kerkhove et al., 2015 and Chowell et al., 2014). It is nevertheless expected that model choices reflect the balance between data availability and the needs of the public health community (Keeling and Danon, 2009). Moreover, since the accuracy of predictions depends heavily on modeling assumptions (Keeling and Danon, 2009 and Wearing et al., 2005), it is also important to balance the need for detailed, realistic models against limitations in parameter information (May, 2004).

Knowing how many cases to expect, as well as when they will peak, before an outbreak has run its course is central to preparing a public health response (Flu Activity Forecasting Website Launched, 2016). Entire epidemiological curves can often be fitted with standard functions, such as for instance a logistic curve or the Richards model (Tjørve and Tjørve, 2010, Peleg and Corradini, 2011, Wang et al., 2012 and Ma et al., 2014), but are only effective late into the outbreak. Conversely, time series approaches allow forecasting, but are considered accurate only for short-term prediction. For instance, using only case data and an autoregressive integrated moving average (ARIMA) model, researchers were able to forecast hospital bed utilization during the severe acute respiratory syndrome (SARS) outbreak in Singapore up to three days forward (Earnest et al., 2005). Additional information is usually required for longer forecasts (see e.g. 3-month dengue forecasting using climate data (Gharbi et al., 2011)), limiting the utility of such approaches for newly emerging diseases, when many associated risk factors are still unknown.

We identify a simple property common to the epidemiological curves of many outbreaks and explore the modeling implications of this finding. In particular, it allows us to describe the course of each outbreak in terms of a very simple model, whose two parameters can be extracted from epidemiological data. This is different from estimating disease transmission rates since, for instance, knowledge of the model discussed in this article is not sufficient to recover the parameters (e.g. R0) of a simulated epidemic that follows the SIR (Susceptible – Infected – Removed) dynamics. We present an automated parameter extraction method that allows us to explore the applicability of the method to a variety of different outbreaks and, more importantly, explain how the model may be used to forecast the scope of ongoing outbreaks, including those of some vector-borne diseases.

2. Methods

Our general methodology is described in Fig. 1. Starting from reported epidemiological data, we consider the cumulative number of cases, C, and numerically produce a smooth interpolation of its evolution (panel 1). Data collection procedures for the examples discussed in this article are given in Technical Appendix 1 in Supplementary Material. We then use this smoothed data to estimate incidence, G, as described in Technical Appendix 2 (see Supplementary Material). The crucial point of our approach is that rather than plotting C as a function of time, we plot the estimated incidence G, as a function of cumulative cases, C, G(C) (panel 2). For many outbreaks, the graph of G as a function of C has a single “hump” and can, at first order, be approximated by an inverted parabola (panel 3). This inverted parabola, whose equation contains two parameters, defines a simple model for the evolution of the outbreak, which can be used to predict future number of cases given an initial condition (panel 4). We developed a method, detailed in Technical Appendix 3 in Supplementary Material, that automatically associates a parabola to available epidemiological data of one-wave outbreaks. It works on partial (for ongoing outbreaks) or full (for outbreaks that have completed their course) data sets and proceeds as follows: rather than attempting to estimate the parabola parameters from the cumulative epidemiological curve, we fit the graph of G(C) to its parabolic approximation and the graph of C(t) to its corresponding time course, simultaneously. Doing so therefore demands that the two unknown parameters describing the parabola be chosen to provide good approximations of two different (albeit related) plots. This approach is easily applicable to ongoing outbreaks for which limited data are available, and can therefore be used for forecasting.