Saturday, March 28, 2026

Paper Recommendation 1: How Hidden Markov Models Unmasked the True Scale of COVID19

This is the first in what will hopefully be an intermittent series of appreciation posts of other people's papers. Here it goes:

One of the big statistical problems during the COVID pandemic was that the official case counts were never the whole story. The number reported each day depended not only on how many people were actually infected, but also on how many were tested, how quickly laboratories processed samples, and how public health systems recorded cases. In other words, the observed data were only a partial and noisy picture of the real epidemic.

This is exactly the kind of problem state-space models, or SSMs, are designed to handle. In the 2020 work of Fernández-Fontelo, Moriña, Cabaña, Arratia, and Puig, the main idea was to separate the epidemic into two layers: a hidden layer representing the true number of infections, and an observed layer representing the reported counts. Once those two layers are separated, the model can estimate how much disease activity is missing from the official numbers.

Let \(X_n\) denote the true number of new COVID cases on day \(n\), and let \(Y_n\) denote the number of cases that are actually reported. The important point is that \(X_n\) is not directly observed, while \(Y_n\) is. The goal of the model is to infer the hidden sequence \(X_1, X_2, \dots\) from the reported sequence \(Y_1, Y_2, \dots\).

The hidden epidemic process is written as

\[ X_n = \alpha \circ X_{n-1} + W_n, \qquad W_n \sim \mathrm{Poisson}(\lambda_n). \]

This equation says that today’s true case count is built from two pieces. The first piece, \( \alpha \circ X_{n-1} \), represents dependence on yesterday’s true count. The second piece, \(W_n\), represents newly generated cases on day \(n\).

The symbol \( \alpha \circ X_{n-1} \) is called binomial thinning. It means that each of yesterday’s cases is carried forward with probability \( \alpha \). More explicitly,

\[ \alpha \circ X_{n-1} = \sum_{j=1}^{X_{n-1}} B_{n,j}, \qquad B_{n,j} \sim \mathrm{Bernoulli}(\alpha). \]

So if yesterday had a large hidden case count, today is also more likely to have a large hidden count. This gives the model memory over time.

The second term \(W_n\) is modeled as Poisson with mean \( \lambda_n \). If \( \lambda_n \) were constant, the model would be too simple for an actual epidemic wave. COVID does not produce the same average number of new infections every day. Instead, the epidemic rises, peaks, and falls.

To capture that, the authors let \( \lambda_n \) vary with time using information inspired by an SIR epidemic curve (logistic growth approximation to SIR ODEs). They write the cumulative affected population as

\[ A(t) = \frac{M^* A_0 e^{kt}}{M^* + A_0\left(e^{kt}-1\right)}, \qquad k = \beta - \gamma. \]

Where β is the transmission rate and γ is the recovery rate. Then they define the expected number of new cases on day \(n\) by taking the daily increment:

\[ \lambda_n = A(n) - A(n-1). \]

This means the hidden process does not just wander randomly. It is guided by an epidemic-growth structure that allows the true number of cases to rise quickly at first and then slow down later.

So the hidden epidemic layer becomes

\[ X_n = \alpha \circ X_{n-1} + W_n, \qquad W_n \sim \mathrm{Poisson}\!\bigl(A(n)-A(n-1)\bigr). \]

The reported data \(Y_n\) are not assumed to be equal to the true data \(X_n\). Instead, the model treats reporting as imperfect. A simple way to understand the idea is

\[ Y_n \approx q_n X_n, \qquad 0 < q_n < 1. \]

Here \(q_n\) is the reporting fraction on day \(n\). If \(q_n = 0.4\), then only about 40% of the true cases appear in the official count. The remaining cases are hidden from the data.

The full model is more careful than this rough formula, but the intuition is correct: the observed count is only a partial measurement of the hidden epidemic.

The reporting fraction is allowed to change over time through a logistic function:

\[ q_n = \frac{e^{\eta_n}}{1 + e^{\eta_n}}. \]

This guarantees that \(q_n\) always stays between 0 and 1. In other words \(q_n\) is a sigmoid. The quantity \( \eta_n \) can include a time trend and day-of-week effects, which lets the model account for changing testing practice, administrative delays, and weekend reporting effects.

Without a state-space model, it is easy to mistake the official case counts for the epidemic itself. But official counts mix together at least two processes: actual transmission and the reporting system. If testing expands, reported cases may rise even if transmission is stable. If testing is restricted, reported cases may look artificially low even while infections are surging.

That is why SSMs were so useful during COVID. They provided a principled way to reconstruct a hidden epidemic process underneath noisy and incomplete surveillance data. In that sense, they helped unmask the true danger of COVID. This is what good SSMs do, they systemically bridge mathematical modeling and data to see the known unknown.