The authors have declared that no competing interests exist.
Conceived and designed the experiments: TJ AC XD SC CF NF. Performed the experiments: TJ. Analyzed the data: TJ. Contributed reagents/materials/analysis tools: TJ. Wrote the paper: TJ AC XD SC CF NF. Designed the software ‘outbreaker’: TJ.
Recent years have seen progress in the development of statistically rigorous frameworks to infer outbreak transmission trees (“who infected whom”) from epidemiological and genetic data. Making use of pathogen genome sequences in such analyses remains a challenge, however, with a variety of heuristic approaches having been explored to date. We introduce a statistical method exploiting both pathogen sequences and collection dates to unravel the dynamics of densely sampled outbreaks. Our approach identifies likely transmission events and infers dates of infections, unobserved cases and separate introductions of the disease. It also proves useful for inferring numbers of secondary infections and identifying heterogeneous infectivity and super-spreaders. After testing our approach using simulations, we illustrate the method with the analysis of the beginning of the 2003 Singaporean outbreak of Severe Acute Respiratory Syndrome (SARS), providing new insights into the early stage of this epidemic. Our approach is the first tool for disease outbreak reconstruction from genetic data widely available as free software, the R package
Understanding how infectious diseases are transmitted from one individual to another is essential for designing containment strategies and epidemic prevention. Recently, the reconstruction of transmission trees (“who infected whom”) has been revolutionized by the availability of pathogen genome sequences. Exploiting this information remains a challenge, however, with a variety of heuristic approaches having been explored to date. Here, we introduce a new method which uses both pathogen DNA and collection dates to gain insights into transmission events, and detect unobserved cases and separate introductions of the disease. Our approach is also useful for identifying super-spreaders, i.e., cases which caused many subsequent infections. After testing our method using simulations, we use it to gain new insights into the beginning of the 2003 Singaporean outbreak of Severe Acute Respiratory Syndrome (SARS). Our approach is applicable to a wide range of diseases and available in a free software package called
This is a
Statistical methods for analyzing detailed epidemiological data collected during infectious disease outbreaks have seen rapid development in recent years
Integrated analysis of both epidemiological and sequence data clearly would maximize our ability to reconstruct transmission trees, but there are methodological and computational challenges. These challenges center on constructing and evaluating a unified likelihood for both the genetic and epidemiological data. One of the first attempts at integrated analysis
Here we introduce a novel and generic framework for the reconstruction of disease outbreaks based on pathogen genetic sequences and collection dates. We use the distribution of the generation time (
We analysed simulated outbreaks to assess the performance of our method under a variety of conditions, including different basic reproduction numbers (
Parameter | Possible values | Label |
Basic reproduction number (R0) | 1.1 | Low R |
Basic reproduction number (R0) | Base | |
Basic reproduction number (R0) | 4 | High R |
Generation time distribution | short (1.5, 1, 4) |
Short generation |
Generation time distribution | Base | |
Generation time distribution | long (6, 3, 20) |
Long generation |
Mutation rate |
0 | No mutation |
Mutation rate |
Base | |
Mutation rate |
2×10−4 | Fast evolution |
Genome length | [constant across simulations] | |
Rate of imported cases | 0 | No import |
Rate of imported cases | Base | |
Rate of imported cases | 0.2 | Many imports |
Proportion of cases sampled | 0.25 | 75% missing cases |
Proportion of cases sampled | 0.50 | 50% missing cases |
Proportion of cases sampled | 0.75 | 25% missing cases |
Proportion of cases sampled | Base |
Values indicated in bold correspond to the base simulation. Every other value was changed individually from the base simulation, giving one unique simulation setting. For every setting, 50 independent simulated epidemics were obtained. The minimum outbreak size was set to 10 cases (smaller outbreaks were discarded). Labels are used throughout the text to identify unique simulation settings.
the first two figures refer to the mean and standard deviation of the gamma distribution, before discretization; the third value is the date after which the distribution is truncated to zero.
per site and per generation.
Transmission trees were overall very well reconstructed, with 70% to 90% of true ancestries being recovered in most simulation settings (
This violinplot represents the proportion of correctly inferred transmissions in the consensus ancestries, obtained by retaining the most frequent infectors in the posterior trees for each case. Each colored ‘violin’ represents the density of points for a given simulation setting, indicated on the x-axis (see
The detection of imported cases showed excellent specificity and good sensitivity pooling results across the simulated datasets examined, with a majority of simulations exhibiting perfect results (
This figure shows the specificity and sensitivity of the procedure for detecting imported cases based on the identification of genetic outliers. Colored rectangles represent the percentage of simulations within a given specificity/sensitivity range. All simulation settings were pooled for this analysis.
While our model does not explicitly estimate the effective reproduction number ‘
This violinplot shows the estimates of individual effective reproduction numbers (
To gain a better understanding of disease outbreak dynamics, identifying systematic heterogeneity in
Results showed that our method was able to recover contrasted infectivity between different groups (
This violinplot shows actual and estimated values of effective reproduction numbers (
We analyzed data collected during the beginning of a SARS outbreak which took place in Singapore in 2003
The genetic diversity amongst isolates was limited, with less than 15 mutations separating any pair of genomes (
We used
This figure summarizes the reconstruction of the outbreak, showing putative transmissions (arrows) amongst individuals (rows). Arrows represent ancestries with a least 5% of support in the posterior distributions, while boxes correspond to the posterior distributions of the infection dates. Arrows are annotated by number of mutations and posterior support of the ancestries, and colored by numbers of mutations, with lighter shades of grey for larger genetic distances. The actual sequence collection dates are plotted as plain black dots. Bubbles are used to represent the generation time distribution, with larger disks used for greater infectivity. Shades of blue indicate the degree of certainty for inferring the origin of different cases, as measured by the entropy of ancestries (see methods and equation 12): blue represents conclusive identification of the ancestor of the case (low entropy), while grey shades are uncertain (high entropy).
This figure indicates the most supported transmission tree reconstructed by
The most recent investigation of this outbreak suggested a dual introduction of the pathogen, with a separate index case (Sin2679) nearly 20 days after the initial index case Sin2500
Building on past work
As in other tree reconstruction methods
Our method relies on several assumptions which can be used to define the scope of its possible applications. The most important element in this respect is the proportion of cases represented in the sampled data, and thus often the scale of the epidemics considered. Our approach aims to reconstruct ancestries in closely related cases. As such, it should be most useful for detailed outbreak investigations. While the reconstruction of transmission tree seems relatively robust to large proportions of unobserved cases (up to 75% of missing cases,
One of the novelties of our approach is the detection of imported cases, which are identified as genetic outliers. While this method should be useful to detect separate introductions of different pathogenic lineages in an epidemic, it may be sensitive to other events prone to creating genetic outliers, such as sequencing errors or recombination. Care should therefore be devoted to ensuring data quality and filtering out polymorphism due to recombination. Moreover, the assumption that imported cases are genetically distinguishable from other cases may not always be true, especially when multiple introductions take place from a closely related lineage. Such cases cannot be detected by genetic data only, and would require other sources of information (e.g. contact tracing) to be considered. In this respect, an interesting feature of
Another important point is that following a previous, widely-used approach for the analysis of outbreaks
More fundamentally, the use of a generation time distribution also implies that our method is less appropriate for diseases in which long periods of asymptomatic carriage are frequent. For instance, bacteria such as
Moreover, carried pathogens are also more likely to cause multiple colonizations of the host, resulting in several lineages coexisting within the same patient. Our model assumes that a single pathogen genome exists within each host, and is therefore not designed to account for multiple infections. A simple workaround would consist in duplicating cases of multiple infections into single infections, assuming that multiple infections are made of independent, single colonization events. However, this would not allow for disentangling multiple infections from mere within-host evolution of a single lineage. A more satisfying approach would consist in modeling explicitly the evolution of isolates within host, but this will likely result in a much more complex model and is beyond the remit of our current approach.
A major simplification made in our model, that could be relaxed in future work, is that we do not consider within host diversity of pathogens. Within-host diversity is particularly prominent in pathogens that infect a host for a long time relative to their within-host replication cycle (e.g. HIV or Hepatitis C Virus), pathogens that can be carried for a long time (e.g.
Finally, we wish to emphasize the importance of including all available prior information in the analysis. Because the estimates of parameters governing an outbreak are often correlated, accurate knowledge of one can be used to refine the estimation of the others. For instance, specifying known transmission chains or imported cases will improve the estimation of the mutation rates, as well as the overall reconstruction of the transmission tree. Conversely, fixing the mutation rate to its ‘true’ value (or a good estimate thereof) is likely to improve the detection of imported cases. As currently implemented, our method allows for fixing any parameter as well as individual ancestries, which are used in the likelihood computations but not changed during the MCMC. This feature should be especially useful for incorporating known transmission events or introductions of the pathogen into the population, based for instance on clinical investigations and contact tracing information. However, results of contact tracing studies should always be considered cautiously, and could be contradicted by the analysis of corresponding sequences, as illustrated by the SARS outbreak in Singapore.
There are other promising avenues for incorporating various streams of information into our approach. The likelihood of our model allows for additional ‘plug-in’ terms for individual transmissions, which could be used to model spatial dispersion processes as well as movement over a contact network. Therefore, we hope that the present method will not only be applied widely, but also motivate further developments for the investigation of infectious disease outbreaks.
We developed a discrete-time stochastic model for reconstructing likely transmission trees of an outbreak based on pathogen genetic sequences and their collection dates (see notations summary in
Symbol | Type | Description |
Index | index of cases | |
Data | number of cases in the sample | |
Data | sequence of case |
|
Data | collection date of |
|
Function | generation time distribution | |
Function | time-to-collection distribution | |
Function | number of mutations between |
|
Function | number of comparable nucleotides between |
|
Augmented data | index of the most recent sampled ancestor of case |
|
Augmented data | number of generations between |
|
Augmented data | date of the infection of |
|
Parameter | mutation rate, per site and per generation of infection | |
Parameter | proportion of cases of the outbreak sampled |
Our model is embedded within a Bayesian framework. We denote
The likelihood is computed as a product of case-specific terms, in which we assume that all cases are independent conditional on their ancestries:
The general term of the pseudo-likelihood for case
As in
The epidemiological pseudo-likelihood
Imported cases are not explicitly included in the model, but detected using a preliminary run of the model, during which genetic outliers are identified and the corresponding cases classified as imported. The ancestry of these cases is fixed as ‘unknown’ in the second and final run. We use a leave-one-out procedure for detecting cases with outlying genetic log-likelihood which has been used previously in a similar context
Because our model uses a mutation rate expressed per generation of infection, estimated values cannot be readily compared to classical rates of evolution, typically expressed per unit of time. As a workaround, we can re-estimate a classical mutation rate from the distribution of posterior trees. The mutation rate can be inferred from one transmission event as the ratio of the number of mutations from ancestor to descendent and the amount of time separating the infection dates of these cases. For each tree, we compute the average mutation rate across all ancestries, which provides one estimate of the mutation rate for each posterior sample. This procedure is implemented in the function
Our approach is implemented in the R package
Outbreaks were simulated using the function
Mutations are simulated using a single mutation rate, all sites mutating independently. Pathogens of separate introductions of the disease (including the index case) are assumed to all coalesce to the same common ancestor ten generations ago.
We evaluated the overall performance of the method using a basic scenario, and assessed the impact of different factors on the results by changing one aspect of the simulation at a time. These factors included the shape of the generation time distribution (from peaked to flat), the basic reproduction number (from 1.1 to 4), the mutation rates (from 0 to 2 mutations on average per generation and genome), the proportion of cases observed (from 0.25 to 1), the rate at which external cases are imported (from 0 to 0.2), and the proportion of sampled cases with DNA sequences (from 0.25 to 1). The different values for each element are summarized in
In addition, two other types of simulation were used to test our approach's ability to detect heterogeneous infectivity amongst cases. First, we generated outbreaks where the host population was divided into two groups of equal sizes, one being twice as infectious (equivalent
Thirteen previously published full SARS genomes
(CSV)
(FASTA)
(TXT)
(TIF)
(TIF)
(TIF)
(TIF)
(TIF)
(TIF)
(TIF)
(TIF)
(TIF)
(TIF)
(TIF)
(TIF)
(TIF)
(TIF)
(TIF)
(TIF)
(TIF)
(TIF)
(TIF)
(PDF)
We are thankful to Sourceforge (