The NOSTRA model: Coherent estimation of infection sources in the case of possible nosocomial transmission

David J. Pascall; Christopher Jackson; Stephanie Evans; Theodore Gouliouris; Christopher J.R. Illingworth; Stefan G. Piatek; Julie V. Robotham; Oliver Stirrup; Ben Warne; Judith Breuer; Daniela De Angelis

doi:10.1371/journal.pcbi.1012949

Abstract

Nosocomial, or hospital-acquired, infections are a key determinant of patient health in healthcare facilities, leading to longer stays and increased mortality. In addition to the direct effects on infected patients, the burden imposed by nosocomial infections impacts both staff and other patients by increasing the load on the healthcare system. The appropriate infection control response may differ depending on whether the infection was acquired in the hospital or the community. For example, nosocomial outbreaks may require ward closures to reduce the risk of onward transmission, whilst this may not be an appropriate response to repeated importations of infections from outside the facility. Unfortunately, it is often unclear whether an infection detected in a healthcare facility is nosocomial, as the time of infection is unobserved. Given this, there is a strong case for the development of models that can integrate multiple datasets available in hospitals to assess whether an infection detected in a hospital is nosocomial. When assessing nosocomiality, it is beneficial to take into account both whether the timing of infection is consistent with hospital acquisition and whether there are any likely candidates within the hospital who could have been the source of the infection. In this work, we developed a Bayesian model which jointly estimates whether a given infection detected in hospital is nosocomial and whether it came from a set of individuals identified as candidates by hospital staff. The model coherently integrates pathogen genetic information, the timings of epidemiological events, such as symptom onset, and location data on the infected patient and candidate infectors. We illustrated this model on a real hospital dataset showing both its output and how the impact of the different data sources on the assessed probabilities are contingent on what other data has been included in the model, and validated the calibration of the predictions against simulated data.

Author summary

Nosocomial, or hospital-acquired, infections have important consequences for patients and hospital staff: they worsen patient outcomes and their management stresses already overburdened health systems. Accurate judgments of whether an infection is nosocomial helps staff make appropriate choices to protect other patients within the hospital, as such appropriate models to assess whether an infection is nosocomial are a key public health need. Our assessed probability of nosocomiality should change if the infected patient came into contact with high-risk potential infectors within the hospital, and as such, we should not attempt to judge whether an infection is nosocomial without also considering this factor. Given this, we developed a model that integrates epidemiological, contact and pathogen genetic data to determine how likely an infection is to be nosocomial and the probability of given infection candidates being the source of the infection, and validated this model using simulations from a previously published agent-based hospital outbreak simulation model.

Figures

Citation: Pascall DJ, Jackson C, Evans S, Gouliouris T, Illingworth CJ, Piatek SG, et al. (2025) The NOSTRA model: Coherent estimation of infection sources in the case of possible nosocomial transmission. PLoS Comput Biol 21(4): e1012949. https://doi.org/10.1371/journal.pcbi.1012949

Editor: Virginia E. Pitzer, Yale School of Public Health, UNITED STATES OF AMERICA

Received: November 2, 2023; Accepted: March 10, 2025; Published: April 21, 2025

Copyright: © 2025 Pascall et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: The code for the analyses and the implementation of the NOSTRA model is available at https://github.com/dpascall/NOSTRA-model. The Cambridge University Hospitals data used to illustrate the model are patient identifiable, and as such are not available. The data required to repeat the simulation study is available at https://doi.org/10.6084/m9.figshare.27172758.

Funding: DJP was funded by a NIHR award to JB (NIHR200652). CI was funded by the Medical Research Council (MC\UU\00034/1). JVR was supported by the National Institute for Health and Care Research (NIHR) Health Protection Research Unit (HPRU) in Modelling and Health Economics, which is a partnership between the UK Health Security Agency (UKHSA), Imperial College London, and the London School of Hygiene and Tropical Medicine (NIHR200908). This work was supported by UKRI through the JUNIPER consortium (MR/V038613/1). This work was supported by the Medical Research Council via the MRC Biostatistics Unit Core Award (MC\UU\00002/11). This work was supported by the NIHR Cambridge Biomedical Research Centre (NIHR203312). The views expressed are those of the authors and not necessarily those of the NIHR or the Department of Health and Social Care. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Competing interests: The authors have declared that no competing interests exist.

Introduction

Nosocomial infections are an important issue facing health systems across the world, impacting both patient survival [1,2] and their willingness to access healthcare [3]. However, when an infection is found in a patient in a healthcare setting it is often unclear whether the infection was genuinely acquired in the facility or just detected there. These two scenarios have very different implications for decision making. High levels of nosocomial infections may require the stepping up of infection control responses both at the ward level, and across the whole hospital to protect patient health [4]. The precise required response may be idiosyncratic, as the ultimate causes of nosocomial infections differ dramatically across pathogens and facilities. For example, poorly ventilated wards may be susceptible to frequent nosocomial respiratory infections, while wards containing particularly susceptible patient groups may be at risk of high rates of nosocomial infection across pathogen groups. The ideal responses to these two cases will be very different, and the set of available responses will be limited by the economic context. But while the responses across scenarios and facilities will differ, in order to know whether they are necessary at all, it is first necessary to determine whether the infections that are being seen truly are healthcare-associated. Given this, methods for assessing both whether an infection is nosocomial, and if it is, whether it is part of a large hospital outbreak, are necessary for coherent decision making in infection control.

Historically, whether an infection was hospital acquired was commonly assessed by the application of heuristic criteria. For example, England’s public health agency, the UK Health Security Agency (then Public Health England), in its last major study of nosocomial infections, used their main definition of a hospital-acquired infection as being an infection where symptom onset occurs on at least the 3rd day post-admission [5]. Moving beyond this heuristic approach to one based on specific evidence related to characteristics of the pathogen of interest and host population would be more principled, with integrating larger amounts of data on a coherent fashion hopefully leading to more accurate assessments.

Tools have been developed explicitly for the assessment of nosocomiality [6–8] or the related task of inferring transmission histories (reviewed in [9]). However, these tools cannot simultaneously answer the questions of “Is the infection nosocomial?" and “What is the most likely source?". Treating both questions simultaneously should lead to systematically different answers than would be the case if attempting to answer them individually, because the presence or absence of high probability infection candidates within the hospital should impact our assessment of the probability that the infection is nosocomial. Given that the appropriate infection control response depends on whether an infection truly is nosocomial, there is an urgent need for the development of tools that are capable of simultaneously answering both of these questions, which can be used routinely by hospital staff in real time to guide management decisions.

Unfortunately, attributing nosocomiality and assessing infection sources of a given infection is a non-trivial task, as generally not all infection candidates will be identified, and even in cases where every possible infection candidate is known, data may only be available for a subset. Hence, any method to approach these problems together (or, indeed, individually) must decide what assumptions are going to be made about this problem of missing infection sources (and their associated data). This is done by either making assumptions about the nature of the missing data (commonly that data that is available is representative of the missing data), as in [6], or by reducing focus to a circumscribed question that is answerable with just the data observed, as in [10–12].

One model that directly attempted to assess nosocomiality was Hospital-Onset COVID-19 Infections (HOCI) [6]. This Bayesian model was developed at the height of the SARS-CoV-2 pandemic to integrate epidemiological and genetic data to assess whether detected SARS-CoV-2 infections were hospital-acquired. To model the epidemiological information, it used a distribution from infection time to symptom onset to assess whether infection was more likely in the hospital or the community given the observed admission and onset times. The genetics were modelled by testing the consistency of the viral isolate from the patient of interest to sequences from the hospital relative to those from the community. This tool was rolled out across National Health Service (NHS) hospitals in the UK for real time use by infection control teams in a large study assessing the impact of sequencing on clinical outcomes [13]. HOCI, however, did not attempt to make any assessments of the source of the infection if it was determined to have a high posterior probability of being nosocomial.

In contrast, the A2B model [11], focuses exclusively on source attribution. That is, conditional on there having been an infection in hospital, which individuals are consistent with having been the source. A2B works within a frequentist framework providing p-values on whether the data are more extreme than would be observed under the null model of infection from individual A to individual B, hence the name. Like HOCI, A2B makes use of infection times and sequence information, but it also uses information about the locations of individuals within the hospital, in order to rule out some transmission events.

Here, we, a group of authors comprising many of the original developers of the HOCI and A2B models, present a novel Bayesian model, designed as a conceptual integration of these two models [6,11], that we call NOSTRA (NOSocomial TRansmission Assessment), a name inspired by the historical prognosticator Nostradamus. The aims of our model are twofold:

To provide both a probability a detected infection was acquired within the hospital, and probabilities for the source being within a set of given candidate individuals
To have a low enough runtime to be usable on wards in real-time for clinical decision making

We illustrate the outputs of this model using previously published data [10] collected during the early stages of the COVID-19 pandemic at Cambridge University Hospitals NHS Foundation Trust (CUH) and validate its performance against infection data from an agent-based hospital simulation model.

Materials and methods

Data

The data we used to illustrate our model outputs are fully explained in Illingworth et al. [10], but we briefly re-describe it here. The data were collected during prospective COVID-19 surveillance at CUH between the 22nd March and 14th June 2020. Patients were tested for COVID-19 through targeted patient screening in wards with detected hospital-onset outbreaks. The original data comprised five wards, but we focus only on the ward identified as A in Illingworth et al. Due to a large outbreak on this ward, all individuals on the ward were eventually tested, including those exhibiting no symptoms. Final case sets were generated manually by seeking for possible links in a social network diagram generated in FoodChain-lab [14].

Patient locations on each day and the date of onset of symptoms (or in the case of asymptomatic infections, date of detection) were extracted from the hospital’s electronic records. Location data were available for all but two patients.

Viral sequences were generated from isolates using the modified ARTIC v2 protocol [15]. These sequences represent the string of nucleic acids that make up the viral genome, and differences between them can be informative on how the viral populations within different patients are related to one another. The whole viral population in a patient is summarised as a “consensus sequence,” which can be viewed as the average viral genome in that patient.

For this study, we performed some post-processing of the genetic data, firstly reducing the alignment for each pair of individuals to those columns which contain no ambiguities or gaps, recorded the length of this reduced alignment, and calculated the number of single nucleotide polymorphisms (SNPs) between each pair in the reduced alignment.

Model

The goal of the model is to estimate the most likely source of infection for an individual, who we label B, whose infection was discovered in hospital using: data on B’s movements; the genetic sequence of their infecting pathogen; their times of admission to hospital and symptom onset; and the movements and onset times of a set of n candidate infector individuals in the hospital, who we label A₁ to A_n. We partition the possible sources of infection into mutually exclusive groups as follows:

the candidate individuals, labelled ,
any source of infection from the hospital other than these individuals, including visitors, labelled H,
infections outside the hospital, labelled C.

We set this up a Bayesian inference problem. The unknown source of infection S is a categorical variable with potential values, . The goal is to estimate the posterior distribution of S. This posterior distribution is fully specified by the set of quantities , with , and expresses our judgment and uncertainty about B’s true infection source after observing the data, X. For example, if we became certain that B was infected outside the hospital, we would have , and a probability of 0 for each other potential source. We start with a prior probability distribution and deduce the posterior distribution given this prior and the observed data. All notation is defined in Table 1.

Download:

Table 1. A reference for the notation used throughout the paper.

https://doi.org/10.1371/journal.pcbi.1012949.t001

We partition the data X into several components, such that . consists of the onset time of A_z, , the distance in terms of SNPs between their pathogen genomes (assumed to be generated from an alignment with no gaps or ambiguities), , and whether A_z and B were in the in the same location on each day, . X_H consists of the admission time of B, , and the onset time of B, .

Bayesian analysis: prior

Bayesian analysis requires a prior over S. There are multiple ways that this prior could be justified. Our default prior is a uniform prior over nosocomiality. More precisely, this prior is of the form , for , where n is the number of candidate individuals. This prior expresses complete uncertainty over whether a given infection is nosocomial, and, conditional on it being nosocomial, complete uncertainty over its source within the healthcare facility.

Bayesian analysis: likelihood

For Bayesian estimation of , we need to define the likelihood of the data X for each potential value (or “hypothesis") for the unknown S. We denote this , and we now define it in turn for each s.

There are two broad classes of hypothesis; when the infection is not from a candidate individual () and when it is (. Within each class of hypothesis, the likelihood has the same general structure. We will treat them one at a time.

Likelihood of the data given that infection was from a non-candidate in the hospital or in the community, .

Under this class of hypothesis, the infection occurred in the community or from an unknown individual in the hospital.

We make the strong simplifying assumption that each of the components of X are generated independently of one another, and hence the likelihood factorises. This assumption rules out indirect transmission between candidate individuals and the focal individual, as in the scenario B’s infection came indirectly from A_z via a person in H or C, the onset time of A_z would not be independent of that of B.

Under this assumption:

(1)

and

(2)

We will now derive each component of these likelihoods separately.

Likelihood of X_H given that infection was in the community, L(X_H|S = C).

This is the likelihood for the onset time of person B, , given their admission time, . This is obtained by specifying parametric models for B’s (unknown) infection time , assumed to have density f(), and the incubation time between B’s infection and onset, , assumed to have density g().

Suppose we knew the infection time was , then the onset time is . The probability of observing could then be obtained directly from the model for , that is, . However, we do not know the infection time, so the likelihood for is determined by integrating this probability over the range of values of compatible with having acquired the infection outside of hospital, that is, infection times between the start of the epidemic and the admission time :

(3)

Likelihood of X_H given that infection was from a non-candidate in the hospital, L(X_H|S = H).

This is as L(X_H|S = C), except that the integral is taken over the range of infection times that are compatible with infection being acquired from an unidentified individual within the hospital between and .

Likelihood of given that infection was from a non-candidate in the hospital or in the community, .

contains the onset time of A_z, , the difference in terms of SNPs between the sequenced genomes of the pathogens infecting A_z and B, , a vector of 1s and 0s describing on which A_z and B were in the same location, , and for any unobserved elements of elicited probabilities that A_z and B were in contact on those days, w(A_z,B). Note that, as none of these data are impacted by whether B was infected in the hospital or the community, the likelihood is identical under both S = H and S = C. Under these hypotheses, A_z was not the source of B’s infection, hence, we assume that the components of A_z’s data (A_z’s onset time, the genetic difference between the viruses affecting A_z and B, and A_z’s co-location with B) are independent of each other. As before, this independence is plausible if there was no intermediate transmission between A_z and B. This implies that A_z did not infect B (and vice versa), their co-locations, , are independent of the other data in and the genetic data is independent of the epidemiological data and thus can be modelled separately.

Given this, we have:

(4)

where and are the random variables underlying the observed genetic data and co-location data .

We already have a parametric model for onset time, which we applied to B’s onset time above. The same model can be applied to A_z’s onset time, with the integration for being over the possible infection dates for A_z, that is between and .

For the co-locations, , we follow the approach taken in A2B [11]. Individuals either are or are not in contact on any particular day, giving a total of 2^|D| potential contact history vectors for days. Assuming that none of these contact histories are more or less likely than any other given A_z did not transmit to B, the observed contact history then has probability .

To specify we make use of the coalescent [16–18]. All the necessary genealogical theory for this section is reviewed in Hudson 1990 [19]. Assume that the viruses are evolving under a Poisson process with rate, M. Under the coalescent, the number of generations to the most recent common ancestor (MRCA), for two randomly chosen individuals, is exponentially distributed with rate given by the inverse of the (effective) population size, N_e. Let represent this random variable. Since this represents the number of generations since the pathogens infecting A_z and B last shared a common ancestor, the pathogens are separated by generations of independent evolution at rate M. Hence, during the time in standard units spanned by these generations, we would expect the number of SNPs generated through evolution to be Poisson distributed with mean .

Our assumption that the alignment has no gaps or ambiguities is unrealistic, so we create an accounting variable for each candidate individual and the focal individual , which corresponds to the effective genome size after ambiguous sites and gaps have been removed. We assume that the alignment of length is comparable to the unrealised complete alignment of length G. We use this new variable to correct the mean to .

As is an exponentially distributed random variable with rate 1/N_e, is also exponentially distributed with rate, . As a Poisson distribution with a Gamma-distributed random rate parameter is equivalent to a Negative Binomial distribution, the number of mutations generated through evolution between the two sequences can be modelled as NB[].

In addition, there would then be differences added by sequencing error in both genomes. We can model the number of sequencing errors as a Binomial random variable with probability E, the per base error probability and number of trials 2G, double the genome size, as this occurs in both genomes. As E will be small and 2G is large, we approximate this Binomial distribution with a Poisson distribution with rate 2EG. Again, we correct for the observed genome length by modifying this rate to .

Therefore, the total number of genetic differences between the two genomes is the sum of the Negative Binomial distribution describing the SNPs generated though mutation and the Poisson approximation to the Binomial distribution describing the SNPs generated through sequencing error (assuming no back mutation). The sum of a Negative Binomial distributed random variable and a Poisson distributed random variable is Delaporte distributed [20]. Hence, the likelihood is:

(5)

Note that if the isolates are collected on different days with time difference in standard units, , the distribution of the time between them would be Exp() instead of just Exp(). This can be accounted for by modifying the parameter of the Delaporte distribution from to .

Likelihood of the data given that infection was from a candidate individual, .

Under this class of hypothesis, B was infected by one of the candidate individuals A_z. We assume then that the data , that describe the relationship between B and each other individual A_j, , are independent between each A_j, and independent of the data that describe the relationship between the infecting A_z and B. That is:

(6)

takes the same form as . All that remains is to generate a parametric model for .

Likelihood of X_H and given that infection was from the candidate individual A_z, .

We partition the data into event times T, genetic distance , and co-location D components, and rearrange in terms of the conditional probability of B’s data given A_z’s onset time .

(7)

has been defined above.

We then obtain using the same technique as in the A2B model [11]. This involves expanding this term by summing it over B’s unknown time of infection , and assuming that B’s onset time, the genetic distance of B’s pathogen from A_z’s, and B’s co-location with A_z are conditionally independent given this infection time. Specifically:

(8)

Furthermore, we note that the onset time of A_z provides no extra information after conditioning on the infection time of B, so we can simplify as follows:

(9)

(10)

(11)

Any distributional form could be assumed for and this choice should be specific to the pathogen in question and informed from its epidemiological literature.

For the term , we take a similar the approach for the genetics used for , but the situation is drastically simplified as the infection time is known. We make the assumption of no within host variation in the pathogen, so that there is no risk of incomplete lineage sorting and the time of the MRCA of B and A_z is exactly the infection time of B. If t is the infection time of B, and and are the sampling times of the pathogen genomes of B and A_z respectively, then there has been time units of independent evolution for the pathogens. Given the Poisson process assumption for mutation being used, we expect the number of SNPs between the genomes to follow a Poisson distribution with mean , which after accounting for partial observation, becomes . Again, we assume that additional mutations are generated by sequencing error, following a Poisson distribution with mean . As the sum of two Poisson random variables is Poisson, this gives that:

(12)

Finally, consider the likelihood term for the history of co-location of individuals A_z and B under a hypothesis that A_z infected B at time t. These data consists of a vector of , where or if A_z and B are known to have been co-located, or not co-located, respectively, for each day c. On some occasions is unknown, and we assume we have an elicited probability that they were in contact then. In the case of fully observed location data, w(A_z,B) is undefined.

Then, assuming conditional independence of the co-location data on each day, we have:

(13)

where

(14)

at times c when is observed. When contact status is unobserved, this likelihood contribution is defined as a weighted average over the missing contact indicator , using the estimated contact probabilities as weights, giving:

(15)

To specify these likelihood contributions, firstly note that if , then A_z and B must have been in contact at day t, since contact is necessary for infection. Therefore if c = t, then . At any other time c (following Illingworth et al. [11]) we suppose that all observed co-location patterns are equally plausible, implying that .

Bayesian analysis: inference

With the likelihood and priors defined as above, the posterior probabilities of the different categorical sources of infection are available analytically, and can simply be calculated directly, without recourse to any kind of numerical integration, such as MCMC. That is:

(16)

This gives us the posterior probability of each of the possible infection sources being the source of the infection. The community probability, , is the probability the infection was obtained outside the medical facility. The nosocomiality probability is , that is, the sum of all the non-community probabilities. The probability of the hospital compartment, , is the probability that the infection came from a source within the medical facility that had not been a priori identified by the medical staff.

Note that, under NOSTRA, if no candidate infectors are provided the posterior is purely a function of the length of the waiting time between admission and symptom onset/detection:

(17)

Model illustration

This section briefly states which distributions and parameters we used for our illustration of the model with the CUH SARS-CoV-2 data. We use the default prior described in the prior section as the prior on S. is a given a uniform distribution between and . is a given a uniform distribution between and . Following Illingworth et al. [11,21], we give a lognormal distribution with mean = 1.434 and standard deviation = 0.6612, and approximate with 0.404. We follow Ferretti et al. [22] and give a shifted Student’s t-distribution with shift = -0.078, scale = 1.857, and df = 3.345, and set the generation time, g, to 5.5. We took the mutation rate from Wang et al. 2022 [23] and set N_e = 51 based off an approximation from their figure. We set to the 30th December 2019, the earliest admission time for a patient on the ward.

Simulation validation

We performed a simulation analysis to validate this method. We used a SARS-CoV-2 calibrated individual-based hospital simulation model previously generated by some of the authors [24] to generate full infection histories of infections detected in hospital, some of which have are acquired in the community and some of which are acquired from patients or healthcare workers in the hospital. As we have the full infection histories of these patients, we know what the true infection source is. All data required for NOSTRA other than the genetics is generated during the process of running the model.

The transmission trees implicitly generated by the model are only for infections that occurred within the hospital itself, so the result is a set of disconnected graphs, with each graph corresponding to the hospital outbreak from a single imported SARS-CoV-2 infection. In order to simulate required genetics, we assumed that mutation occurred independently across each transmission tree following a Poisson process with rate yr⁻¹ site⁻¹. We drew the number of SNPs between each transmission tree from a distribution, with N_e = 51, g = 5.5 days, yr⁻¹ site⁻¹, G = 29811, being the starting time of the simulation, and being the time of the first event in the nth transmission tree.

We simulated the data under three broad conditions; high (6% community prevalence, 0.15% of new admissions infected), intermediate (4% community prevalence, 1% of new admissions infected) and low (2% community prevalence, 0.05% new admission infected) infection prevalence. Within each condition, simulations were executed for each of 20 calibrated hospital infection parameter sets (as described in Evans et al. 2021 [25], see S1 Table for precise parameter sets used), giving a total of 60 simulations. The hospital in each simulation has 42 wards.

NOSTRA was then run with and without candidate individuals on the data at the ward level on patients with detected infections admitted after the 50th day of the simulation (to allow the model to equilibrate) using the same parameter values reported above for the illustration. We used our advised default prior of , with the rest of the probability divided equally between the rest of the sources. This represents complete uncertainty as to whether an infection is nosocomial or not. We assessed the model posterior against three references, the above prior, the (usually unknown) true prevalence of nosocomiality, and a rule of thumb where every detection 96 hours post-admission is assigned a nosocomial probability of 1 and every detection prior to this is assigned a nosocomial probability of 0. The true prevalence condition corresponds to a prior of the form , where is equal to the true proportion of cases after the 50th day of the given simulation that were actually acquired in the community. The remaining of the probability is divided equally between the rest of the sources.

To quantify the quality of the estimates we used Brier scores [26], which measure the calibration of probabilistic forecasts. This was done on three targets; the nosocomiality itself, and the true infection source, and individuals within the same transmission chain as the true infection source. This allows the differential assessment of skill at whether an infection is nosocomial but not at where in the hospital it came from, or vice versa. The transmission chain test was performed because we expect, from the structure of the model, that it may inappropriately assign high probability of sourcehood to individuals who are in the same transmission chain of the true infector when that infector is not detected, given that the data for those individuals will look very like that of the true infector. As such, the transmission tree simulation will test whether the correct chains of transmission are being detected by the model. The tests were done on the full dataset of patients admitted after the 50th day, and for each subset of patients whose time difference between admission and detection were 0 to 9 and greater than 9 days. These subsets should represent different degrees of challenge for the model, with, for example, those being detected on the day they are admitted being easily identifiable as community-acquired, and the hardest cases being those detected a day or two after admission, where both within hospital and external infection are plausible. We used one-sided Bonferroni-corrected Wilcoxon signed rank tests to compare the performance of NOSTRA to the references used for the Brier scores from the full dataset. The use of the one-sided test is justified as we are only interested in whether NOSTRA performs better than the references. The rule of thumb and NOSTRA without candidate infectors references were only used in the nosocomial assessment comparison.

As analyses were run at the ward level, and all identified candidate individuals were therefore from the same ward, if the true infection source was in the hospital, but resulted from a cross-ward transmission chain, the truth was set as the H source. Probabilities were converted from the per individual source level to the transmission tree level by summing over the individuals involved in each transmission tree.

Results

Model illustration

NOSTRA provides an estimate of the posterior probability for all of the possible infection sources (Fig 1), with rows corresponding to focal individuals and columns to infection sources. The probability of nosocomial infection, shown in the final column, is then the sum of the probabilities over the candidate individuals and hospital component.

Download:

Fig 1. A visualisation of the output of the model applied to the full CUH Ward A dataset.

Each row corresponds to a candidate individual, each column, except the last, to a potential infection source. Sources starting with “CAMP" are identified patients within the hospital. The “Hospital" source represents all unidentified sources within the hospital. The “Community" source is all sources outside the hospital. Cells are coloured by the posterior probability of that infection source. The last column shows the posterior nosocomiality probability, which is 1–P(CommunityData).

https://doi.org/10.1371/journal.pcbi.1012949.g001

As NOSTRA is capable of running with only symptom dates, we can assess the impact of each of the data sources on the posterior by adding one at a time. This is shown in Fig 2 where the change in posterior probability of the infection sources as different data is added is visible. We exclude patients CAMP000676 and CAMP000706, as their admission times and location data were unavailable. In this dataset, the genetics has a very large impact on the generated posterior probabilities. This appears to be driven by the genetic consistency of specific candidate individual’s viral isolates to the focal individual’s viral isolate dramatically reducing the posterior probabilities of the hospital and community compartments. For this dataset, the impact of the location information on the posterior is contingent on the data already in the model, with it only causing a large change in posterior probabilities if the genetics has already been added. This is because the strength of the location data is to rule out transmission by identifying potential transmission pairs who never interacted at the appropriate time and thus are very unlikely to have been linked. Hence, the location data has the largest impact when it removes infection sources that were favoured by the genetics. Another example of the concentration of the posterior around specific sources as data more data is added is shown for analysis of one the simulation outputs in S1 Fig.

Download:

Fig 2. The impact of adding data sources on the assessed probability of each infection source.

Panel a. shows the posterior probabilities of each infection source when only symptom onsets and admission times are provided to NOSTRA. The other panels explore the impact on the posterior probabilities as different data is added. Panel b. shows the change in posterior probability from the estimates in panel a. when location information is added. Panel c. shows the change in posterior probability from the estimates in panel a. when genetic information about the pathogen is added. Panel d. shows the change in posterior probability from the model including onset times, admission times and patient locations, when genetic information about the pathogen is added. Panel e. shows the change in posterior probability from the model including onset times, admission times and genetic information about the pathogen, when patient locations are added.

https://doi.org/10.1371/journal.pcbi.1012949.g002

Model validation

Fig 3 and Table 2 summarise NOSTRA’s calibration for the different targets compared against a series of references; nosocomiality, source and transmission chain. The simulation prevalence has no consistent impact on the performance, though there is substantial variability among prevalence-parameter set combinations. NOSTRA performs well at nosocomial prediction with a mean absolute error in probability of 0.102, with it significantly exceeding the calibration of all references, including the unknown true frequency of nosocomiality. There is a clear gain in skill for nosocomial assessment from the joint estimation of the infection sources and nosocomiality relative to infectious sources alone. To illustrate this, the NOSTRA model which doesn’t include candidates performs worst among all comparisons while the NOSTRA model including the candidates performs best. With respect to source attribution, NOSTRA performs significantly better that the comparator references, but only by a small amount. In the transmission chain task, NOSTRA performs comparatively to its ability at nosocomial prediction despite the extra difficulty. Given this, it appears that the poor source attribution calibration is predominantly due to putting large amounts of posterior probability on candidate sources within the same transmission chain as the true source.

Download:

Fig 3. The calibration of NOSTRA versus references as measured by Brier score.

The calibration of NOSTRA versus references as measured by Brier score for nosocomiality assessment (left), source identification (middle), and transmission chain identification (right). Low scores indicate better calibration. Points are coloured by the prevalence used in that simulation (see methods). The large black points correspond to the mean across simulations. NOSTRA, Candidates is the NOSTRA model run with a full set of candidate individuals and all data. NOSTRA, No Candidates is the NOSTRA model run with no candidate individuals using Eq 17. Prevalence Prior sets the prior probability of nosocomiality to the true probability of nosocomiality in the dataset. Na ve Prior sets the prior probability of community infection to 0.5 and the probability of every source within the hospital to , where n is the number of candidate individuals in the hospital. 96hr Categorisation assigns a nosocomiality probability of 0 to anything detected in the first 96 hours post-admission and a nosocomiality probability of 1 to everything else. The backets show the Bonferroni-corrected p-values of the one tailed paired Wilcoxon signed rank test that the Brier score of the NOSTRA model run with candidates is lower than each reference.

https://doi.org/10.1371/journal.pcbi.1012949.g003

Download:

Table 2. The arithmetic mean, and 0.025 and 0.975 quantiles of the Brier scores of the different models for each of the different estimation targets NOSTRA, Candidates is the NOSTRA model run with a full set of candidate individuals and all data. NOSTRA, No Candidates is the NOSTRA model run with no candidate individuals using Eq 17. Prevalence Prior sets the prior probability of nosocomiality to the true probability of nosocomiality in the dataset. Na ve prior sets the prior probability of community infection to 0.5 and the probability of every source within the hospital to

, where n is the number of candidate individuals in the hospital. 96hr Categorisation assigns a nosocomiality probability of 0 to anything detected in the first 96 hours post-admission and a nosocomiality probability of 1 to everything else.

https://doi.org/10.1371/journal.pcbi.1012949.t002

The waiting time from admission to detection is related to the difficulty in assigning its nosocomial status. When infection is detected on the day of admission, it is a simple task to determine that the infection was acquired in the community, and, likewise, if the infection is detected several weeks after admission, there is little doubt that an infection is nosocomial. Therefore, the most difficult cases are those that occur in the period consistent with both hospital and community acquisition. Fig 4 summarises the calibration of the model under different times from admission to detection. NOSTRA performs comparatively well to its average performance on the complete dataset (Fig 3) over this entire time period, with its worst calibration occurring for patients whose infections are detected one day after their admission to hospital, the time period that would be expected to be the most challenging to assess nosocomiality during.

Download:

Fig 4. The calibration of NOSTRA versus references as measured by Brier score by the between admission and detection of infection.

The calibration of NOSTRA versus references as measured by Brier score for nosocomiality assessment (left), source identification (middle), and transmission chain identification (right) by the time between admission and detection of infection. Low scores indicate better calibration. The large red points correspond to the mean across simulations. NOSTRA - Candidates is the NOSTRA model run with a full set of candidate individuals and all data. NOSTRA - No Candidates is the NOSTRA model run with without any candidate individuals, using Eq 17. Prevalence Prior sets the prior probability of nosocomiality to the true probability of nosocomiality in that simulation run. Na ve prior sets the prior probability of nosocomiality to 0.5. 96hr Categorisation assigns a nosocomiality probability of 0 to anything detected in the first 96 hours post-admission and a nosocomiality probability of 1 to everything else.

https://doi.org/10.1371/journal.pcbi.1012949.g004

Discussion

We have presented NOSTRA, a new model that integrates both epidemiological and genetic data to give a posterior distribution over the potential infection origins of an individual. This new model represents the conceptual unification of the HOCI tool [6], for nosocomiality assessment, and the A2B model [11], for infection source identification. As all the terms in our model are mathematically tractable, we get the posterior distribution over sources in analytic form, allowing us to avoid any numerical integration and keep runtime low. There are few published models designed to estimate whether an infection is nosocomial [6–8], and to our knowledge no models previously were designed to jointly estimate the probability of nosocomiality and infection sources within the hospital.

NOSTRA provides probabilities, not dichotomised answers to the question of whether an infection is nosocomial. An important question that we have not attempted to answer in this work is how healthcare workers should interpret these probabilities for action. We believe that labelling a particular infection as “nosocomial" (or otherwise) on the basis of the model probability is, in effect, a policy decision. As such, this should be taken with reference to the potential costs and benefits of any action that would be taken in response to that decision, and how these would change if the classification were wrong. Therefore, it is important that thresholds for labelling be determined by individuals embedded within healthcare systems who are aware of the economic consequences of the decision for that particular system, and it is implausible that any one-size-fits-all approach will be appropriate.

The results from the CUH dataset give us some evidence on the kinds of data that may be useful to collect for the assessment of nosocomiality, irrespective of the method that is to be applied. In this case, the genetic data were very informative, as it allowed high probability candidates within the hospital to be identified. This suggests that routine sequencing of hospital pathogens may allow better assessments of nosocomiality, as well as being useful for tracking transmission networks. The waiting time between admission and onset provides a great deal of information about the location of the infection. If this waiting time is almost always less than five days, and the patient was admitted six days ago, then assessed probability of nosocomiality by NOSTRA is always going to be high, even if there are no genetically consistent identified candidates. Therefore, the genetic and location information are going to have the most impact on the assessed probability of nosocomiality in “difficult" cases. That is, when the time of onset is consistent with both hospital and community transmission, either because the observed time is right in the middle of the waiting time distribution, or because the waiting time is highly variable. Thus, we theorise that hospital sequencing of isolates may be especially valuable for pathogens with highly variable incubation times.

The simulation results we present help make clear NOSTRA’s strengths and weaknesses. Namely, the Brier scores indicate, for the agent-based SARS-CoV-2 hospital infection model we used, in absolute terms NOSTRA performed well at estimating whether an infection was nosocomial and identifying the transmission chain the true source was in, but comparatively poorly at identifying the precise identity of that source. If the model has been well parameterised for the specific pathogen under study, in real usage we expect a decrease in NOSTRA’s performance relative to what was seen here, because the way that the genetic data was simulated closely matches NOSTRA’s assumptions. While the simulation model generates transmission trees, and hence does not conform to the assumption of independence between different sources we made in the derivation of the likelihood, the mutation process on those transmission trees follows the model that NOSTRA uses to calculate the genetic likelihood, with the only difference being the lack of additional noise from sequencing. The complexity of the substitution process in reality will lead to a degradation in performance of this part of the model. However, as the rest of the data were not simulated under NOSTRA’s assumptions, and thus NOSTRA has already shown some robustness to violations here, we expect less loss of performance in the other components of the model.

NOSTRA is currently the only real option for joint estimation of nosocomiality and infection sources, it has some important caveats for use. We summarise the caveats potential users should take into account in Box 1.

Box 1: Caveats for the usage of NOSTRA

NOSTRA requires well defined generation and incubation times of the pathogen under study
NOSTRA requires knowledge about the mutation rate and effective population size of the pathogen under study
NOSTRA requires that the pathogen is directly transmitted between individuals
NOSTRA should not be applied to pathogens with high levels of within host variation
NOSTRA should not be applied to focal individuals who are asymptomatic
NOSTRA should not be applied to pathogens where indirect transmission is believed to be important
Infection candidates with high posterior probability of sourcehood should be considered linked infections rather than definite sources

A first caveat regards our handling of the genetics of the pathogen. We implicitly assume that there is a single genetic type at any time in each host. Explicitly we assume that given data for A_z and B, when A_z was the infection source, the time of the most recent common ancestor was the point of infection, . This is only true if A_z had no within host variation in its pathogen. If there is within host variation, this actually represents a lower bound on the time to the common ancestor, due to the potential for incomplete lineage sorting [27]. It is unlikely that this is likely to cause a large issue in most cases in which we envision that NOSTRA would be applied, given that the short generation times of most respiratory viruses means that the upper bound on the time to the most recent common ancestor is likely to be close to the lower bound. However, in infections with long generation times, where large amounts of within host diversity may be generated and maintained, this assumption may have a large impact, causing the number of expected SNPs between the infector and infectee to be underestimated. To account for this, a more complicated model allowing for pathogen diversification within hosts after infection would be required. Thus, we advise that users should not use NOSTRA for pathogens with long generation times.

A second caveat is about our handling of missing data. While NOSTRA can run with all data other than the focal individual’s onset time being missing, we are making strong assumptions about the nature of that missingness in order to do this. We assume that the data is missing completely at random. That is, that the missing data is a random subset of the full data and it being missing is independent of both the values of observed and unobserved data. The degree to which this assumption holds is likely to depend on the specifics of the pathogen and hospital that this is being applied to. For example, samples with low pathogen load are known to fail sequencing more frequently than those with high load, so in a case where failed sequences are not reattempted, a missing genetic sequence may be indicative of someone early or late in their infection course, or it may simply be that their sample was not sent for sequencing for an unrelated reason. A full understanding of the data providence is necessary to assess whether this assumption is reasonable, and as such whether the model is appropriate to be applied in the case of the user’s specific missing data.

A type of missing data that is worth commenting on specifically is asymptomatic carriage. In the case of asymptomatic carriage symptom onset times will be unobserved (and, indeed, undefined). While we did this for illustrative purposes with the CUH data, we strongly advise that, in practice, NOSTRA should not be applied to focal individuals who are asymptomatic. This is because there is no well-defined distribution that describes the waiting time between admission and detection for asymptomatic individuals. It is entirely a function of the testing strategy applied by the healthcare facility. This does not mean that NOSTRA is inapplicable to pathogens where asymptomatic carriage is an important factor in the epidemiological dynamics, as long as the focal individual is symptomatic. If one is willing to accept the missing completely at random assumption, candidate infectors who are asymptomatic can be included in the model with their onset times as missing data, this allows the genetics of their pathogen to still inform the likelihood. If one is not willing to make that missingness assumption, these individuals can simply be ignored, which is equivalent to including them in the “Hospital" source.

Another important kind of missing data is that of unrecognised infector candidates in the hospital. Our “Hospital" infection source allows for the true infector in the hospital to be unidentified. However, pathological behaviour is likely to occur if the true infector is unidentified, but there is a consistent infector in the set of identified candidates, likely from the same transmission chain. Under this circumstance, the consistent infector will be likely to be assessed as having high posterior probability at the expense of the “Hospital" source. Given this, the “Hospital" source should be considered a guard against the possibility that there are no likely identified candidates, but the timings of infection strongly suggest nosocomiality. This means that, in cases where the “Hospital" source is identified as the most probable, it does indicate that the detected infection genuinely is unrelated to anything in the candidate set. Therefore, while NOSTRA is not designed for this, with appropriately designed candidate sets (i.e. excluding a class of interest), NOSTRA could potentially be used to explore whether different classes of individuals are involved in within hospital outbreaks. However, we would advise further validation studies be performed before this were attempted.

Related to missing the true infector and inappropriately assigning an individual from the same transmission chain as the source, is the case where the data are too weak to correctly identify which individual within a transmission chain is the source, even if they are in the set of candidate individuals. We believe that this explains NOSTRA’s relatively poor performance at identifying the true source, but very good performance at identification of sources within the same transmission chain in the simulations. Given a transmission chain that takes place over a week or two in a single ward, the number of SNPs separating the cases will be small and all detected cases are likely to have been co-located. In this case, NOSTRA commonly ends up placing similar probability over all cases. Unlike the case of the true infector being missing and another individual in the transmission chain being assigned high probability, this is not unintended behaviour, as there genuinely is uncertainty about the true source which cannot be resolved by the data. However, this does mean that in many cases a single source will not be identifiable. Given these issues, and the high accuracy achieved for the detection of individuals within the transmission chain seen in the simulations, we advise users to consider the high probability infection sources within the hospital as likely linked cases, rather than the definite source of the infection.

A third thing that users should take into account when using NOSTRA is its prior. A prior that is uniform across infection sources should be avoided, as this has the unfortunate consequence that the induced prior on whether the infection is nosocomial becomes a function of the number of candidate individuals within the hospital. For this reason, in actual usage, we advise firstly defining a prior probability that the infection occurred in the community, and then distributing the rest of the probability equally over the hospital-associated infection components. Our default throughout this work has been to set this prior probability of community infection to 0.5, and this performed well in our simulations, but an informed prior based on knowledge of the approximate frequency of nosocomial infections in the facility where NOSTRA is being used will lead to better performance in general.

The required complexity of NOSTRA to model the multiple data sources that can be inputted to it means that it has higher requirements for prior knowledge about the biology of the pathogen of interest than many other similar epidemiological models. Specifically, studies must have been performed estimating an effective population size for the pathogen in the recent past, in order for the genetic likelihoods to be calculable. This is not a large problem for well-studied infections of the kind that we believe NOSTRA is likely be applied to (e.g. SARS-CoV-2, RSV and influenza) where phylogenetic studies are regularly being performed and values will be accessible in the literature. However, it does represent a limitation with respect to understudied or novel pathogens, where effective population size estimates may not be available.

Another potential limitation is that to ensure analyticity of the posterior distribution, allowing its direct calculation, we had to make strong independence assumptions between the different data types. The assumption of independence between the genetic data and the epidemiological data for A_z and B, when A_z was the infection source, being one notable for example. In reality, there will be complex interrelations between these two data types, given that B’s epidemiological data and the genetic distance from the isolate from A_z should depend on A_z’s epidemiological data through its influence on the infection time of B. This could lead to over- or under-estimation of the probability of A_z being the infection source of B depending on the precise combination of the data.

One final limitation relates to the use of the coalescent to model “unrelated" genetic sequences. The form of the coalescent we use makes two assumptions that might be problematic. Firstly, that there is no selection. Over short periods of time, where there is one dominant genetic type, this might be approximately true, but over longer time periods, it will definitely not be. Secondly, that there is no population structure. It is likely that there will be some degree of spatial structuring, and that the sequences in the hospital will be more closely related than would be the case if they were drawn at random from the entire population. This means that is likely to have a mean that is too high, i.e. that the time to coalescence would be shorter than would be expected for two sequences drawn at random, and consequently, the expected number of SNPs between the isolates would be overestimated. Both of these issues could potentially be resolved by modifying the form of the coalescent used, likely at the cost of more prior knowledge being required, but that goes beyond the scope of this work.

Conclusion

We believe that NOSTRA represents a step forward in data integration for nosocomial infection detection. Our tool provides the probability that an infection is nosocomial as well as the probability that certain given candidate individuals were linked to the infectee, something that was not previously available. We have reached the point that there are now multiple models that purport to assess nosocomiality in the literature, but we are limited by the absence of datasets where both the truth is known and the answer is non-trivial, so the accuracy of the assessments could not be quantified or compared. The simulation tools used in the evaluation here would likely be useful for more general comparisons between these methods, so that clinicians and medical statisticians can choose to implement the model that they would expect to perform best for their specific scenarios.

Supporting information

S1 Table. The parameter sets used for the hospital simulations.

Parameters starting with b are transmission rates used in the model. bP2P is the within bay transmission rate between patients. bP2P_hosp is the indirect between patient hospital transmission rate. bH2P is the healthcare worker to patient transmission rate. bP2H is the transmission probability from patients to healthcare workers per timestep. bH2H is transmission probability from healthcare workers to other healthcare workers per timestep. bH2H_hosp is the indirect between healthcare worker hospital transmission rate. commScale is the scale of community acquisition rate for HCWs. See the supplementary materials of Evans et al.[24] for a full description of the model.

https://doi.org/10.1371/journal.pcbi.1012949.s001

(XLSX)

S1 Fig. The concentration of posterior mass on specific sources as data is added.

This figure shows an example NOSTRA run from one simulated individual from the simulation analyses. Potential candidate infectors for individual are labelled 1 to 16. The high of each bar corresponds to the posterior mass placed on that infection source. The top left panel shows the prior probabilities of each infection source. The top right panel shows the posterior probabilities of each infection source after admission and onset times are added (dark) and the prior probabilities of each infection source (light). The bottom left panel shows the posterior probabilities of each infection source after admission times, onset times, and location information are added (dark) and the posterior probabilities of each infection source after admission and onset times are added (light). The bottom right panel shows the posterior probabilities of each infection source after all data are added (dark) and the posterior probabilities of each infection source after admission times, onset times, and location information are added (light).

https://doi.org/10.1371/journal.pcbi.1012949.s002

(TIFF)

References

1. Goto M, Al-Hasan M. Overall burden of bloodstream infection and nosocomial bloodstream infection in North America and Europe. Clin Microbiol Infect. 2013;19(6):501–9.
- View Article
- Google Scholar
2. de Kraker ME, Wolkewitz M, Davey PG, Koller W, Berger J, Nagler J, et al. Clinical impact of antimicrobial resistance in European hospitals: excess mortality and length of hospital stay related to methicillin-resistant Staphylococcus aureus bloodstream infections. Antimicrob Agents Chemother. 2011;55(4):1598–605. pmid:21220533
- View Article
- PubMed/NCBI
- Google Scholar
3. Saah FI, Amu H, Seidu AA, Bain LE. Health knowledge and care seeking behaviour in resource-limited settings amidst the COVID-19 pandemic: a qualitative study in Ghana. PLoS One. 2021;16(5):e0250940. pmid:33951063
- View Article
- PubMed/NCBI
- Google Scholar
4. Wong H, Eso K, Ip A, Jones J, Kwon Y, Powelson S, et al. Use of ward closure to control outbreaks among hospitalized patients in acute care settings: a systematic review. Syst Rev. 2015;4:152. pmid:26546048
- View Article
- PubMed/NCBI
- Google Scholar
5. Team PPE. Point prevalence survey of healthcare-associated infections and antimicrobial use in European acute-care hospitals. London, UK: Public Health England; 2016. Available from: https://assets.publishing.service.gov.uk/media/5c4f26a1e5274a491a413823/ECDC_PHE_HAI_AU_PPS_2016_single_codebook.pdf
6. Stirrup O, Hughes J, Parker M, Partridge DG, Shepherd JG, Blackstone J, et al. Rapid feedback on hospital onset SARS-CoV-2 infections combining epidemiological and sequencing data. Elife. 2021;10:e65828. pmid:34184637
- View Article
- PubMed/NCBI
- Google Scholar
7. Gómez-Vallejo H, Uriel-Latorre B, Sande-Meijide M, Villamarín-Bello B, Pavón R, Fdez-Riverola F, et al. A case-based reasoning system for aiding detection and classification of nosocomial infections. Decis Support Sys. 2016;84:104–16.
- View Article
- Google Scholar
8. Cohen G, Hilario M, Sax H, Hugonnet S, Pellegrini C, Geissbuhler A. An application of one-class support vector machine to nosocomial infection detection. Stud Health Technol Inform. 2004;107(Pt 1):716–20. pmid:15360906
- View Article
- PubMed/NCBI
- Google Scholar
9. Duault H, Durand B, Canini L. Methods combining genomic and epidemiological data in the reconstruction of transmission trees: a systematic review. Pathogens. 2022;11(2):252. pmid:35215195
- View Article
- PubMed/NCBI
- Google Scholar
10. Illingworth CJ, Hamilton WL, Warne B, Routledge M, Popay A, Jackson C, et al. Superspreaders drive the largest outbreaks of hospital onset COVID-19 infections. Elife. 2021;10:e67308. pmid:34425938
- View Article
- PubMed/NCBI
- Google Scholar
11. Illingworth CJR, Hamilton WL, Jackson C, Warne B, Popay A, Meredith L, et al. A2B-COVID: a tool for rapidly evaluating potential SARS-CoV-2 transmission events. Mol Biol Evol. 2022;39(3):msac025. pmid:35106603
- View Article
- PubMed/NCBI
- Google Scholar
12. Didelot X, Kendall M, Xu Y, White P, McCarthy N. Genomic epidemiology analysis of infection disease outbreaks using TransPhylo. Current Protocols. 2021;1:e60.
- View Article
- Google Scholar
13. Stirrup O, Blackstone J, Mapp F, MacNeil A, Panca M, Holmes A, et al. Effectiveness of rapid SARS-CoV-2 genome sequencing in supporting infection control for hospital-onset COVID-19 infection: multicentre, prospective study. Elife. 2022;11:e78427. pmid:36098502
- View Article
- PubMed/NCBI
- Google Scholar
14. Weiser AA, Thöns C, Filter M, Falenski A, Appel B, Käsbohrer A. FoodChain-Lab: a trace-back and trace-forward tool developed and applied during food-borne disease outbreak investigations in Germany and Europe. PLoS One. 2016;11(3):e0151977. pmid:26985673
- View Article
- PubMed/NCBI
- Google Scholar
15. Quick J. nCoV-2019 sequencing protocol v1. protocolsio. 2020. p. bbmuil6w. https://doi.org/10.17504/protocols.io.bbmuik6w
16. Kingman J. On the genealogy of large populations. J Appl Probab. 1982;13:27–43.
- View Article
- Google Scholar
17. Kingman J. The coalescent. Stoch Process Their Appl. 1982;13(3):235–248.
- View Article
- Google Scholar
18. Tajima F. Evolutionary relationship of DNA sequences in finite populations. Genetics. 1983;105(2):437–460.
- View Article
- Google Scholar
19. Hudson R. Gene genealogies and the coalescent process. In: Futuyma D, Antonovics J, editors. Oxford surveys in evolutionary biology. Vol. 7. Oxford: Oxford University Press; 1990. p. 1–44.
20. Schröter K. On a family of counting distributions and recursions for related compound distributions. Scand Actuar J. 1990;161–75.
- View Article
- Google Scholar
21. He X, Wu P, Deng X, Wang J, Hao X, Lau Y, et al. Temporal dynamics in viral shedding and transmissibility of COVID-19. Nat Med. 2020;26(5):672–5. pmid:32296168
- View Article
- PubMed/NCBI
- Google Scholar
22. Ferretti L, Ledda A, Wymant C, Zhao L, Ledda V, Abeler-Dörner L, et al. The timing of COVID-19 transmission. medRxiv. preprint. 2020.
- View Article
- Google Scholar
23. Wang S, Xuanyu X, Wei C, Li S, Zhao J, Zheng Y, et al. Molecular evolutionary characteristics of SARS-CoV-2 emerging in the United States. J Med Virol. 2022;94(1):310–7. pmid:34506640
- View Article
- PubMed/NCBI
- Google Scholar
24. Evans S, Stimson J, Pople D, White P, Wilcox M, Robotham J. Impact of interventions to reduce nosocomial transmission of SARS-CoV-2 in English NHS Trusts: a computational modelling study. BMC Infect Dis. 2024;24(1):475. pmid:38714946
- View Article
- PubMed/NCBI
- Google Scholar
25. Evans S, Stimson J, Pople D, Bhattacharya A, Hope R, White PJ, et al. Quantifying the contribution of pathways of nosocomial acquisition of COVID-19 in English hospitals. Int J Epidemiol. 2022;51(2):393–403. pmid:34865043
- View Article
- PubMed/NCBI
- Google Scholar
26. Brier G. Verification of forecasts expressed in terms of probability. Mon Weather Rev. 1950;78:1–3.
- View Article
- Google Scholar
27. Pamilo P, Nei M. Relationships between gene trees and species trees. Mol Biol Evol. 1988;5(5):568–83.
- View Article
- Google Scholar

[ref1] 1. Goto M, Al-Hasan M. Overall burden of bloodstream infection and nosocomial bloodstream infection in North America and Europe. Clin Microbiol Infect. 2013;19(6):501–9.
View Article
Google Scholar

[2] View Article

[3] Google Scholar

[ref2] 2. de Kraker ME, Wolkewitz M, Davey PG, Koller W, Berger J, Nagler J, et al. Clinical impact of antimicrobial resistance in European hospitals: excess mortality and length of hospital stay related to methicillin-resistant Staphylococcus aureus bloodstream infections. Antimicrob Agents Chemother. 2011;55(4):1598–605. pmid:21220533
View Article
PubMed/NCBI
Google Scholar

[5] View Article

[6] PubMed/NCBI

[7] Google Scholar

[ref3] 3. Saah FI, Amu H, Seidu AA, Bain LE. Health knowledge and care seeking behaviour in resource-limited settings amidst the COVID-19 pandemic: a qualitative study in Ghana. PLoS One. 2021;16(5):e0250940. pmid:33951063
View Article
PubMed/NCBI
Google Scholar

[9] View Article

[10] PubMed/NCBI

[11] Google Scholar

[ref4] 4. Wong H, Eso K, Ip A, Jones J, Kwon Y, Powelson S, et al. Use of ward closure to control outbreaks among hospitalized patients in acute care settings: a systematic review. Syst Rev. 2015;4:152. pmid:26546048
View Article
PubMed/NCBI
Google Scholar

[13] View Article

[14] PubMed/NCBI

[15] Google Scholar

[ref5] 5. Team PPE. Point prevalence survey of healthcare-associated infections and antimicrobial use in European acute-care hospitals. London, UK: Public Health England; 2016. Available from: https://assets.publishing.service.gov.uk/media/5c4f26a1e5274a491a413823/ECDC_PHE_HAI_AU_PPS_2016_single_codebook.pdf

[ref6] 6. Stirrup O, Hughes J, Parker M, Partridge DG, Shepherd JG, Blackstone J, et al. Rapid feedback on hospital onset SARS-CoV-2 infections combining epidemiological and sequencing data. Elife. 2021;10:e65828. pmid:34184637
View Article
PubMed/NCBI
Google Scholar

[18] View Article

[19] PubMed/NCBI

[20] Google Scholar

[ref7] 7. Gómez-Vallejo H, Uriel-Latorre B, Sande-Meijide M, Villamarín-Bello B, Pavón R, Fdez-Riverola F, et al. A case-based reasoning system for aiding detection and classification of nosocomial infections. Decis Support Sys. 2016;84:104–16.
View Article
Google Scholar

[22] View Article

[23] Google Scholar

[ref8] 8. Cohen G, Hilario M, Sax H, Hugonnet S, Pellegrini C, Geissbuhler A. An application of one-class support vector machine to nosocomial infection detection. Stud Health Technol Inform. 2004;107(Pt 1):716–20. pmid:15360906
View Article
PubMed/NCBI
Google Scholar

[25] View Article

[26] PubMed/NCBI

[27] Google Scholar

[ref9] 9. Duault H, Durand B, Canini L. Methods combining genomic and epidemiological data in the reconstruction of transmission trees: a systematic review. Pathogens. 2022;11(2):252. pmid:35215195
View Article
PubMed/NCBI
Google Scholar

[29] View Article

[30] PubMed/NCBI

[31] Google Scholar

[ref10] 10. Illingworth CJ, Hamilton WL, Warne B, Routledge M, Popay A, Jackson C, et al. Superspreaders drive the largest outbreaks of hospital onset COVID-19 infections. Elife. 2021;10:e67308. pmid:34425938
View Article
PubMed/NCBI
Google Scholar

[33] View Article

[34] PubMed/NCBI

[35] Google Scholar

[ref11] 11. Illingworth CJR, Hamilton WL, Jackson C, Warne B, Popay A, Meredith L, et al. A2B-COVID: a tool for rapidly evaluating potential SARS-CoV-2 transmission events. Mol Biol Evol. 2022;39(3):msac025. pmid:35106603
View Article
PubMed/NCBI
Google Scholar

[37] View Article

[38] PubMed/NCBI

[39] Google Scholar

[ref12] 12. Didelot X, Kendall M, Xu Y, White P, McCarthy N. Genomic epidemiology analysis of infection disease outbreaks using TransPhylo. Current Protocols. 2021;1:e60.
View Article
Google Scholar

[41] View Article

[42] Google Scholar

[ref13] 13. Stirrup O, Blackstone J, Mapp F, MacNeil A, Panca M, Holmes A, et al. Effectiveness of rapid SARS-CoV-2 genome sequencing in supporting infection control for hospital-onset COVID-19 infection: multicentre, prospective study. Elife. 2022;11:e78427. pmid:36098502
View Article
PubMed/NCBI
Google Scholar

[44] View Article

[45] PubMed/NCBI

[46] Google Scholar

[ref14] 14. Weiser AA, Thöns C, Filter M, Falenski A, Appel B, Käsbohrer A. FoodChain-Lab: a trace-back and trace-forward tool developed and applied during food-borne disease outbreak investigations in Germany and Europe. PLoS One. 2016;11(3):e0151977. pmid:26985673
View Article
PubMed/NCBI
Google Scholar

[48] View Article

[49] PubMed/NCBI

[50] Google Scholar

[ref15] 15. Quick J. nCoV-2019 sequencing protocol v1. protocolsio. 2020. p. bbmuil6w. https://doi.org/10.17504/protocols.io.bbmuik6w

[ref16] 16. Kingman J. On the genealogy of large populations. J Appl Probab. 1982;13:27–43.
View Article
Google Scholar

[53] View Article

[54] Google Scholar

[ref17] 17. Kingman J. The coalescent. Stoch Process Their Appl. 1982;13(3):235–248.
View Article
Google Scholar

[56] View Article

[57] Google Scholar

[ref18] 18. Tajima F. Evolutionary relationship of DNA sequences in finite populations. Genetics. 1983;105(2):437–460.
View Article
Google Scholar

[59] View Article

[60] Google Scholar

[ref19] 19. Hudson R. Gene genealogies and the coalescent process. In: Futuyma D, Antonovics J, editors. Oxford surveys in evolutionary biology. Vol. 7. Oxford: Oxford University Press; 1990. p. 1–44.

[ref20] 20. Schröter K. On a family of counting distributions and recursions for related compound distributions. Scand Actuar J. 1990;161–75.
View Article
Google Scholar

[63] View Article

[64] Google Scholar

[ref21] 21. He X, Wu P, Deng X, Wang J, Hao X, Lau Y, et al. Temporal dynamics in viral shedding and transmissibility of COVID-19. Nat Med. 2020;26(5):672–5. pmid:32296168
View Article
PubMed/NCBI
Google Scholar

[66] View Article

[67] PubMed/NCBI

[68] Google Scholar

[ref22] 22. Ferretti L, Ledda A, Wymant C, Zhao L, Ledda V, Abeler-Dörner L, et al. The timing of COVID-19 transmission. medRxiv. preprint. 2020.
View Article
Google Scholar

[70] View Article

[71] Google Scholar

[ref23] 23. Wang S, Xuanyu X, Wei C, Li S, Zhao J, Zheng Y, et al. Molecular evolutionary characteristics of SARS-CoV-2 emerging in the United States. J Med Virol. 2022;94(1):310–7. pmid:34506640
View Article
PubMed/NCBI
Google Scholar

[73] View Article

[74] PubMed/NCBI

[75] Google Scholar

[ref24] 24. Evans S, Stimson J, Pople D, White P, Wilcox M, Robotham J. Impact of interventions to reduce nosocomial transmission of SARS-CoV-2 in English NHS Trusts: a computational modelling study. BMC Infect Dis. 2024;24(1):475. pmid:38714946
View Article
PubMed/NCBI
Google Scholar

[77] View Article

[78] PubMed/NCBI

[79] Google Scholar

[ref25] 25. Evans S, Stimson J, Pople D, Bhattacharya A, Hope R, White PJ, et al. Quantifying the contribution of pathways of nosocomial acquisition of COVID-19 in English hospitals. Int J Epidemiol. 2022;51(2):393–403. pmid:34865043
View Article
PubMed/NCBI
Google Scholar

[81] View Article

[82] PubMed/NCBI

[83] Google Scholar

[ref26] 26. Brier G. Verification of forecasts expressed in terms of probability. Mon Weather Rev. 1950;78:1–3.
View Article
Google Scholar

[85] View Article

[86] Google Scholar

[ref27] 27. Pamilo P, Nei M. Relationships between gene trees and species trees. Mol Biol Evol. 1988;5(5):568–83.
View Article
Google Scholar

[88] View Article

[89] Google Scholar

Abstract

Author summary

Figures

Introduction

Materials and methods

Data

Model

Bayesian analysis: prior

Bayesian analysis: likelihood

Likelihood of the data given that infection was from a non-candidate in the hospital or in the community, .

Likelihood of XH given that infection was in the community, L(XH|S = C).

Likelihood of XH given that infection was from a non-candidate in the hospital, L(XH|S = H).

Likelihood of given that infection was from a non-candidate in the hospital or in the community, .

Likelihood of the data given that infection was from a candidate individual, .

Likelihood of XH and given that infection was from the candidate individual Az, .

Bayesian analysis: inference

Model illustration

Simulation validation

Results

Model illustration

Model validation

Discussion

Box 1: Caveats for the usage of NOSTRA

Conclusion

Supporting information

S1 Table. The parameter sets used for the hospital simulations.

S1 Fig. The concentration of posterior mass on specific sources as data is added.

References

Cookie Preference Center

Customize Your Cookie Preference

Likelihood of X_H given that infection was in the community, L(X_H|S = C).

Likelihood of X_H given that infection was from a non-candidate in the hospital, L(X_H|S = H).

Likelihood of X_H and given that infection was from the candidate individual A_z, .