The authors have declared that no competing interests exist.
Conceived and designed the experiments: CG SKS CMB MJPW SHvdB CC. Performed the experiments: CG SKS CMB MJPW SHvdB. Analyzed the data: AC JF LL CC. Contributed reagents/materials/analysis tools: AC JF CG SKS CMB MJPW SHvdB. Wrote the paper: AC CG MW CC.
Flow cytometry is the prototypical assay for multi-parameter single cell analysis, and is essential in vaccine and biomarker research for the enumeration of antigen-specific lymphocytes that are often found in extremely low frequencies (0.1% or less). Standard analysis of flow cytometry data relies on visual identification of cell subsets by experts, a process that is subjective and often difficult to reproduce. An alternative and more objective approach is the use of statistical models to identify cell subsets of interest in an automated fashion. Two specific challenges for automated analysis are to detect extremely low frequency event subsets without biasing the estimate by pre-processing enrichment, and the ability to align cell subsets across multiple data samples for comparative analysis. In this manuscript, we develop hierarchical modeling extensions to the Dirichlet Process Gaussian Mixture Model (DPGMM) approach we have previously described for cell subset identification, and show that the hierarchical DPGMM (HDPGMM) naturally generates an aligned data model that captures both commonalities and variations across multiple samples. HDPGMM also increases the sensitivity to extremely low frequency events by sharing information across multiple samples analyzed simultaneously. We validate the accuracy and reproducibility of HDPGMM estimates of antigen-specific T cells on clinically relevant
The use of flow cytometry to count antigen-specific T cells is essential for vaccine development, monitoring of immune-based therapies and immune biomarker discovery. Analysis of such data is challenging because antigen-specific cells are often present in frequencies of less than 1 in 1,000 peripheral blood mononuclear cells (PBMC). Standard analysis of flow cytometry data relies on visual identification of cell subsets by experts, a process that is subjective and often difficult to reproduce. Consequently, there is intense interest in automated approaches for cell subset identification. One popular class of such automated approaches is the use of statistical mixture models. We propose a
Flow cytometry is the prototypical assay for multi-parameter single cell analysis, and is essential in vaccine development, monitoring of T cell-based immune therapies and the search for immune biomarkers. In many clinical research applications, the cell subsets of interest are
There has therefore been increasing interest in the use of objective, automated methods for cell subset identification
We briefly describe three alternative software packages for automated analysis to contrast the approach of HDPGMM. FLOCK 2.0 (FLOw cytometry Clustering without K)
With this in mind, the developments reported here concern the implementation of a hierarchical Gaussian mixture model based on a Dirichlet process prior, and extensions of the basic model to identify and quantify rare cell subsets in flow cytometry data. Simulated data is first used to demonstrate the advantages of hierarchical models over conventional clustering approaches. This is followed by validation of the model on experimental samples, using retrovirally TCR-transduced T cells that are spiked into autologous peripheral blood mononuclear cell (PBMC) samples to give a defined number of antigen-specific T cells
The basic concept in model-based approaches is to consider events in a flow cytometry data set as being random samples drawn from a multi-dimensional probability distribution. The objective of analysis is then to define the probability distribution model and evaluate inferences over the model parameters based on fit to the specific data set. Statistical mixture models are a standard approach for the construction of the underlying distribution, using the sum of many simpler probability distributions (e.g. multivariate Gaussian, Student-t or skewed distributions) to approximate arbitrary multi-dimensional distributions. For biological interpretation, fitted models are then used for clustering, i.e. using statistical properties of individual events to assign them to biological cell subsets. For example, with statistical mixture models, this can be done by grouping events with the highest probability of coming from a specific mixture component together, or merging of multiple components using specified criteria such as having a common mode in the estimated distribution over markers
Of course, the number of distinguishable cell subsets and Gaussian components necessary to fit the model satisfactorily is not known in advance. To avoid having to specify the number of mixture components needed in the model, we use a Dirichlet process prior in which the number of components necessary is directly estimated from the data
Clustering methods applied to data samples independently face two major limitations. The first is that cluster labels are not aligned across data samples, posing a problem for comparing subsets across multiple samples which is usually the purpose of the original experiment. The second is that there are limits to the ability of clustering models to identify very rare event clusters due to
Hierarchical, or multi-level models, represent individual events in flow cytometry data as being organized into successively higher units. For example, individual events belong to a sample, and a sample may belong to a collection of similar samples. The critical idea is that cell subset phenotypes that are common across data samples can be used to inform and hence better characterize events in individual samples. For example, one hierarchical Dirichlet process model formulation partitions components into those common across data samples and those unique to a specific sample
Instead, we model information sharing by placing all data samples under a common prior, such that the mean and covariance in any of the individual sample Gaussian components are shared across all samples, but the weight (proportion) of the component in each sample is unique. As described by Teh et al (2006)
As depicted in the summary schematic of the HDPGMM model shown in
A graphical model provides a declarative representation of the HDPGMM. The figure shows a compact plate representation of the graphical model, in which plates (rounded rectangles) are used to group variables in a subgraph. Each subgraph in a plate is replicated a number of times as indicated by the label within the plate. The
In the context of flow cytometry, a data sample typically consists of an
The hierarchical DP mixture model allows information sharing over data sets. In the hierarchical model, each flow cytometry data sample can be thought of as a representative of the collection of data samples being simultaneously analyzed. The individual data samples then provide information on the properties of the collection, and this information, in turn, provides information on any particular data sample. In this way, an HDPGMM fitted to a single data sample “borrows strength” from all other samples in the collection being analyzed. In other words, if a rare cell subtype is found in more than one of the samples, we share this information across the samples in the collection to detect the subtype even though the frequency in a particular data sample may be vanishingly small. HDPGMM thus increases sensitivity for clustering cell subsets that are of extremely low frequency in one sample but common to many samples or present in high frequency in one or more samples. In principle, there is no lower limit to the size of a cluster that can be detected in a particular sample. In practice, vanishingly small clusters (e.g. 3–5 events out of 100,000) require expert interpretation to distinguish background from signal, but it is not uncommon for biologically significant antigen-specific cells to be present at such frequencies.
We illustrate the ability of hierarchical modeling to simultaneously overcome the problem of masking of rare event clusters and provide an alignment of cell subsets over multiple data samples. Four simulated data sets were created, each with up to 4 bivariate normal clusters in 4 quadrants. Clusters in each quadrant may have different means or covariance matrices, or be absent entirely; see
(Left) Row 1 shows independent fitting of DPGMMs to each data set; row 2 shows the use of reference posterior distribution from data set 3 to classify events in other data set; row 3 shows a DPGMM fitted to pooled data from all data sets; and row 4 shows fitting of an HDPGMM to all 4 data sets.
In
The panels show the estimated frequencies of antigen-specific cells (large red dots) expressed as a percentage of all events (yellow boxes). These percentages were estimated using manual gating by a representative user (left), DPGMM (middle) and HDPGMM (right). Text in red in the first column shows the spiked-in frequency of retrovirally transduced T cells for the data sample in that row. The red polygons in the left panel are gates used for identifying antigen-specific cells by manual gating; the exact shape, sequence and location of these gates is determined by the operator and may vary between different operators depending on their training, experience and expertise. With the DPGMM approach, cell subsets across the samples from top to bottom are not directly comparable as indicated by the event colors, posing a problem for quantification of the same cell subset in different samples. In contrast, with the HDPGMM approach, cell subsets are aligned and directly comparable across all samples. HDPGMM is more sensitive at detecting antigen-specific cells when the frequency is extremely low (first 3 rows). HDPGMM is also more consistent in labeling events across different samples, while DPGMM is prone to detect likely false positive antigen-specific cells that are CD3-low or negative (arrows in rows 1 and 4 of middle panel). HDPGMM improves on the accuracy and consistency because the model incorporates both sample-specific and group-specific information, in contrast to DPGMM which only has access to sample-specific information. For both DPGMM and HDPGMM, model fitting was done with an MCMC sampler running 20,000 burn-in and 2,000 averaged iterations.
In row 2, we fitted a DPGMM to sample 3 (reference data set), then used the posterior distribution found to classify events in all the other samples. While this ensures that all clusters are aligned across the data sets, it has several limitations. The first issue is the need to choose a specific reference data sample, which introduces an element of subjectivity. A more worrying issue is that differences in distribution across data samples are simply ignored, and this can result in artifacts as shown with data sample 1 and data sample 2, where there is mixing of the red/green clusters because the mean or covariance matrices of those clusters deviated from that of the reference data sample 3. Also, because the small cluster (circled in red) is masked in data sample 3, it is also missed in all the other samples. While another data sample could have been chosen as the reference, it is clear from inspection of the variation across the simulated data samples that no single reference can give a satisfactory result.
In row 3, we fitted a DPGMM to pooled data from all four data samples. Pooled data is problematic because the resulting distribution is for an “averaged” data sample, and may result in the loss of information specific to a particular sample. We observe artifacts from clusters present in the pooled distribution but not in the specific sample in data sample 2 (green events in blue cluster) and data sample 4 (red events in blue cluster). A subtle issue is the over-counting of red cluster events in data sample 3 (9 events circled in red) due to the excessive influence of the red clusters in data samples 1 and 2.
Finally, in row 4, we fitted a HDPGMM to all four data sets simultaneously with the consensus modal clustering approach to identify cell subsets as described in
To evaluate the utility of HDPGMM for identifying rare event clusters in real data, we used reference cell samples containing a predefined number of T cells with known TCR specificity for the NY-ESO-1 cancer-testis antigen. TCR-transduced cells were added to autologous PBMC samples at final concentrations of 0%, 0.013125%, 0.02625%, 0.0525%, 0.105% and 0.21%
DPGMM and HDPGMM models were separately fitted to these six data samples using the FSC, SSC, CD45, CD3 and HLA-multimer channels (5 dimensional), using a truncated Dirichlet process with 128 mixture components, 20,000 burn-in steps and 2,000 identified iterations to calculate the posterior distribution as described in
The MCMC was run for 22,000 iterations, and samples obtained from the final 2,000 iterations were used to calculate and plot the log likelihood at each iteration. The log likelihood appears to vary stochastically about an equilibrium distribution indicating convergence, and the chain traverses its distribution indicating mixing, but the steps tend to be small indicating some degree of autocorrelation. Text in yellow boxes indicates the frequencies of the spiked antigen-specific T cells in the sample being fitted.
Each panel shows a scatter plot of the log component proportions ordered by size for the HDP model fitted to each flow cytometry data sample. The largest component has a log probability of approximately -1, indicating that this single component can account for about 10% of the total events in the data sample. In contrast, the smallest component has a log probability of between -5 and -6, indicating that the smallest component only accounts for 0.001–0.0001% of the total events in the data sample. Since each sample has 50,000 events, components with log probabilities of -5 and below are likely to be empty of events. Hence, the dip at the right of each plot is an indication of cutting back by the Dirichlet process model, and provides evidence that the number of components is adequate for a good model fit. If there is no dip in the size of smallest component proportions, there is a need to increase the maximal number of components if rare event clusters are to be adequately modeled. Text in yellow boxes indicates the frequencies of the spiked antigen-specific T cells in the sample being fitted.
A side-by-side comparison of manually gated, DPGMM and HDPGMM classifications is shown in
The panels show the estimated frequencies of antigen-specific cells (large red dots) expressed as a percentage of all events (yellow boxes). (Left panel) FLOCK detects the antigen-specific cluster at the highest spiked-in frequency but not in the other samples. There are several CD3-negative events included in the detected cluster that are most likely false positive events. As indicated by the color coding of events, FLOCK does not provide any alignment of cell subsets across samples. (Middle panel) Using the default settings, FLAME failed to identify any antigen-specific cell subsets. Cell subsets found were aligned but there were alignment artifacts when the event partitioning was different across samples (arrowed example). (Right panel) Using 64 components and 1000 iterations, flowClust only identified antigen-specific clusters at the highest spiked-in levels and did not provide any methods to align clusters across samples.
In
Finally, to evaluate the robustness of the DPGMM and HDPGMM frequency estimates, the fitting was repeated 10 times for each algorithm using different random number seeds, and compared to manual gating results from 10 users. Manual gating was performed by operators who were instructed to gate using the same sequence of 2D plots (common gating strategy), but were free to set gate boundaries within any given plot. The results are shown in
For gating estimates, frequency estimates from 10 flow cytometry operators were collected. For both DPGMM and HDPGMM, 10 MCMC runs with unique random number seeds were performed to evaluate the reproducibility of antigen-specific cell frequency estimates. Estimates of the antigen-specific frequencies from manual, DPGMM and HDPGMM approaches are shown as open blue circles, with the blue bar representing the mean of all 10 estimates at each spike frequency. The red crosses represent the “true” frequency of antigen-specific cells combining the known spiked-in frequencies and the average background from 10 manual evaluations. As shown in the figure, HDPGMM (right panel) estimates have equal or less variability at every spike dilution when compared with DPGMM (middle panel). A linear regression fit (red line) shows that the standard errors and correlation coefficient of all 3 approaches are comparable. The number in red text above each set of estimates is the absolute value of (median of estimates – “true value”), a measure of accuracy. This shows that HDPGMM is more accurate than manual gating at every spiked-in concentration. The number in blue text below each set of estimates is the coefficient of variation (CV), which is lower for HDPGMM than manual gating for all concentrations except autologous sample only. For both DPGMM and HDPGMM, model fitting was done with an MCMC sampler running 20,000 burn-in and 2,000 averaged iterations.
We have shown that HDPGMM improves on fitting individual samples with DPGMM in two ways - 1) it aligns clusters, making direct comparison of cluster counts across samples possible, and 2) by sharing information across samples, it can identify biologically relevant cell subsets present at frequencies in the 0.01–0.1% range, since “real” cell subsets would naturally be expected to be present in multiple data samples. The hierarchical model is also preferable to using a reference data sample or pooling the data from all samples, since individual sample characteristics are lost with these alternative strategies.
Unlike HDPGMM, other approaches for automated flow cytometry analysis treat data in the same way as DPGMM, that is, fitting a model to independent samples separately, then using a heuristic or algorithm to match up clusters in one data set with another. However, since the model fitting is performed independently, the way that events are partitioned in individual data sets into clusters may be different even across samples that are otherwise very similar, resulting in poor alignment as seen in the FLAME analysis. We are not aware of any other automated flow cytometry analysis software that directly models contributions from individual and grouped samples to align cell subsets, and believe that the HDPGMM approach fills a useful niche in multi-sample comparisons, especially for the quantification of rare event clusters.
One limitation of the HDPGMM model is that all the data to be fitted need to be simultaneously available. This is not an issue for most studies, but may be limiting for longitudinal studies that collect samples serially over an extended period where interim analyses need to be performed. Even in these cases, it may be useful to batch process cell samples in stages using a hierarchical model, then perform post-processing to align cell subsets over different stages. Because of information sharing, cell subsets that are consistent across data samples will be extremely robust features in the posterior distribution. Hence, it is likely that features across batches will be more consistent and easier to align for HDPGMM-fitted batch samples than if every sample was independently fitted.
As described in the text, HDPGMM achieves alignment by assuming that the cluster locations and shapes are constant across datasets, and only their proportions vary from sample to sample. This is similar to the standard practice of using a gating template common to all samples for manual analysis. However, the HDPGMM approach has several advantages over the use of a common gating template. Because the locations and shapes of the clusters are inferred from the data and not imposed top-down by an expert, there is less risk of a subjective bias and failure to detect novel cell subsets. Since classification of events is done by assignment to the maximum probability cluster, cell subsets are not demarcated by arbitrary (typically polygonal) boundaries. In addition, it is simple to tune for higher sensitivity or specificity depending on experimental context by setting the probability necessary for an event to be included in a cluster; events that fall below this threshold are considered to be indeterminate. However, clusters that are doubly rare in the sense of being found in only a small proportion of the samples, and which also constitute a tiny fraction of the total events in any given sample, risk being masked by other more common and high abundance clusters. In many cases, this limitation can be addressed by the inclusion of appropriate positive controls in the samples. Where such positive controls are not available, a post-processing step to scan for “anomalous” events that are found in extremely low probability regions of the posterior distribution at higher frequencies than predicted, may be effective for identifying these doubly rare events.
Technically, our implementation of the HDPGMM integrates several innovations necessary to make such hierarchical models a practical tool for flow cytometry analysis, including the use of a Metropolis-within-Gibbs step for sampling, an identification strategy to maintain consistent component labels across iterations that allows us to calculate the posterior distribution from multiple MCMC iterations, and a consensus modal map to merge components in such a way that non-Gaussian cell subsets are aligned across multiple data sets. To ensure scalability, we have implemented Message Passing Interface (MPI) and Compute Unified Device Architecture (CUDA) optimized code that can take advantage of multiple CPUs and GPUs from a cluster of machines to fit a single HDPGMM model to multiple data sets.
We provide software for HDPGMM fitting to flow cytometry data sets, together with pre-specified robust default parameters and hyper-parameters that make practical usage simple. In our experience, we have never needed to adjust these parameters for data sets ranging from 3-color to 11-color flow cytometry data sets. The only parameters we individually set are the number of burn-ins, the number of iterations to collect for the posterior distribution, and the maximal number of components for the truncated DP algorithm. These parameters are tuned mainly for computational efficiency since conservative defaults that would be expected to be effective in all common use cases can be given, with the trade-off being longer run times. In addition, the use of prior information to set the starting values for component means and covariances (e.g. from fits to previously collected similar data) would reduce the number of iterations necessary to achieve convergence.
The fitting of HDPGMM is computationally demanding but can be accelerated with cheap commodity graphics cards as previously described
Left panel shows time taken to fit HDPGMM to 10 samples with 50,000 to 500,000 events and 10 markers. Middle panel shows time taken to fit HDPGMM to 3 to 30 samples each with 100,000 events and 10 markers. Right panel shows time taken to fit 10 samples each with 100,000 events with the number of markers varying from 5 to 15. In each case, the model was run for 1,000 MCMC steps with an upper limit of 128 mixture components on a NVidia GTX 580 GPU, and the times from three replicate runs are shown.
In summary, we describe and provide code for a hierarchical modeling extension to statistical mixture models that improves on the robustness, sensitivity and interpretability of model-based approaches for automated flow cytometry analysis. We demonstrate the consistency of frequency of HDPGMM estimates on reference data samples spiked with extremely low frequencies of antigen-specific cells, a scenario directly relevant to many clinical research studies in vaccine development, immune monitoring and immune biomarker discovery where the frequency of rare antigen-specific T cells is of interest.
Assume we observe flow cytometry measurements
An alternative and equivalent representation of (1) is to assume that for each observation
We now generalize DPGMM to simultaneously classify T cells across multiple datasets. Assume we observe
Our interest is in extensions of this basic framework to hierarchical models on the
In summary, within each sample every cell is assumed to come from some unknown cluster where the number of clusters is learned from the data and the shape of each cluster is unknown. Note that we can assume this to be true because we group many parametric Gaussian clusters into flexibly shaped groups. See the consensus modal clustering below. Since the model is hierarchical in the sense that cluster shapes are shared between samples while their prevalence variance between samples, information is shared when cells from multiple samples are assigned to the same cluster giving us more information about the cluster's shape. This is especially prevalent when the number of cells in a particular cluster is small for a given sample.
We perform posterior inference by sampling via a Markov chain Monte Carlo (MCMC) algorithm using the latent classification variable
Since the conditional distributions for
To address the label switching issue, we use the method of Cron and West
In each iteration of the MCMC, the multivariate normal distribution must be evaluated at every event (in every dataset) for each of the
As cell subsets may have non-Gaussian distributions, it is often necessary to merge several mixture components to represent a single cell subtype. An intuitively appealing concept is to cluster components together when the components share a common mode, since the mode is an objective feature of the posterior distribution that links multiple components - here we adapt the procedure to find a coherent modal assignment across data sets. We first create a reference distribution whose whose components have the same means and covariance matrices as the fitted HDPGMM model, but whose component weights are averaged over all data sets. We first create a consensus Gaussian mixture model distribution whose components have the same means and covariance matrices as the fitted HDPGMM model, but whose component weights are averaged over all data sets. Starting from the location of each component mean, we use a numerically efficient iterative procedure to identify the mode associated with that location as previously described
We give posterior computational details only for HDPGMM since details for our implementation of DPGMM have been previously published
For each observation calculate
Given the cluster assignments, sampling each
Updating the cluster weights,
Furthermore,
The conditional distributions for
First, sample
The generation of the standard samples with a defined number of antigen-specific CD8 T cells spiked into autologous PBMC for use in HLA-peptide multimer has been described
Sample preparation conditions were set so that results (i.e. generated FCS files) would be as comparable as possible: Cell staining was performed simultaneously by the same operator, using the same batches of staining reagents, and data acquisition was subsequently done in a single experiment using the same cytometer settings (voltages, compensations) for all samples. The data were generated using a FACSCalibur and CellQuest Pro 6.0, with values ranging from 0 to 1023. No further transformations were performed on the data but standardization to have zero mean and unit standard deviation was performed before fitting the mixture model so all markers would have equal contributions. The standardization was reversed before plotting - i.e. all plots are based on the original 0 to 1023 scale. For gating estimates, frequency estimates from 10 flow cytometry operators using the same gating strategy were collected.
(TIF)
(ZIP)
(ZIP)
We would also like to thank our ten flow cytometry users for providing the estimates using manual gating, and Dr. T.N. Schumacher (NKI, Amsterdam, The Netherlands) for providing the NY-ESO-1 specific TCR.