RB, ML, MM, WHW, WRS, and CL conceived and designed the experiments. RB performed the experiments. RB, ML, YP, and CL analyzed the data. RB, ML, KH, XZ, LAG, EAF, EPH, IKM, MDH, AD, MAR, and CL contributed reagents/materials/analysis tools. RB, ML, WRS, and CL wrote the paper.
¤ Current address: Novartis Institutes for Biomedical Research, Cambridge, Massachusetts, United States of America
The authors have declared that no competing interests exist.
Loss of heterozygosity (LOH) of chromosomal regions bearing tumor suppressors is a key event in the evolution of epithelial and mesenchymal tumors. Identification of these regions usually relies on genotyping tumor and counterpart normal DNA and noting regions where heterozygous alleles in the normal DNA become homozygous in the tumor. However, paired normal samples for tumors and cell lines are often not available. With the advent of oligonucleotide arrays that simultaneously assay thousands of single-nucleotide polymorphism (SNP) markers, genotyping can now be done at high enough resolution to allow identification of LOH events by the absence of heterozygous loci, without comparison to normal controls. Here we describe a hidden Markov model-based method to identify LOH from unpaired tumor samples, taking into account SNP intermarker distances, SNP-specific heterozygosity rates, and the haplotype structure of the human genome. When we applied the method to data genotyped on 100 K arrays, we correctly identified 99% of SNP markers as either retention or loss. We also correctly identified 81% of the regions of LOH, including 98% of regions greater than 3 megabases. By integrating copy number analysis into the method, we were able to distinguish LOH from allelic imbalance. Application of this method to data from a set of prostate samples without paired normals identified known regions of prevalent LOH. We have developed a method for analyzing high-density oligonucleotide SNP array data to accurately identify of regions of LOH and retention in tumors without the need for paired normal samples.
A key event in the generation of many cancers is loss of heterozygosity (LOH) of chromosomal regions containing tumor suppressor genes, whereby one parent's version of the tumor suppressor is lost. As we develop a better understanding of the molecular mechanisms that generate different cancers, a description of the LOH events underlying these cancers is forming an important part of their classification. Generally, detection of LOH relies on comparison of the tumor's genome to the normal genome of the individual. Unfortunately, for many tumors, including most experimental models of cancer, the normal genome is not available. Therefore, the authors have developed a hidden Markov model-based method that evaluates the probability of LOH at all sites throughout the genome, based on high-resolution genotyping of only the tumor. They were able to achieve high levels of accuracy, specifically by taking into account the haplotype block structure of the genome. Application of this method to a set of 34 prostate cancer samples allowed the authors to identify the locations of the known and suspected tumor suppressor genes that are targeted by LOH.
Loss of heterozygosity (LOH) refers to change from a state of heterozygosity in a normal genome to a homozygous state in a paired tumor genome. LOH is most often regarded as a mechanism for disabling tumor suppressor genes (TSGs) during the course of oncogenesis [
Single nucleotide polymorphisms (SNPs) are the most common genetic variation in the human genome and can be used to search for germline genetic contributions to disease. To that end, oligonucleotide SNP arrays have been developed to simultaneously genotype thousands of SNP markers across the human genome [
Traditionally, LOH analyses require the comparison of the genotypes of the tumor and its normal germline counterpart. However, for cell line, xenograft, leukemia, and archival samples, paired normal DNA is often unavailable. Current generations of SNP arrays provide high enough marker density to make it feasible to identify regions of LOH by the absence of heterozygous loci (which we call inferred LOH), rather than by comparison to the paired normal. For example, the homozygosity mapping of deletions method was developed to use highly polymorphic microsatellite markers to identify regions of hemizygous deletion in unpaired tumor cell lines [
We approached this problem by developing a hidden Markov model (HMM) to infer LOH. HMMs are appropriate for inferring the unobserved underlying states that give rise to an observed chain of data, using multiple sources of information. They have been used to model biological data in diverse applications such as sequence analysis [
The components of a HMM are the unobserved states, the observed measurements, the emission probabilities connecting these two, the transition probabilities between the unobserved states, and the initial probabilities of the states at the beginning of the chain (
Unobserved LOH states (LOSS or RET) of SNP markers generate observed genotype calls via emission probabilities. The solid arrows indicate the transition probabilities between LOH states, and the dashed arrows indicate LD-induced dependencies between consecutive SNP genotypes.
For a SNP under the RET state, we observed Het calls with a probability equal to the heterozygosity rate of each SNP, which we estimated from normal samples (see
These probabilities, denoted by
These probabilities describe the dependence between the LOH states of adjacent markers. For any two adjacent SNP markers, we first defined
The probability of RET at the second marker is one minus these two probabilities. This transition probability model is the same as those used in the “instability-selection” model for LOH analysis [
Empirically determined transition probabilities (circles) between RET loci (top graph) and LOSS loci (bottom graph) are compared to those predicted by
The HMM and these emission, initial, and transition probabilities specify the joint probability of the observed SNP genotypes and the unobserved LOH states in one chromosome of a sample. We applied the forward-backward algorithm [
An alternative inference method for HMM is the Baum-Welch algorithm [
We compared tumor-only inferred LOH to the observed LOH calls determined by paired analysis of tumor and normal genotypes, using 10 K SNP array data from autosomes of 14 lung and breast cancers and EBV-transformed normal cell line pairs (
(A) Results from 10 K SNP array data. Each column represents a sample, with SNP markers from Chromosome 10 displayed from the p terminus (top) to the q terminus (bottom) (not all markers are displayed at this resolution). Tumor/normal observations (left) represent direct comparisons of tumor to normal genotypes. Here, SNP markers observed as having undergone LOH are indicated in blue, retention is shown in yellow, and noninformative SNPs are indicated in grey. Inferences from unpaired tumor data represent the probability of each SNP having undergone LOH, as made by the basic HMM (center) and HC/LD-HMM (right). Here, a high probability of LOH (LOSS) is also indicated in blue, a high probability of retention (RET) is indicated in yellow, and indeterminate SNPs with an almost equal probability of either state are indicated in white. Occasionally, regions that are noninformative in the tumor/normal comparison are falsely inferred as LOH by the basic HMM in the unpaired data (red arrows); some of these false regions are corrected by the HC/LD-HMM (green arrows).
(B) Results from 100 K SNP array data, shown as in (A). Data from Chromosome 21 are shown to highlight the detection of false LOH in the analysis of unpaired tumor data, and are not representative of the frequency of true LOH events in this sample set. Almost all regions falsely inferred as LOH by the basic HMM are correctly inferred by the HC/LD-HMM. The blue arrows indicate a region of true LOH, which is correctly identified by both the basic and HC/LD-HMM.
This initial analysis does not, however, account for the SNPs that are homozygous in both tumor and the paired normal, and thus are noninformative. A string of such homozygous SNPs may be falsely called LOSS in the HMM analyses of unpaired tumors, but not accounted for in the above comparison of observed and inferred LOH states (the red arrows in
With these methods in place, we next applied the basic HMM to 100 K SNP array data from two prostate cancers and two lung cancer cell lines along with paired normal DNA, which were not included in the 10 K dataset. Here, the number of noninformative regions inferred as LOSS increased significantly (
(A) Inferences from the basic HMM applied to 100 K SNP array data are shown for Chromosome 4 in normal samples. Data are shown as in
(B) The genotypes of one region of falsely inferred LOH reveal a region of linkage disequilibrium (dashed red box), also identified by the HapMap project. The sample in column “D” contains one haplotype, the samples in columns “E” through “K” contain another haplotype, and the samples in columns “A” through “C” are heterozygous.
(C) Improved LOH inferences after application of the LD-HMM.
As indicated above, within a region of LD, the observed genotype of any marker depends not only on the underlying LOH state, but also on the genotypes of the adjacent markers (i.e., the two markers are dependent in genotype, indicated by the broken arrows in
We use the same observed Het and Hom genotypes of the tumor sample as in the basic HMM, but expand the unobserved LOH states for a SNP marker from the previous two states (LOSS or RET) to four states: Hom LOSS, Het LOSS, Hom RET, and Het RET. Here Hom and Het represent the SNP marker's genotype in the unobserved normal sample. For example, “Hom LOSS” indicates that the SNP is homozygous in normal and LOH in tumor. The state “Hom LOSS”, “Het LOSS”, and “Hom RET” will result in homozygous genotype calls in the tumor unless genotyping or mapping error occurs, so the emission probability of the Hom genotype from these three states is set to (1 – SNP error rate). The state “Het RET” will result in a heterozygous SNP call in the tumor unless a genotyping or mapping error happens, so the emission probability of the Hom genotype from this state is set to the SNP error rate. The emission probability of the Het genotype is 1 minus that of the Hom genotype.
The transition probabilities now reflect both the probability of a state change from RET to LOSS (LOH state), and a state change from Het to Hom (genotype state). We estimated genotype dependencies as the probability, for each SNP marker, of the next adjacent SNP marker toward q-arm being Hom (or Het), given the current SNP marker being Hom (or Het), in a reference set of normal samples (see
With the addition of the initial probabilities (which are the same as the basic HMM), the HMM parameters were fully specified and the forward-backward algorithm was used to obtain the probability of the LOH state being LOSS (either “Hom LOSS” or “Het LOSS”) for every SNP, given all the observed SNP calls along one chromosome of a tumor sample. Application of the LD-HMM to the 100 K dataset of normals, in place of the basic HMM, reduced the frequency of loss calls from 4.7% to 1.5% of markers (
We posited that the remaining regions of falsely inferred LOH resulted from three specific deficiencies of the LD-HMM. First, regions of LD might be present in a relatively small subset of patients [
To validate these results, we extended the analysis to a set of 100 K data obtained from two lung cancer cell lines and six gliomas with paired normals, that had not been used in any of our prior analyses. Here, the sensitivity and specificity of the HC/LD-HMM were 98.7% and 99.3%, respectively (
Sensitivity and Specificity of the Basic HMM and HC/LD-HMM
Given the importance of taking into account haplotype block structure, which is known to vary between ethnic groups [
Conversely, the haplotype correction method relies on the ability to match specific haplotypes present in the tumors to those same haplotypes in the reference set. One might expect, therefore, that use of the JHC samples as the reference set for haplotype correction would result in poorer specificity than use of a Caucasian reference set. That is in fact the case, with the specificity of the HC/LD-HMM rising to only 98.3% when JHC samples were used for the haplotype correction, rather than the 99.0% when Caucasian samples were used (
The methods described above rely on the empirical estimates of a number of the parameters used in the initial, emission, and transition probabilities of the HMM. To assess whether the tumor-only inference methods were unduly influenced by these estimates, we tested the performance of the basic and LD-HMMs as we varied these parameters. Specifically, the accuracy of the model results, as judged against observed LOH in the paired tumor/normal data, changed by less than 0.3% as the SNP error rate was varied from 0.1% to 1% (10 K array). Moreover, when the SNP-specific heterozygosity rates were replaced by an average heterozygosity rate that was varied from 0.1 to 0.5 (10 K array) or from 0.1 to 0.27 (100 K array), the accuracy of the model results changed by less than 5% and 0.5%, respectively. Likewise, use of 60 versus 89 reference samples (from the JHC reference set) affected model accuracy by less than 0.1%. We also found that varying the scaling factor
If the HMM is robust to parameter specifications, the question naturally arises, Why institute an HMM-based approach that requires these parameters, rather than a more simplistic approach? The most obvious simple approach is to calculate, for all
We applied such an approach (called NumHom) to our training set of 100 K data, and found that it does not match the specificity of the HC/LD-HMM for thresholds at which reasonable sensitivities are reached (
The value of this HMM-based approach appears to be that it can straightforwardly integrate multiple sources of information, including SNP-specific heterozygosity rates, haplotype block structure, and genotyping error rates, to generate a local probability of LOH. It appears that each of these sources of information is necessary to obtain the highest sensitivity and specificity. Undoubtedly, other approaches may be devised to integrate these sources of information, but such approaches are likely to have a similar complexity to the HMM-based methods.
The above analyses suggest that the HMM-based methods are robust for inferring LOH on a per marker basis. We next asked whether the HC/LD-HMM was equally effective in detecting regions of LOH and whether detection of such regions was influenced by their size. To this end, we compared the ability of the tumor-only LOH analysis to identify LOH regions observed from comparing paired normal and tumor samples (
Sensitivity of the HC/LD-HMM by Size of LOH Region
As mentioned in the introduction, LOH arises due to complete loss of one allele through hemizygous deletion (copy loss) or through gene duplication (copy neutral). On the other hand, heterozygous loci can erroneously be assigned a homozygous genotype in settings of allele specific amplification (allelic imbalance). This will occur whether or not LOH is determined using paired normals, and may present paradoxical results, with recurrently amplified oncogenes seen as potential TSGs. To address this issue we determined the copy number at each SNP locus using the probe level signal intensity data [
For each inferred copy number (
Models of human cancer including xenografts and cell lines rarely are accompanied by paired normal samples. The utility of such models may be enhanced if we can ascertain the patterns of LOH in such models and relate them to those seen in actual human tumors. To this end, we next asked whether the HC/LD-HMM could detect regions of common LOH using 11 K SNP array data from 34 prostate cell lines, xenografts, and metastases where the corresponding normal DNA was unavailable (RB, unpublished data). We first scored each SNP by averaging the probability of LOH over all 34 samples (
The mean LOH probability across 34 prostate cancer samples is plotted along the left for all chromosomes. Peak regions of LOH are noted, and data from Chromosomes 8, 13, and 17 are highlighted on the right. These data are displayed as in
We have developed an HMM-based method to infer the probability of LOH events from tumor samples without matched normals. The method utilizes several sources of information, including intermarker distances, SNP genotyping and mapping error rates, and haplotype information. LOH inferences using only tumor samples agree well with LOH patterns determined by analysis of tumor/normal pairs in two different array types (10 K and 100 K), three different tissue types (lung, glioma, and prostate), and in both cell lines and tumors, in test and in validation datasets. The inferences are robust to model parameter specifications. LOH is resolved to about 3 Mb or 100 SNPs in 100 K array data. This method makes it feasible to use SNP array technology to map LOH in tumor samples for which normal DNA is unavailable. Given that genotyping paired normal samples constitutes up to half the cost of LOH mapping experiments, this method also makes it feasible to perform these experiments at a much lower cost per sample, at the expense of slightly reduced accuracy.
One advantage of a model-based approach over the existing tumor-only LOH inference methods [
At higher SNP densities, where the haplotype structure of the human genome becomes relevant, an approach that considers the dependence among multiple SNPs in a region of LD is necessary in addition to the LD-HMM. We used a haplotype correction that compared regions of inferred putative LOH to a set of reference normal samples to reduced false LOH inference. This method works best if the reference samples have similar haplotypes to the tumor sample. Use of reference samples from a different ethnic group tends not to decrease the sensitivity of the method, but can substantially decrease its specificity.
False designation of regions of LOH due to allelic imbalance may lead to paradoxical results, with recurrently amplified oncogenes seen as potential TSGs. SNP arrays, by providing signal intensity along with genotyping data, allow such regions to be identified. We can thus integrate these data to exclude regions of putative LOH with high copy numbers as likely due to allelic imbalance. At the interpretive level, our finding that LOH is often copy-neutral suggests that LOH and copy loss should be considered independently when predicting the presence of a TSG, and may best be used in conjoined analyses.
The ability to identify regions of LOH in tumors without paired normal DNA allows LOH mapping in the many model systems lacking paired normal DNA, including cell lines and xenografts. As such model systems are the platform for experiments aimed at understanding the biology of human tumors, it is critical that we understand their genetic relationship to real human tumors. As an example, among the prostate cancer samples, LOH at the
SNP array analysis of cancer genomes provides a single platform for copy number and LOH analysis. As these arrays move to higher resolution (500K), accounting for the haplotype structure of the human genome in the analysis of these data will be of greater import. The methods described herein should be readily extensible to both the higher density arrays and to the increasingly detailed information describing the haplotype structure of the human genome. The software package, dChipSNP, is freely available at
We used data from Early Access 10 K, Mapping 10 K, and 100 K SNP arrays (Affymetrix, Santa Clara, California, United States) (referred to as 10 K, 11 K and 100 K arrays, respectively) interrogating, respectively, 10,044, 11,555, and 116,204 SNP loci on all chromosomes except Y, with an average intermarker distance of 210 kb (11 K array) and 23.6 kb (100 K array) and average heterozygosity rate of 0.38 (11 K array) and 0.27 (100 K array) [
The heterozygosity rates for each SNP and the dependence information between the genotypes of neighboring SNPs were estimated from sets of normal samples; the haplotype correction was also performed against separate sets of normal samples (see
dChipSNP [
NumHom was applied using window sizes of 33 (with and without haplotype correction) and 50. Data are displayed as in
(99 KB PDF)
(135 KB DOC)
The proportion of LOSS and RET markers identified in paired tumor/normal data, that were identified correctly in unpaired tumors in the 10 K dataset (A), 100 K training dataset (B), and 100 K validation dataset (C).
(53 KB DOC)
Ground truth was considered to be the results of a HMM applied to paired tumor/normal data.
(32 KB DOC)
The number and proportion of SNP Markers in the 100 K Validation Dataset with LOSS or RET in Tumor/Normal Pairs, inferred as LOSS or RET by the LD-HMM (A) and HC/LD-HMM (B), using reference samples from alternative ethnicities.
(35 KB DOC)
(28 KB DOC)
The percentage of LOH regions identified in 10 K data from tumor/normal pairs that were also identified by the HC/LD-HMM applied to the unpaired tumors, according to the size of the region (A) or number of SNPs present (B).
(33 KB DOC)
(36 KB DOC)
We thank L. J. Wei, M. Freedman, and D. Altshuler for helpful discussions and J. G. Paez and C. Rosenow for training data. K. Pienta and the U. Michigan Prostate SPORE provided tumor tissues, and R. Vessella (U. Washington) and C. Sawyers (UCLA) provided prostate xenograft DNA.
heterozygous
hidden Markov model
homozygous
linkage disequilibrium
loss of heterozygosity
loss
megabase(s)
retention
single-nucleotide polymorphism
tumor suppressor gene