The authors have declared that no competing interests exist.
Conceived and designed the experiments: ÖT MD CB SK MK JCC US. Performed the experiments: JdG MW CP CK. Analyzed the data: ML BYR JCC. Contributed reagents/materials/analysis tools: MD. Wrote the paper: ML BYR JCC. Designed the software used in analysis: ML BYR JCC Implemented the software: ML BYR Processed the sequencing data: ML.
¶ These authors are joint senior authors on this work.
Next generation sequencing (NGS) has enabled high throughput discovery of somatic mutations. Detection depends on experimental design, lab platforms, parameters and analysis algorithms. However, NGS-based somatic mutation detection is prone to erroneous calls, with reported validation rates near 54% and congruence between algorithms less than 50%. Here, we developed an algorithm to assign a single statistic, a false discovery rate (FDR), to each somatic mutation identified by NGS. This FDR confidence value accurately discriminates true mutations from erroneous calls. Using sequencing data generated from triplicate exome profiling of C57BL/6 mice and B16-F10 melanoma cells, we used the existing algorithms GATK, SAMtools and SomaticSNiPer to identify somatic mutations. For each identified mutation, our algorithm assigned an FDR. We selected 139 mutations for validation, including 50 somatic mutations assigned a low FDR (high confidence) and 44 mutations assigned a high FDR (low confidence). All of the high confidence somatic mutations validated (50 of 50), none of the 44 low confidence somatic mutations validated, and 15 of 45 mutations with an intermediate FDR validated. Furthermore, the assignment of a single FDR to individual mutations enables statistical comparisons of lab and computation methodologies, including ROC curves and AUC metrics. Using the HiSeq 2000, single end 50 nt reads from replicates generate the highest confidence somatic mutation call set.
Next generation sequencing (NGS) has enabled unbiased, high throughput discovery of genetic variations and somatic mutations. However, the NGS platform is still prone to errors resulting in inaccurate mutation calls. A statistical measure of the confidence of putative mutation calls would enable researchers to prioritize and select mutations in a robust manner. Here we present our development of a confidence score for mutations calls and apply the method to the identification of somatic mutations in B16 melanoma. We use NGS exome resequencing to profile triplicates of both the reference C57BL/6 mice and the B16-F10 melanoma cells. These replicate data allow us to formulate the false discovery rate of somatic mutations as a statistical quantity. Using this method, we show that 50 of 50 high confidence mutation calls are correct while 0 of 44 low confidence mutations are correct, demonstrating that the method is able to correctly rank mutation calls.
Next generation sequencing (NGS) has revolutionized our ability to determine genomes and compare, for example, tumor to normal cells to identify somatic mutations. However, the platform is not error free and various experimental and algorithmic factors contribute to the false positive rate when identifying somatic mutations
Given the large discrepancies, one is left wondering which mutations to select, such as for clinical decision making or ranking for follow-up experiments. Ideal would be a statistical value, such as a p-value, indicating the confidence of each mutation call. Error sources have been addressed by examining bulk sets of mutations, such as computational methods to measure the expected amount of false positive mutation calls utilizing the transition/transversion ratio of a set of variations
Several NGS mutation identification algorithms do output multiple parameters for each mutation call, such as coverage, genotype quality and consensus quality. However, it is not clear if and how to interpret these metrics with regards to whether a mutation call is correct. Furthermore, multiple parameters are generated for each mutation call and thus one simply cannot rank or prioritize mutations using the values. Instead, researchers often rely on personal experience and arbitrary filtering thresholds to select mutations. In summary, a) there is a low level of congruence between somatic mutations identified by different algorithms and sequencing platforms and b) no method to assign a single accuracy estimate to individual mutations.
Here, we develop a methodology to assign a confidence value - a false discovery rate (FDR) - to individual identified mutations. This algorithm does not identify mutations but rather estimates the accuracy of each mutation. The method is applicable both to the selection and prioritization of mutations and to the development of algorithms and methods. Using Illumina HiSeq reads and the algorithms GATK, SAMtools and SomaticSNiPer, we identified 4,078 somatic mutations in B16 melanoma cells. We assigned a FDR to each mutation and show that 50 of 50 mutations with low FDR (high confidence) validated while 0 of 44 with high FDR (low confidence) validated.
Somatic mutation discovery involves the determination and comparison of two genomes: the “normal” germline genome from non-cancerous cells and the tumor genome. If one, however, sequences one sample multiple times, such as the normal genome, and then compares the replicates, one should identify no differences. Thus, any mutation detected in this “same versus same comparison” is a false positive. These can be generated during sample extraction, sample preparation, amplification and library construction, NGS sequencing and data analysis.
To determine the false discovery rate (FDR) for each somatic mutation detected in a tumor sample relative to a normal sample (“tumor versus normal comparison”), we first define and assign a quality score Q to each identified mutation. Then, we count the number of “same versus same” mutations (false positives) at the same or greater quality score (
To discover mutations, DNA from tail tissue of three black6 mice, all litter mates, and DNA from three B16 melanoma samples, was extracted and exon-encoding sequences were captured, resulting in six samples. RNA was extracted from B16 cells in triplicate. Single end 50 nt (1×50 nt) and paired end 100 nt (2×100 nt) reads were generated on an Illumina HiSeq 2000 (Supplementary Table S1 in
Somatic mutations were independently identified using the software packages SAMtools
Numbers for the individual steps are given as an example for one B16 sample, compared to one black6 sample. “Exons” refers to the exon coordinates defined by all protein coding RefSeq transcripts.
We want to assign each somatic mutation a single quality score Q that could be used to rank mutations based on confidence. However, it is not straightforward to assign a single value since most mutation detection algorithms output multiple scores, each reflecting a different quality aspect. Thus, we generated a random forest classifier
After defining a relevant quality score, we sought to re-define the score into a statistically relevant false discovery rate (FDR). We determined, at each Q value, the number of mutations with a better Q score in the “same versus same” and the number of mutations with a better Q score in the “tumor versus normal” pair. For a given mutation with quality score Q detected in the “tumor versus normal” comparison, we estimate the false discovery rate by computing the ratio of “same versus same” mutations with a score of Q or better to the overall number of mutations found in the tumor comparison with a score of Q or better.
A potential bias in comparing methods is differential coverage; we thus normalize the false discovery rate for the number of bases covered by NGS reads in each sample:
We calculate the common coverage by counting all bases of the reference genome which are covered by data of the tumor and normal sample or by both “same versus same” samples, respectively. After assigning our FDR to each mutation, the FDR-sorted list of somatic mutations shows a clear preference of mutations found by three programs in the low FDR region (
We identified 50 mutations with a low FDR (high confidence) for validation, including 41 with an FDR less than 0.05 (
Chromosome | Position | Reference allele | Sample allele(s) | FDR |
8 | 110078987 | G | A/G | 0.006 |
1 | 59540714 | G | G/C | 0.007 |
5 | 124854313 | G | G/T | 0.007 |
10 | 59352802 | C | A/C | 0.007 |
16 | 36919828 | A | A/C | 0.007 |
2 | 144078227 | C | C/T | 0.007 |
8 | 12834637 | G | G/C | 0.007 |
19 | 6121411 | T | C/T | 0.007 |
1 | 58533360 | A | A/C | 0.007 |
15 | 98478052 | A | A/G | 0.007 |
None of these mutations is present in dbSNP (version 128; genome assembly mm9).
We selected 44 mutations identified by at least one detection algorithm, present in only one B16 sample and assigned a high FDR (>0.5) by our algorithm (
Both mutations are predicted by GATK, SomaticSNiPer and SAMtools. The mean coverage is 54 (true positive) and 10 (false positive), respectively. Only four reads are shown for visual clarity. The red box marks the sample, in which the three mutation callers wrongly detected a SNV.
To test mutations with less extreme FDRs, we selected 45 somatic mutations, which were distributed evenly across the FDR spectrum from 0.1 to 0.6. Validation using both Sanger sequencing and inspection of the RNA-Seq reads resulted 15 positive (either Sanger sequencing or RNA-Seq reads), 22 negative validations (neither Sanger sequencing nor RNA-Seq reads) and 8 non-conclusive (failed sequencing reactions and no RNA-Seq coverage). See the
We computed a receiver operating characteristic (ROC) curve for all 131 validated mutations (
ROC curves and the corresponding AUC are useful for comparing classifiers and visualizing their performance
ROC curves and the associated AUC values can be compared across experiments, lab protocols, and algorithms. For the following comparisons, we used all somatic mutations found by any algorithm and in any tumor-normal pairing without applying any filter procedure. We considered only those mutations in target regions (exons).
First, we tested the influence of the reference “same versus same” data on the calculation of the FDRs. Using the triplicate black6 and B16 sequencing runs, we created 18 triplets (combinations of “black6 versus black6” and “black6 versus B16”) to use for calculating the FDR. When comparing the resulting FDR distributions for the sets of somatic mutations, the results are consistent when the reference data sets are exchanged (
Using our definition of a false discovery rate, we have established a generic framework for evaluating the influence of numerous experimental and algorithmic parameters on the resulting set of somatic mutations. We apply this framework to study the influence of software tools, coverage, paired end sequencing and the number of technical replicates on somatic mutation identification.
First, the choice of the software tool has a clear impact on the identified somatic mutations (
The impact of the coverage depth on whole genome SNV detection has been recently discussed
It is straightforward to simulate and rank other experimental settings using the available data and framework (
The 2×100 nt library was used to create 6 libraries: a 2×100 nt library; a 1×100 nt library; a 1×50 nt library using the 50 nucleotides at the 5′ end of the first read; a 1×50 nt library using the nucleotides 51 to 100 at the 3′ end of the first read; a 2×50 nt read using nucleotides 1 to 50 of both reads; and a 2×50 nt library using nucleotides 51 to 100 of both reads. These libraries were compared using the calculated FDRs of predicted mutations (
NGS is a revolutionary platform for detecting somatic mutations. However, the error rates are not insignificant, with different detection algorithms identifying mutations with less than 50% congruence. Other high throughput genomic profiling platforms have developed methods to assign confidence values to each call, such as p-values associated with differential expression calls from oligonucleotide microarray data. Similarly, we developed here a method to assign a confidence value (FDR) to each identified mutation.
From the set of mutations identified by the different algorithms, the FDR accurately ranks mutations based on likelihood of being correct. Indeed, we selected 50 high confidence mutations and all 50 validated; we selected 45 intermediate confidence mutations and 15 validated, 22 were not present and 8 inconclusive; we selected 44 low confidence mutations and none validated. Again, all 139 mutations were identified by at least one of the detection algorithms. Unlike a consensus or majority voting approach, the assigned FDR not only effectively segregates true and false positives but also provides both the likelihood that the mutation is true and a statistically ranking. Also, our method allows the adjustment for a desired sensitivity or specificity which enables the detection of more true mutations than a consensus or majority vote, which report only 50 or 52 of all 65 validated mutations.
We applied the method to a set of B16 melanoma cell experiments. However, the method is not restricted to these data. The only requirement is the availability of a “same versus same” reference dataset, meaning at least a single replicate of a non-tumorous sample should be performed for each new protocol. Our experiments indicate that the method is robust with regard to the choice of the replicate, so that a replicate is not necessarily required in every single experiment. Once done, the derived FDR(Q) function can be reused when the Q scores are comparable (i.e. when the same program for mutation discovery was used). Here, we profiled all samples in triplicate; nevertheless, the method produces FDRs for each mutation from single-run tumor and normal profiles (non-replicates) using the FDR(Q) function. We do show, however, that duplicates improve data quality.
Furthermore, the framework enables one to define best practice procedures for the discovery of somatic mutations. For cell lines, at least 20-fold coverage and a replicate achieve close to the optimum results. A 1×50 nt library resulting in approximately 100 million reads is a pragmatic choice to achieve this coverage.
The possibility of using a reference data set to rank the results of another experiment can also be exploited to e.g. score somatic mutations found in different normal tissues by similar methods. Here, one would expect relatively few true mutations, so an independent set of reference data will improve the resolution of the FDR calculations.
While we define the optimum as the lowest number of false positive mutation calls, this definition might not suffice for other experiments, such as for genome wide association studies. However, our method allows the evaluation of the sensitivity and specificity of a given mutation set and we show application of the framework to four specific questions. The method is by no means limited to these parameters, but can be applied to study the influence of all experimental or algorithmic parameters, e.g. the influence of the alignment software, the choice of a mutation metric or the choice of vendor for exome selection.
In summary, we have pioneered a statistical framework for the assignment of a false-discovery-rate to the detection of somatic mutations. This framework allows for a generic comparison of experimental and computational protocol steps on generated quasi ground truth data. Furthermore, it is applicable for the diagnostic or therapeutic target selection as it is able to distinguish true mutations from false positives.
Next-generation sequencing, DNA sequencing: Exome capture for DNA resequencing was performed using the Agilent Sure-Select solution-based capture assay
3 µg purified genomic DNA was fragmented to 150–200 nt using a Covaris S2 ultrasound device. gDNA fragments were end repaired using T4 DNA polymerase, Klenow DNA polymerase and 5′ phosphorylated using T4 polynucleotide kinase. Blunt ended gDNA fragments were 3′ adenylated using Klenow fragment (3′ to 5′ exo minus). 3′ single T-overhang Illumina paired end adapters were ligated to the gDNA fragments using a 10∶1 molar ratio of adapter to genomic DNA insert using T4 DNA ligase. Adapter ligated gDNA fragments were enriched pre capture and flow cell specific sequences were added using Illumina PE PCR primers 1.0 and 2.0 and Herculase II polymerase (Agilent) using 4 PCR cycles.
500 ng of adapter ligated, PCR enriched gDNA fragments were hybridized to Agilent's SureSelect biotinylated mouse whole exome RNA library baits for 24 hrs at 65°C. Hybridized gDNA/RNA bait complexes where removed using streptavidin coated magnetic beads. gDNA/RNA bait complexes were washed and the RNA baits cleaved off during elution in SureSelect elution buffer leaving the captured adapter ligated, PCR enriched gDNA fragments. gDNA fragments were PCR amplified post capture using Herculase II DNA polymerase (Agilent) and SureSelect GA PCR Primers for 10 cycles. Cleanups were performed using 1.8× volume of AMPure XP magnetic beads (Agencourt). For quality controls we used Invitrogen's Qubit HS assay and fragment size was determined using Agilent's 2100 Bioanalyzer HS DNA assay. Exome enriched gDNA libraries were clustered on the cBot using Truseq SR cluster kit v2.5 using 7 pM and sequenced on the Illumina HiSeq2000 using Truseq SBS kit.
Sequence reads were aligned using bwa (version 0.5.8c)
For each sequencing lane, mutations were identified using three software programs: SAMtools pileup (version 0.1.8)
Barcoded mRNA-seq cDNA libraries were prepared from 5 ug of total RNA using a modified version of the Illumina mRNA-seq protocol. mRNA was isolated using SeramagOligo(dT) magnetic beads (Thermo Scientific). Isolated mRNA was fragmented using divalent cations and heat resulting in fragments ranging from 160–200 bp. Fragmented mRNA was converted to cDNA using random primers and SuperScriptII (Invitrogen) followed by second strand synthesis using DNA polymerase I and RNaseH. cDNA was end repaired using T4 DNA polymerase, Klenow DNA polymerase and 5′ phosphorylated using T4 polynucleotide kinase. Blunt ended cDNA fragments were 3′ adenylated using Klenow fragment (3′ to 5′ exo minus). 3′ single T-overhang Illumina multiplex specific adapters were ligated on the cDNA fragments using T4 DNA ligase. cDNA libraries were purified and size selected at 300 bp using the E-Gel 2% SizeSelect gel (Invitrogen). Enrichment, adding of Illumina six base index and flow cell specific sequences was done by PCR using Phusion DNA polymerase (Finnzymes). All cleanups were performed using 1.8× volume of Agencourt AMPure XP magnetic beads.
Barcoded RNA-seq libraries were clustered on the cBot using Truseq SR cluster kit v2.5 using 7 pM and sequenced on the Illumina HiSeq2000 using Truseq SBS kit.
The raw output data of the HiSeq was processed according to the Illumina standard protocol, including removal of low quality reads and demultiplexing. Sequence reads were then aligned to the reference genome sequence
We selected SNVs for validation by Sanger re-sequencing and RNA. SNVs were identified which were predicted by all three programs, non-synonymous and found in transcripts having a minimum of 10 RPKM. Of these, we selected the 50 with the highest SNP quality scores as provided by the programs. As a negative control, 44 SNVs were selected which have a FDR of 0.5 or more, are present in only one cell line sample and are predicted by only one mutation calling program. 45 mutations with intermediate FDR levels were selected. Using DNA, the selected variants were validated by PCR amplification of the regions using 50 ng of DNA (see
Random Forest Quality Score Computation: Commonly-used mutation calling algorithms (
We use the following strategy to achieve a complete ordering. In a first step, we apply a very rigorous definition of superiority by assuming that a mutation has better quality than another if and only if it is superior in all categories. So a set of quality properties S = (s1,…,sn) is preferable to T = (t1,…,tn), denoted by S>T, if si>ti for all i = 1,…,n. We define an intermediate FDR (IFDR) as follows
However, we regard the IFDR only as an intermediate step since in many closely related cases, no comparison is feasible and we are thus not benefitting from the vast amount of data available. Thus, we take advantage of the good generalization property of random forest regression
For
The resulting regression score is our generalized quality score Q; it can be regarded as a locally weighted combination of the individual quality scores. It allows direct, single value comparison of any two mutations and the computation of the actual false discovery rate:
For the training of the random forest models used to create the results for this study, we calculate the sample IFDR on the somatic mutations of all samples before selecting the random 1% subset. This ensures the mapping of the whole available quality space to FDR values. We used the quality properties “SNP quality”, “coverage depth”, “consensus quality” and “RMS mapping quality” (SAMtools,
To acquire the “same vs. same” and “same vs. different” data when calculating the FDRs for a given set of mutations, we use all variants generated by the different programs without any additional filtering.
Common coverage computation: The number of possible mutation calls can introduce a major bias in the definition of a false discovery rate. Only if we have the same number of possible locations for mutations to occur for our tumor comparison and for our “same vs. same” comparison, the number of called mutations is comparable and can serve as a basis for a false discovery rate computation. To correct for this potential bias, we use the common coverage ratio. As common coverage we define the number of bases with coverage of at least one in both samples which are used for the mutation calling. We compute the common coverage individually for the tumor comparison as well as for the “same vs. same” comparison.
The estimation of the ROC curves should satisfy the following criteria:
When all calculated FDRs are 0.5, one cannot use these rates to select true positive mutations. This should be reflected by a diagonal line from (0,0) to (1,1) in the ROC plot resulting in a ROC AUC of 0.5, which indicates a completely random prediction.
The normal calculation of ROC curves involves summing up the TP counts and FP counts, respectively, up to a given score threshold. Here, an individual mutation does not add one to the TP or FP count, but a fraction depending on the given FDR to both sums, respectively. Both fractions should add to one, then.
We start with two conditions;
To obtain an estimated ROC curve, the mutations in the dataset are sorted by FDR and for each mutation a point is plotted at the cumulative TPR and FPR values up to this mutation, divided by the sum of all TPR and FPR values, respectively. The AUC is calculated by summing up the areas of all consecutive trapezoids between the curve and the x-axis.
The program is implemented in R and is available from
12,460 somatic mutations found in triplicate samples.
(XLS)
Validation results for mutations with an intermediate FDR.
(XLS)
Alignment statistics for all samples.
(XLS)
Primer sequences.
(XLS)
Selection of a filtering threshold for SomaticSNiPer, discussion of paired end library results and additional figures (S1, S2, S3, S4, S5, S6, S7, S8) and tables (S1, S2, S3).
(PDF)