Conceived and designed the experiments: AM AZ MAM MH TON SCS DH SPS. Performed the experiments: AM. Analyzed the data: AM. Contributed reagents/materials/analysis tools: AM. Wrote the paper: AM SPS. Assisted with algorithm development: FH. Software development: RG MGFS. Copy number variation analysis: GH. Development of precursor software: MG. Gene expression analysis: AHM. RT-PCR validation: JS. FISH validation: NM MP.
The authors have declared that no competing interests exist.
Gene fusions created by somatic genomic rearrangements are known to play an important role in the onset and development of some cancers, such as lymphomas and sarcomas. RNA-Seq (whole transcriptome shotgun sequencing) is proving to be a useful tool for the discovery of novel gene fusions in cancer transcriptomes. However, algorithmic methods for the discovery of gene fusions using RNA-Seq data remain underdeveloped. We have developed deFuse, a novel computational method for fusion discovery in tumor RNA-Seq data. Unlike existing methods that use only unique best-hit alignments and consider only fusion boundaries at the ends of known exons, deFuse considers all alignments and all possible locations for fusion boundaries. As a result, deFuse is able to identify fusion sequences with demonstrably better sensitivity than previous approaches. To increase the specificity of our approach, we curated a list of 60 true positive and 61 true negative fusion sequences (as confirmed by RT-PCR), and have trained an adaboost classifier on 11 novel features of the sequence data. The resulting classifier has an estimated value of 0.91 for the area under the ROC curve. We have used deFuse to discover gene fusions in 40 ovarian tumor samples, one ovarian cancer cell line, and three sarcoma samples. We report herein the first gene fusions discovered in ovarian cancer. We conclude that gene fusions are not infrequent events in ovarian cancer and that these events have the potential to substantially alter the expression patterns of the genes involved; gene fusions should therefore be considered in efforts to comprehensively characterize the mutational profiles of ovarian cancer transcriptomes.
Genome rearrangements and associated gene fusions are known to be important oncogenic events in some cancers. We have developed a novel computational method called deFuse for detecting gene fusions in RNA-Seq data and have applied it to the discovery of novel gene fusions in sarcoma and ovarian tumors. We assessed the accuracy of our method and found that deFuse produces substantially better sensitivity and specificity than two other published methods. We have also developed a set of 60 positive and 61 negative examples that will be useful for accurate identification of gene fusions in future RNA-Seq datasets. We have trained a classifier on 11 novel features of the 121 examples, and show that the classifier is able to accurately identify real gene fusions. The 45 gene fusions reported in this study represent the first ovarian cancer fusions reported, as well as novel sarcoma fusions. By examining the expression patterns of the affected genes, we find that many fusions are predicted to have functional consequences and thus merit experimental followup to determine their clinical relevance.
Gene fusions are known to play an important role in the development of haematalogical disorders and childhood sarcomas, while the recent discovery of ETS gene fusions in prostate cancer
Gene fusions are thought to arise predominantly from double stranded DNA breakages followed by a DNA repair error
Large scale, genome-wide efforts to comprehensively identify and characterize genomic rearrangements that lead to gene fusions in human cancers have recently been made possible through next generation sequencing technologies. These technologies provide a deeper level of sequencing than is possible by cytogenetic and Sanger sequencing methods and are poised to reveal a more detailed understanding of the extent and nature of genomic rearrangements in cancer. For example, using low-coverage paired end whole genome (gDNA) shotgun sequencing, Stephens et al.
Next generation sequencing of cDNA (RNA-Seq or whole transcriptome shotgun sequencing) provides an ideal experimental platform for expressed gene fusion discovery. Analogous to genome sequencing, RNA-Seq enables an unbiased and relatively comprehensive view into tumor transcriptomes, and can provide information about the rarest of transcripts. RNA-Seq targets only expressed sequences from protein coding genes and is thus more focused than whole genome sequencing. Maher et al.
Sequence reads that align across a gene fusion boundary (so-called
With the goal of resolving the limitations described above and therefore providing a more accurate method for detecting gene fusions from RNA-Seq, we developed a novel algorithm called
We obtained three sarcomas and 40 ovarian carcinomas from the OvCaRe (Ovarian Cancer Research) frozen tumor bank. Patients provided written informed consent for research using these tumor samples before undergoing surgery, and the consent form acknowledged that a loss of confidentiality could occur through the use of samples for research. Separate approval from the hospital's institutional review board was obtained to permit the use of these samples for RNA-sequencing experiments.
We interrogated the transcriptomes of a cell line derived from a serous borderline tumor, in addition to three sarcomas and 40 ovarian carcinomas obtained from the OvCaRe (Ovarian Cancer Research) frozen tumor bank. Pathology review, sample preparation, RNA extraction, RNA-Seq library construction and RNA-Seq sequence data generation using Illumina GA
Case | Type | Reads (Millions) | Read Length | Fragment Mean | Fragment Std. Dev. | Total Fusions | In-frame | Inter-chr. | Intra-chr. | Read-through | Inversion | Eversion | Deletion |
SBOT | LGS | 28 | 36–42 | 210 | 38 | 24 | 2 | 2 | 22 | 17 | 3 | 1 | 1 |
CCC1 | CCC | 18 | 50 | 282 | 36 | 49 | 6 | 10 | 39 | 30 | 1 | 3 | 5 |
CCC2 | CCC | 38 | 50 | 198 | 29 | 27 | 0 | 5 | 22 | 18 | 3 | 0 | 1 |
CCC3 | CCC | 37 | 50 | 209 | 27 | 34 | 2 | 6 | 28 | 22 | 3 | 0 | 3 |
CCC4 | CCC | 20 | 50 | 249 | 41 | 55 | 7 | 7 | 48 | 33 | 7 | 8 | 0 |
CCC5 | CCC | 32 | 36–42 | 245 | 36 | 26 | 1 | 6 | 20 | 17 | 2 | 0 | 1 |
CCC6 | CCC | 32 | 36–42 | 234 | 38 | 14 | 3 | 0 | 14 | 10 | 1 | 1 | 2 |
CCC7 | CCC | 19 | 50 | 259 | 39 | 48 | 4 | 12 | 36 | 21 | 4 | 6 | 5 |
CCC8 | CCC | 39 | 36–42 | 242 | 38 | 41 | 7 | 12 | 29 | 15 | 2 | 10 | 2 |
CCC9 | CCC | 38 | 50 | 265 | 41 | 62 | 13 | 10 | 52 | 35 | 5 | 6 | 6 |
CCC10 | CCC | 37 | 50 | 278 | 38 | 97 | 10 | 12 | 85 | 75 | 5 | 1 | 4 |
CCC11 | CCC | 53 | 36–42 | 259 | 39 | 64 | 2 | 24 | 40 | 33 | 4 | 2 | 1 |
CCC12 | CCC | 36 | 36–42 | 244 | 31 | 40 | 7 | 10 | 30 | 18 | 8 | 4 | 0 |
CCC13 | CCC | 31 | 50 | 263 | 35 | 74 | 10 | 15 | 59 | 49 | 3 | 6 | 1 |
CCC14 | CCC | 40 | 50 | 250 | 39 | 82 | 8 | 13 | 69 | 56 | 6 | 4 | 3 |
CCC15 | CCC | 40 | 50 | 189 | 29 | 53 | 6 | 16 | 37 | 19 | 5 | 9 | 4 |
CCC16 | CCC | 41 | 50 | 229 | 27 | 80 | 2 | 16 | 64 | 46 | 5 | 8 | 5 |
EMD1 | EMD | 32 | 36–50 | 187 | 35 | 62 | 5 | 9 | 53 | 37 | 9 | 1 | 6 |
EMD2 | EMD | 30 | 42–50 | 208 | 33 | 64 | 7 | 2 | 62 | 50 | 6 | 5 | 1 |
EMD3 | EMD | 33 | 50 | 227 | 31 | 40 | 4 | 6 | 34 | 30 | 4 | 0 | 0 |
EMD4 | EMD | 38 | 50 | 242 | 33 | 58 | 7 | 3 | 55 | 41 | 8 | 3 | 3 |
EMD5 | EMD | 39 | 50–75 | 244 | 29 | 49 | 6 | 3 | 46 | 38 | 4 | 2 | 2 |
EMD6 | EMD | 39 | 50 | 246 | 34 | 85 | 11 | 12 | 73 | 45 | 11 | 11 | 6 |
EMD7 | EMD | 25 | 42–50 | 211 | 33 | 23 | 4 | 3 | 20 | 15 | 3 | 1 | 1 |
EMD8 | EMD | 30 | 50–75 | 189 | 31 | 51 | 7 | 3 | 48 | 40 | 2 | 3 | 3 |
GRC1 | GRC | 58 | 36–50 | 206 | 39 | 105 | 14 | 10 | 95 | 78 | 8 | 3 | 6 |
GRC2 | GRC | 74 | 36–42 | 183 | 39 | 95 | 5 | 15 | 80 | 60 | 12 | 0 | 8 |
GRC3 | GRC | 31 | 36–42 | 196 | 37 | 38 | 3 | 5 | 33 | 29 | 3 | 1 | 0 |
GRC4 | GRC | 34 | 36–42 | 172 | 34 | 46 | 5 | 7 | 39 | 27 | 7 | 1 | 4 |
GRC5 | GRC | 41 | 50–75 | 247 | 31 | 101 | 9 | 8 | 93 | 71 | 16 | 0 | 6 |
HGS1 | HGS | 39 | 50 | 241 | 37 | 73 | 6 | 9 | 64 | 51 | 8 | 2 | 3 |
HGS2 | HGS | 29 | 50 | 278 | 38 | 75 | 12 | 8 | 67 | 58 | 5 | 1 | 3 |
HGS3 | HGS | 26 | 37–42 | 211 | 34 | 80 | 7 | 15 | 65 | 59 | 3 | 2 | 1 |
HGS4 | HGS | 30 | 36–42 | 209 | 33 | 54 | 3 | 11 | 43 | 20 | 8 | 10 | 5 |
HGS5 | HGS | 33 | 50 | 220 | 25 | 92 | 7 | 11 | 81 | 65 | 7 | 6 | 3 |
LGS1 | LGS | 35 | 50 | 242 | 26 | 47 | 8 | 3 | 44 | 34 | 9 | 0 | 1 |
MUC1 | MUC | 42 | 36–50 | 208 | 30 | 66 | 8 | 11 | 55 | 44 | 6 | 3 | 2 |
MUC2 | MUC | 33 | 36 | 224 | 31 | 61 | 9 | 11 | 50 | 37 | 10 | 1 | 2 |
SCH1 | SCH | 24 | 50–75 | 210 | 30 | 43 | 3 | 11 | 32 | 27 | 5 | 0 | 0 |
SCH2 | SCH | 35 | 36–50 | 201 | 31 | 46 | 0 | 6 | 40 | 34 | 4 | 1 | 1 |
YKS1 | YKS | 46 | 50 | 249 | 27 | 44 | 6 | 5 | 39 | 34 | 3 | 1 | 1 |
YKS2 | YKS | 40 | 50 | 252 | 31 | 49 | 5 | 11 | 38 | 32 | 3 | 1 | 2 |
SARC1 | EPS | 19 | 50 | 263 | 35 | 39 | 1 | 6 | 33 | 27 | 3 | 2 | 1 |
SARC2 | EPS | 28 | 36–50 | 333 | 36 | 69 | 10 | 10 | 59 | 51 | 6 | 0 | 2 |
SARC3 | IGMS | 17 | 50 | 233 | 33 | 15 | 2 | 4 | 11 | 10 | 0 | 1 | 0 |
Summary of RNA-Seq statistics and fusion predictions across all samples. LGS: Low Grade Serous, HGS: High Grade Serous, CCC: Clear cell carcinoma, EMD: Endometrioid tumor, MUC: Mucinous tumor, YKS: Yolk sac tumor, GRC: Granulosa cell tumor, SCH: Small cell hypercalemic, EPS: Epithelioid Sarcoma, IGMS: intermediate grade myofibroblastic sarcoma.
In addition to our internally generated data, we tested deFuse using published paired end RNA-Seq data sets known to contain gene fusions. These datasets were used as positive controls in the evaluation of deFuse. We used the NCI-H660 prostate cell line from the FusionSeq website
In this section, we describe the deFuse algorithm. We begin by defining essential terms. We define a
With these definitions in hand we will now describe how deFuse predicts gene fusions by searching RNA-Seq data for fragments that harbour fusion boundaries. As mentioned previously, the problem of identifying the true genomic origin of a set of RNA-Seq reads is confounded by several factors, and as a result, a proportion of the RNA-Seq reads will have ambiguous alignments to the genome. The deFuse method, outlined schematically in
The method consists of four main steps. The first step is alignment of paired end reads to a reference comprised of the sequences that are expected to exist in the sample, with all relevant alignments considered. We use spliced and unspliced gene sequences as a reference because we have found that fusion genes often produce splice variants that express intronic sequences, and that some of those splice variants are biologically relevant (unpublished data). We define two necessary conditions for considering discordant alignments to have originated from reads spanning the same fusion boundary and use these conditions to cluster discordant alignments representing the same fusion event. The second step resolves ambiguous discordant alignments by selecting the most likely set of fusion events, and the most likely assignment of spanning reads to those events (
deFuse begins with a search for spanning reads as evidence of gene fusion events. We describe two necessary conditions for considering two discordant alignments to have originated from reads spanning the same fusion boundary.
The size selection step of the RNA-Seq library construction protocol results in a collection of cDNA fragments with lengths that we approximate with the inferred fragment length distribution
By definition a spanning read harbours a fusion boundary in the insert sequence (
We define the
Given two reads spanning the same fusion boundary, the difference between the fragment lengths of those reads can be calculated as
The utility of an ambiguously aligned read depends on our ability to select the correct alignment for that read based on the greater context of all paired end alignments in the RNA-Seq dataset. Given that we are considering alignments to spliced and unspliced gene sequences, ambiguous alignments will result from homology between genes and also from the redundant representation of the same exon multiple times for multiple splice variants of the same gene. The true alignment for each read must be inferred, as it will be used to identify, for the first situation, the correct pair of genes involved in the fusion, and for the second situation, the correct pair of splice variants of those genes.
Define a
Computation of the maximum parsimony solution is NP-Hard by reduction to the set cover problem as shown by Hormozdiari et al.
Given fusion events nominated by spanning reads, deFuse does a targeted split read analysis to predict nucleotide level fusion boundaries. For a cluster of discordant alignments, the
Given the alignment of one end of a read, we define the
The maximum parsimony solution will nominate fusions such that the average number of spanning reads per fusion is maximized. However, it is possible that the selected transcript variants do not maximize both spanning and split read evidence. Thus, when searching for candidate split reads, it is necessary to search across all relevant transcripts. In addition to calculating the approximate fusion boundary in the transcript variants proposed by the maximum parsimony solution, we also project those approximate fusion boundaries onto other transcript variants of the same gene. The approximate fusion boundary in transcript variant
The split read analysis proceeds by aligning candidate split reads to approximate fusion boundaries. We first align a candidate split read to the two approximate fusion boundaries in the two fused transcripts, then combine the two alignments in a way that maximizes the combined alignment score. Let
We start by aligning
With this method, finding all splits with maximum score does not necessitate backtracking through a dynamic programming matrix (which is a worst case exponential operation). Additionally, we add the constraint that
Multiple splits
We resolve the problem of multiple split alignments as follows. We first cluster together split alignments that corroborate the same fusion boundaries
To test the corroboration between spanning read and split read evidence, we first use the fusion boundary predicted by the split reads to calculate the inferred fragment lengths
To classify paired end reads as concordant we aligned reads to spliced genes, the genome, and UniGene sequences using
Paired end reads not classified as concordant were classified as discordant. Single end mode alignments of discordant paired end reads were then classified as fully aligned or single end anchored. Fully aligned discordant paired end reads were clustered with
Predicted fusion sequences were annotated as open reading frame preserving, 5′ or 3′ UTR exchanges, interchromosomal, inversion, eversion, and between adjacent genes. The translational phase for each coding nucleotide was calculated using the frame column for each exon in the ensembl GTF file. Given nucleotide
We computed a set of features to better characterize our predicted fusions. The features were calculated for each fusion prediction with the aim of discriminating between true and false positives. We initially lacked a set of positive and negative controls that would have been necessary for a principled machine learning based classification method. Thus initial validation candidates were identified by thresholding these features at levels we suspected would enrich for real fusions (see
Once we had performed a significant number of validations, these validations became the training set for a classifier. We calculated the following 11 features for the examples in our training set (detailed descriptions in Supplementary Methods,
We then used the ada (2.0–2) package in R (2.11.0) to train an adaboost model using the stochastic gradient boosting algorithm with exponential loss, discrete boosting, and decision stumps as the base classifier
deFuse is implemented in C++, perl and R. A typical library of 120,000,000 paired end reads completes in approximately 6 hours using a cluster of 100 compute nodes. The human genome (NCBI36) and gene models in GTF format (ensembl 54) were downloaded from Ensembl
Fusion sequence predictions were obtained for the 44 datasets as detailed in
Next we assembled a set of positive and negative controls by attempting to validate a selection of predictions potentially representing real fusions, and another set of predictions representing systematic artifacts. To select potential positives, we first used the following set of heuristic filters to enrich for real fusions, producing a subset consisting of 268 predictions.
From the 268 filtered predictions we selected 46 predictions, and in doing so attempted to select predictions from libraries with a range of read lengths, such that those predictions covered a large range of values for the spanning and split read counts. Included in this set of 46 predictions were all eight predictions that pass the heuristic filters and involve a cancer associated gene from the cancer gene census (Welcome Trust Sanger Institute Cancer Genome Project web site,
Next we selected 14 predictions representing potential recurrent artifacts, requiring that each of the 14 fail at least one of our heuristic filters, and also requiring that each was predicted to exist in two or more libraries. None of these predictions validated. Finally, we selected 40 predictions at random from the unfiltered list of 20,327 with the assumption that the majority of them would be negative. Only one of the 40 randomly selected predictions validated as real. In total, 45 predictions were validated by RT-PCR (
library | 5′ gene | 3′ gene | span count | ambig span count | split read count | exon bndry | inter. expr. | prom. exch. | fish valid. | CNV break | split pos. p-value | corrob. p- value | min anchor p-value |
CCC1 | TYW1 | HGSNAT | 41 | 38 | 11 | 0.92 | 1 | 0.65 | |||||
CCC4 | TNS3 | PKD1L1 | 12 | 0 | 4 | 0.82 | 0.39 | 0.68 | |||||
CCC9 | RPN2 | PMEPA1 | 48 | 0 | 11 | 0.83 | 0.88 | 0.34 | |||||
CCC9 | TLX3 | RANBP17 | 10 | 0 | 5 | 0.51 | 0.63 | 0.38 | |||||
CCC12 | ITCH | RALY | 59 | 0 | 6 | 0.99 | 0.88 | 0.78 | |||||
CCC12 | MTHFD1 | C1orf61 | 27 | 8 | 11 | 0.53 | 0.67 | 0.79 | |||||
CCC12 | YTHDF2 | SYTL1 | 53 | 0 | 19 | 0.75 | 0.94 | 0.81 | |||||
CCC13 | PPME1 | MRPL48 | 69 | 0 | 22 | 0.48 | 0.76 | 0.41 | |||||
CCC14 | EPCAM | DLEC1 | 27 | 0 | 17 | 0.98 | 0.72 | 0.91 | |||||
CCC15 | AFF4 | LAMC3 | 5 | 0 | 3 | 0.64 | 0.89 | 0.54 | |||||
CCC15 | ARSB | DMGDH | 103 | 0 | 87 | 0.92 | 0.8 | 0.21 | |||||
CCC15 | KIFC3 | CNGB1 | 14 | 0 | 16 | 0.36 | 1 | 0.45 | |||||
CCC15 | NUMB | ALDH6A1 | 22 | 0 | 12 | 0.98 | 0.85 | 0.81 | |||||
CCC15 | PVRL2 | LMNA | 17 | 2 | 7 | 0.39 | 0.33 | 0.67 | |||||
CCC15 | SLC38A10 | ZCCHC11 | 12 | 0 | 1 | 0.28 | 1 | 0.29 | |||||
CCC15 | TMEM63A | NRD1 | 17 | 0 | 7 | 0.57 | 0.83 | 0.85 | |||||
CCC15 | UBR4 | JMJD2B | 27 | 0 | 18 | 0.5 | 1 | 0.56 | |||||
CCC16 | HPS5 | APOO | 23 | 3 | 11 | 0.67 | 0.49 | 0.35 | |||||
CCC16 | PAPOLA | HIP1R | 44 | 0 | 19 | 0.89 | 0.62 | 0.57 | |||||
CCC16 | PPL | RBKS | 10 | 0 | 14 | 0.81 | 0.7 | 0.43 | |||||
EMD6 | BCAS3 | ARHGAP15 | 10 | 0 | 4 | 0.42 | 0.73 | 0.81 | |||||
EMD6 | CAMK2G | DDX1 | 9 | 0 | 2 | 0.28 | 1 | 0.46 | |||||
EMD6 | CYB5D2 | ANKFY1 | 6 | 0 | 1 | 0.65 | 0.82 | 0.75 | |||||
EMD6 | EIF4G3 | LRRC8D | 7 | 0 | 4 | 0.19 | 0.96 | 0.47 | |||||
EMD6 | ROCK1 | CMKLR1 | 13 | 0 | 8 | 0.62 | 0.31 | 0.82 | |||||
GRC5 | FBXO25 | BET1L | 8 | 5 | 3 | 0.86 | 0.37 | 0.68 | |||||
GRC5 | PCP4L1 | SDHC | 7 | 7 | 10 | 0.71 | 0.18 | 0.68 | |||||
HGS1 | CAPNS1 | WDR62 | 7 | 0 | 11 | 0.55 | 1 | 0.85 | |||||
HGS1 | LETM1 | USP15 | 7 | 1 | 5 | 0.95 | 0.74 | 0.59 | |||||
HGS1 | RAB6A | USP43 | 14 | 9 | 6 | 0.45 | 0.81 | 0.5 | |||||
HGS3 | ELL | CYLN2 | 15 | 0 | 8 | 0.85 | 1 | 0.56 | |||||
HGS3 | FRYL | SH2D1A | 27 | 0 | 7 | 0.9 | 1 | 0.62 | |||||
HGS3 | GTF2I | PGPEP1 | 34 | 0 | 3 | 0.15 | 1 | 0.3 | |||||
HGS3 | PRR12 | FLT3LG | 20 | 0 | 11 | 0.72 | 1 | 0.24 | |||||
HGS4 | FLNB | VPS8 | 95 | 4 | 51 | 0.8 | 0.72 | 0.59 | |||||
HGS4 | LMF1 | UMOD | 15 | 0 | 7 | 0.88 | 1 | 0.4 | |||||
HGS4 | SLC37A1 | ABCG1 | 40 | 0 | 14 | 0.83 | 1 | 0.42 | |||||
HGS4 | STK3 | NPAL2 | 7 | 0 | 3 | 0.69 | 0.79 | 0.13 | |||||
MUC1 | ERBB2 | PERLD1 | 25 | 0 | 11 | 0.84 | 1 | 0.75 | |||||
MUC1 | KIAA0355 | UQCRC1 | 10 | 0 | 6 | 0.82 | 0.76 | 0.44 | |||||
YKS2 | C12orf48 | MYBPC1 | 8 | 0 | 6 | 0.63 | 0.84 | 0.2 | |||||
SARC1 | CMKLR1 | HNF1A | 38 | 0 | 7 | 0.72 | 0.84 | 0.11 | |||||
SARC1 | ERBB3 | CRADD | 103 | 7 | 41 | 0.87 | 0.88 | 0.61 | |||||
SARC2 | SMARCB1 | WASF2 | 16 | 14 | 4 | 0.38 | 0.64 | 0.66 | |||||
SARC3 | RREB1 | TFE3 | 103 | 0 | 28 | 0.93 | 1 | 0.3 |
RNA-Seq evidence, annotation information and validation information is shown for each prediction for which validation by PCR was attempted.
We were interested in building a classifier that could discriminate between real fusions and false positives. As a training set, we compiled a list of all ovarian and sarcoma fusions for which validation was attempted, and added to this list the 11 melanoma fusions, the three K-562 fusions and the TMPRSS2-ERG fusion in NCI-H660. The resulting dataset contained 60 positive and 61 negative predictions (
ROC curve for deFuse annotated with the threshold for the adaboost probability estimate. The threshold corresponds to a false positive rate of 10% and true positive rate of 82%.
Relative importance of each of the 11 features used by deFuse classifier.
Next we used the adaboost model to classify all remaining ovarian and sarcoma predictions to produce a final set of predictions for the ovarian and sarcoma datasets, thresholding the probability estimates produced by the adaboost model at 0.81. In total we predicted 2,540 gene fusions across all RNA-Seq datasets (
We analyzed CCC15, CCC16 and EMD6 with MapSplice version 1.14.1 and FusionSeq version 0.6.1 in order to compare the sensitivity of these methods with that of deFuse (Supplementary Methods,
We also attempted to establish whether FusionSeq and MapSplice could identify real fusions in our data that deFuse should have been able to identify, but did not. To this end, we identified all MapSplice and FusionSeq predictions for which there were no corresponding deFuse predictions in the set of initial predictions produced by the heuristic filters. We also removed MapSplice and FusionSeq predictions that did not involve ensembl annotated genes because it would be impossible for deFuse to identify those events. From this list we selected 14 MapSplice predictions and eight FusionSeq predictions that we considered to have the highest likelihood of successful validation according to a variety of conservative criteria (Supplementary Methods,
library | 5′ gene | 3′ gene | FusionSeq | deFuse thresholds | deFuse classifier | PCR validated |
CCC15 | ||||||
CCC15 | ||||||
CCC15 | ||||||
CCC15 | ||||||
CCC15 | ||||||
CCC15 | ||||||
CCC15 | ||||||
CCC16 | ||||||
CCC16 | ||||||
CCC16 | ||||||
EMD6 | ||||||
EMD6 | ||||||
EMD6 | ||||||
EMD6 | ||||||
EMD6 | ||||||
EMD6 | ||||||
CCC15 | ||||||
EMD6 | ||||||
CCC15 | ||||||
CCC16 | ||||||
CCC16 |
Comparison of deFuse using heuristic filters (deFuse thresholds) and deFuse using a classifier (deFuse classifier) with FusionSeq.
None of the 14 MapSplice predictions had corresponding deFuse predictions, filtered or unfiltered (blat with
In total there were 21 predictions with PCR results (17 positive and 4 negative) in CCC15, CCC16 and EMD6 upon which a quantitative comparison between FusionSeq and deFuse could be made. We computed the sensitivity and specificity on this data for deFuse-Threshold, deFuse-Classifier and FusionSeq. The sensitivity and specificity values were 82.3% and 100% for deFuse-Threshold; 100% and 94.4% for deFuse-Classifier; and 76.5% and 76.5% for FusionSeq (see
Method | P | N | TP | TN | FP | FN | Sens | Spec |
deFuse-Threshold | 17 | 4 | 14 | 4 | 0 | 3 | 82.3 | 100 |
deFuse-Classifier | 17 | 4 | 17 | 3 | 1 | 0 | 100 | 94.4 |
FusionSeq | 17 | 4 | 13 | 0 | 4 | 4 | 76.5 | 76.5 |
Comparison of accuracy between deFuse and FusionSeq on a subset of events predicted by either method in CCC15, CCC16 and EMD6. There were 21 PCR validations attempted including 17 positives (P) and 4 negatives (N). TP: true positives, TN: true negatives, FP: false positives, FN: false negatives, Sens =
We sought to establish the benefit of the maximum parsimony approach for resolving ambiguously aligning reads, and the dynamic programming based approach for aligning split reads to discover fusion boundaries. Each predicted fusion splice was annotated as coincident or not coincident with known ensembl exon boundaries. The fusion splices for eight of the 45 PCR validated fusions were not predicted to coincide with ensembl exon boundaries (
For each PCR validated fusion, we also calculated the number of spanning reads that align to a unique location in the genome, and considered the effect of an analysis restricted to considering only these reads. Such a theoretical analysis would have resulted in four fusions having lower than the threshold of five spanning reads (
An analysis that considered only uniquely aligning spanning reads and considered fusion splices at known exon boundaries would theoretically result in fewer false positives, as is apparent from the high validation rate in previous studies
We evaluated the ability of deFuse to rediscover known gene fusions in publicly available RNA-Seq data. Using deFuse, we searched for the
library | 5′ gene | 3′ gene | span count | split count | corrob. p-value | Split pos. p-value | split anchor p-value | deFuse correct sequence | deFuse thresholds | deFuse classifier probability |
NCIH660 | TMPRSS2 | ERG | 19 | 10 | 0.31 | 0.31 | 0.45 | 0.48 | ||
SRR018259 | KCTD2 | ARHGEF12 | 4 | 1 | 0.33 | 0.97 | 0.91 | 0.98 | ||
SRR018260 | ITM2B | RB1 | 19 | 2 | 0.37 | 0.51 | 0.68 | 0.98 | ||
SRR018260 | ANKHD1 | C5orf32 | 2 | 2 | 0.45 | 0.09 | 0.05 | 0.00 | ||
SRR018261 | GCN1L1 | PLA2G1B | 4 | 1 | 0.35 | 0.57 | 0.62 | 1.00 | ||
SRR018265 | WDR72 | SCAMP2 | 3 | 2 | 0.00 | 0.97 | 0.25 | 0.59 | ||
SRR018266 | C1orf61 | CCT3 | 54 | 17 | 0.12 | 0.56 | 0.56 | 0.79 | ||
SRR018266 | MIXL1 | PARP1 | 2 | 1 | 0.18 | 0.25 | 0.19 | 0.03 | ||
SRR018266 | C11orf67 | SLC12A7 | 43 | 24 | 0.92 | 0.55 | 0.75 | 0.99 | ||
SRR018266 | GNA12 | SHANK2 | 29 | 9 | 0.58 | 0.24 | 0.34 | 0.83 | ||
SRR018267 | TLN1 | C9orf127 | 3 | 1 | 0.08 | 0.71 | 0.76 | 0.91 | ||
SRR018267 | ALX3 | RECK | 4 | 6 | 0.72 | 0.25 | 0.45 | 0.99 | ||
SRR018269 | ABL1 | BCR | 91 | 14 | 0.68 | 0.68 | 0.67 | 0.97 | ||
SRR018269 | SLC44A4 | BAT3 | 27 | 6 | 0.44 | 0.74 | 0.67 | 0.99 | ||
SRR018269 | NUP214 | XKR3 | 67 | 15 | 0.90 | 0.91 | 0.28 | 1.00 |
Results for an analysis of existing datasets using deFuse thresholds and deFuse classifier.
deFuse-Thresholds identifies 7 of the 15 known fusions, whereas deFuse-Classifier identifies 10 of the 15 fusions. Notably,
We sought to understand each fusion's impact on the expression patterns of the fused genes. For a given fusion boundary
Promoter exchanges are characterized by overexpression of the 3′ exons of a gene resulting from the replacement of 5′ regulatory regions
We sought to rule out genomic amplification as a mechanism of overexpression for the seven putative promoter exchanges. Analysis of Affy SNP6.0 genome data indicates that two of the 3′ partners,
We sought to identify previously described rearrangements in our sarcoma and ovarian carcinoma data. Although generally considered a breast cancer rearrangement, amplification of
Analysis of the two epithelioid sarcomas and one intermediate grade myofibroblastic sarcoma produced five fusion predictions between non-adjacent genes, three involving genes previously described as translocated in cancer. The
The
Finally, the
We have developed a new algorithmic method called deFuse for gene fusion discovery in RNA-Seq data. We evaluated deFuse on 40 ovarian cancer patient samples, one ovarian cancer cell line and three sarcoma patient samples. Using these data, we demonstrate with RT-PCR validated fusions how deFuse exhibits substantially better accuracy than two competing methods and that deFuse is able to discover gene fusions that are not discoverable by more simplistic methods. deFuse computes a set of 11 quantitative features used to characterize its predicted fusions. In our initial analysis we used heuristic, intuitively chosen thresholds to eliminate false positives and nominated expected true positives and false positive predictions for RT-PCR validation. This yielded a set of benchmark fusion predictions: 60 true positives and 61 true negatives that we in turn leveraged to train an adaboost classifier to more robustly and objectively identify real gene fusions from the features. The classifier yielded an AUC accuracy of 0.91. Importantly, the validated fusions in ovarian cancer represent the first reported gene fusions in that tumor type.
The lack of a sufficient number of positive and negative controls for a particular type of event, such as gene fusions, represents a major challenge when evaluating novel algorithms designed for discovery of those events. This challenge is exacerbated when the prediction set contains a much larger proportion of negatives than positives. We attempted to select candidates to enrich for positive examples to provide a balanced set of ground truth events with which to train our classifier. While this has inherent biases, only one in 40 randomly chosen predictions validated indicating that a completely unbiased selection would have yielded too few positives to robustly fit a classifier. We attempted to mitigate the acknowledged biases by using other software to find additional positives and also included the very limited set of published examples from the literature.
The main limitation of deFuse is the requirement of at least five discordant read pairs to nominate a gene fusion to the adaboost classifier. This will certainly miss fusions that have very low expression and may result in insensitivity to fusions from RNA-Seq datasets with minimal sequence generation. This is suggested by the results in “Rediscovery of known gene fusions”. However, sequencing platforms are increasing throughput at exponential rates and it will soon be rare for an RNA-Seq library to under-sample a transcriptome. Another potential limitation of deFuse is its reliance on an annotated set of genes. As such, it will not be able to discover fusions that involve loci that are not annotated as genes. Finally, deFuse relies on alignment to a reference as its primary analytical step. Thus deFuse would miss gene fusions involving completely novel sequences that may exist in a transcriptome library but are not represented in the reference used by the aligner. In such situations, de novo assembly based methods such as Trans-ABySS
Full characterization of the mutational composition of cancer genomes will provide the opportunity to discover drivers of oncogenesis and will aid the development of biomarkers and drug targets for targeted therapy. As production of RNA-Seq data derived from tumor transcriptomes becomes routine, sophisticated techniques such as those used by deFuse will be required to identify the gene fusions that are part of each tumor's mutational landscape. As a first step in this process, we have identified gene fusions as a new class of features of the mutational landscape of ovarian tumor transcriptomes, in addition to discovering novel gene fusions in three sarcoma tumors.
RT-PCR sequence traces.
(5.00 MB GZ)
FISH images.
(4.55 MB GZ)
MapSplice output.
(10.16 MB GZ)
FusionSeq output.
(0.28 MB GZ)
Table of all gene fusion predictions.
(1.97 MB TXT)
Table of predicted interrupted genes.
(0.01 MB TXT)
Table of predicted CNVs.
(0.64 MB TXT)
FISH probe selection table.
(0.00 MB TXT)
Table of Validation Sets and RT-PCR primers.
(0.06 MB TXT)
Ovarian gene expression table.
(0.25 MB TXT)
Sarcoma gene expression table.
(0.00 MB TXT)
Table of positive and negative controls.
(0.10 MB TXT)
UMOD aligned read counts.
(0.00 MB TXT)
Gene names and their ensembl ids.
(0.51 MB TXT)
Supplementary methods and analysis.
(0.58 MB PDF)