JdR, MR, and LW conceived and designed the experiments. JdR performed the experiments. JdR, AU, and JK analyzed the data. JdR wrote the paper.
The authors have declared that no competing interests exist.
Retroviral insertional mutagenesis screens, which identify genes involved in tumor development in mice, have yielded a substantial number of retroviral integration sites, and this number is expected to grow substantially due to the introduction of high-throughput screening techniques. The data of various retroviral insertional mutagenesis screens are compiled in the publicly available Retroviral Tagged Cancer Gene Database (RTCGD). Integrally analyzing these screens for the presence of common insertion sites (CISs, i.e., regions in the genome that have been hit by viral insertions in multiple independent tumors significantly more than expected by chance) requires an approach that corrects for the increased probability of finding false CISs as the amount of available data increases. Moreover, significance estimates of CISs should be established taking into account both the noise, arising from the random nature of the insertion process, as well as the bias, stemming from preferential insertion sites present in the genome and the data retrieval methodology. We introduce a framework, the kernel convolution (KC) framework, to find CISs in a noisy and biased environment using a predefined significance level while controlling the family-wise error (FWE) (the probability of detecting false CISs). Where previous methods use one, two, or three predetermined fixed scales, our method is capable of operating at any biologically relevant scale. This creates the possibility to analyze the CISs in a scale space by varying the width of the CISs, providing new insights in the behavior of CISs across multiple scales. Our method also features the possibility of including models for background bias. Using simulated data, we evaluate the KC framework using three kernel functions, the Gaussian, triangular, and rectangular kernel function. We applied the Gaussian KC to the data from the combined set of screens in the RTCGD and found that 53% of the CISs do not reach the significance threshold in this combined setting. Still, with the FWE under control, application of our method resulted in the discovery of eight novel CISs, which each have a probability less than 5% of being false detections.
A potent method for the identification of novel cancer genes is retroviral insertional mutagenesis. Mice infected with slow transforming retroviruses develop tumors because the virus inserts randomly in their genome and mutates cancer genes. The regions in the genome that are mutated in multiple independent tumors are likely to contain genes involved in tumorigenesis. As the size of these datasets increases, conventional methods to detect these so-called common insertion sites (CISs) no longer suffice, and an approach is required that can control the error independent of the dataset size. The authors introduce a framework that uses a technique called kernel density estimation to find the regions in the genome that show a significant increase in insertion density. This method is implemented over a range of scales, allowing the data to be evaluated at any relevant scale. The authors demonstrate that the framework is capable of compensating for the inherent biases in the data, such as preference for retroviruses to insert near transcriptional start sites. By better balancing the error, they are able to show that from the 361 published CISs, 150 can be identified that have a low probability of being a false detection. In addition, they discover eight novel CISs.
In retroviral insertional mutagenesis experiments, genes involved in the development of cancer are identified by determining the loci of viral insertions from tumors induced by retroviruses in mice [
A tumor develops when an accumulation of oncogenic insertions causes uncontrolled proliferation of a cell. As a result, the tumor tissue contains many copies of the cell bearing the oncogenic insertions that induced the proliferation, but only a few copies of cells carrying non-oncogenic (random, background) insertions. Consequently, when the DNA of the tumor is analyzed, one will encounter the insertion that induced proliferation in larger proportions than insertions that do not. Regions in the genome that are found to carry insertions in multiple independent tumors are called common insertion sites (CISs). As a result, the locations of the CISs are highly correlated with the location of genes involved in tumor development. Cloning the flanking sequences of the inserted virus to determine the insertion loci, and analyzing these data to find significant CISs, therefore enable the discovery of new candidate cancer genes. This is summarized in
Over the last few years an extensive amount of insertional mutagenesis data has been published [
Due to noise in the data, not all insertions are informative. In the idealized case, oncogenic insertions are present in every tumor cell (since these cells are all copies of the cell with the initial oncogenic insertion that induced the tumor), whereas background insertions are only present in a small proportion of the tumor cells. Although this implies that the probability of finding a non-oncogenic insertion is far lower than for an oncogenic insertion, it may still occur that a non-oncogenic insertion is found. This results in non-informative insertions, the noise. Moreover, when a non-oncogenic insertion happens early in the tumor development phase, i.e., co-occurs in the same cell with one or more oncogenic insertions, there will also be many copies of this non-oncogenic insertion in the final tumor. Consequently, the probability of mapping this insertion will increase dramatically. This phenomenon is called
A CIS is defined as a region in the genome that has been hit by viral insertions in multiple independent tumors significantly more frequently than expected by chance (schematically illustrated in
(A) Schematic view of the mapped data of four tumors. Significance is determined by the number of tumors which contain insertions in a particular region. The geometric symbols represent the insertions and are given a different shape for each tumor. The blue regions indicate possible CISs.
(B) When considering a broad region, the number of insertions one would expect to have occurred by chance is higher, and hence the regions need to be hit in more independent tumors than for narrow regions before significance is reached.
(C) Genes (indicated by the green bars) may be affected from various loci around or within the gene, and there does not exist one distance over which viral inserts act on their targets.
For the analysis of the individual screens in the RTCGD, previous methods used one, two, or three windows of fixed size, and obtained an estimate of the number of false CISs by using Monte Carlo simulation [
The definition of a CIS depends on some expectation of the insertion rate associated with non-CIS regions. For this reason, we have to make assumptions about the background insertion distribution, i.e., the distribution of insertions under the assumption that there is no proliferative selection. In current methods, this distribution is assumed to be uniform, i.e., viral inserts show no preference for specific regions in the genome. Various authors suggest, however, that viral inserts do show local biases [
Summarizing, we state that, for the detection of CISs in retroviral insertional mutagenesis data, a framework is needed that 1) evaluates significance at any desired (biologically relevant) scale, 2) does so while keeping control of the error (since in the near future a significant increase in these data is expected), and 3) provides the possibility of including a background distribution, enabling compensation for the background bias. In this study, we propose a kernel convolution (KC) framework that meets the criteria outlined above. We apply this framework to the data of all the screens in the RTCGD combined, as if they originated from one screen. This gives us the opportunity to evaluate the method for large amounts of data. Indeed, the method rejects 53% of the CISs that do not reach the significance level in the combined set of screens. While keeping the family-wise error (FWE) under control, the method still revealed eight new CISs that are significant across the different screens. In addition, we provide the nearby putative target genes that may play a role in oncogenesis. Due to its generality, the method can be applied to other types of high-throughput genome-wide data, too; for example, to copy number aberration data or data from insertional mutagenesis screens using transposons [
The steps involved in the application of the KC framework can be summarized as follows (see
The insertions are convolved with a kernel function with a width determined by the scale parameter. In principle any kernel function can be used, but the Gaussian kernel function is depicted. The significance of the peaks is evaluated using a null-distribution computed by means of a random permutation of the data. This is done for a range of scale parameters to obtain the CISs in the scale space.
Obviously, the choice of the kernel function is an important design parameter in the CIS detection. Various kernel functions (Gaussian, triangular, rectangular, Barlett-Epanechnikov, etc.) have been proposed for various applications [
Since it is possible for CISs to be present for large scale parameters (broad CISs), but not for small scale parameters (narrow CISs), or vice versa, it is vital to consider the significance of CISs for different scale parameters. We propose a scale space approach in which the scale parameter is varied across a range of values to gain information about the “lifespan” of a CIS, while keeping the FWE at a predefined level. The lifespan is defined as the range of scale parameters for which the CIS is significant (exists). It will be shown that the CISs with a long lifespan (i.e., the CISs that appear for small as well as larger scale parameters) often consist of different narrow CISs that are joined together when increasing the scale parameter.
Plotting the CISs versus the scale parameter yields scale space diagrams (see, e.g.,
As mentioned before, MLV favors integration near TSS. Consequently, the location of TSS may be a good predictor for integration hot spots. We therefore explore a background model which uses the locations of the 5′ ends of the genes annotated in ENSEMBL for background bias correction. Although it has been shown that the viral insertions prefer integration near the 5′ end of
The performance and robustness of the KC framework in conjunction with either the Gaussian, triangular, or rectangular kernel function, is evaluated using artificial data. It consists of a uniform background distribution and one artificially generated CIS at a predefined locus. The insertions within the CIS are generated using a uniform distribution. In
The following evaluation criteria are defined (for details see the
(A) The blue line represents the estimated number of insertions as a function of position for a certain region. The red line depicts the threshold associated with an
(B) CISs are depicted by means of vertical lines. From top to bottom these represent: the CISs for the current scale (30k), the csCISs, the CISs from the RTCGD, the insertions, and the genes (top and bottom strand separated).
(C) Scale space diagram. The vertical axes of the scale space has a logarithmic scale and indicates the scale for which the CIS was detected (only a subset of scales was actually evaluated: [50 100 250 500 1 k 2.5 k 5 k 10 k 30 k 50 k 100 k 150 k] bp).
(D) Evaluation of the insertion distribution over four small scale CISs, identified by scale space analysis. Per screen we list the number of insertions that fall within the small scale CIS. The screens are labeled consistent with RTCGD nomenclature.
This can be explained from the discrete nature of the null-distribution of peak heights (see
Thus, using the number of peaks to correct for multiple testing proves to keep the FWE below the predefined level of 5%, as can be seen from
It should be noted that FWE is controlled per scale parameter. A range of 12 scale parameters is used so that, if for every scale parameter a
The scale-dependent bias and consequent conservativeness of the RKC and TKC also has repercussions for the TPs (
(A,B) Results for the GKC applied to artificial data.
(C,D) Results for the TKC.
(E,F) Results for the RKC.
The horizontal solid lines in (A), (C), and (E) show the 5% significance level, the dotted lines show the average number of csFPs. The legend shows the different simulated CISs, stating the number of insertions
From
From
The results for a Gaussian distribution of insertions within the CIS are given in
In conclusion, the GKC shows a clear advantage with regard to the performance when applied to artificial data, some advantage with regard to positional accuracy, but most important, it shows a consistent error distribution across the scales. For these reasons, we propose the GKC to be the method of choice to analyze the data from the RTCGD.
(A) Average number of csTPs per artificially generated CIS. Significant errors are made for the borderline cases: the narrow CISs (500 bp), or broad CISs (80k bp) with relatively few insertions. The GKC outperforms the RKC and TKC for all simulated CISs.
(B) Average deviation of the detected CIS center from the actual simulated CIS center normalized on the simulated CIS width plus the scale parameter under consideration.
Applying the GKC method to real data yields scale space diagrams, such as the one depicted in
The added value of breaking a single large-scale CIS for the
In [
The limitations of the definition from [
(A) Plot of the increase of the error as a function of the screen size, when using the definition from [
(B) Venn diagram comparing three different CIS definitions: a) the definition from [
Number of CISs for Various Scale Parameters (Corrected and Uncorrected), the csCISs, the Background-Corrected csCISs, and the CISs from the RTCGD. Background correction only has effect at larger scales.
(A) Venn diagram comparing the csCISs and the CISs in the RTCGD. For reasons explained in
(B) An example of a CIS that consists of three insertions from three independent screens, and therefore is only detected when integrally analyzing the data.
(C) Venn diagram comparing the csCISs with and without applying background correction.
(D) An example of a csCIS, that was also included in the RTCGD, and is rejected based on the background-corrected threshold. The small vertical bars (red) in the genes denote the 5′ ends of genes, and a star denotes a corrected CIS. Since we are only interested in correcting regions that are putative CISs, a background-corrected threshold is only computed for peaks in the estimated number of insertions. The corrected threshold is given by the horizontal dotted line above the peak.
Overview of the Novel CISs Detected by GKC
As expected, the total number of detected CISs is reduced as a consequence of the control of the FWE. The discarded CISs (53%) are not necessarily all false detections; many of them may be screen-specific CISs that consisted of only few insertions and did not reach significance when we integrally analyzed the data. Also, some of the CISs in the RTCGD were found using human interpretation of the insertions. The GKC can also be applied to any relevant subset of the data, although a minimum of approximately 800 insertions is required to reliably estimate a null-distribution within a reasonable timeframe.
Additionally, the background bias was removed using the procedure described in the
Detection of CISs in large retroviral insertional mutagenesis screens at acceptable false detection rates necessitates correction for multiple testing and renders manual curation of CISs impractical. Current methods do not control the number of falsely detected CISs without changing the scale of the putative CIS, and fail when applied to large datasets. In this paper, this is solved by introducing a KC framework capable of discovering statistically significant CIS, while controlling the FWE for any biologically relevant scale. Because the KC framework controls the error per scale, it is capable of analyzing the data in the scale space, allowing the discovery of narrow as well as broad CISs.
We evaluated the performance of the KC framework using three often-used kernel functions: the Gaussian, triangular, and rectangular kernel functions. From the results obtained using artificial data, we conclude that the KC framework is capable of keeping the FWE under the desired error level, for a range of different CISs and scale parameters. The GKC, however, performs most robustly, since it is capable of controlling the error in an unbiased fashion across the scales. This is highly desirable when analyzing the data in a scale space. Additionally, the TKC and GKC show better positional accuracy when compared with the RKC.
To test the performance of the method on a large dataset, we used the GKC to integrally analyze the data from the RTCGD. This resulted in the discovery of CISs that are significant across the screens according to a consistent definition, have a low probability of being false detections, and can be analyzed in the scale space. As a consequence, 53% of the CISs previously published in the RTCGD did not reach significance in the combined dataset. Among the discovered CISs are eight novel CISs, of which six could have only been found when we integrally analyzed the data. Three of those might be attributed to the background bias, but this is based on too little evidence. For these novel CISs, the putative targets have been provided.
The KC framework is flexible enough to incorporate a background bias correction. Currently, data to base the background model on is lacking. For instance, the effect of
As an additional benefit, the KC framework excels in visualizing the results, allowing the biologist to inspect the smoothed insertion estimate around interesting loci in the genome. Plotting the CISs in the scale space by means of scale space diagrams yield a valuable visualization tool for the biologist, showing the lifespan of CISs across a range of values of the scale parameter. This enabled the detection of screen-specific biases toward small-scale CISs. Together with the insertion locus relative to the neighboring genes, this provides useful information in determining the target of the insertional mutations.
Recently, some attention was given to multi-experiment analysis in the detection of significant copy number aberrations across experiments in array-CGH data (STAC algorithm from [
Convolution of the insertion data with a kernel results in a smoothed estimate of the number of insertions
The design of the appropriate kernel function is important. Since we are estimating the
Nondescending kernels that have sharp flanks (e.g., rectangular kernels), can only result in discrete (or even integer) estimations of the number of insertions. In this study we show that, although the error is controlled to be below the
In this study the following well-known kernel functions are used and compared:
The null-distribution is estimated by a permutation-based analysis of the insertion data (see
(A) The position of the
(B) The convolution method is applied to the resulting permuted insertion profile. The heights of all peaks are recorded.
(C) Step A and B are repeated. A distribution of the peaks in random data results.
(D) The threshold is computed by determining the
The KC introduces dependencies between bp. A Bonferroni correction will therefore produce overly conservative results. When this dependence is removed by only evaluating the peaks, applying the Bonferroni correction to the
To determine the position of the csCISs, a single linkage hierarchical clustering algorithm is applied to the CIS center loci of the CISs, for all scale parameters. The resulting dendrogram is thresholded at a linkage distance equal to the highest scale parameter, to ensure good cluster separation (see
Compensating for background bias requires inclusion of local changes of the a priori insertion probability in the null-hypothesis. In the KC framework, the correction of the null-hypothesis is achieved by replacing the permutation of the insertion data with a simulation process that incorporates the background insertion distribution. Analogous to recent literature [
The simulation follows the steps outlined in
(A) The density of TSSs (the 5′ ends of the genes) is computed using a fixed kernel width
(B) A new realization of insertions is generated using the density from step A.
(C) The GKC method is applied to the resulting insertion profile, yielding one realization of the background density estimate. Steps (A) and (B) and applying the GKC are repeated N times to yield a distribution of background realizations. For every position on the genome, a CDF of these realizations is computed and the threshold is determined based on the
(D) The location-dependent threshold is combined with the threshold based on uniform background. Finally, the smoothed insertion estimate of the real data is thresholded with the resulting threshold.
To simulate the background, a uniform distribution of
The width and number of insertions of the CIS is varied to evaluate the performance of the method for different types of CISs.
To evaluate the accuracy and performance of the methods, some criteria are defined. Using artificial data allows us to evaluate the correctness of a detected CIS, since the actual CIS locus is known. A TP is defined as the true detection of the artificially generated CIS, and is accomplished if the method identifies a CIS that has one of its bounds within the bounds of the artificially generated CIS (given by
An FP is defined as the detection of a CIS that does not pass the test for a TP, and hence occurs in the background. It should be noted that the probability of making at least one FP (the FWE) is controlled per scale parameter. If the errors made per scale parameter were mutually exclusive, this could result in an undesirably high overall error. To analyze this behavior, the average number of csFPs is computed, which counts an FP at a locus only once even if it occurred across multiple scales.
The average number of TPs and FPs are computed by taking the mean across the 500 simulations. Since FPs only occur in the background, distinguishing between the different simulated artificial CISs makes no sense. Therefore, the average number of csFPs is computed by additionally averaging across all different experiments.
As a final performance measure, the positional accuracy is evaluated. For this purpose, the deviation of the detected CIS center with respect to the artificial CIS center is normalized to the artificial CIS width plus the scale parameter
(A) The red insertions are oncogenic insertions, i.e., insertions either activating oncogenes or inactivating tumor suppressor genes, causing uncontrolled proliferation (cell division) of that cell, yielding a tumor. The green insertions are non-oncogenic insertions. Note that in every tumor cell we find this red insertion at exactly the same locus, since these cells are copies.
(B) After removal of the tumor from the animal, the tumor cells' DNA is cleaved into small fragments using restriction enzymes. These enzymes cut the DNA at certain nucleotide sequences, resulting in small DNA fragments. An additional property of these enzymes is that they cut exactly within the viral inserts, yielding fragments with viral DNA on one end and host cells' DNA on the other.
(C) The histogram shows the abundance of fragments. Because the restriction enzymes cut frequently, many fragments result. However, while only a limited number of insertions is present in each cell, by far most fragments do not contain an insertion. Due to the proliferation in (A), the fragments with oncogenic inserts will be present more often than the fragments containing the non-oncogenic (random) insertions. Note that in reality these abundances are not known.
(D) With a polymerase chain reaction (PCR), only the fragments containing a viral insert are amplified (multiplied). There exist various types of PCR reaction that may be used for this purpose, i.e., the linker-mediated PCR [
(E) Shows the histogram of the abundances of DNA fragments after the PCR. After PCR amplification, the oncogenic (red) insertions will be more prevalent than others, and should all map to exactly the same locus on the genome (defined as a contig). Due to noise in the following steps (sequencing and mapping), this may differ by several hundred bp. Singletons are sequences that individually map to a certain locus on the genome and ideally consist of the non-oncogenic insertions.
(F) Finally, a subset of the fragments is cloned, sequenced, and mapped onto the known Ensembl mouse genome. It is highly probable that an informative (oncogenic) insertion is sequenced and mapped, because the abundance of DNA fragments containing oncogenic insertions is a substantial proportion of the total number of insertions. Thus it might occur that a non-oncogenic insertion is sequenced and mapped, hence the data contains a certain amount of noise.
(294 KB EPS)
(A,B) Results for the GKC applied to artificial data.
(C,D) Results for the TKC.
(E,F) Results for the RKC.
The horizontal solid lines in (A), (C), and (E) show the 5% significance threshold, the dotted lines show the average number of csFPs. The legend shows the different simulated CISs, stating the number of insertions drawn from a normal distribution of standard deviation:
(871 KB EPS)
(A) Average number of csTPs per artificially generated CIS. Only the CIS with three insertions with
(B) Average deviation of the detected CIS center from the actual simulated CIS center normalized on the simulated CIS width plus the scale parameter under consideration.
(380 KB EPS)
The threshold corrected for multiple testing is given by the horizontal red line. Note that the threshold should change because when the scale parameter increases the number of tests (the number of peaks) decreases. This is not visible in
(608 KB EPS)
The blue line represents the estimation of the number of insertions as a function of position for a certain region. The red line depicts the 0.05 threshold level. In the middle, the CISs are depicted by means of vertical lines. From top to bottom these represent: the CISs for the current scale (30k, green), the csCISs (cyan), the CISs from the RTCGD (magenta), and the insertions (blue). The genes are not shown on this large scale. At the bottom, the CISs are plotted in the scale space. The vertical axis has a logarithmic scale and indicates the scale for which the CIS was detected (only a subset of scales was actually evaluated: [50 100 250 500 1 k 2.5 k 5 k 10 k 30 k 50 k 100 k 150 k] bp).
(547 KB EPS)
CISs that fall within one scale parameter from a TSS are candidates for correction. We clearly see that for small scale parameters only a few of these CISs exist, indicating that it is not justified to correct narrow CISs for background bias.
(254 KB EPS)
A single linkage dendrogram is built from the CIS centers, and thresholded at a linkage distance equal to the highest scale parameter. The mean center position of the CISs within one of the resulting clusters is defined at the locus of the csCIS. Note that a csCIS arises if a CIS is present for at least one scale parameter. In case a CIS is present for only one scale parameter, the csCIS locus is equal to the CIS center position.
(239 KB EPS)
(45 KB DOC)
The authors thank M. van Uitert for critical reading of the manuscript. Furthermore, the authors are thankful to the reviewers for their insightful comments.
common insertion site
cross scale CIS
cross scale false positive
family-wise error
false positive
Gaussian kernel convolution
kernel convolution
murine leukemia virus
polymerase chain reaction
rectangular kernel convolution
Retroviral Tagged Cancer Gene Database
triangular kernel convolution
true positive
transcription start site
normalization factor for rectangular kernel
normalization factor for triangular kernel
mean height of the peaks in the permuted insertion data
mean height of the estimation of the number of insertions in the permuted insertion data
observed height of a peak
position of the nth insertion
base pair position in the genome
genome length in base pair
genome length in artificial data experiment
kernel width
kernel width used for density estimation of the background
kernel function
total number of insertions
number of insertions in artificial data experiment
number of insertions within artificial data experiment
smoothed estimate of the number of insertions
width of the CIS in artificial data experiment