Conceived and designed the experiments: JD RDB ZDZ MBG. Performed the experiments: JD. Analyzed the data: JD RDB ZDZ YK MS MBG. Contributed reagents/materials/analysis tools: JD RDB. Wrote the paper: JD MBG.
The authors have declared that no competing interests exist.
The goal of human genome re-sequencing is obtaining an accurate assembly of an individual's genome. Recently, there has been great excitement in the development of many technologies for this (e.g. medium and short read sequencing from companies such as 454 and SOLiD, and high-density oligo-arrays from Affymetrix and NimbelGen), with even more expected to appear. The costs and sensitivities of these technologies differ considerably from each other. As an important goal of personal genomics is to reduce the cost of re-sequencing to an affordable point, it is worthwhile to consider optimally integrating technologies. Here, we build a simulation toolbox that will help us optimally combine different technologies for genome re-sequencing, especially in reconstructing large structural variants (SVs). SV reconstruction is considered the most challenging step in human genome re-sequencing. (It is sometimes even harder than de novo assembly of small genomes because of the duplications and repetitive sequences in the human genome.) To this end, we formulate canonical problems that are representative of issues in reconstruction and are of small enough scale to be computationally tractable and simulatable. Using semi-realistic simulations, we show how we can combine different technologies to optimally solve the assembly at low cost. With mapability maps, our simulations efficiently handle the inhomogeneous repeat-containing structure of the human genome and the computational complexity of practical assembly algorithms. They quantitatively show how combining different read lengths is more cost-effective than using one length, how an optimal mixed sequencing strategy for reconstructing large novel SVs usually also gives accurate detection of SNPs/indels, how paired-end reads can improve reconstruction efficiency, and how adding in arrays is more efficient than just sequencing for disentangling some complex SVs. Our strategy should facilitate the sequencing of human genomes at maximum accuracy and low cost.
In recent years, the development of high throughput sequencing and array technologies has enabled the accurate re-sequencing of individual genomes, especially in identifying and reconstructing the variants in an individual's genome compared to a “reference”. The costs and sensitivities of these technologies differ considerably from each other, and even more technologies are expected to appear in the near future. To both reduce the total cost of re-sequencing to an affordable point and be adaptive to these constantly evolving bio-technologies, we propose to build a computationally efficient simulation framework that can help us optimize the combination of different technologies to perform low cost comparative genome re-sequencing, especially in reconstructing large structural variants, which is considered in many respects the most challenging step in genome re-sequencing. Our simulation results quantitatively show how much improvement one can gain in reconstructing large structural variants by integrating different technologies in optimal ways. We envision that in the future, more experimental technologies will be incorporated into this simulation framework and its results can provide informative guidelines for the actual experimental design to achieve optimal genome re-sequencing output at low costs.
The human genome is comprised of approximately 6 billion nucleotides on two pairs of 23 chromosomes. Variations between individuals are comprised of ∼6 million single nucleotide polymorphisms (SNPs) and ∼1000 relatively large structural variants (SVs) of ∼3 kb or larger and many more smaller SVs are responsible for the phenotypic variation among individuals
Making personal genomics almost a reality over the past decade, the development of high throughput sequencing technologies has enabled the sequencing of individual genomes
These projects and algorithms, however, mostly relied on a single sequencing technology to perform individual re-sequencing and thus did not take full advantage of all the existing experimental technologies.
Long Sequencing | Medium Sequencing | Short Sequencing | CGH array (high/low resolution) | ||
∼800 | ∼250 | ∼30 | Tiling step size: ∼85 bp | ||
∼1E-3 | ∼7E-5 | ∼7E-6 | ∼3E-7 per array | ||
0.001–0.002% | 0.3–0.5% | 0.2–0.6% | N/A (detecting signals rather than sequences) | ||
Substitution errors | Insertion/deletion errors (usually caused by homo-polymers) | All error types | Array-specific errors (cross-hybridization effects) | ||
Identify small / medium SVs; localize SVs close to highly represented genomic regions | Identify small SVs; localize SVs in highly represented ∼100mers | Identify SNPs; localize SNPs in lowly represented genomic regions | Detect large CNVs with relatively low resolution; relatively cheaper than current sequencing technologies | ||
Detect large Indels with relatively low resolution; provide extra information to localize SVs | Detect large Indels with relatively low resolution; provide extra information to localize SVs | Link distant SNPs for haplotype phasing |
Data based on:
1) de la Vega FM, Marth GT, Sutton GG (2008) ‘Computational Tools for Next-Generation Sequencing Applications’, Pacific Symposium on Biocomputing 2008.
2) de Bruin D (2007) UBS Investment Research, Q Series: DNA Sequencing, UBS, New York, 2007.
Due to the existence of reference genome assemblies
Here we present a toolbox and some representative case studies on how to optimally combine the different experimental technologies in the individual genome re-sequencing project, especially in reconstructing large SVs, so as to achieve accurate and economical sequencing. An “optimal” experimental design should be an intelligent combination of the long, medium, and short sequencing technologies and also some array technologies such as CGH. Some of the previous genome sequencing projects
In the following sections, we will first briefly describe a schematic comparative genome re-sequencing framework, focusing on the intrinsically most challenging steps of reconstructing large SVs, and then use a set of semi-realistic simulations of these representative steps to optimize the integrated experimental design. Since full simulations are computationally intractable for such steps in the large parameter space of combinations of different technologies, the simulations are carried out in a framework that can combine the real genomic data with analytical approximations of the sequencing and assembly process. Also, this simulation framework is capable of incorporating new technologies as well as adjusting the parameters for existing ones, and can provide informative guidelines to optimal re-sequencing strategies as the characteristics and cost-structures of such technologies evolve, when combining them becomes a more important concern. The simulation framework is downloadable as a general toolbox to guide optimal re-sequencing as technology constantly advances.
We first briefly describe in the following subsection a systematic genome assembly strategy for the different types of sequencing reads and array signals, which is an integration of different sequence assembly and tiling array data analysis algorithms. With the most difficult steps in the assembly strategy, i.e. the reconstructions of large SVs, discussed in detail and the performance metric for such large SV reconstruction defined, we then present a semi-realistic sequencing simulation framework, which can guide the optimal experimental design, and show the results of simulations in the reconstruction of two types of large SVs.
The hybrid genome assembly strategy incorporates both comparative
The orange line represents the target individual genome, the red bars stand
for the SNPs and small SVs compared to the reference, and the green region
represents a large SV. (A) After the sequencing experiments, single and
paired-end reads with different lengths (long, medium, short, shown in different
colors) are generated, which can be viewed as various partial observations of
the target genome sequence. The dashed lines represent the links of the
paired-ends. The horizontal positions of the reads indicate their locations in
the genome. (B) After error correction, the reads are mapped back to the
reference genome, and the short reads are assembled into longer contigs based on
their overlapping information. The red and green regions stand for the
mismatches/gaps in the mapping results. (C) The SNPs and small SVs can be
inferred directly from the mapping results, and haplotype phasing can also be
performed after this step. (D, E) Large SVs can be detected and reconstructed
based on the reads without consistent matches in the reference genome, and also
based on the results from CGH arrays. This step will be explained in more
details in the
Further analysis of the single/paired-end reads are required to reconstruct the large SVs (
The horizontal positions of the reads indicate the mapping locations, and the colors refer to sequences from different genomic regions. (A–C) An example of the reconstruction of a novel insertion. (A) The region A (
It is important for us to define a reasonable performance metric so that the re-sequencing approach can be designed in such a way that its outcome will be optimized according to that metric. For large SVs, the metric can be defined based on the alignment result of the actual variant sequence and the inferred variant sequence. For a large SV due to genomic rearrangements (e.g. deletion, duplication), it is natural to define its recovery rate as either 1 (detected) or 0 (missed). For a large novel insertion, on the other hand, we may want to take into account cases where the insertion is detected but its sequence content is not reconstructed with full accuracy. Hence, we define the recovery rate of such a large novel insertion as follows based on its reconstruction percentage:
Based on the schematic assembly strategy and the performance measure defined in the previous sections, we can simulate the sequence assembly process in order to obtain an optimal set of parameters for the design of the sequencing experiments (e.g. the amount of long (Sanger), medium (454) and short (Illumina) reads, the amount of single and paired-end reads) and the array experiments (e.g. the incorporation of CGH arrays) to achieve the desired performance with a relatively low cost in the individual genome re-sequencing project.
Here we present the results of a set of simulation case studies on reconstructing large SVs, which are in general much more challenging problems compared to the detection of small SVs. In order to fully reconstruct a long novel insertion, for instance, one needs to not only detect the insertion boundaries based on the split-reads, but also assemble the insertion sequence from the spanning- and misleading-reads. For the identification of genomic rearrangements such as deletion/translocations, one may also want to incorporate array data to increase the confidence level of such analysis. The simulations described in this section are based on large (∼10 kb, ∼5 Kb and ∼2 Kb) novel insertions and deletions discovered by Levy et al.
One major challenge in implementing these simulations is to design them in a computationally realistic way. Brute-force full simulations of whole-genome assembly in this case would be unrealistic: thousands of possible combinations of different technologies will need to be tested, and for each of these combinations hundreds of genome assembly simulations need to be carried out to obtain the statistical distributions of their performance. Since a full simulation of one round of whole-genome assembly will probably take hundreds of CPU hours to finish, the full simulation to explore the full space of technology combinations will then require hundreds of millions (∼108) of CPU hours, equivalent to ∼10 years with 1000 CPUs. We designed the simulations using analytical approximations of the whole-genome assembly process in order for them to be both time and space efficient, and the gain in efficiency is summarized in
Variable | Description | Representative value |
G | Size of the genome | 3E9 bp |
c | Sequencing coverage | 10× |
I | Size of the large novel insertion of interest | 1E4 bp |
r | Average read length | 50 bp |
m | Average mapability values of the sub-sequences in the novel insertion | 3 |
Simulation strategy | Number of reads generated for the reconstruction of a novel insertion | Time to compute read overlaps |
Whole genome sequencing+hybrid (comparative+de novo) assembly | ||
Simulation utilizing pre-computed mapability maps | ||
Approximate reduction in complexity (fold) | ∼1E5 | ∼1.5E7 |
The simulation results of the recovery rates of novel insertions when we combine long, medium and short sequencing technologies with a fixed total cost and reconstruct a ∼10 Kb novel insertion region previously identified in the HuRef genome compared to the NCBI reference genome. The total cost is ∼$7 on this novel insertion (i.e. the reads covering this region cost ∼$7), and the total re-sequencing budget is ∼$2.1 M if we scale the cost on this region to the whole genome with the same sequencing depth. (A) The triangle plane corresponds to all the sequencing combinations whose total costs are fixed. The colors on the plane indicate the average recovery rates of the novel insertion with different sequencing combinations, averaged over multiple trials of simulations. (B) The same triangle region as in
Our simulation here is focusing on the reconstruction of large novel SVs, and thus depending on the actual characteristics of different sequencing technologies, the optimal combination of these technologies obtained in this simulation may have a trade-off in the accuracy of detecting SNPs and small indels, i.e., the optimal mixed sequencing strategy for the reconstruction of large novel SVs could lead to a low detection rate of smaller SV events. In this particular example, however, our optimal combination would also guarantee a high recovery rate of SNPs and small indels in the genome, according to the results of an individual genome re-sequencing project described in
Similarly to
(A) The same type of figure as
We also carried out simulations on reconstructing these novel insertion regions (∼10 Kb, ∼5 Kb, ∼2 Kb) using paired-end reads with different insert sizes (10 Kb and 3 Kb inserts for medium paired-end reads, and 150b insert for short paired-end reads).
(A) The same type of figure as
The second simulation focuses on the identification of genomic rearrangement events, such as deletions and translocations. CNV analysis can be used for this purpose and in this section we simulate its results based on the read-depth and signal intensity analysis of sequencing and CGH array data.
Boxplot of the CNV analysis simulation results of a large (∼18 Kb) deletion in the target individual's genome. The values on the x-axis correspond to different sequencing coverage and relative noise level in the CGH arrays. The value on the y-axis indicates the confidence of using different datasets to determine that a deletion event takes place instead of a translocation event.
In order to be adaptive to the fast development of the experimental technologies in personal genomics, our simulation framework is modularized in such a way that it is capable of incorporating new technologies as well as adjusting the parameters for the existing ones. Also, this approach relies on the general concept of mapability data, and can be easily applied to any representative SV for similar analysis. We envision that in the future, more experimental technologies can be incorporated into this sequencing/assembly simulation and the results of such simulations can provide informative guidelines for the actual experimental design to achieve optimal assembly performance at relatively low costs. With this purpose, we have made this simulation framework downloadable at
The simulation results in the previous section are based on three sequencing technologies and an idealized array technology, and assume a specific parameterization of their characteristics and costs. Thus, the particular optimal solutions found may not be immediately applicable to a real individual genome re-sequencing project. However, these results illustrate quantitatively how we can design and run simulations to obtain guidelines for optimal experimental design in such projects.
Since our simulation approach is based on the general concept of mapability map and comparative SV reconstruction instead of on a specific organism, it can also be adapted to the comparative sequencing of a non-human genome with regard to a closely related reference. In such a study, we can first construct an artificial target genome based on estimations of its divergence from the reference, and then compute the mapability maps of those representative SVs as input to the simulation framework to find the optimal combination of technologies. Obviously, the closer the two genomes are, the more informative the simulation result would be. In cases where it is hard to estimate the divergence of the target genome from the reference, a two-step approach can be conducted: First, combined sequencing experiments will be carried out using an optimal configuration obtained from the simulation based on the “best guess”, such as another closely related genome. Second, by using the target genome constructed in the previous step, a new set of simulations can be executed and their results can guide a second round of combined sequencing which can provide a finer re-sequencing outcome when combined with the previous sequencing data. Meanwhile, our simulation framework specifically focuses on the effects of misleading reads in the SV reconstruction process, and it will be the most helpful in cases where the target and reference genome both have complex repetitive/duplicative sequence characteristics which will introduce such reads.
In this paper, we propose to optimally incorporate different experimental technologies in the design of an individual genome-sequencing project, especially for the full reconstruction of large SVs, to achieve accurate output with relatively low costs. We first describe a hybrid genome re-sequencing strategy for detecting SVs in the target genome, and then propose how we can design the optimal combination of experiments for reconstructing large SVs based on the results of semi-realistic simulations with different single and paired-end reads. We also present several examples of such simulations, focusing on the reconstruction of large novel insertions and confirmation of large deletions based on CNV analysis, which are the most challenging steps in individual re-sequencing. The simulations for actual sequencing experimental design can integrate more technologies with different characteristics, and also test the sequencing/assembly performance at different SV levels. By doing so, a set of experiments based on various technologies can be integrated to best achieve the ultimate goal of an individual genome re-sequencing project: accurately detecting all the nucleotide and structural variants in the individual's genome in a cost-efficient way. Such information will ultimately prove beneficial in understanding the genetic basis of phenotypic differences in humans.
The NCBI assembly v36
Since we would be testing thousands of possible combinations of the long, medium and short sequencing technologies, it would be unrealistic (both time and space consuming) to generate for each combination all the reads from the whole target genome and then apply any existing assembler to these reads. We decided to semi-realistically simulate the assembly process of large novel insertions to achieve relatively accurate estimates in an affordable amount of time. Several difficulties need to be addressed by such a simulation: 1) One of the most time-consuming step in a real assembler is the read overlap-layout step. 2) The whole-genome sequencing experiment introduces large numbers of misleading reads that are partially similar to the reads from the targeted genomic region, which would require an huge storage space in a real assembly process.
In order to both accelerate the simulation of the overlap-layout step and simulate the whole-genome sequencing setting in a space-efficient manner, we pre-computed the mapability
The following lemmas are obvious:
According to the above definition,
First, all the reads from the target insertion region are generated (
(A) A target genome with a large novel insertion. Regions
The generated reads that align to the same genomic starting locations are grouped together and the per-position error statistics are computed, resulting in a set of read-groups that starts from different locations with their position-specific error statistics computed. These read-groups are then further combined in the de novo reconstruction process describe below.
Additional reads (same, similar and misleading) are introduced (
In order to simulate such a process in a whole-genome sequencing setting, the mapability data are again utilized, as illustrated in
The misleading-reads are generated in the following way: for a contig
For computational efficiency, we also developed a simplified assembler module to assemble all the generated reads. As illustrated in
The de novo extensions are performed by the simplified assembler described
above from both ends of the insertion region, and the combined results are then
compared to the actual insertion to obtain the reconstruction rate of the target
region, based on the metric described in the
In this simulation, we assume that the boundaries of a large deletion event have already been identified by sequence reads, and we are simulating the process of determining whether this is a deletion or translocation event, based on the short reads alone or on the idealized CGH data. The reads are generated in a similar fashion as described in the previous section, without considering sequencing errors for simplicity. The idealized CGH signal of a corresponding region
MM values and worst case reconstruction examples of a 10 Kb novel insertion.
(0.08 MB PDF)