Research Article

# Integrating Sequencing Technologies in Personal Genomics: Optimal Low Cost Reconstruction of Structural Variants

• Affiliation: Department of Computer Science, Yale University, New Haven, Connecticut, United States of America

X
• Affiliations: Department of Computer Science, Yale University, New Haven, Connecticut, United States of America, Keck Biotechnology Resource Laboratory, Yale University, New Haven, Connecticut, United States of America

X
• Affiliation: Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, Connecticut, United States of America

X
• Affiliation: Keck Biotechnology Resource Laboratory, Yale University, New Haven, Connecticut, United States of America

X
• Affiliations: Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, Connecticut, United States of America, Department of Molecular, Cellular and Developmental Biology, Yale University, New Haven, Connecticut, United States of America

X
• Mark.Gerstein@yale.edu

Affiliations: Department of Computer Science, Yale University, New Haven, Connecticut, United States of America, Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, Connecticut, United States of America, Program in Computational Biology and Bioinformatics, Yale University, New Haven, Connecticut, United States of America

X
• Published: July 10, 2009
• DOI: 10.1371/journal.pcbi.1000432

## Abstract

The goal of human genome re-sequencing is obtaining an accurate assembly of an individual's genome. Recently, there has been great excitement in the development of many technologies for this (e.g. medium and short read sequencing from companies such as 454 and SOLiD, and high-density oligo-arrays from Affymetrix and NimbelGen), with even more expected to appear. The costs and sensitivities of these technologies differ considerably from each other. As an important goal of personal genomics is to reduce the cost of re-sequencing to an affordable point, it is worthwhile to consider optimally integrating technologies. Here, we build a simulation toolbox that will help us optimally combine different technologies for genome re-sequencing, especially in reconstructing large structural variants (SVs). SV reconstruction is considered the most challenging step in human genome re-sequencing. (It is sometimes even harder than de novo assembly of small genomes because of the duplications and repetitive sequences in the human genome.) To this end, we formulate canonical problems that are representative of issues in reconstruction and are of small enough scale to be computationally tractable and simulatable. Using semi-realistic simulations, we show how we can combine different technologies to optimally solve the assembly at low cost. With mapability maps, our simulations efficiently handle the inhomogeneous repeat-containing structure of the human genome and the computational complexity of practical assembly algorithms. They quantitatively show how combining different read lengths is more cost-effective than using one length, how an optimal mixed sequencing strategy for reconstructing large novel SVs usually also gives accurate detection of SNPs/indels, how paired-end reads can improve reconstruction efficiency, and how adding in arrays is more efficient than just sequencing for disentangling some complex SVs. Our strategy should facilitate the sequencing of human genomes at maximum accuracy and low cost.

## Author Summary

In recent years, the development of high throughput sequencing and array technologies has enabled the accurate re-sequencing of individual genomes, especially in identifying and reconstructing the variants in an individual's genome compared to a “reference”. The costs and sensitivities of these technologies differ considerably from each other, and even more technologies are expected to appear in the near future. To both reduce the total cost of re-sequencing to an affordable point and be adaptive to these constantly evolving bio-technologies, we propose to build a computationally efficient simulation framework that can help us optimize the combination of different technologies to perform low cost comparative genome re-sequencing, especially in reconstructing large structural variants, which is considered in many respects the most challenging step in genome re-sequencing. Our simulation results quantitatively show how much improvement one can gain in reconstructing large structural variants by integrating different technologies in optimal ways. We envision that in the future, more experimental technologies will be incorporated into this simulation framework and its results can provide informative guidelines for the actual experimental design to achieve optimal genome re-sequencing output at low costs.

### Introduction

The human genome is comprised of approximately 6 billion nucleotides on two pairs of 23 chromosomes. Variations between individuals are comprised of ~6 million single nucleotide polymorphisms (SNPs) and ~1000 relatively large structural variants (SVs) of ~3 kb or larger and many more smaller SVs are responsible for the phenotypic variation among individuals [1],[2]. Most of these large SVs are due to genomic rearrangements (e.g. duplication and deletion), and a few others contain novel sequences that are not present in the reference genome [3]. The goal of personal genomics is to determine all these genetic differences between individuals and to understand how these contribute to phenotypic differences in individuals.

Making personal genomics almost a reality over the past decade, the development of high throughput sequencing technologies has enabled the sequencing of individual genomes [3],[4]. In 2007, Levy et al. reported the sequencing of an individual's genome based on Sanger [5] whole-genome shotgun sequencing, followed by de novo assembly strategies. Wheeler et al. in 2008 presented another individual's genome sequence constructed from 454 sequencing reads [6] and comparative genome assembly methods. In the mean time, other new sequencing technologies such as Solexa/Illumina sequencing [7] have become available for individual genome sequencing with corresponding, specially-designed sequence assembly algorithm designed [8][12].

These projects and algorithms, however, mostly relied on a single sequencing technology to perform individual re-sequencing and thus did not take full advantage of all the existing experimental technologies. Table 1 gives a summary of the characteristics of several technologies in comparative individual genome sequencing. At one extreme, performing long Sanger sequencing with a very deep coverage will lead to excellent results at high cost. In another, performing only the inexpensive and short Illumina sequencing may generate good and cost-efficient results in SNP detection, but will not be able to either unambiguously locate some of the SVs in repetitive genomic regions or fully reconstruct many of the large SVs. Moreover, array technologies such as the SNP array [1] and the CGH array at different resolutions [13][16] can also be utilized to identify the SVs: the SNP arrays can detect SNPs directly, and the CGH array is able to detect kilobase-(kb) to megabase-(mb) sized copy number variants (CNV) [17], which can be integrated into the sequencing-based SV analysis. It is thus advantageous to consider optimally combining all these experimental techniques into the individual genome re-sequencing framework and to design experiment protocols and computational algorithms accordingly.

Due to the existence of reference genome assemblies [18],[19] and the high similarity between an individual's genome and the reference [3], the identification of small SVs is relatively straightforward in comparative re-sequencing with the analysis of single split-reads covering small SVs. Meanwhile, although there exist algorithms to detect large SVs with paired-end reads [2], the complete reconstruction of a large SV requires the integration of reads spanning a wide region, often involving misleading reads from other locations of the genome. If there were no repeats or duplications in the human genome, the reconstruction of such large SVs would be trivially accomplished by the de novo assembly with a high coverage of inexpensive short reads around these regions. With the existence of repeats and duplications in the human genome, however, a set of longer reads will be required to accurately locate some of these SVs in repetitive regions, and a hybrid re-sequencing strategy with both comparative and de novo approaches will be necessary to identify genomic rearrangement events such as deletions and translocations, and also to reconstruct large novel insertions in individuals. Such steps are thus much harder than the others, and will be the main focus of this paper.

Here we present a toolbox and some representative case studies on how to optimally combine the different experimental technologies in the individual genome re-sequencing project, especially in reconstructing large SVs, so as to achieve accurate and economical sequencing. An “optimal” experimental design should be an intelligent combination of the long, medium, and short sequencing technologies and also some array technologies such as CGH. Some of the previous genome sequencing projects [20],[21] have already incorporated such hybrid approaches using both long and medium reads, although the general problem of optimal experimental design has not yet been systematically studied. While it is obvious that combining technologies is advantageous, we want to quantitatively show the potential savings based on different integration strategies. Also, since the technologies are constantly developing, it will be useful to have a general and flexible approach to predict the outcome of integrating different technologies, including the new ones coming in the future.

In the following sections, we will first briefly describe a schematic comparative genome re-sequencing framework, focusing on the intrinsically most challenging steps of reconstructing large SVs, and then use a set of semi-realistic simulations of these representative steps to optimize the integrated experimental design. Since full simulations are computationally intractable for such steps in the large parameter space of combinations of different technologies, the simulations are carried out in a framework that can combine the real genomic data with analytical approximations of the sequencing and assembly process. Also, this simulation framework is capable of incorporating new technologies as well as adjusting the parameters for existing ones, and can provide informative guidelines to optimal re-sequencing strategies as the characteristics and cost-structures of such technologies evolve, when combining them becomes a more important concern. The simulation framework is downloadable as a general toolbox to guide optimal re-sequencing as technology constantly advances.

### Results

We first briefly describe in the following subsection a systematic genome assembly strategy for the different types of sequencing reads and array signals, which is an integration of different sequence assembly and tiling array data analysis algorithms. With the most difficult steps in the assembly strategy, i.e. the reconstructions of large SVs, discussed in detail and the performance metric for such large SV reconstruction defined, we then present a semi-realistic sequencing simulation framework, which can guide the optimal experimental design, and show the results of simulations in the reconstruction of two types of large SVs.

#### Schematic genome assembly strategy

The hybrid genome assembly strategy incorporates both comparative [22] and de novo methods. On one hand, most of the assembly can be done against the reference, and it will be unnecessary to perform a computationally intensive whole genome de novo assembly. Comparative approaches will be capable of identifying small SVs and large rearrangement events. On the other hand, de novo assembly will sometimes still be useful in reconstructing regions with large and novel SVs.

Fig. 1 shows the schematic steps of SV reconstruction in the context of the genome sequencing/assembly process. The data from different sequencing/array experiments can be processed in the following way: As shown in Fig. 1A and 1B, with errors corrected [23] and short reads combined into “unipaths” [10], all the reads (long/medium/short) from the individual's genome can be mapped back to the reference genome. In Fig. 1C, the SNPs can then be identified immediately based on the reads with single best matches, and the boundaries of deletions or small insertions will be detected by such reads as well (allowing gaps in alignment). Meanwhile, haplotype islands can also be extracted based on the paired-end information [3],[24],[25] and the prior knowledge of the population haplotype patterns revealed by previous work [26].

Further analysis of the single/paired-end reads are required to reconstruct the large SVs (Fig. 1D and 1E), which are by nature more complicated than identifying small SVs. First of all, locations of such SV events need to be detected by analyzing the split-reads (shown in Fig. 2A and 2B) that cover their boundaries. Second, two distinct types of SVs need to be handled separately: de novo assembly is required to reconstruct large novel insertions, and comparative algorithms should be utilized to identify genomic rearrangement events (e.g. segmental duplication/deletion). The homozygosity/heterozygosity of such SVs can be determined based on the existence of the reads that map back to the corresponding reference sequences.

Fig. 2A–C show the overall process of de novo assembly for large novel insertions. While the reconstruction of such regions mostly depends on the spanning-reads from the new inserted sequence, misleading-reads from elsewhere in the genome can often hinder the full reconstruction process. These reads usually comes from the highly represented regions in the genome, which also exist in the insertion. In such cases, reads longer than such regions and appropriate assembly strategies are needed to ensure the unambiguous and correct assembly output. Paired-end reads with an appropriate gap size can also help the unambiguous mapping of the reads inside novel insertions [2].

Fig. 2D illustrates the comparative identification of rearrangements from the reference sequence. CGH array data can be integrated into the reconstruction process of such SVs. For long rearrangements detected by sequencing data, the CGH data can be utilized in both validation and correction of large segmental duplications/deletions. What is more, incorporating the CGH data can also lower the coverage depth requirement of sequencing experiments, since the inner (i.e. not close to SV boundaries) regions of segmental duplications/deletions not covered by sequencing reads can still be identified by CNV results. An example is shown in Fig. 2D: Although the sequence reads can detect the SV event in region A, B and C, they may not be sufficient to distinguish deletions from translocations when the sequencing coverage is relatively low. The copy numbers of the genomic regions inferred from CGH array data can be integrated into the rearrangement analysis, and provide additional evidence of the SV types.

#### Defining a performance metric for large SV reconstructions

It is important for us to define a reasonable performance metric so that the re-sequencing approach can be designed in such a way that its outcome will be optimized according to that metric. For large SVs, the metric can be defined based on the alignment result of the actual variant sequence and the inferred variant sequence. For a large SV due to genomic rearrangements (e.g. deletion, duplication), it is natural to define its recovery rate as either 1 (detected) or 0 (missed). For a large novel insertion, on the other hand, we may want to take into account cases where the insertion is detected but its sequence content is not reconstructed with full accuracy. Hence, we define the recovery rate of such a large novel insertion as follows based on its reconstruction percentage:
in which SV is the actual insertion (in simulations, it is already known; in reality, it will need to be identified in a validation step), SVinf is the insertion sequence inferred by the genome re-sequencing approach, mismatch returns the number of mismatches of two aligned sequences, wflanking returns a sequence with its flanking sequences on both ends, and size returns the size of a sequence. The purpose of introducing flanking sequences is to take into account the accuracy of the predicted location of the SV.

#### Simulations of genome re-sequencing for optimal experimental design

Based on the schematic assembly strategy and the performance measure defined in the previous sections, we can simulate the sequence assembly process in order to obtain an optimal set of parameters for the design of the sequencing experiments (e.g. the amount of long (Sanger), medium (454) and short (Illumina) reads, the amount of single and paired-end reads) and the array experiments (e.g. the incorporation of CGH arrays) to achieve the desired performance with a relatively low cost in the individual genome re-sequencing project.

Here we present the results of a set of simulation case studies on reconstructing large SVs, which are in general much more challenging problems compared to the detection of small SVs. In order to fully reconstruct a long novel insertion, for instance, one needs to not only detect the insertion boundaries based on the split-reads, but also assemble the insertion sequence from the spanning- and misleading-reads. For the identification of genomic rearrangements such as deletion/translocations, one may also want to incorporate array data to increase the confidence level of such analysis. The simulations described in this section are based on large (~10 kb, ~5 Kb and ~2 Kb) novel insertions and deletions discovered by Levy et al. [3], and they perform semi-realistic whole genome assembly representative using the sequence characteristics of both the NCBI reference genome [18] and the target HuRef genome [3]. The sequencing/array technologies considered in these simulations are long, medium and short sequencing methods and CGH arrays, as shown in Table 1. Paired-end reads are also included in these simulations.

One major challenge in implementing these simulations is to design them in a computationally realistic way. Brute-force full simulations of whole-genome assembly in this case would be unrealistic: thousands of possible combinations of different technologies will need to be tested, and for each of these combinations hundreds of genome assembly simulations need to be carried out to obtain the statistical distributions of their performance. Since a full simulation of one round of whole-genome assembly will probably take hundreds of CPU hours to finish, the full simulation to explore the full space of technology combinations will then require hundreds of millions (~108) of CPU hours, equivalent to ~10 years with 1000 CPUs. We designed the simulations using analytical approximations of the whole-genome assembly process in order for them to be both time and space efficient, and the gain in efficiency is summarized in Table 2 and will be described in details later in the Materials and Methods section. We have also made this simulation framework publicly available as a toolbox that can incorporate technology advancements as well as other SV regions.

##### Implementation and Availability.

In order to be adaptive to the fast development of the experimental technologies in personal genomics, our simulation framework is modularized in such a way that it is capable of incorporating new technologies as well as adjusting the parameters for the existing ones. Also, this approach relies on the general concept of mapability data, and can be easily applied to any representative SV for similar analysis. We envision that in the future, more experimental technologies can be incorporated into this sequencing/assembly simulation and the results of such simulations can provide informative guidelines for the actual experimental design to achieve optimal assembly performance at relatively low costs. With this purpose, we have made this simulation framework downloadable at http://archive.gersteinlab.org/proj/ReSe​qSim/ as a general toolbox that can be either used directly or extended easily.

### Discussion

The simulation results in the previous section are based on three sequencing technologies and an idealized array technology, and assume a specific parameterization of their characteristics and costs. Thus, the particular optimal solutions found may not be immediately applicable to a real individual genome re-sequencing project. However, these results illustrate quantitatively how we can design and run simulations to obtain guidelines for optimal experimental design in such projects.

Since our simulation approach is based on the general concept of mapability map and comparative SV reconstruction instead of on a specific organism, it can also be adapted to the comparative sequencing of a non-human genome with regard to a closely related reference. In such a study, we can first construct an artificial target genome based on estimations of its divergence from the reference, and then compute the mapability maps of those representative SVs as input to the simulation framework to find the optimal combination of technologies. Obviously, the closer the two genomes are, the more informative the simulation result would be. In cases where it is hard to estimate the divergence of the target genome from the reference, a two-step approach can be conducted: First, combined sequencing experiments will be carried out using an optimal configuration obtained from the simulation based on the “best guess”, such as another closely related genome. Second, by using the target genome constructed in the previous step, a new set of simulations can be executed and their results can guide a second round of combined sequencing which can provide a finer re-sequencing outcome when combined with the previous sequencing data. Meanwhile, our simulation framework specifically focuses on the effects of misleading reads in the SV reconstruction process, and it will be the most helpful in cases where the target and reference genome both have complex repetitive/duplicative sequence characteristics which will introduce such reads.

In this paper, we propose to optimally incorporate different experimental technologies in the design of an individual genome-sequencing project, especially for the full reconstruction of large SVs, to achieve accurate output with relatively low costs. We first describe a hybrid genome re-sequencing strategy for detecting SVs in the target genome, and then propose how we can design the optimal combination of experiments for reconstructing large SVs based on the results of semi-realistic simulations with different single and paired-end reads. We also present several examples of such simulations, focusing on the reconstruction of large novel insertions and confirmation of large deletions based on CNV analysis, which are the most challenging steps in individual re-sequencing. The simulations for actual sequencing experimental design can integrate more technologies with different characteristics, and also test the sequencing/assembly performance at different SV levels. By doing so, a set of experiments based on various technologies can be integrated to best achieve the ultimate goal of an individual genome re-sequencing project: accurately detecting all the nucleotide and structural variants in the individual's genome in a cost-efficient way. Such information will ultimately prove beneficial in understanding the genetic basis of phenotypic differences in humans.

### Materials and Methods

#### The data and parameters used in the simulation

The NCBI assembly v36 [18] and the HuRef assembly [3] were used as reference and target genomes, respectively. Three sequencing technologies, long (Sanger), medium (454), and short (Illumina) sequencing, were considered with the characteristics summarized in Table 1. We also assumed that the per-base sequencing error rate increases linearly from the start to the end of a read similar to ReadSim [28], and assigned error types (insertion, deletion or substitution) randomly according to the characteristics of the sequencing technique used [6],[7],[28]. The novel SVs used in the novel insertion reconstruction simulation are ~10 Kb, ~5 Kb and ~2 Kb insertion sequences in the HuRef genome [3] with variant IDs 1104685256488, 1104685222085 and 1104685613186, respectively. The deletion used in the CNV analysis simulation is a ~18 Kb sequence in the HuRef genome with variant ID 1104685125828.

#### The simulation of the sequencing/assembly of large novel insertions

Since we would be testing thousands of possible combinations of the long, medium and short sequencing technologies, it would be unrealistic (both time and space consuming) to generate for each combination all the reads from the whole target genome and then apply any existing assembler to these reads. We decided to semi-realistically simulate the assembly process of large novel insertions to achieve relatively accurate estimates in an affordable amount of time. Several difficulties need to be addressed by such a simulation: 1) One of the most time-consuming step in a real assembler is the read overlap-layout step. 2) The whole-genome sequencing experiment introduces large numbers of misleading reads that are partially similar to the reads from the targeted genomic region, which would require an huge storage space in a real assembly process.

##### The mapability data.

In order to both accelerate the simulation of the overlap-layout step and simulate the whole-genome sequencing setting in a space-efficient manner, we pre-computed the mapability [29] values of all the possible sub-sequences in the reads from the inserted region. The mapability value of a sequence is the number of times this particular sequence (allowing the specified number of mismatches) appears in a genome, defined below:

Definition 1 For a given genome G and a given sequence s, the mapability function M(s, G, m) is defined as the total number of occurrences of the elements in S in G, where S = {s′|mismatch(s, s′)< = m}. For simplicity, we also denote that M(s, G) = M(s, G, 0), which is the extract occurrence of s in G.

The following lemmas are obvious:

Lemma 1 Given a genome G and two sequences s and s′, if s contains s′, then M(s, G)< = M(s′, G). M(s, G) = M(s′, G) if and only if all the occurrences of s′ in G are within sequence s. An intuitive interpretation of this lemma is that if a sequence s contains s′, then s must occur at most the same number of times as s′ in a genome.

Lemma 2 Given a genome G, a sequence s, and two non-negative integers m, m′, if m>m′, then M(s, G, m)> = M(s, G, m′). This lemma states that for any given sequence, its mapability value in a genome never decreases with an increasing mismatch threshold.

Definition 2 For a given genome G and a given sequence s, the k-mapability map MM(s, G, k, m) of s with respect to G is a vector sequentially containing the mapability values of all the k-mers in s with a tolerance of m mismatches: MM(s, G, k, m) = [M(sub(s, 0, k), G, m), M(sub(s, 1, k+1), G, m), …], where sub(s, a, b) returns the sub-sequence of s from a to b-1 (0-based index). For simplicity, we denote that MM(s, G, k) = MM(s, G, k, 0), which counts exact occurrence only.

According to the above definition, MM(s, G, k, m) can be viewed as a set of mapability values of all the length-k sub-sequences in s allowing no more than m mismatches.

##### Generation of the split-/spanning-reads and computation of the mapability maps.

First, all the reads from the target insertion region are generated (Fig. 7E) based on the same setting of the long, medium and short sequencing coverages as in the problem being simulated (Fig. 7A and 7B). Second, as shown in Fig. 7D, in order to take into account the effects of the same/similar/misleading-reads from elsewhere in the genome in a whole-genome sequencing experiment, we computed the mapability maps MM(s, G, k, m) of the insertion region s (the ~10 Kb insertion sequence with its 1 Kb up/down-stream flanking sequences), where G = NCBI reference genome, HuRef target genome; k = 25, 26, …, 800; m = 0, 2. For computational efficiency, the “mismatch” function is currently implemented to take into account only the nucleotide mismatches of two sequences with the same length. On one hand, it would be more realistic to include indel mismatches as well to represent such sequencing errors. On the other hand, we would expect that in practice most of such sequencing errors will be corrected in a preprocessing step [23].

The generated reads that align to the same genomic starting locations are grouped together and the per-position error statistics are computed, resulting in a set of read-groups that starts from different locations with their position-specific error statistics computed. These read-groups are then further combined in the de novo reconstruction process describe below.

In order to simulate such a process in a whole-genome sequencing setting, the mapability data are again utilized, as illustrated in Fig. 7D and 7E. For a highly represented region r in the insertion, its corresponding same/similar reads from elsewhere in the target genome are generated based on the pre-computed M(r, TargetG, 0) and M(r, TargetG, 2)−M(r, TargetG, 0) (the maximal allowed mismatch of 2 corresponds to ~6% or lower difference between two short reads). The number of such reads are randomly generated based on the mapability values of r, the sequencing coverage, and the distribution P(n, r) of the number of reads (n) exactly covering a region with the same size of r, which can be either empirically generated based on the previously simulated reads from the inserted region, or constructed based on a theoretical Poisson distribution representing a uniform sequencing process.

The misleading-reads are generated in the following way: for a contig c and a read r that overlaps it, denote the overlapping sequence with s, then according to lemma 1, the unambiguous extension of c based on read r is guaranteed if and only if M(s, TargetG) = M(r, TargetG), which means that the sequence s is always within sequence r in the target genome. When M(s, TargetG)>M(r, TargetG), we introduce the misleading reads based on M(s, TargetG)−M(r, TargetG) and P(n, r).

##### Computing the reconstruction rate of the combined result.

The de novo extensions are performed by the simplified assembler described above from both ends of the insertion region, and the combined results are then compared to the actual insertion to obtain the reconstruction rate of the target region, based on the metric described in the Results section. The flanking sequences are taken into account to measure the accuracy of boundary detection. If the de novo reconstruction result does not cover the insertion boundaries, the reconstructed sequence cannot be localized in the reference genome and the reconstruction rate is set to 0. Fig. 7F shows example output contigs, which contain small sequencing errors, a false extension error due to the misleading-reads introduced by an highly represented region inside the insertion, and a gap due to both the false extension and the low-coverage of sequencing in that particular region.

#### The simulation of CNV analysis

In this simulation, we assume that the boundaries of a large deletion event have already been identified by sequence reads, and we are simulating the process of determining whether this is a deletion or translocation event, based on the short reads alone or on the idealized CGH data. The reads are generated in a similar fashion as described in the previous section, without considering sequencing errors for simplicity. The idealized CGH signal of a corresponding region r is defined as Gaussian variable with mean M(r, TargetG), and noise/standard deviation = 0.05, 0.1, 0.2. For each dataset, the log-ratio of the posterior probability of the deletion event is computed to represent the confidence level provided by each dataset for determining that deletion. These confidence levels are computed according the following formulas:

where sub(s, a, b) returns the sub-sequence of s from a to b-1 (1-based index), l is the length of the short read, SV stands for the deleted region, coveragereads is the sequencing coverage, obs(r) is the number of observed reads that are the same as r, sig(r) is the normalized CGH-array signal of probe r, PDF{D, v} is the probability density/mass function of the distribution D at value v, and RefG/TargetG refers to the reference/target genome.

### Supporting Information

Figure S1.

MM values and worst case reconstruction examples of a 10 Kb novel insertion.

doi:10.1371/journal.pcbi.1000432.s001

(0.08 MB PDF)

### Author Contributions

Conceived and designed the experiments: JD RDB ZDZ MBG. Performed the experiments: JD. Analyzed the data: JD RDB ZDZ YK MS MBG. Contributed reagents/materials/analysis tools: JD RDB. Wrote the paper: JD MBG.

### References

1. 1. Hehir-Kwa JY, Egmont-Petersen M, Janssen IM, Smeets D, van Kessel AG, et al. (2007) Genome-wide copy number profiling on high-density bacterial artificial chromosomes, single-nucleotide polymorphisms, and oligonucleotide microarrays: a platform comparison based on statistical power analysis. DNA Res 14: 1–11.
2. 2. Korbel JO, Urban AE, Affourtit JP, Godwin B, Grubert F, et al. (2007) Paired-end mapping reveals extensive structural variation in the human genome. Science 318: 420–426.
3. 3. Levy S, Sutton G, Ng PC, Feuk L, Halpern AL, et al. (2007) The diploid genome sequence of an individual human. PLoS Biol 5: e254. doi:10.1371/journal.pbio.0050254.
4. 4. Wheeler DA, Srinivasan M, Egholm M, Shen Y, Chen L, et al. (2008) The complete genome of an individual by massively parallel DNA sequencing. Nature 452: 872–876.
5. 5. Sanger F, Nicklen S, Coulson AR (1977) DNA sequencing with chain-terminating inhibitors. Proc Natl Acad Sci U S A 74: 5463–5467.
6. 6. Margulies M, Egholm M, Altman WE, Attiya S, Bader JS, et al. (2005) Genome sequencing in microfabricated high-density picolitre reactors. Nature 437: 376–380.
7. 7. Bentley DR (2006) Whole-genome re-sequencing. Curr Opin Genet Dev 16: 545–552.
8. 8. Dohm JC, Lottaz C, Borodina T, Himmelbauer H (2007) SHARCGS a fast and highly accurate short-read assembly algorithm for de novo genomic sequencing. Genome Res 17: 1697–1706.
9. 9. Warren RL, Sutton GG, Jones SJM, Holt RA (2007) Assembling millions of short DNA sequences using SSAKE. Bioinformatics 23: 500–501.
10. 10. Butler J, Maccallum I, Kleber M, Shlyakhter IA, Belmonte MK, et al. (2008) ALLPATHS: De novo assembly of whole-genome shotgun microreads. Genome Res 18: 810–820.
11. 11. Zerbino DR, Birney E (2008) Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res 18: 821–829.
12. 12. Pop M, Salzberg SL (2008) Bioinformatics challenges of new sequencing technology. Trends Genet 24: 142–149.
13. 13. Pinkel D, Segraves R, Sudar D, Clark S, Poole I, et al. (1998) High resolution analysis of DNA copy number variation using comparative genomic hybridization to microarrays. Nat Genet 20: 207–211.
14. 14. Selzer RR, Richmond TA, Pofahl NJ, Green RD, Eis PS, et al. (2005) Analysis of chromosome breakpoints in neuroblastoma at sub-kilobase resolution using fine-tiling oligonucleotide array CGH. Genes Chromosomes Cancer 44: 305–319.
15. 15. Urban AE, Korbel JO, Selzer R, Richmond T, Hacker A, et al. (2006) High-resolution mapping of DNA copy alterations in human chromosome 22 using high-density tiling oligonucleotide arrays. Proc Natl Acad Sci U S A 103: 4534–4539.
16. 16. Perry GH, Ben-Dor A, Tsalenko A, Sampas N, Rodriguez-Revenga L, et al. (2008) The fine-scale and complex architecture of human copy-number variation. Am J Hum Genet 82: 685–695.
17. 17. Redon R, Ishikawa S, Fitch KR, Feuk L, Perry GH, et al. (2006) Global variation in copy number in the human genome. Nature 444: 444–454.
18. 18. International Human Genome Sequencing Consortium (2004) Finishing the euchromatic sequence of the human genome. Nature 431: 931–945.
19. 19. Venter JC, Adams MD, Myers EW, Li PW, Mural RJ, et al. (2001) The sequence of the human genome. Science 291: 1304–1351.
20. 20. Chaisson MJ, Pevzner PA (2008) Short read fragment assembly of bacterial genomes. Genome Res 18: 324–330.
21. 21. Goldberg SMD, Johnson J, Busam D, Feldblyum T, Ferriera S, et al. (2006) A Sanger/pyrosequencing hybrid approach for the generation of high-quality draft assemblies of marine microbial genomes. Proc Natl Acad Sci U S A 103: 11240–11245.
22. 22. Pop M, Phillippy A, Delcher AL, Salzberg SL (2004) Comparative genome assembly. Brief Bioinform 5: 237–248.
23. 23. Pevzner PA, Tang H, Waterman MS (2001) An Eulerian path approach to DNA fragment assembly. Proc Natl Acad Sci U S A 98: 9748–9753.
24. 24. Lippert R, Schwartz R, Lancia G, Istrail S (2002) Algorithmic strategies for the single nucleotide polymorphism haplotype assembly problem. Brief Bioinform 3: 23–31.
25. 25. Bansal V, Halpern AL, Axelrod N, Bafna V (2008) An MCMC algorithm for haplotype assembly from whole-genome sequence data. Genome Res 18: 1336–1346.
26. 26. International HapMap Consortium (2005) A haplotype map of the human genome. Nature 437: 1299–1320.
27. 27. Korbel JO, Urban AE, Grubert F, Du J, Royce TE, et al. (2007) Systematic prediction and validation of breakpoints associated with copy-number variants in the human genome. Proc Natl Acad Sci U S A 104: 10110–10115.
28. 28. Schmid R, Schuster S, Steel M, Huson D (2006) ReadSim - A simulator for Sanger and 454 sequencing
29. 29. Rozowsky J, Euskirchen G, Auerbach R, Zhang Z, Gibson T, et al. (2008) PeakSeq: systematic scoring of ChIP-seq experiments relative to controls. Nature Biotechnol 27: 66–75.
30. 30. Batzoglou S, Jaffe DB, Stanley K, Butler J, Gnerre S, et al. (2002) ARACHNE: a whole-genome shotgun assembler. Genome Res 12: 177–189.

Ambra 2.9.10 Managed Colocation provided
by Internet Systems Consortium.