FB and CD are employees of Novartis Vaccines and Diagnostics. LS and VR were employees of Novartis Vaccines and Diagnostics at the time of the study.
Conceived and designed the experiments: LS ST VR CD. Performed the experiments: LS VR CD. Analyzed the data: LS ST VR CD. Wrote the paper: LS ST VR FB CD.
Advances in high-throughput DNA sequencing technologies have determined an explosion in the number of sequenced bacterial genomes. Comparative sequence analysis frequently reveals evidences of homologous recombination occurring with different mechanisms and rates in different species, but the large-scale use of computational methods to identify recombination events is hampered by their high computational costs. Here, we propose a new method to identify recombination events in large datasets of whole genome sequences. Using a filtering procedure of the gene conservation profiles of a test genome against a panel of strains, this algorithm identifies sets of contiguous genes acquired by homologous recombination. The locations of the recombination breakpoints are determined using a statistical test that is able to account for the differences in the natural rate of evolution between different genes. The algorithm was tested on a dataset of 75 genomes of
The extent to which recombination occurs in natural populations is either unknown or controversial but it is widely accepted that recombination plays a crucial role in the evolution of many bacterial species. Numerous methods have been developed for the investigation of recombination events, but most of them require expensive computations and are applicable only to a limited number of genomes or to short nucleotide sequences. Here we present a new algorithm designed to identify recombination events affecting a group of adjacent genes. The procedure is based on the comparison of gene sequences and requires as input the matrix of gene conservation of a test genome against a group of reference genomes. The method is fast, and has minimal computational requirements. Therefore, it can be applied to datasets composed of a large number of complete genomes, and can be easily adapted to analyze data directly from high-throughput sequencing projects. We applied the algorithm to a dataset of
Recombination, the integration of foreign DNA in the chromosome of an acceptor cell, is one of the major evolutionary forces in bacterial species. Recombination can be mediated by viral infections
Homologous recombination may involve whole genes or even larger segments. Whole genome sequencing has shown that homologous recombination is frequent in
Numerous approaches have been developed to measure the frequency of recombination and to determine the chromosomal locations of the inserted sequences. Parametric methods estimate the recombination rate
Here, we introduce a novel method that, using a discrete filtering procedure of the gene conservation profiles in a panel of unrelated strains, identifies sets of contiguous genes likely acquired by homologous recombination. Due to its modest requirements in terms of computational resources, the method can be applied to large panels of complete genome sequences. The method was able to confirm known events of recombination involving genomes of
Clonal complexes were defined running eBurst
Phylogenetic analysis of the complete genome sequences has been performed using Mega4
The genes of each strain in the collection of 75 genomic sequences of
The purpose of the algorithm is the identification of recombination events affecting groups of adjacent genes in a genomic sequence. In the following, we will define as “recombinant” the strain(s) containing the recombinant segment, as “major parent” the strain(s) contributing the genetic backbone of the recombinant strain, and as “minor parent” the strain(s) contributing the sequence that was inserted into the backbone by the recombination event. The identification of recombination events affecting more than one gene has to face two obstacles, which are related to the age of the event itself,
The purpose of this step was to identify sets of contiguous genes with an anomalous level of conservation compared to one or more reference genomes, independently from the rates of evolution of the individual genes.
Initially, we selected one test genome, for which we wanted to identify the genes recently acquired by homologous recombination. For this genome, we computed the matrix
For linear genomes
Beside the
To visually identify regions of the target genome showing an anomalous pattern of sequence conservation with the reference genomes, we then converted the filtered matrix
The choice to discretize the data using the gene-dependent distribution of sequence conservation allowed us to correct for differences in the natural rate of evolution between different genes. In this way, while the actual value of the cut-off for sequence conservation was different for each gene and depended on the gene-specific mutation rate, regions of the test genome having an anomalously high level of conservation with the reference sequence were readily identified.
Breakpoints of putative recombination events could be visually identified in the matrix
While many efficient methods exist to identify small recombination events in a set of short aligned sequences, few convenient methods exist to identify on a genome-wide scale large events, including several genes or entire operons. The proposed procedure, using a smoothing of the conservation signal over a window of size
a) The recombinant genome has been designed identical to a major parent (recipient) and containing segments acquired from donor_1 (red) and donor_2 (blue) in a variable number. b) Recombination events detected for different sizes of the sliding window (top to bottom,
Strains of Sequence Type ST239 (classified by eBURST into CC8, see
a) Heatmap representation of the percentage of conservation of the genes of
a) P-values of the Fisher test performed on the filtered matrix obtained from the comparison of TW20 (ST239) against all the other
The strain H19 (ST10, CC10) showed two regions, consisting of about 50 ORFs and encoding for two phages, not conserved in the closely related strain D139 (ST 145, CC10) (
CC5 is one of the most prevalent Methicillin-resistant (MRSA) lineages of
a) Comparison of
We found two regions of the
a) Comparison of
In order to highlight the specific features of
a) RDP detected four regions acquired from strain P1031. Strain P1031 belongs to CC217 and it could be the putative donor strain for INV104B. b)
We found three regions of the ST15, CC15 strain CGSP14 that were not conserved in INV200, the other CC15 strain present in the collection. The first region (
We found one case of exchange between the antibiotic resistant
a) Comparison of
Population genetic studies on many bacterial species, such as
We have applied the algorithm to two pathogens,
Extending the analysis to the
The rate of homologous recombination varies greatly between different species
(PDF)
(TIF)
(TIF)
(TIF)
(GZ)
(DOC)
(DOC)
(DOCX)
(XLS)
(XLS)
(XLS)
(DOC)
We are grateful to our colleagues Antonello Covacci, Guido Grandi and Alessandro Muzzi for scientific support to the study. In addition, we thank Dr. Jacques Schrenzel of Geneva University Hospitals, Central Laboratory of Bacteriology and Genomic Research for providing us