GG and AFAS conceived and designed the experiments. GG, SQ, and MREG performed the experiments. FF, AFS, and JCR analyzed the data. LH contributed reagents/materials/analysis tools. GG and AFAS wrote the paper.
The authors have declared that no competing interests exist.
The identification and characterization of the complete ensemble of genes is a main goal of deciphering the digital information stored in the human genome. Many algorithms for computational gene prediction have been described, ultimately derived from two basic concepts: (1) modeling gene structure and (2) recognizing sequence similarity. Successful hybrid methods combining these two concepts have also been developed. We present a third orthogonal approach to gene prediction, based on detecting the genomic signatures of transcription, accumulated over evolutionary time. We discuss four algorithms based on this third concept: Greens and CHOWDER, which quantify mutational strand biases caused by transcription-coupled DNA repair, and ROAST and PASTA, which are based on strand-specific selection against polyadenylation signals. We combined these algorithms into an integrated method called FEAST, which we used to predict the location and orientation of thousands of putative transcription units not overlapping known genes. Many of the newly predicted transcriptional units do not appear to code for proteins. The new algorithms are particularly apt at detecting genes with long introns and lacking sequence conservation. They therefore complement existing gene prediction methods and will help identify functional transcripts within many apparent “genomic deserts.”
To date, genes have been identified from genomic sequence using two basic concepts: the identification of specific signals delineating the structure of the genes and by similarity to previously known genes. Here the authors describe four novel algorithms based on a third basic concept: the identification and quantification of mutational and selectional effects of transcription. Central to this work is a detailed analysis of interspersed repeats, the “junk DNA” left behind by transposon activity, that is usually discarded when predicting genes even though it amounts to nearly half the human genome. Using the new methodology, the authors identify thousands of potential novel genes, some of which appear not to code for protein products. The new algorithms are particularly apt at detecting genes with long introns and lacking sequence conservation. They therefore complement existing gene prediction methods and will help identify functional transcripts within many “genomic deserts,” regions currently thought to be devoid of genes.
The current annotation of human genes is likely to be incomplete, particularly for genes not coding for proteins. Wong et al. [
Several types of software for gene prediction are currently available. There are two basic concepts underlying these methods: (1) the recognition of gene structure and (2) sequence similarity. Methods that predict gene structure identify the structural and functional requirements for a segment of genomic sequence to be able to be transcribed, spliced, and finally to be translated into a protein sequence, e.g., GenScan [
Here, we develop a third basic concept in gene prediction that is based solely on the analysis of genomic sequence data: the recognition of “transcription footprints.” These are the side effects of sustained transcription on the genomic sequence, leading over evolutionary time to an accumulation of differences between the two DNA strands, which can be detected by appropriate statistical analysis. We present here an integrated suite of four prediction methods exemplifying this third basic concept. Importantly, the method presented here can readily predict transcriptional units that do not code for proteins.
Many of the transcription footprints are buried in interspersed repeats, which comprise almost half of the human genomic sequence, including intronic sequences. Most of these repeats are copies of transposable elements exhibiting various levels of sequence decay and are systematically excluded (“masked”) as a first step in most standard gene prediction methodologies. The present work takes advantage of a detailed analysis of interspersed repeats, as a reference framework for detecting the otherwise overlooked transcription footprints.
In this work, we focus on the detection of transcription footprints in the form of significant differences between the “forward” (coding, sense) and “reverse” (antisense) strands of transcribed regions. Two sources for orientation biases within genes are (1) mutations influenced by the act of transcription and (2) selection against harmful signals that disrupt transcription.
A transcription-associated strand asymmetry has been described [
The identification of ORFs has long been considered the method of choice for identifying genes in genomes with little or no splicing. Where mRNA splicing is prevalent, this method loses power since ORFs can be split into several short segments separated by potentially long introns. In conceptual similarity to ORF identification, we hypothesized the existence of selection against the introduction of any signal that could prematurely interrupt the transcription process, in particular, polyadenylation signals (PASs). Since the PAS is asymmetric, with consensus sequence AATAAA or ATTAAA, selection should lead to orientation biases: introduction of the same signal in the opposite orientation would typically be a neutral event.
We describe here four algorithms for transcript prediction and then present their combination as an integrated method (
The genomic sequence is analyzed using RepeatMasker, yielding a masked sequence (studied for its base composition), a repeat table, and an alignment file, which is used to list mutations in repeats and to produce a “sequence mask.” Both the original sequence and the sequence mask are studied using polyadq, yielding tables of predicted PASs. The nucleotide composition of the unique sequence, and the mutations within repeats, is tabulated as well. The tables are then analyzed to calculate skews, which are finally used to produce predictive scores, separately for each method (Greens, ROAST, CHOWDER, and PASTA) or in combination (FEAST).
The combined mutational biases yield an excess of G + T over A + C in the forward strand of genes, leading to an equilibrium value of 52.7% G + T [
The Greens algorithm is not applicable to interspersed repeats, most of which have inserted into the genome too recently to have reached equilibrium with respect to G + T composition [
Many retrotransposons have a highly conserved PAS; this suggests that the retroposition of such elements into the genome could be harmful when the orientation of the repeat is the same as the host gene (
We further hypothesized that selective pressure to maintain functionally transcribed regions “open” to uninterrupted transcription might act against mutations making a weak forward-strand PAS stronger and in turn might favor mutations weakening cryptic PAS-like sites within transcripts (
We implemented the four methods using the same conceptual structure, as follows. First, we obtained log-likelihood ratio parameters for each predictor by genomewide training on annotated known genes. Second, we scanned the whole genome and tabulated local frequencies and orientations for each of the predictors. Third, we calculated for each genomic location four scores representing the significance level of the strand bias; these scores indicate each method's support for a claim of transcription at that genomic location. These four scores are then combined, yielding an integrated predictive score (
The FEAST analysis can be used as an independent qualification of known genes and gene predictions, by combining FEAST scores included within the genomic span of each gene. It is important to stress that here we are not comparing two independent sets of transcript predictions but rather calculating FEAST scores for transcribed regions as predicted by other methods, and predicting just the transcript orientation based on our log-likelihood model. We performed this calculation for all known genes in the human genome, excluding those for which an overlapping antisense transcript has been annotated (
Scatterplot of FEAST scores versus gene length for known genes from the UCSC Genome Bioinformatics Site [
Success rates for FEAST reanalysis of known genes (top left), experimental gene annotations (center and bottom left), and gene predictions (right). Gene annotations were stratified by length into three classes: short (<10 kb), medium (10 to 100 kb), and long (>100 kb); the number of genes in each class is given above each bar. FEAST scores were stratified into nonsignificant (white, −2 < Z < 2), giving significant scores for the expected strand (shades of brown, Z > 2) and giving significant scores for the wrong strand (shades of red, Z < −2). The Z < −4 and Z > 4 bins include potentially large values as displayed in
We observed some extremely negative scores for some annotated genes, typically indicating errors in the annotation. For example, the AF118089 transcript spanning the range chr1:88,617,147–88,738,864 (q→p strand) was assigned a FEAST score of −37.4. This transcript in fact corresponds to the reverse strand of the first nine exons of the
We similarly calculated FEAST scores for all GenScan, Twinscan, and AceView annotations in the human genome and found them to be largely in agreement (
Finally, we applied the FEAST algorithms to qualify experimentally derived gene structures, as represented in the University of California Santa Cruz (UCSC) Genome Browser [
We implemented a variation on the maximal segment analysis [
We aligned the known genes at their start and end positions and studied the average FEAST values surrounding gene boundaries. As expected, the observed scores outside the transcripts are low while those within transcripts are significant (
The average FEAST scores for known genes (thick black,
We identified 9,237 intergenic segments between consecutive genes on the same strand (“head to tail”) and calculated integrated FEAST values for them. As a general rule, location between two similarly oriented genes appears to be strongly predictive of intergenic transcription on the same strand, e.g., 52% for intergenic segments in the 10-to 100-kb range, versus 28% for the opposite strand (
We identified 5,286 regions that are predicted to be transcribed (Z score ≥ 2) but do not overlap with any previously known genes on the same strand (
Several of the new predictions have been confirmed by gene annotations that were introduced at a later stage, i.e., they did not contribute to the training set. For example, a 112-kb-long prediction in 19q13.42 (Z = 8.8) corresponds to a recently published microRNA (miRNA) cluster [
Since many computational gene prediction strategies start by masking the sequence for repeats, we tested whether the novel FEAST predictions are substantially enriched in repetitive sequences when compared to previously known genes. We found this not to be the case: the repeat content of known genes and that of novel FEAST predictions are comparable (
Finally, the possibility exists that some of the novel predictions correspond to genomic regions that are transcribed for miscellaneous reasons other than gene encoding, e.g., chromatin remodeling [
With the availability of experimental expression data from dense genomic tiling microarray experiments for ten human chromosomes [
We next considered the possibility that the genomic tiling data may include noise in the form of scattered, spurious transfrags. We therefore clustered the transfrags by joining those separated by less than 1 kb of unique sequence (i.e., excluding interspersed repeats) and using maximal linkage. We excluded all resulting clusters less than 5 kb long, which yielded 14,302 clusters including 311,441 transfrags. Ninety percent of the clusters include four to 50 transfrags each, with a typical constituency of ten transfrags per cluster. As for the unfiltered data, the novel FEAST predictions for all chromosomes are significantly enriched in transfrags except for chr19 and chr22. Interestingly, filtering disjoint transfrags from the data set increases the enrichment ratios, particularly for chr7, chr13, chr21, and chrY (
Finally, we compared the enrichment ratios observed when partitioning the transfrags by source: polyA+ versus polyA− and cytoplasmic versus nuclear. We observed the highest enrichment ratios for transfrags derived from polyA− and nuclear samples (
We used multidimensional scaling (MDS) [
The matrix of disagreement measures for all pairs of annotation methods is represented by point in two dimensions using MDS. Filled black circles represent experimentally observed transcripts, the vast majority being in the “RNA” set. Triangles represent methods involving significant manual curation and/or based on the RNA set. “S,” “H,” and “F” represent methods based on gene structure prediction, hybrid methods (gene structure and sequence similarity), and methods measuring footprints of transcription, respectively. The combined FEAST method was excluded from the MDS analysis, and its projected location (squared F) was calculated later (see
We performed an initial experimental validation for some of the newly predicted novel transcribed regions.
Ceruloplasmin (ferroxidase) is a medically important metalloprotein that evolved by internal tandem triplication [
Standard UCSC Genome Browser view of the
Inset: Phylogenetic analysis of the CP/CPHL1 family rooted using the hephaestin protein sequence as outgroup. Numbers above branches represent percentage bootstrap support over 1,000 replicates; the horizontal bar indicates 10% divergence along each branch.
FEAST analysis of a 2-Mb “gene desert” between the
PASTA, Greens, CHOWDER, and FEAST predictions are displayed for each strand in brown, green, pink, and red, respectively, with lighter shades indicating less significant scores. In the FEAST track, actual scores are indicated in red, and maximal segments are displayed in blue. The
The strongest novel FEAST prediction not overlapping any previously known gene indicated the presence of a novel transcript between the
VISTA and GESTALT analyses of the
Inset on top: Detail on the conserved intronic noncoding sequences, between two nonconserved exons.
Current methods for gene prediction perform well on genes that are “typical” in several respects, including number and length of exons, length of introns, quality of splice sites, and conservation (similarity to known genes). Some divergent genes may be difficult to discover by experimental observation of transcripts if their range of expression is restricted to one or a few cell types or if they are expressed at very low levels. It is significantly more difficult to identify and produce correct models of genes with extremely long introns or short exons or that have diverged extensively from other genes; this is particularly true for genes that do not code for proteins. Such genes would be composed almost entirely of intronic sequence and could be practically “invisible” to current computational gene prediction methods.
We found that transcribed sequences hold significant information about the direction of transcription, in the form of significant orientation biases of (1) nucleotide composition, (2) mutations within interspersed repeats, (3) the interspersed repeats themselves, and (4) PASs. We implemented and integrated four algorithms (
For the purpose of the current work, interspersed repeats have two interesting characteristics. First, since copies generally do not adopt a function within the genome, they accumulate substitutions in a neutral fashion. The availability of sufficient copies allows for a relatively accurate reconstruction of the element sequence at the time of integration, while comparison of the extant copies against these “consensus sequences” gives an accurate account of the frequency spectrum of neutral substitutions. These data have, for example, been used to derive log-odds matrices for comparison of interspersed repeats to a consensus database in the program RepeatMasker as well as for the alignment of genomic sequences of different mammals [
The “transcriptional footprints” described here have some conceptual similarity to “content” methods like coding potential and coding sequence compositional biases. While those are limited to coding exons, the signal of transcriptional footprints can be observed throughout the length of the transcript, the vast majority of which is usually intronic in nature. Furthermore, while the “content” methods detect deviations from the sequence composition expected under a random model, the FEAST methods detect significant strand biases of selected signals, regardless of their absolute frequency. A generalized linear model for transcript detection was published [
The basic model underlying the four FEAST methods assumes the accumulation in introns and UTRs of strand-biased signals arising as side effects of transcription. Several deviations from this model can be postulated: (1) a significant proportion of the genomic region included in a gene might not be transcribed, as is the case for somatic rearranging immune loci; (2) antisense overlapping transcripts may lead to partial signal cancellation; and (3) a gene may have long coding exons, or a large number of exons separated by short introns. Furthermore, the statistical model used assumes independence between the observed signals, which may not be true for (4) arrays of tandemly duplicated sequences or (5) interspersed repeats “homing” into similar repeats, e.g., Alu [
In the current implementation of FEAST, the four algorithms are combined with equal weights, except for Greens and CHOWDER being weighted according to the repeat fraction. This will be improved in future versions by identifying context-dependent optimal weights for the different algorithms. Since the mutation-based methods refer to germline expression but the selection-based methods reflect functional importance at any developmental stage, additional functional information could be obtained by using different combinations of the four algorithms. Finally, a promising area for future development is the joint gene prediction on orthologous regions, by collating biases accrued independently in different species lineages.
Sequence-based gene prediction has long been dominated by methods based on modeling gene structure and sequence comparisons, followed by extensive expert curation. It might have appeared impossible to detect genes from genomic sequence without identifying splicing signals or sequence conservation and not even relying on the genomic localization of experimentally observed expressed sequences. We presented here a third basic concept (
By studying various sources of sequence information (pink boxes), genes have been identified using a variety of computational methods based on the identification of gene structure and/or the identification of sequence conservation. The FEAST methods represent a third basic concept, in which sustained transcriptional activity is inferred by its mutational and selective effects on the genomic sequence, the “transcriptional footprints.” Light blue boxes indicate the three basic concepts for gene prediction. The dashed vertical line separates gene prediction (to the left), from gene identification (to the right): the latter is based on the analysis of sequences expressed from the same locus.
In addition to yielding hypotheses for correcting 5′ incomplete gene annotations and novel independent predictions, many of which cannot be detected by gene structure or similarity, the new algorithms are complementary to existing methods (
We analyzed the human genomic sequence [
We obtained the July 2003 freeze of the human genome (hg16, based on NCBI Build 34) and its annotation database from the UCSC Genome Bioinformatics Site [
We stratified the genomes by G + C content into five nearly equal parts by sequence length. Any genomic sequence (and hence the repeats in it) is classified as having low, medium low, medium, medium high, or high G + C content. The optimal cutoffs to obtain such separation are calculated to be 35.9%, 38.5%, 41.3%, and 45.4% G + C for the human genome and 38.0%, 40.0%, 42.3%, and 45.2% G + C for the mouse. When considering sequences in the TRON sets, the forward strand was defined to be the one running 5′ to 3′ in the direction of transcription of the gene. For intergenic sequences (TERG set), both strands are equivalent, and the forward strand is arbitrarily defined to be that which runs from the p telomere to the q telomere of the chromosome.
We identified and classified all SINEs, LINEs, LTR, and DNA elements in the human and mouse genomes, using the standard classification implemented in the RepeatMasker software (version of 23 June 2003, at sensitive settings;
We stratified the repeats by G + C content as described above. We further stratified the repeats by age. The percentage of divergence of each repeat from its consensus sequence was used as an estimate of its age, assuming that an older repeat would have diverged more from its corresponding family consensus than a younger one: we subdivided repeats into nonoverlapping 5% divergence bins.
This analysis resulted in a detailed catalog of repeat counts, stratified by repeat type, repeat family, G + C content, divergence level, and orientation (see below). For example, in the human genome we observed 17.55 LINE/L1 repeats less than 5% divergent from consensus, per Mb of low G + C content sequence in intergenic regions, but 14.86 such repeats per Mb in similar regions that are annotated to be transcribed. The data sets are available at
We implemented a variation on the nucleotide composition method published by Green et al. [
where
When studying a genomic sequence for prediction purposes, the Greens score of a 1-kb bin was calculated as the sum of the scores of 100-bp nonrepetitive windows contained within it and normalized to Z scores in the same way as for ROAST scores (described below).
We studied the alignment files produced by RepeatMasker to identify all the single-nucleotide differences between each interspersed repeat (with the same exclusions as in ROAST) and its corresponding consensus sequence. These differences can be assumed to represent directional mutations from the consensus sequence state to the extant sequence state (
Interspersed repeat consensus sequences in the RepeatMasker/RepBase Update databases are oriented in the direction of transcription of the transposable element, which is usually recognized by the coding region. This orientation was unknown for a fraction of repeats, primarily LTR sequences without an associated internal sequence or noncoding DNA transposons. We determined orientations for these based on similarity to oriented elements or discovery of internal sequences or coding region in extended consensus sequences, leaving for the human genome only the Mariner-like MADE and MER1-group MER119 DNA transposons nonoriented and excluded from further analysis. A significant fraction of L1 repeats show an inversion, attributed to a “twin priming” mechanism [
When a repeat is located within a transcript, we refer to it as a “forward repeat” if its orientation is the same as that of the transcript; otherwise, we consider it a “reverse repeat.” For each repeat family, stratified by repeat age and regional G + C content, we calculated a score based on log-likelihood ratios, with the null hypothesis claiming the expectation of observing equal numbers of forward and reverse repeats. If F and R represent the observed number of repeats in the forward or reverse orientation relative to the enclosing transcript, the log-likelihood ratio contribution to the ROAST prediction of transcription in the same orientation as such a repeat is given by:
For example, we observed 26,191 mid-aged reverse Alu repeats within transcripts of intermediate G + C content but only 19,976 in the forward strand. Therefore, as only 43.27% were observed in the forward strand (and not the expected 50%), the score contribution for a transcription claim in the same orientation as such an Alu repeat is log2(0.4327/0.5) = −0.209, while the contribution to the claim of transcription in the reverse orientation receives a score of log2[(1 − 0.4327)/0.5] = 0.182. The −0.209 figure for the forward strand is different from the value plotted in
For testing genomic sequences, we subdivided them into nonoverlapping 1-kb bins and assigned to each bin those interspersed repeats for which the midpoint lies within the bin. The ROAST score for a bin is the sum of the log-likelihood scores of its repeats, normalized to the average and standard deviation of the distribution of scores obtained if the same repeats had been observed in random orientations. Therefore, the ROAST score is expressed as a Z score, or standard deviations away from the mean value (under the null hypothesis), i.e.:
with:
and
where
Based on the RepeatMasker alignment files, we produced a “sequence mask,” which is a version of the genomic sequence in which each interspersed repeat is replaced by the corresponding segment from its consensus sequence (
The FEAST score for a single bin is calculated from the individual method scores as follows:
where
The FEAST score of the range of bins
We identified maximal scoring segments [
We obtained the transcribed fragment (“transfrag”) data from the UCSC Genome Bioinformatics Site [
We computed for each method M its total extent TEM as the sum of the lengths of all the gene ranges annotated by it (collapsing overlapping ranges on each strand). For each pair of prediction methods x and y (e.g., ROAST and GenScan), we computed the measure of minimal possible disagreement minDx,y as abs(TEx − TEy); i.e., the least disagreement will be observed when one set of predictions is entirely contained in the other, and their extents are exactly the same. We then calculated the maximal possible disagreement maxDx,y as TEx + TEy (i.e., maximal when they are totally disjoint), or 2 • length(chrom) − (TEx + TEy) when TEx + TEy > length(chrom). The actual observed nucleotide overlap disagreement between the methods obsDx,y is then normalized linearly to the range minDx,y..maxDx,y, yielding a distance measure equal to 0 when the methods yield identical results and equal to 1 when they are maximally disjoint.
We then created a two-dimensional visualization of the relationships among the methods from the matrix of pairwise distance measures, using the MDS algorithm ALSCAL as implemented by the SPSS statistical system, while specifying a ratio level of measurement with Euclidean distance, thereby creating a metric scaling solution. The technique of MDS seeks to create a configuration of points in two-dimensional space such that the pairwise Euclidean distances between pairs of points are closest to their respective actual distance measures. The ALSCAL algorithm iteratively seeks to minimize Young's S-stress formula 1, which is defined as the square root of the ratio (sum over all pairs of the squared difference between squared actual and squared Euclidean distances) divided by (sum over all pairs of the fourth power of the scaled Euclidean distance). Further details may be found, for example, in Davidson [
In order to avoid giving additional weight to the FEAST components in
Selected novel gene predictions were confirmed by PCR amplification from double-stranded cDNA, which was prepared from a mixture of over 30 different human tissues. PCR primers were specifically designed for each gene based on predicted exon sequences. Amplification products were sequenced to confirm their identity and to establish intron-exon boundaries.
Primers were designed using the primer3 software [
The PCR cycle parameters used were optimized for target size 5 to 9 kb as recommended by the manufacturer—melting: 95 °C for 1 min; 35 cycles annealing and extension: 95 °C for 30 s, 68 °C for 6 min; hold: 68 °C for 10 min, 4 °C indefinitely.
The PCR products were visualized under long-wave UV light on 1.2% agarose gel loaded with 1-kb DNA ladder, stained with ethidium bromide. The PCR product bands were cut from the gel and purified by Qiaquick gel extraction kit (Qiagen, Valencia, California, United States). The DNA samples eluted from gel extraction columns were used directly for sequencing reactions or further amplified by PCR prior to sequencing.
Sequencing reactions were performed using 1/16 dilution of Applied Biosystems Big Dye Terminator v3.1 Reaction Mix (Foster City, California, United States). Reactions are performed in 50 cycles on MJ Research PTC-225 thermocycler tetrads (Waltham, Massachusetts, United States) and precipitated with isopropanol and centrifugation. The sequencing ladders were resolved using Applied Biosystems 3730XL sequencer and accompanying base-calling and data quality analyses software.
(A) Schematic describing how excessive G + T skews may not be predictive of transcription.
(B) Log-likelihood ratio contribution of different strengths of G + T skew, within known genes. Skews range from −1 (only A + C) to +1 (only G + T). Observed values are given in blue, and arbitrary fit curve, in red. Highly skewed G + T compositions are observed to be less indicative of transcription than more moderate skews.
(49 KB PPT)
(A) Schematic describing the integration of a transposable element, potentially truncated, followed by accumulation of generally neutral substitutions. A comparison of the consensus sequence for many copies of this element, approximating the original sequence (filled), to the extant sequence (striped) yields a list of substitutions. To avoid distortions from alignment artifacts, only substitutions flanked by unchanged nucleotides are considered (thick vertical lines). Mutations involving CpG dinucleotides are also excluded.
(B) The chart indicates the total number (in millions) of directional mutational events observed in interspersed repeats within genes in the entire human genome, from repeat consensus to extant sequence. For each mutation, the upper value represents the mutation in the forward strand of the enclosing gene, while the lower value represents the same mutation in the reverse strand. For example, we observed 4.58 × 106 mutations from an A in the repeat consensus to a G in the forward strand of the extant sequence but only 4.11 × 106 mutations from a reverse-strand A in the repeat consensus to a reverse-strand G in the extant sequence (i.e., a T→C mutation in the forward strand). Bold arrows indicate mutations that are more frequent in the forward strand than in the reverse strand.
(29 KB PPT)
(A) Schematic describing the hypothesis of how the introduction of an interrupting signal (red) tends to be rejected, while the same signal in the opposite strand is not disruptive (white) and therefore is neutral. This process yields a strand bias.
(B) The log-likelihood ratio contribution, of a single repeat, to the claim of transcription in the same orientation as the repeat. Negative log-likelihood ratio values represent the prevalence of repeats in the reverse strand. The values for the various interspersed repeat families in human and mouse are correlated. The correlation was calculated based on repeat families that are significantly biased in both lineages, represented by filled icons. SINEs, LINEs, LTR elements, and DNA repeats are shown in red, green, blue, and black, respectively.
(29 KB PPT)
(A) A statistical analysis of PASs in transcribed nonrepetitive sequence revealed significant orientation biases, after correcting for nucleotide composition skews. In this schematic, PAS are represented by octagons, with color indicating signal strength (darker represents stronger signals). Within repeats, we found biases in PAS strength changes from the repeat consensus (filled repeat icons) to the extant sequence (open repeat icons).
(B) Expected PAS frequency skew as a function of nucleotide composition skew. Sequences enriched in T in the forward strand are expected to have more PASs in the reverse strand. Since the AATAAA signal is more biased than ATTAAA, its expected random skew is stronger.
(31 KB PPT)
We selected the regions between consecutive genes in the same orientation and normalized their orientation to the forward strand. These regions show a prevalence of positive FEAST scores, indicating preference for transcription in the same strand as that of the flanking genes. Legend is the same as for
(30 KB PPT)
FEAST score versus prediction length given in kilobases, logarithmic scale.
(109 KB PPT)
From top to bottom: CpG contrast values, %G + C; predictions by PASTA, Greens, CHOWDER, and ROAST on the top strand, and their integration into FEAST; interspersed repeats color-coded by family; gene annotations; annotations and predictions on the bottom strand; Mb scale from the p telomere of chr19. The novel 43 miRNA cluster appears to be transcribed as a unit from the CpG island at 58.84 Mb. The FEAST prediction on the top strand (score = 8.8) is consistent with the orientation of the miRNA genes. The smaller miRNA cluster (including mir-371, mir-372, and mir-373) appears to be transcribed separately.
(39 KB PPT)
There is a large set of annotated known genes (blue) composed entirely of repetitive sequence. The novel FEAST predictions (Z > 3, red) with greater than 90% repeats are mostly satellite-rich pericentromeric regions and most probably represent false positives.
(190 KB PPT)
We include here the genomewide annotation of known genes (KG), Ensembl genes (ENS), Twinscan (TW), GenScan (GS), Softberry genes (SB), EC genes (EC), GeneID (GID), RNAs (RNA), Mammalian Gene Collection (MGC), pseudogenes (PS), Exoniphy exons (EX), Exoniphy exons bridged when in the same orientation and within 25 kb of each other, and not separated by exons in the opposite strand (EXB), all combinations of FEAST methods (e.g., CP is CHOWDER and PASTA; RCG includes ROAST, CHOWDER, and Greens), and a randomized version of each annotation method (names appended with “.s” for “shuffled”). The randomized versions are distributed along a wide arc that includes the pseudogenes and are clearly distinct from the unshuffled annotations (including the four FEAST methods and all their combinations). Note that, like geographical maps of intercity distances, MDS representations have no axes.
(105 KB PPT)
(A) Lane 3: Two PCR product bands amplified using
(B) The DNA products purified from the lower band (~750 bp) and higher band (~1,350 bp) were further amplified by PCR before sequencing. Lane M: 1-kb DNA ladder. Lanes 2 through 4: PCR products amplified from
(482 KB PDF)
(A) All transfrags in the ten chromosomes listed.
(B) The filtered transfrags after clustering. All FEAST predictions have Z > 3. The columns indicate the chromosome; the total number of transfrags; the number of transfrags in known genes, in FEAST predictions, in FEAST predictions but outside known genes, and the percentage of transfrags outside known genes that were included in FEAST predictions; the effective chromosome length (excluding gaps); the length of sequence included in known genes, in FEAST predictions but outside known genes, and the percentage of sequence outside known genes that was included in FEAST predictions; and the novel/out ratio between number of transfrags and sequence length (enrichment), its standard error, its Z score, and the probability to observe such enrichment under the null hypothesis. Numbers shown in italics are not significant (
(23 KB XLS)
For each chromosome, and for all chromosomes in combination, we calculated enrichment ratios as in
(17 KB XLS)
The GenBank (
We wish to thank Irit Rubin, Lee Rowen, Nat Goodman, Robert Hubley, Phil Green, Michael Brent, Benno Schwikowski, and Yoav Gilad for helpful discussions.
CHanges Oriented Within DispErsed Repeats
conserved noncoding sequence
fast empirical algorithms suggesting transcripts
multidimensional scaling
microRNA
polyadenylation signal
PAS transcript analysis
repeat orientation analysis suggesting transcripts
University of California Santa Cruz