The authors have declared that no competing interests exist.
Conceived and designed the experiments: GR MS HGS SH. Performed the experiments: FM IC. Analyzed the data: KO KG AC CL. Wrote the paper: KO FM KG IC AC GR HGS SH. Designed the analysis algorithm: KO.
Identification of responsive genes to an extra-cellular cue enables characterization of pathophysiologically crucial biological processes. Deep sequencing technologies provide a powerful means to identify responsive genes, which creates a need for computational methods able to analyze dynamic and multi-level deep sequencing data. To answer this need we introduce here a data-driven algorithm, SPINLONG, which is designed to search for genes that match the user-defined hypotheses or models. SPINLONG is applicable to various experimental setups measuring several molecular markers in parallel. To demonstrate the SPINLONG approach, we analyzed ChIP-seq data reporting PolII, estrogen receptor
Cellular processes in mammalian cells are tightly regulated to ensure that the cells function properly as a part of an organism. Dysregulation of some of these processes, such as apoptosis, cell proliferation and growth, can lead to cancer. One of the most important regulation mechanisms for cellular processes is via activation of membrane receptors by extra-cellular stimulus. Such cues trigger signal cascades that lead to altered expression of a number of genes in the cell nucleus; a key challenge in biomedicine is to identify which genes respond to a specific stimulus. These so called response genes can be investigated on a whole-genome scale with genomic sequencing, which is a technology that can quantify protein binding to DNA or gene activation. Analysis of such whole-genome data, however, is challenging due to billions of data points measured in the experiments. Here we introduce a novel computational method, SPINLONG, which is a widely applicable novel computational method that integrates multiple levels of deep sequencing data to produce experimentally testable hypotheses. We applied SPINLONG to breast cancer data and found early responsive genes for estrogen receptor and analyzed their regulation. These analyses resulted in a gene whose high activity is associated with decreased breast cancer patient survival.
The identification of genes whose expression patterns are altered due to a stimulus is essential as it provides a basis to understand which signaling and metabolic pathways are influenced as a consequence of a stimulus. The majority of approaches to identify stimulus-regulated changes in gene expression rely on the relative abundance of mRNA molecules, either measured with microarrays or with RNA-seq, as an indirect indication of transcriptional initiation
Transcription is a dynamic process that is regulated by transcription factors and is reflected in local histone modifications. A reliable indication of an actively transcribed gene is the presence of RNA polymerase II (PolII) protein complex in the body of the gene. PolII generates the precursors of most mRNA, snRNA and miRNA molecules, and its activity is modulated by histone modifications
Genome-wide PolII activity can be measured with ChIP-seq (chromatin immunoprecipitation combined with massive parallel sequencing)
Estrogen receptor
In order to demonstrate the utility of SPINLONG, we characterized estradiol-induced early responsive genes in MCF-7 breast cancer cells with SPINLONG analysis based upon ChIP-seq data for PolII, H3K4me3 and H2A.Z occupancy at 5+2 time points following estradiol stimulation. Time points 0, 10, 20, 40 and 80 minutes were used to capture early transcriptional responses. Additionally, information from 160 and 640 minutes was used as auxiliary data points to supplement the main analysis. We used also ChIP-seq data for measuring the binding of
SPINLONG is a computational method for ranking genomic regions, such as genes, based on how closely they match a spatio-temporal deep sequencing pattern defined by the user. The overall schematic of the SPINLONG approach is shown in
Red and green boxes denote implementation of SPINLONG and input from a user, respectively. SPINLONG produces tabulated and graphical results that can be interpreted in the context of the hypotheses. (B) Illustration of the pattern matching step with two samples and four segments in the context of a gene
An essential first step in identifying
In addition to PolII ChIP-seq profiles, local histone modifications and substitution by the variant histone H2A.Z correlate with the location of TSSs and thus provide information that can be used in localizing the gene bodies
Regions associated with the presence of PolII are computed from the lengths of PolII body segments assigned by the SPINLONG score optimizer (see Materials and Methods). The scores ranged from 0 to 1.08, and the threshold used here (0.60) corresponds to a gene whose spatial short read coverage matches the evaluated pattern in 60% of genomic locations. The spatial pattern searched from the PolII profile is ‘low-high-low’,
Two high scoring putative estrogen responsive genes,
PolII, H3K4me3 and H2A.Z read counts are shown vertically with sample labels on the right. Sample data are the maximum of bin counts over all time points. The X-axis shows gene position in base-pairs (5′ to 3′ direction) and the Y-axis shows relative bin count. Green horizontal lines indicate Gaussian means of HMM states; the HMM state with lower mean is interpreted as class
Using five time-points of PolII ChIP-seq measurements (0, 10, 20, 40 and 80 minutes after stimulation), in conjunction with a non-specific antibody background sample as a subtraction control and utilizing a conservative score threshold (0.65) for an estrogen regulated gene, SPINLONG analysis characterized 699 genes as induced and 78 as repressed following estradiol stimulus. The results are available at
PolII read counts are shown vertically. The last pseudo-sample (locator) is an aggregate sample containing the maximum value of all time points used for fine-tuning the gene location. Initial locations (gray vertical lines) are obtained from a previous optimizer run with PolII and histone markers. Gray bins indicate they are within ignored area (class
To identify genes that respond rapidly to estradiol stimulus via direct binding
Adjuvant endocrine therapy with selective anti-estrogens or with aromatase inhibitors is used to treat breast cancer patients with
Kaplan-Meier analysis with log-rank test resulted in 19 genes that have a survival effect with a nominal
Vertical ticks represent censoring events. The X and Y axes represent follow-up time in months and the percentage of survival, respectively. The associated log-rank p-value is
In order to find out whether the survival association of
The PolII transcription process consists of initiation, elongation and termination phases, with elongation occurring at 0.5–4 kb/min
We observed that the transcription rate of PolII correlates strongly with gene length (
Interestingly, we noted that genes with an
We have developed a novel data-driven computational approach to facilitate the analysis of cell processes that are characterized by spatio-temporal signals. The large degree of freedom in defining patterns allows SPINLONG to scale to a variety of experimental designs where the user is able to formulate patterns to be identified within the data. Although the need to define the patterns
Methodologically, SPINLONG is a machine learning algorithm that can be used for classification, such as to determine whether a gene is induced, repressed or neither, but also for “sequence labeling” tasks, such as dividing a gene into flanking regions of PolII activity and inactivity. In genomic data analysis, SPINLONG shares some similarities with ChIP-seq peak detection algorithms (
When we applied SPINLONG to a time-series data from MCF-7 breast cancer cell line after
In addition to recent PolII profiling-based efforts, ER responsive genes have been investigated using microarrays, RNA sequencing and low-throughput methods. These resources include ERGDB
Our results show that only 40% of
Kaplan-Meier analysis with 150 primary breast cancer tumors from TCGA resulted in 19 early
The significance of
SPINLONG can be customized according to research hypotheses, as demonstrated here for the propagation of PolII on a gene. This, together with a computationally fast implementation that allowed to process 1.2 billion data points in 8 hours, makes SPINLONG a strong method to analyze multi-level deep sequencing data. In addition to ChIP-seq data, SPINLONG can be applied to data produced by other deep sequencing technologies, such as GRO-seq data, provided that the deep sequencing depth is sufficiently deep (greater than 100 million reads per sample) in order to identify patterns reliably. In summary, SPINLONG is a widely applicable novel computational method that integrates multiple levels of deep sequence data to produce experimentally testable hypotheses.
SPINLONG is implemented as a freely available command line tool as well as a component to the Anduril workflow framework
In the sequencing data import step, the genome is divided into fixed length bins of size
After data import, we focus on a pre-defined genomic region
The core of the SPINLONG approach is pattern matching by score optimization. A pattern encodes a hypothesis, such as “transcription of a gene is initiated close to its starting site after stimulus and transcription progresses from 5′ to 3′ direction.” A score indicates how well the bin counts of the current region match the pattern. The user can define the expected spatial arrangements of bin counts in the samples, a scoring scheme and optional constrains for optimization.
Pattern matching is done in the context of a genomic region and the pattern is evaluated against all defined regions, such as all genes in a genome. A pattern divides the bins of each sample into
Formally, the segments of a pattern are denoted as
The pattern scoring step evaluates each genomic region against the hypothesis and guides the optimization step. The result of the pattern scoring is the primary outcome of SPINLONG.
Pattern scoring uses assigned lengths (
All segments belonging to the same sample are scored with the same scorer, but there may be several independent scorers for independent sets of samples. Samples for distinct markers, such as PolII and H3K4me3, use separate scorers because their data ranges are different and segment scorers assume a common range for all associated samples. Segment score for a segment
Segment scoring is done using
Composite pattern scores
Simulated annealing (SA)
The deep sequencing data are available at
The MCF-7 human breast cancer cell line originates from a 69-year old Caucasian woman and is estrogen receptor (ER) positive, progesterone positive (PR) and HER2 negative. Here MCF-7 cells (a clonal isolate kindly provided by Prof. Edison Liu, Jackson Laboratories, Maine, USA) were grown in 15 cm plates to 80% confluency. Plates were then washed 2 times with PBS and overlaid with 20 ml of phenol-red free high glucose DMEM (Gibco) containing 2% charcoal stripped FCS (Sigma). After 24 hours of incubation, the cells were again washed with PBS and fresh media containing 2% charcoal stripped FCS was added. This process was repeated over a three day period to generate cells devoid of estradiol signaling. The time course (10, 20, 40, 80, 160, 320, 640 and 1280 min) was initiated by replacing media with prewarmed media containing 10 nM E2. In addition, an untreated sample was included in the experiment as a zero time point.
For chromatin immunoprecipitation, cells were fixed for 10 minutes at room temperature by the addition of formaldehyde to a final concentration of 1%, after which glycine was added to a concentration of 100 mM. Cells were then washed twice with PBS and collected into 2 ml of lysis buffer (150 mM NaCl, 20 mM Tris pH 8.0, 2 mM EDTA, 1% triton X-100, protease inhibitor [complete EDTA free, Roche, 04 693 132 001], 100 mM PMSF). The lysate was sonicated for 3×30 seconds using a Branson ultrasonicator equipped with a microtip on a power setting of 3 and a duty cycle of 90%. Samples were cooled on ice between rounds of sonication. Alternatively, a Bioruptor sonicator was used (power high, 15 mins total, 30 s on 30 s off; total volume of sample – 1 ml) to fragment chromatin. In either case, the resulting sonicate was centrifuged at 4000×g for 5 minutes, an aliquot of 10% retained for input and the remaining material transferred to a fresh tube.
Four
To compute the associations between
Total RNA was isolated using Trizol (Invitrogen) according to the manufacturer's recommendations, followed by DNazol treatment (QIAGEN). 100–250 ng total RNA was subjected rRNA depletion with the Ribo-Zero rRNA Removal Kit (Human/Mouse/Rat; cat. no. RZH110424). The rRNA depleted sample was purified by ethanol precipitation. mRNA was fragmented by hydrolysis (5× fragmentation buffer: 200 mM Tris acetate, pH 8.2, 500 mM potassium acetate, and 150 mM magnesium acetate) at
To align all reads, we used the GMS_map software (version 3.2.1) on a Genomatix Mining Station. The reference genome used was human hg19 (NCBI build 37). The software uses a seed-based approach to align reads. Mapping a read to a reference sequence involves two steps. In the first step, seeds for potential mapping positions in the target sequence are identified via a mapping library built of short unique subwords from the reference sequence. In the second step, alignments of the complete sequence read to the previously identified positions in the reference sequence are calculated. Results are ranked by their alignment score. We used the ‘deep’ seed search option allowing for point mutations during seed search. The overall alignment quality threshold was set to 92%, allowing for at most two point mutations. Uniquely and multiply matching reads meeting these thresholds were provided in BED and indexed BAM format.
To estimate PolII activity start and end sites, we combined the time points 0, 10, 20, 40 and 80 min together into a “pseudo-sample” that contains the maximum of bin counts in these time points. This was done individually for each marker (PolII, H3K4me3 and H2A.Z). These pseudo-samples contain information from all time points but also simplify pattern formulation by including only one (pseudo-) sample for each marker, instead of five samples for individual time points.
All pseudo-samples were divided into three segments with classes
Composite scoring was based on
The assumption behind identifying genes whose activity levels are increased after a stimulus is that initially and without stimulation such a gene is transcribed with a low rate and thus there is little PolII activity across the gene. Similarly, we assume that genes whose activity is repressed after a stimulus initially show high PolII activity, which is decreased after a stimulus. For genes that are activated due to a stimulus, the PolII complex binds to the genomic regions at the transcription start site (TSS) of a responder gene. This is indicated by an increased number of reads in the 5′ end of the gene. These assumptions are used to form hypotheses used in SPINLONG.
Identification of estrogen responder genes is divided into identifying induced and repressed genes. For this analysis, PolII data were used at time points 0, 10, 20, 40 and 80 min; in addition, time points 160 and 640 min were used to compute a pseudo-sample (see below). The hypothesis for induced genes is that they show low signal (class
Patterns for induction are named
The segment configuration of each pattern is visible in the result visualizations (
Estimation of PolII propagation speed is done using the segment lengths obtained in the identification of estrogen responsive genes above. For the patterns
Survival analyses were conducted with the Anduril framework
In the Cox regression model for
The results published here are in part based upon data generated by The Cancer Genome Atlas pilot project established by the NCI and NHGRI. Information about TCGA and the investigators and institutions who constitute the TCGA research network can be found at
In the first
Kaplan-Meier survival plot comparing TCGA patients with overexpression (denoted 1), neutral expression (0) or underexpression (−1) of
(PDF)
Kaplan-Meier survival plot comparing TCGA patients with overexpression (denoted 1), neutral expression (0) or underexpression (−1) of
(PDF)
Kaplan-Meier survival plot comparing TCGA patients with overexpression (denoted 1), neutral expression (0) or underexpression (−1) of
(PDF)
Kaplan-Meier survival plot comparing TCGA patients with overexpression (denoted 1), neutral expression (0) or underexpression (−1) of
(PDF)
Kaplan-Meier survival plot comparing TCGA patients with overexpression (denoted 1), neutral expression (0) or underexpression (−1) of
(PDF)
Kaplan-Meier survival plot comparing TCGA patients with overexpression (denoted 1), neutral expression (0) or underexpression (−1) of
(PDF)
Kaplan-Meier survival plot comparing TCGA patients with overexpression (denoted 1), neutral expression (0) or underexpression (−1) of
(PDF)
Kaplan-Meier survival plot comparing TCGA patients with overexpression (denoted 1), neutral expression (0) or underexpression (−1) of
(PDF)
Kaplan-Meier survival plot comparing TCGA patients with overexpression (denoted 1), neutral expression (0) or underexpression (−1) of
(PDF)
Kaplan-Meier survival plot comparing TCGA patients with overexpression (denoted 1), neutral expression (0) or underexpression (−1) of
(PDF)
Kaplan-Meier survival plot comparing TCGA patients with overexpression (denoted 1), neutral expression (0) or underexpression (−1) of
(PDF)
Kaplan-Meier survival plot comparing TCGA patients with overexpression (denoted 1), neutral expression (0) or underexpression (−1) of
(PDF)
Kaplan-Meier survival plot comparing TCGA patients with overexpression (denoted 1), neutral expression (0) or underexpression (−1) of
(PDF)
Kaplan-Meier survival plot comparing TCGA patients with overexpression (denoted 1), neutral expression (0) or underexpression (−1) of
(PDF)
Kaplan-Meier survival plot comparing TCGA patients with overexpression (denoted 1), neutral expression (0) or underexpression (−1) of
(PDF)
Kaplan-Meier survival plot comparing TCGA patients with overexpression (denoted 1), neutral expression (0) or underexpression (−1) of
(PDF)
Kaplan-Meier survival plot comparing TCGA patients with overexpression (denoted 1), neutral expression (0) or underexpression (−1) of
(PDF)
Kaplan-Meier survival plot comparing TCGA patients with overexpression (denoted 1), neutral expression (0) or underexpression (−1) of
(PDF)
Kaplan-Meier survival plot comparing TCGA patients with overexpression (denoted 1), neutral expression (0) or underexpression (−1) of
(PDF)
PolII elongation speed lower bound estimates compared to gene lengths. The gene set includes induced and repressed estradiol early response genes. Genes are colored accoring to the time point by which the leading or lagging end reaches the 3′ end; 80 minute time point includes genes for which the transition remains incomplete. For gene length, transcribed region lengths are used when this estimate is available; otherwise, lengths obtained from Ensembl are used.
(PDF)
Short read counts of
(PNG)
Short read counts of
(PNG)
Venn diagrams for ER responsive gene set intersections. Each number denotes the number of genes in the intersection. Data for the three external gene sets are from ERGDB
(PDF)
Kaplan-Meier survival plot comparing patients from
(PDF)
Kaplan-Meier survival plot comparing patients from
(PDF)
Survival-associated genes in the TCGA cohort predicted to respond to estradiol stimulus by SPINLONG.
(PDF)
Gene Ontology enrichment for induced genes with an
(PDF)
Gene Ontology enrichment for induced genes without an
(PDF)
We thank Ari Ristimäki, Matjaz Barboric, Mikko Frilander, Martijn Huynen, Fiona Nielsen and Ville Rantanen for discussions, and Ping Chen for analyzing RNA-seq data.