I have read the journal's policy and have the following conflicts: JB and UB are part owners of a patent application regarding the use of TAL effectors.
Implemented the software: JG AW. Conceived and designed the experiments: JG AW MR UB SP JB. Performed the experiments: MR. Analyzed the data: JG AW MR SP JB. Wrote the paper: JG SP JB.
Transcription activator-like (TAL) effectors are injected into host plant cells by
While it had already been discovered that transcription activator-like (TAL) effectors from
The DNA-binding domain of transcription activator-like (TAL) effectors is unique in its modular DNA-specificity. Natural TAL effectors are potent virulence proteins from plant-pathogenic
The recognition of signals in nucleic acid sequences, such as transcription factor binding sites, splice sites, or translation initiation sites, is one of the major fields of computational biology since the seminal work of Berg and von Hippel
Berg and von Hippel already note that the independence assumptions of position weight matrices are most likely not satisfied. First order Markov models or weight array matrix models
Dependencies between non-adjacent binding site positions are represented by Bayesian networks
In principle, all of these models could also be employed for the prediction of TAL effector target sites, where a direct application would require to learn distinct parameters for the target sites of each TAL effector. However, the number of validated target sites of individual TAL effector is currently not sufficient to reliably estimate the parameters of any of these models. More importantly, such an approach would render the target site prediction for TAL effectors with currently unknown targets impossible. The ab-initio prediction of zinc finger transcription factor binding sites poses similar problems, which are addressed by an approach Kaplan
We give an overview of current tools for the prediction of TAL effector target sites and TAL effector nuclease target sites in
TALgetter | Target Finder | Storyteller | TALVEZ | Paired Target Finder | idTALE | |
Prediction of TALE target sites | yes | yes | yes | yes | no | no |
Prediction of TALEN target sites | no | no | no | no | yes | yes |
Web server |
|
|
|
|
|
|
Custom input limit (web-server) | 100 mb | 5 kb | unkown | unknown | 5 kb | N/A |
Stand-alone application | yes | yes | yes | yes | yes | no |
Local web server | yes | no | no | no | no | no |
Access | free | free | on e-mail request | on e-mail request | free | free |
Method published |
|
yes |
no | no | yes |
no |
Method/Model | local mixture model | modular PWM | unknown | unknown | modular PWM | unknown |
Adaptable to new data | yes | in source code | unknown | unknown | in source code | no |
Open Source License |
|
|
no | no |
|
no |
only pre-defined data sets;
this manuscript;
GNU General public license;
Internet Systems Consortium license, only stand-alone application;
Storyteller and TALVEZ are provided as a web-server and stand-alone application as well. However, the methods behind both approaches are not published, yet, and are accessible only on e-mail request. For these reasons, we do not consider Storyteller and TALVEZ in the remainder of this paper. Paired Target Finder and idTALE use RVD-dependent binding specificities to predict target sites of TAL effector nucleases, which function as homo- or hetero-dimers to specifically cut genomic DNA. While Paired Target Finder is available as a web-server and command line application, idTALE is only available as a web-server and can only be applied to pre-defined input data sets. Both approaches are applicable to TAL effector nucleases but not to TAL effectors.
In this paper, we propose a new statistical model for the prediction of TAL effector target sites, which represents
The mechanism of transcriptional activation by TAL effectors is still not fully understood. However, there are indications that the presence of a suitable TAL effector target site in a promoter is not always sufficient to induce transcription of the downstream gene
The remainder of this paper is structured as follows. In the section
In this section, we define the statistical model used by TALgetter, we describe how the parameters of this statistical model are estimated from training data, and we explain how the trained model is used to scan genomes, promoteromes or other input sequences for putative target sites. Subsequently, we describe the gene expression data obtained from microarray experiments and sequence data used in the studies of this paper.
The statistical model employed by TALgetter is defined by its likelihood, which is derived in the following. Let
We can decompose the likelihood
Since a strong preference for nucleotide T at position
As motivated in the introduction, we model binding specificity and importance of an RVD independently. In addition, we impose several independence assumptions:
If an RVD of a repeat of the TAL-effector interacts with the DNA, the probability of nucleotide
If that RVD does not interact with the DNA, the probability of nucleotide
The binding specificity of an RVD is independent of the position within the target site and independent of the other RVDs of the TAL effector.
The importance of an RVD is independent of the position of its repeat within the TAL effector and independent of the other RVDs of the TAL effector.
These assumptions may be formulated as a
Given an input data set
One commonly used principle for estimating the parameters of statistical models is the generative maximum likelihood principle. However, two reasons suggest employing the Bayesian maximum a-posteriori (MAP) learning principle, which additionally imposes a prior on the model parameters. First, we have prior knowledge about binding specificities and importance of binding that are not fully covered by our training data. Second, our training data contains only a very limited number of TAL effectors and corresponding target sites. For this reason, we cannot be certain that binding events which are not observed in the training data can never occur in functional target sites. More severely, some rare but known RVDs are not present in any TAL effector of our training data. Nonetheless, the model should be able to assign likelihoods to putative target sites of TAL effectors containing such rare RVDs.
We use independent Dirichlet priors on each sub-set of parameters defined on a common simplex, which results in a product-Dirichlet prior on the full set of parameters. We denote the prior on the parameters
We estimate the parameters
We estimate the optimal parameters
The training data set used in this study comprises known pairs of TAL effector and target site from
The values of all parameters of the TALgetter model as estimated from these training data are listed in supplementary
Once the parameters have been estimated, we predict target sites of a given TAL effector by scanning input sequences, for instance the promoterome of an organism, using the trained model. Given the RVD sequence of length
Since the absolute value of the likelihood directly depends on the length of the RVD sequence, and, hence, on the width of the sliding window used for scanning, the obtained likelihood values are in general not comparable between different TAL effectors. To overcome this issue, we additionally compute empirical p-values for the putative target sites based on their likelihood values. To this end, we generate a data base of random sequences of at least the total length of all scanned input sequences. We generate these random data by drawing sequences from a homogeneous Markov model of order
For assessment and predictions in
For finding differentially expressed genes after infection with
PLEXdb ID | GEO ID | Experiments | #arrays | |
OS3 | GSE16793 | 24 hpi, mock | 12 | |
OS38 | GSE19844 | 24 hpi, mock | 9 | |
OS66 | GSE19844 | 24 hpi, mock | 27 |
Depending on the experiment, we extract different lists of differentially expressed genes based on the log-fold change of a target data set versus a control data set. Specifically, target and control data sets are
Name | PLEXdb ID | Target | Control |
BAI3 | OS38 | ||
MAFF311018 | OS66 | mock | |
PXO86 | OS66 | mock | |
PXO99 | OS66 | mock | |
PXO99AME1 | OS66 | ||
PXO99AME2 | OS66 | ||
XOC | OS3 | mock | |
XOO | OS3 | mock |
Although these gene expression data are a valuable source of information about
For this reason, we additionally design a more controlled environment for studying the effects of a single TAL effector as described in the following.
We generate transgenic
In a third study, we predict putative target sites of TAL effectors of
We obtain gene expression data of
We obtain
We obtain
We obtain
In this section, we examine known TAL effector target sites in
In the second setting, we use the final version of TALgetter, where we use the complete training data. This version is available as a web-application and command line program. For testing, we scan regions from 300 bp upstream the transcription start site (TSS) to 200 bp downstream the TSS or the start codon, whichever comes first. This choice will be motivated in the section
The ranks of the known TAL effector target sites achieved in these two settings are listed in
TAL effector | Target gene | Locus ID | Reference | TALgetter CV | TALgetter final |
Tal1c/XOCORF_0460 | OsHEN1 | Os07g06970 |
|
1 | 1 |
PthXo6 | OsTFX1 | Os09g29820 |
|
8 | 2 |
PthXo7 | OsTFIIa |
Os01g73890 |
|
3 | 2 |
TalC | Os11N3 | Os11g31190 |
|
2 | 2 |
PthXo1 | OS8N3 | Os08g42350 | 20 | 1 | |
PthXo3 | Os11N3 | Os11g31190 |
|
479 | 240 |
AvrXa7 | Os11N3 | Os11g31190 |
|
732 | 324 |
Ranks of known TAL effector target sites among the TALgetter predictions in independent cross validation-like experiments (TALgetter CV) and for the final version used in the web-application and command line program (TALgetter final).
For the known target sites of PthXo3 and AvrXa7, the achieved ranks in both settings are considerably worse. Interestingly, these two TAL effectors contain atypical, long repeats, which might influence the overall binding of the TAL effector. Since such atypical repeats are not specifically modelled by TALgetter, this might explain the high ranks of the true target sites of PthXo3 and AvrXa7. However, once the impact of long repeats on TAL effector binding are understood, these could be implemented in the modular structure of TALgetter. Notably, the same effect can also be observed for Target Finder, where the known target sites of PthXo3 and AvrXa7 obtain ranks 558 and 1543, respectively. We compare the prediction accuracy of TALgetter and Target Finder in more detail in the next section.
In the first part of this section, we consider public gene expression microarray data of
We compare the prediction accuracy of TALgetter to that of Target Finder of the TALE-NT suite using the public gene expression microarray data of
For each of the experiments, we obtain the RVD sequences of the TAL effectors expressed by the
For the following comparisons, we only consider those genes that are represented by at least one probe set on the Rice 57k microarray, since we have no knowledge about the regulation of genes that are not on the chip. For Target Finder, we basically use the default parameters with two exceptions: First, we switch off the option to scan the reverse complementary sequence, because natural TAL effector act as transcriptional activators only if the activation domain is oriented towards the downstream gene. Second, we examine the variant of Target Finder filtering for a T at position
In Target Finder, better predictions are assigned lower scores, where the score is basically the negative log-likelihood obtained from the position weight matrix model. By default, Target Finder includes all putative target sites into its predictions that yield a score that is at most 3 times the best possible score for the current TAL effector. However, the number of predicted target sites varies greatly between different TAL effectors. For instance, Target Finder reports only 7 target sites for TAL effector XOO2160_MAFF, whereas more than 1.4 million target sites are predicted for XOCORF_1565 using the default threshold. In contrast, TALgetter predicts between 6 and 314 target sites for the TAL effectors considered using the default threshold of
We use as a measure of accuracy the number of genes with predicted target sites for a specific rank cutoff that are also up-regulated according to the microarray experiment. Since the number of up-regulated genes only depends on the threshold on the log-fold change in the given experiment, this number is proportional to the recall. Since the number of predicted target genes is limited by the rank cutoff and almost equal for the different tools, it is also roughly proportional to the precision. Hence, we consider it a suitable measure of the overall performance of a prediction tool.
The results of this evaluation procedure are presented in
We consider as performance measure the number of predicted targets that are supported by up-regulation according to gene expression data after
For each rank cutoff (10, 20, 50, 100), we count the number of data sets where a prediction program outperforms the other (bars colored identical to program), or both score equally well (bars colored gray).
First, we compare the prediction accuracy of TALgetter to that of Target Finder filtering for a T at position 0 (Target Finder T) using a rank cutoff of 10. We find that for the data sets BAI3, MAFF311018, PXO86, PXO99, and PXO99AME2, TALgetter (green bars) predicts more target sites that are present in promoters of genes which are also transcriptionally induced in the corresponding microarray studies than Target Finder T (red bars). For two experiments, namely XOC and XOO, the number of recovered genes is equal for both tools. Finally, Target Finder T finds more up-regulated genes than TALgetter for the data set PXO99AME1. We find the corresponding aggregate values in the first column of
This picture is similar for the comparison to Target Finder filtering for a T or a C at position 0 (Target Finder T/C) using the same rank cutoff. For 4 data sets, namely MAFF311018, PXO86, PXO99, and XOC, TALgetter recovers more up-regulated genes than Target Finder T/C. We observe the opposite only for the data set PXOAME1, while for the remaining three data sets both tools score equally well.
As can be observed from
For a rank cutoff of 100, which for instance already results in a total number of 1600 predictions considered for the 16 TAL effectors of the data set PXO99, TALgetter and Target Finder T achieve a comparable number of up-regulated genes. In contrast, Target Finder T/C still finds less up-regulated genes than TALgetter for 3 out of the 8 data sets, whereas the opposite is true for none of the data sets.
Summarizing the results on all data sets considered, we may state that TALgetter shows a slightly improved overall prediction performance compared to Target Finder using the number of predicted targets consistent with up-regulation after
Since the results of the comparison might be specific to the threshold on the fold-changes chosen to determine up-regulated genes, we repeat this analysis for a threshold of 0.5 on the log-fold change. The outcome of this evaluation is presented in
The relevance of a novel approach for TAL effector target site prediction also depends on the number of additional targets that we gain compared to previous approaches. Therefore, we investigate if the predictions of TALgetter and Target Finder are largely overlapping or if both approaches rather predict complementary target sites. We consider the overlap of predicted target genes between TALgetter and the two variants of Target Finder in
Notably, the Venn diagrams reflect that TALgetter on the one hand accepts any nucleotide at position 0, resulting in an exclusive overlap with Target Finder T/C, but on the other hand learned a strong preference for nucleotide T at that position (cf. section
For all data sets, the number of target genes exclusively recovered by TALgetter is equal to or greater than the number of exclusive targets of each of the Target Finder variants. For instance, TALgetter predicts 41 additional putative TAL effector targets for the data set MAFF311018 that would not have been predicted by any of the Target Finder variants. In case of PXO99AME2, TALgetter exclusively predicts 4 target genes, whereas only 1 target gene is predicted by Target Finder but not by TALgetter. From an experimental perspective, these exclusively predicted targets underline the scientific value of TALgetter, since these may include virulence targets that would have been missed using existing approaches.
As an independent validation data set with a reduced number of side-effects of the infection, we consider the microarray experiment of
The results of the evaluations on this data set are presented in
We consider as performance measures the number of predicted targets consistent with up-regulation in gene expression microarrays for different rank cutoffs and thresholds of 1 and 0.5 on the log-fold changes (left panel), and the precision-recall (PR) curve (right panel) using a threshold of 1 on the log-fold changes.
In contrast to the previous microarray data for
Hence, we additionally plot a precision-recall curve of the predictions of the three tools. To this end, we classify all genes with a log-fold change greater than 1 as targets and all genes with an absolute log-fold change of less than 0.5 as non-targets. We plot the precision-recall curves of TALgetter and the two Target Finder variants using a varying threshold on the prediction scores in the right panel of
Summarizing the comparison of TALgetter to Target Finder, the assessment of prediction accuracy presented in this section demonstrates that TALgetter yields an improved overall prediction performance compared to Target Finder. TALgetter uniquely predicts many targets that are not predicted by Target Finder, which diversifies the types of TAL effector target sites that can now be discovered by computational approaches. We focus on the binding specificities learned by TALgetter and the properties of target sites predicted by TALgetter in the following sections. However, depending on the goal of a study, it might also be worthwhile to use both, TALgetter and Target Finder, for predicting putative target sites and to combine the predictions of both approaches. For instance, the union of the predictions of both approaches might cover a broader range of target sites, since we observe a considerable number of unique predictions for all approaches.
We visualize the binding specificities and importances of the different RVDs in
(A) The nucleotide preferences for position 0 are visualized as a sequence logo. (B) The binding specificities for the different RVDs are plotted in analogy to sequence logos as well, whereas the probabilities
Turning to the binding specificities of RVDs, we find the highest specificities for HD (C), NG (T), NH (G), NI (A), and NK (G). This is in accordance to the experimentally determined DNA-specificities of these RVDs
For HD, NG, NH, and NI, we also find a high importance. Hence, these RVDs are highly specific and mismatches according to the binding specificities are hardly tolerated. Other RVDs with a notably high importance are HN, NN, NP, NS, and NT, although these RVDs are less specific than the other high-importance RVDs.
As noted in the introduction, the concept of importance of RVDs is related but not identical to the efficiencies of RVDs as proposed by Streubel
The two RVDs classified as strong, namely HD and NN, also receive the highest importance in the TALgetter model. The intermediate RVDs NS and NH are assigned a fairly high importance as well, whereas the remaining intermediate RVDs, namely NP, HN, and NT, receive a lower importance. The RVDs NG and NI are assigned an importance comparable to that of the intermediate RVD NP, although these RVDs are classified as weak according to their efficiency. This result may be an effect of the related but different concepts of efficiency and importance, which we discuss in the following. An RVD with a low efficiency might prevent transcriptional activation in general, whereas a low importance has the effect that the binding specificities modeled by TALgetter for this RVD have a reduced influence on the overall score. An RVD with high efficiency has a strong positive influence on the transcriptional activation, whereas the contribution of an RVD with a high importance to the total score highly depends on the specificity. Hence, importance affects the penalty that is imposed if the binding specificity is not fulfilled, i.e., a nucleotide with a low probability for a specific RVD is present in a target site.
For some RVDs (HN, NN, NS, NT, N*), we observe a preference for more than one nucleotide, where we recognize gradually decreasing specificities. The most prominent example of this class of RVDs is N*, where we find a preference for C with a probability of 0.693, followed by T 0.272, and very low probabilities for A and G.
The RVD N* has experimentally been determined to specify for T and C with preference for T
For RVDs NC, YG, HH, IS, SS, NV, and S*, a uniform preference was estimated by TALgetter, since these RVDs are neither present in the training data nor do we have prior knowledge about their binding preference. However, under the assumption that only amino acid 13 of a repeat defines binding preference, we can overcome this issue by estimating a common binding preference for all RVDs with the same 13th amino acid. In this case, the probability
The parameters estimated with these modifications are visualized in
(A) The nucleotide preferences for position 0 are visualized as a sequence logo. (B) The binding specificities given the different RVDs are plotted in analogy to sequence logos as well, whereas the probabilities
To investigate if the modified binding preferences influence the prediction accuracy of TALgetter13 compared to TALgetter, we repeat the assessment of prediction accuracy in complete analogy to the previous comparison to Target Finder. The results of this comparison are presented in supplementary
The variable importance of RVDs according to the parameters learned by TALgetter strengthens the observation that RVDs can differ in their efficiency and contribution to overall TAL effector function
In the following, we investigate if TAL effector target sites are located in a preferred distance to either the start codon or the transcription start site (TSS) of target genes. To this end, we scan broader regions of upstream sequences, which span from 1 kb upstream of the transcription start site to the start codon as described in section
We analyze the collected target site positions by conducting a kernel density estimation with a box kernel and a bandwidth of 100. In
The estimated density of positions from the positive set is plotted as a green line, while the density of the negatives is plotted in red. The whiskers indicate the bandwith of the box kernel used to smooth the curves in a kernel density estimation. The green points at the bottom of the plots represent the distribution of positions from the positive set along the x-axis, where the points are distributed randomly in y-direction to make individual points distinguishable.
Considering the relative position to the start codon, we find a clear enrichment of positive target sites compared to negative target sites in a region reaching from the start codon approximately 300 bp upstream. At regions farther than 400 bp upstream of the start codon, the density of false positive predictions according to the microarray experiments is consistently greater than the density of the true positives.
We also find a pronounced positional preference relative to the transcription start site as can be observed from the right panel of
We repeat this analysis for rank cutoffs of 100 and 500 with highly similar results (data not shown).
In the following, we investigate whether this strong positional preference may be exploited to reduce the number of false positive predictions. To this end, we use TALgetter to predict target sites in two modified sets of upstream sequences. First, we predict target sites in the 300 bp upstream sequences relative to the start codon of all
In supplementary
In
As a first core promoter element, we consider the canonical TATA-box with consensus TATAWA
We find a canonical TATA-box in the promoters of 1445 of the 10903 unique predicted target genes among the top 200 predictions for all TAL effectors considered. This is in well accordance to the rate of 14% reported by Bernard
However, TATA-containing genes might be generally enriched in the set of up-regulated genes regardless of TAL effector target sites. Since genes containing a canonical TATA-box are often highly expressed, this could be an effect of the required log-fold change of 1 for experiments 24 hpi. Indeed, the enrichment of TATA-containing genes among all up-regulated genes is highly significant (
Nonetheless, there could be a functional relationship between transcriptional activation by a subset of TAL effectors and the presence of a TATA-box. For instance, some TAL effectors might substitute the TATA binding protein and acquire the transcriptional machinery independently. The latter explanation is supported by known TAL effector target sites that overlap a TATA-box including the known target sites of AvrBs3
In the subset of 142 positive target sites in TATA-containing genes, we find 40 target sites (28%) that overlap the putative TATA-box. In contrast, only 134 of the 1303 (10%) TATA-related negative predictions overlap with the TATA-box. This enrichment is significant, yielding a p-value of
Interestingly, 26 of the 40 TATA-overlapping target sites directly start with the TATA-box, and 12 overlap the TATA-box with an offset of 2, i.e., start with nucleotides 3 to 6 of the TATAWA consensus. Of the remaining two predicted target sites, one overlaps the TATA-box with an offset of 4, and one contains the TATA-box in the middle of the target site. The large number of TAL effector target sites overlapping TATA-boxes might entail an evolutionary advantage for
In addition to the canonical TATA-box, we also examine the enrichment of TATA-variants
We might suspect that the relationship to core promoter elements, especially the observed overlap of predicted target sites with the canonical TATA-box, is the only reason for the positional preference of TAL effector target sites described in the previous section. However, if we remove all genes that contain a canonical TATA-box, a TATA-variant, or a TC-box from the sets of positive and negative genes, and repeat the kernel density estimation for the remaining sets, the overall picture remains unchanged (cf. supplementary
In
The estimated density of positions from the positive set is plotted as a green line, while the density of the negatives is plotted in red. The green points at the bottom of the plots represent the distribution of positions from the positive set along the x-axis, where the points are distributed randomly in y-direction to make individual points distinguishable.
In summary, we find an enrichment of genes with a promoter containing a canonical TATA-box among the predicted TAL effector targets, but a similar enrichment can be found for all up-regulated genes. Within the subset of TATA-containing genes, the number of target sites that overlap the canonical TATA-box is significantly enriched. The most conclusive explanation of this observation is a functional relationship between transcriptional activation by TAL effectors and the TATA-box.
In the following, we present and discuss a selection of putative target sites of TAL effectors in rice (
No. | TAL effector | Locus | Description | Rank | Support (log-fold change) |
|
|||||
1* | PthXo1 | Os08g42350 | 1 | PXO99(9.4); PXO99AME2(9.5); XOO(3.3) | |
2* | TalC | Os11g31190 | 1 | BAI3(6.2) | |
3 | Tal7b/Tal8b | Os02g30910 | nodulin MtN3 family | 91 | PXO99(4.3) |
|
|||||
4 | AvrXa27/XOO1134_MAFF | Os06g29790 | phosphate transporter 1 | 21 | MAFF311018(2.0); PXO99(2.0); PXOAME1(1.8) |
5 | Tal6a & XOO2158_MAFF | Os06g29790 | phosphate transporter 1 | 1 | PXO99(2.0); MAFF311018(2.0) |
6 | Tal9d & XOO1132_MAFF | Os10g25310 | 50;48 | PXO99(3.0); MAFF311018(5.0) | |
|
|||||
7* | XOCORF_0460 | Os07g06970 | HEN1 | 1 | XOC(1.9) |
8 | Tal9a & XOO1138_MAFF | Os07g06970 | HEN1 | 1 | MAFF311018(5.2); PXO99(5.1); XOO(2.3) |
|
|||||
9 | Tal7a/Tal8a | Os08g07760 | BRI1-associated receptor kinase | 4 | PXO99(2.8) |
10 | XOO1998_MAFF | Os08g07760 | BRI1-associated receptor kinase | 1 | MAFF311018(1.8) |
11 | XOO2127_MAFF | Os01g50370 | MAPKKK protein kinase | 1 | MAFF311018(4.0) |
|
|||||
12* | PthXo6 | Os09g29820 | bZIP TF domain containing | 2 | PXO99(6.9); PXO99AME2(7.1) |
13* | PthXo7 | Os01g73890 | TFIIA gamma chain | 2 | PXO99(4.6); XOO(1.5) |
14 | Tal9b | Os06g46366 | zinc finger, C3HC4 type | 2 | PXO99(1.3) |
15 | XOO1136_MAFF | Os06g09310 | zinc finger, C3HC4 type | 47 | MAFF311018(2.8) |
16 | Tal2a | Os12g24490 | zinc finger, C3HC4 type | 28 | PXO99(4.0) |
17 | Tal9e/XOO2001_MAFF | Os04g41229 | helix-loop-helix DNA-binding domain | 3 | PXO99(1.6); MAFF311018(2.4) |
18 | XOO2865_MAFF | Os07g48450 | no apical meristem protein | 11 | MAFF311018(3.0) |
19 | Tal5a | Os04g43560 | no apical meristem protein | 2 | PXO99(1.4) |
20 | XOO1996_MAFF | Os04g52810 | no apical meristem protein | 6 | MAFF311018(3.0) |
|
|||||
21 | AvrXa10 | Os08g09040 | Cupin domain containing | 41 | PXO86(4.5) |
22 | AvrXa10 | Os08g09010 | Cupin domain containing | 38 | PXO86(4.1) |
23 | Avrpth3 | Os06g46500 | monocopper oxidase | 7 | XOC(3.0) |
24 | Tal4 & XOO2129_MAFF | Os12g24320 | ATPase 3 | 23;16 | PXO99(2.7); MAFF311018(3.3) |
25 | Tal7b/Tal8b | Os01g40290 | expressed protein | 1 | PXO99(3.8); XOO(1.0) |
26 | Tal9d & XOO1132_MAFF | Os08g05910 | peptide transporter PTR2 | 1 | PXO99(2.3); MAFF311018(2.0) |
List of predicted targets in
No. | TAL effector | Distance to start codon | Distance to TSS | Target site sequence |
|
||||
1* | PthXo1 | 226 | 79 |
|
2* | TalC | 296 | 91 |
|
3 | Tal7b/Tal8b | 141 | −35 |
|
|
||||
4 | AvrXa27/XOO1134_MAFF | 290 | 269 |
|
5 | Tal6a & XOO2158_MAFF | 332 | 314 |
|
6 | Tal9d & XOO1132_MAFF | 241 | 118 |
|
|
||||
7* | XOCORF_0460 | 200 | 10 |
|
8 | Tal9a & XOO1138_MAFF | 185 | −1 |
|
|
||||
9 | Tal7a/Tal8a | 257 | −8 |
|
10 | XOO1998_MAFF | 255 | −8 |
|
11 | XOO2127_MAFF | 142 | 31 |
|
|
||||
12* | PthXo6 | 112 | 31 |
|
13* | PthXo7 | 446 | 30 |
|
14 | Tal9b | 68 | −47 |
|
15 | XOO1136_MAFF | 115 | 70 |
|
16 | Tal2a | 82 | −56 |
|
17 | Tal9e/XOO2001_MAFF | 53 | −148 |
|
18 | XOO2865_MAFF | 355 | 267 |
|
19 | Tal5a | 225 | 66 |
|
20 | XOO1996_MAFF | 898 | 46 |
|
|
||||
21 | AvrXa10 | 112 | 51 |
|
22 | AvrXa10 | 117 | 31 |
|
23 | Avrpth3 | 409 | 271 |
|
24 | Tal4 & XOO2129_MAFF | 192 | 157 |
|
25 | Tal7b/Tal8b | 81 | 32 |
|
26 | Tal9d & XOO1132_MAFF | 944 | 56 |
|
List of predicted targets in
The first group of targets we consider belongs to the family of nodulin
In addition to these known target sites, we find a novel
The second group of targets also addresses the nutrient supply of the pathogen. We identify several putative TAL effector targets that are related to phosphate metabolism.
The third group of target genes contains
The fourth group of targets is related to signal transmission. The TAL effectors Tal7a/Tal8a and XOO1998_MAFF have overlapping predicted target sites in the promoter of Os08g07760 with rank 4 and 1, respectively, and two different supporting microarray experiments. This target gene encodes a putative brassinosteroid-insensitive1 associated receptor kinase (BAK1) ortholog from rice. BAK1 is a leucine-rich repeat receptor-like kinase that is involved in both, brassinosteroid and pathogen signal perception
The fifth and largest group of predicted targets contains genes that are related to transcriptional regulation. AvrBs3 induces the pepper basic helix-loop-helix regulator UPA20 to trigger plant cell enlargement and a hypertrophy phenotype
The sixth group of predicted targets comprises members with diverse function. AvrXa10 is predicted to induce two target genes (Os08g09040, Os08g09010) and XOO2127_MAFF one target gene (Os08g13440) that encode cupin domain-containing proteins. The cupin superfamily includes functionally diverse proteins that can be involved in transcriptional regulation, seed storage, enzymatic reactions to protect plants from oxidative stresses, and pathogen infection
We also use TALgetter to predict TAL effector target sites for
No. | TAL effector | Locus | Description | Rank | log-fold change |
1 | PthA1 | orange1.1g027210m | LEA hydroxyproline-rich glycoprotein | 10 | 2.2 |
2 | PthA2 | orange1.1g015673m | Tetratricopeptide repeat (TPR)-like superfamily | 17 | 1.4 |
3 | PthA3 | orange1.1g027607m | RAN GTPase 3 | 10 | 1.4 |
4 | PthA4 | orange1.1g026556m | LOB domain-containing 1 | 1 | 5.7 |
List of predicted targets in
No. | TAL effector | Distance from start codon | Target site sequence |
1 | PthA1 | 110 |
|
2 | PthA2 | 103 |
|
3 | PthA3 | 700 |
|
4 | PthA4 | 93 |
|
List of predicted targets in
We aim to experimentally test, if the target sites predicted by TALgetter are valid targets for the corresponding TAL effector. For this, we analyzed the TAL effector AvrXa10, for which the target specificity has been experimentally verified
(A) RVDs of the TAL effector AvrXa10 and predicted target sites. The optimal box is deduced from the known RVD specificites, while box 6, box 38, box 41, and box 98 are TALgetter AvrXa10 target predictions from rice promoters. Mismatches and non-optimal RVD-base pair combinations are shaded in light and dark grey, respectively. (B) AvrXa10 and Hax3 target boxes are cloned upstream of the minimal
In this paper, we present TALgetter, a new tool for the prediction of TAL effector target sites. TALgetter uses a local mixture model that models binding specificity and importance of RVDs independently. In contrast to previous approaches, the parameters of this model are estimated from training data and, hence, allow for an easy adaptation to new validated target sites.
We demonstrate that TALgetter is able to identify known TAL effector target sites in rice and we show that TALgetter predicts a greater number of TAL effector targets that are consistent with up-regulation after
Scrutinizing the binding specificities learned by TALgetter, we find that for many RVDs, binding specificities are estimated in accordance to the literature. In addition, we observe gradually decreasing binding specificities for some RVDs, which have also been reported by recent experimental studies. Regarding the concept of RVD importance, we find substantially different parameters for the individual RVDs, which gives indication that different RVDs indeed contribute differently to transcriptional activation by TAL effectors.
In subsequent studies using target sites predicted by TALgetter, we discover a strong positional preference of target sites towards the transcription start site. Most true positive target sites are located within a window from 300 bp upstream to 200 bp downstream the TSS. We demonstrate that exploiting this positional preference for predicting TAL effector target sites further improves the overall prediction performance of TALgetter. This finding is of general value for the computational prediction of TAL effector target sites, since it may also help to reduce the number of false-positive predictions of other approaches.
We also study the relationship of TAL effector target sites to core promoter elements. We show that a considerable number of target sites overlaps with the TATA-box, which indicates that TAL effector binding to the TATA-box – and possibly substituting the TATA binding protein – might constitute one mode of transcriptional activation by TAL effectors. These two findings, positional preference and binding to the TATA-box, reveal new insights into the biology of TAL effector target sites that may aid the understanding of transcriptional activation by TAL effectors. For models to explain this observation, see supplementary
Against this background, we discuss predictions of TALgetter in
We make TALgetter available as a web-application at
(TXT)
(TIF)
(TIF)
(TIF)
(TIF)
(TIF)
(TIF)
(TIF)
(XLS)
(XLS)
(XLS)
(PDF)
We thank Heidi Scholze for preparing samples for the