> ]_\5@ /8bjbj22 -HXX/0jjjjjjj
$LTTTT/IUWYYYYYY$Rt}j]//]]}jjTTmmm]^jTjTm]Wm$mjjT
`A
0W~jjjjj@]]m]]]]]}}$
W
Protocol S1
Position weight matrix error
The identification of binding site loss assumes that the binding site model is correct. While it is difficult to determine if a position weight matrix is correct, a number of observations suggest that inaccuracies in the binding site model cannot explain all of the binding site loss that we observed.
First, if a binding site model is overly specific, this over-specificity is likely limited to a small number of positions in the binding site. As a result, the substitutions that are the basis of the binding site loss prediction will be unequally distributed across the positions of the binding site. For each transcription factor we compared the distribution of the lineage-specific substitutions used to identify loss to the expected uniform distribution. We limit the analysis to only those positions with information content greater than 0.5 bits. In 4 out of 91 transcription factors (Phd1, Mot3, Spt2, and Sig1) there is a significant deviation from a uniform expectation (X2, p < 0.05). For example, Spt2 is an 11mer motif with 9 positions of high information content. Position 2 and position 7 account for 57% of the Spt2 loss events, suggesting that these two positions may be over specified. These four factors show an average of 11.5% loss and account for 124 of the loss events. Removing these four factors from the analysis lowers the estimated rate of loss to 5.5%.
Second, we have also tested for errors in the binding site model by rebuilding the PWM using nucleotide counts from both the conserved and semi-conserved predictions. In this way, we incorporate the loss events as allowable degeneracy into a new binding site model. By applying the model to the data from which it was built there is no sampling error and no over specification. If the original loss events are the result of inaccuracies in the original model, incorporating the loss events should correct the binding site model. For example, in position 3 of the Msn2/4 model we observed a high substitution rate in semi-conserved sites, with G to A substitutions appearing frequently under the original model. After rebuilding the Msn2/4 position weight matrix, a G to A substitution, while still expected to be rare, is now 30 times more likely along the branch leading to S. mikatae using the new matrix. When we re-annotated the original loss events with these models, 41% (11/27) of the Msn2/4 loss events remain annotated as highly significant loss events. An additional loss event is still significant at the less stringent 1% false positive rate. For Ndt80, 50% (7/14) of the loss events are still annotated as loss using the rebuilt PWM, and for Rox1, 64% (7/11) remain. Of the three experimentally confirmed semi-conserved sites, the Rox1 and Msn2 binding sites remain annotated as semi-conserved. The Ndt80 semi-conserved binding site is no longer a significant match to the semi-conserved model. After incorporating the loss events into the binding site model, we conclude that most of the loss events are due to substitutions in positions with stable, and therefore probably correct, nucleotide frequencies.
Third, to test the overall effects of the PWMs models used in this study, we repeated this analysis using a second set of binding site models ADDIN EN.CITE MacIsaac898917MacIsaac, K. D.Wang, T.Gordon, D. B.Gifford, D. K.Stormo, G. D.Fraenkel, E.Whitehead Institute for Biomedical Research, Nine Cambridge Center, Cambridge, MA 02142, USA. macisaac@mit.eduAn improved map of conserved regulatory sites for Saccharomyces cerevisiaeBMC BioinformaticsBMC Bioinformatics1137*AlgorithmsChromosome Mapping/*methodsConserved Sequence/geneticsGene Expression Regulation, Fungal/*geneticsPhylogenyRegulatory Elements, Transcriptional/*geneticsResearch Support, N.I.H., ExtramuralResearch Support, Non-U.S. Gov'tSaccharomyces cerevisiae/*geneticsSequence Analysis, DNA/*methodsTrans-Activation (Genetics)/geneticsTranscription Factors/*genetics200616522208http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Citation&list_uids=16522208 [1]. Although these motifs are based on the same underlying chromatin immunoprecipitation dataset, they were defined using a different motif finding algorithm and provide a good test for the effect of the binding site models on our results. In total, using the 101 PWMs, and the same statistical thresholds, we identified 21,494 conserved binding sites, 1,758 semi-conserved sites, and estimate that 7.9% of the binding sites have been lost in a lineage-specific manner. The higher frequency of semi-conserved sites can be explained by the slightly higher information content and width of the MacIssac motifs which give us additional power to detect binding site loss. Additionally, we find that 51.7% of the loss events have been compensated for by turnover. We have also used these PWMs to reannotate the Yeastract binding site data, and find that 40.3% of the known binding sites are conserved and 1.7% are semi-conserved. While some of the annotations of individual binding sites do change between datasets, it is clear that the identification of semi-conserved binding sites is not specific to one set of PWMs.
These observations suggest that errors in the PWMs will affect our annotations, but also show that a substantial portion of the semi-conserved sites are unlikely to be explained by noise in the binding site models.
Implementation of semi-conserved algorithm
The equations outlined in the Methods section require a number of parameter choices. First, for the neutral model we used the HKY85 model ADDIN EN.CITE Hasegawa14217Hasegawa, M.Kishino, H.Yano, T.Dating of the human-ape splitting by a molecular clock of mitochondrial DNAJ Mol EvolJ Mol Evol160-74222AnimalsComparative StudyDNA, Mitochondrial/*genetics*EvolutionGenesHaplorhini/*geneticsHumansMathematicsModels, GeneticNucleic Acid HybridizationPrimates/*geneticsProteins/geneticsResearch Support, Non-U.S. Gov'tSpecies Specificity19853934395http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Citation&list_uids=3934395 [2]. We estimated the ( parameters from the genomic nucleotide frequencies {A = 0.3, G = 0.2, C = 0.2, T = 0.3}, and used a transition/transversion ratio of 4, which was estimated from synonymous sites in coding sequences using PAML ADDIN EN.CITE Yang696917Yang, Z.Department of Integrative Biology, University of California, Berkeley 94720-3140, USA.PAML: a program package for phylogenetic analysis by maximum likelihoodComput Appl BiosciComput Appl Biosci555-6135Amino Acid SequenceBase SequenceDNA/*analysis*Likelihood Functions*Mathematical Computing*Phylogeny*SoftwareSupport, U.S. Gov't, Non-P.H.S.Support, U.S. Gov't, P.H.S.1997Oct9367129http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Citation&list_uids=9367129 [3]. Additionally, the semi-conserved model requires a step size to integrate over loss events at different positions within the tree. We used a step size of 0.01 substitutions/site. The choice of step size did not affect the results, but does affect the computational time.
For a given PWM, we calculate the likelihood of the intergenic sequences under the neutral, conserved, or semi-conserved models using the following:
For each of the top 2000 putative binding sites S {
/*the top binding sites are predefined by ranking all sites by their two best log-odds scores from the 4 species. */
Neutral = 1;
Conserved = 1;
Semi = 1;
for i = 0 to PWM->width -1 {
n = 0;
c = 0;
for a in {A,C,G,T} {
n += f(a) * Felsenstein_prune (Root->left, Tree, S, i, Neutralmodel, a)
* Felsenstein_prune (Root->right, Tree,S, i, Neutralmodel, a);
//Neutralmodel is HKY85, Tree includes topology and branch lengths
c += PWMi(a) * Felsenstein_prune (Root->left, Tree, S, i, TFBSmodel, a)
* Felsenstein_prune (Root->right, Tree, S, i, TFBSmodel, a); //TFBSmodel is the scaled rate matrix (Equation 7).
}
Neutral += log(n);
Conserved += log(c);
Semi += log(Semiconserved (Tree, S, Neutralmodel, TFBSmodel, PWM))
}
report 3 likelihoods
}
}
sub Semiconserved (Tree T, Alignment S, Model N, Model TFBS, PWM pwm){
L = 1;
For i = 0 to pwm->Width {
totalp = 0;
For t = 0 to TotalDistance {
/*the branches are concatenated. If branch 1 is 0.1 subs/site long, t = 0.11 is the start of branch 2 */
Newtree = reroot (Tree, t);
p = 0;
For a in {A,C,G,T} {
p += PWMi(a) * Felsenstein_prune(NewRoot->left, S, i, TFBS, a)
* Felsenstein_prune (NewRoot->right, S, n, N, a);
}
totalP += Step_Size/TotalDistance * p;
}
L *= totalP;
}
return L;
//When when rerooting the tree only some of the branches within the original tree actually change their topology. Only those branched effected by rerooting need to be recalculated.
}
sub Felsenstein_prune (Node N, Alignment S, int n, Model M, Char a){
//use pruning algorithm with equation 2.
if (Node is a tip of the tree){
if (a == Observed sequence at node N in the phylogeny){
return 1;
else {
return 0;
}
else{
pLeft = 0;
pRight = 0;
For b in {A,C,G,T}{
pLeft += P(b | a, N->left_length, M) * Felesenstein_prune(N->left, S, n, M, b);
pRight += P(b | a, N->right_length, M) * Felesenstein_prune(N->right,S,n,M,b);
}
return pLeft*pRight;
}
}
ADDIN EN.REFLIST 1. MacIsaac KD, Wang T, Gordon DB, Gifford DK, Stormo GD, et al. (2006) An improved map of conserved regulatory sites for Saccharomyces cerevisiae. BMC Bioinformatics 7: 113.
2. Hasegawa M, Kishino H, Yano T (1985) Dating of the human-ape splitting by a molecular clock of mitochondrial DNA. J Mol Evol 22: 160-174.
3. Yang Z (1997) PAML: a program package for phylogenetic analysis by maximum likelihood. Comput Appl Biosci 13: 555-556.
+,
nors C![!\!&&&&&&&''',,,,ʺ۰xnn_n jph`OJPJQJ^Jh`OJQJ^JhF46OJQJ^JhmeOJQJ^JjhmeOJQJU^Jh:P6OJQJ^Jh:PH*OJQJ^Jh:POJQJ^Jh:P6B*OJQJ^Jph!h:P5B*OJQJ\^Jphh:PB*OJQJ^Jphh:P5OJQJ^Jh(X5OJQJ^J%+Z
N -H.I.}..///;/D/M/d///D001!!!!
!!! ^`^/8,I.O0P022`6b6c6u6v667-8.8/8h| OJQJ^Jhmehme6hmehmejhmeOJQJU^JhmeOJQJ^Jh:PCJH*OJQJ^JaJh:PCJOJQJ^JaJh:POJQJ^J1101G1111111122)2J2222243o3s3333 ^` ^` ^ `^333u4w4x4445B5L5U5b5d5k5x5555A6C6Y6]6`6a6b6 ` ^ ` @ ^@` ^`b6%77,8-8/8gdme0^`0gdme+1h;0/ =!"#$%d@dNormal$*$1$A$a$+B*CJOJPJQJ^J_HaJmH sH tHDADDefault Paragraph FontViVTable Normal :V44
la(k(No ListDA@DDefault Paragraph FontNONHeading
x$CJOJPJQJ^JaJ6B@6 Body Text
x0/@"0ListOJQJ^JP"@2PCaption
xx$6CJOJQJ]^JaJ6OB6Index$OJQJ^J<OR<Table Contents$FOQbF
Table Heading$$a$5\/0H/0+ZN%H&I&}&&''';'D'M'd'''D(())0)G))))))))**)*J*****4+o+s++++++u,w,x,,,-B-L-U-b-d-k-x----A.C.Y.].`.a.b.%//,0-010000p00000p0000000000000000000000000000000000000000000000000000000000000000000000 00 0,/8 13b6/8!"#/8nr[$$b.u.-0/0QQQQ8@0(
B
S ? OLE_LINK210N10
y}y}+"'#'s''''''''''''''((L(P(V(g(~((((((((((((U)b)m)y){))))))*****#*:*G*******++++(+)+?+P+R+Y+u+{+++++++++Q,Z,|,,,,m-r-z----------...-.1.>.K.P.Q.W.y....4/;///00000010&(9%S%&&'!'O'R'l'n'''((L(Q((),)Q)U)))))*#*****u+{+++,,x,{,,,,,
--B-H-N-R-X-^-e-j-m-r-z--------D.J.10333333333333333333333333333333333 10
Scott Doniger
| 0&F46(XmeM?t`:PSMEN.InstantFormathZ EN.Layout\[EN.LibrariesIw<ENInstantFormat><Enabled>0</Enabled><ScanUnformatted>1</ScanUnformatted><ScanChanges>1</ScanChanges></ENInstantFormat>P<ENLayout><Style>plos</Style><LeftDelim>{</LeftDelim><RightDelim>}</RightDelim><FontName>Times New Roman</FontName><FontSize>12</FontSize><ReflistTitle></ReflistTitle><StartingRefnum>1</StartingRefnum><FirstLineIndent>0</FirstLineIndent><HangingIndent>720</HangingIndent><LineSpacing>0</LineSpacing><SpaceAfter>0</SpaceAfter></ENLayout>H<ENLibraries><Libraries><item>nsf05.enl</item></Libraries></ENLibraries>@NN/0@UnknownGz Times New Roman5Symbol3&z ArialmNimbus Roman No9 LTimes New RomanUOpenSymbolCourier New; Minchofg=LucidasansBA hPP%[&1(W1(W!24003QH(?Supplemental File 1scott
Scott DonigerOh+'0D<
8DP
\hpxSupplemental File 1uppscottmecotNormal.dotlScott Donigeril2otMicrosoft Word 10.0@F#@FGJ@?A@?A1(G:VT$m I."Systemr0P-@Times New Roman- 2
Protocol <,3!2,2
2
oS8
2
12
2
7@Times New Roman-
2
% x
2
%f --52
Position weight matrix error<2'!28I,28!R2",2,,,3,
2
7@Times New Roman-
2
B-2
VOThe identification of binding site loss assumes that the binding site model is =2,2,2!,,222!23221',2'',''2N,'2,2,22221',N22,'2
~ correct. ,2!!-,2
~/TWhile it is difficult to determine if a position weight matrix is correct, a number _2,'2!!,222,,!N2,!,22'22H,12N,!3',2!",,,22N2,!;2
of observations suggest that ina2!22',!2,22''321,'2,2,b2
:ccuracies in the binding site model cannot explain all of ,,2",,,'22,22221',N22,,,222,32,2,2!F2
d'the binding site loss that we observed.2,22221',2''2,H,22',!3,2
2
d .
2
Be2
V<First, if a binding site model is overly specific, this over7!'!,22221',N32,'22,!0'3,,!,2'22,!
2
-!,2
5
specificity is likely '2-,!,0'2,0z2
JJlimited to a small number of positions in the binding site. As a result, tN,22,'N,22N2+!2!22'22'22,22221',H',!,'2%2
J
he substitutions k2,(22'222'2
\that are the basis of the binding site loss prediction will be unequally distributed across 2,,!,2,3,''2!2,32221',2''2!,2,32H2,22,22,02'!22,2,,!2''2
0Qthe positions of the binding site. For each transcription factor we compared the e2,22'22'2!2,22221',72!-,,2!,2(,!222!,,2!H,,2N2,"-22,42
distribution of the lineaget2'!22222!2,2,,2,
2
-!D2
&specific substitutions used to identif'2,,!,'22'222'1',222,2".2
y loss to the expected 02''22-,32,,,22
Uuniform distribution. We limit the analysis to only those positions with information 22!2!N2'!2222_,N2,,2,0''222022',22'22'H22!2!N,222
[content greater than 0.5 bits. In 4 out of 91 transcription factors (Phd1, Mot3, Spt2, and ,22,22!,,,!2,2223'22222!22!,2(,!222!,,2!'!8222Y22822,22-a2
9Sig1) there is a significant deviation from a uniform expv812!2,",','12!,,32,2,22!!2N,22!2"N,322
ectation (X,,,22!H@Times New Roman-
2
s
2"-%2
, p < 0.05). For e28222!72!2
o Yexample, Spt2 is an 11mer motif with 9 positions of high information content. Position 2
,3,N2,822',222M,!N2!H2222'22'2!2122!3!N,22,22,282'2222
Rand position 7 account for 57% of the Spt2 loss events, suggesting that these two ,2222'222,,,222!3!22S2!2,8222'',3,2''221,'312,2,(,H2_2
U
8positions may be over specified. These four factors show22'22'N-/3,22,!'3,,!,2=3,',!22!!-,2"''22H82
U
an average of 11.5% loss and ,2,2-!-1,3!222S2'',222
Yaccount for 124 of the loss events. Removing these four factors from the analysis lowers
,,,222!3!2222!2,3'',2,2'C,N22212,(,!22!",,2!'"!2N2,,3,0''2H-!'A2
;$the estimated rate of loss to 5.5%. 2,,'N,,2!,,3!2''222S
2
;L -
2
B2
VOSecond, we have also tested for errors in the binding site model by rebuilding 8,,222H,2,3,,'2,(,2!2!,"!2!'22,22221',N22,40",22221)2
!the PWM using nucleo2,8_Y2'2122,,2M2
!,tide counts from both the conserved and semi2,,222'!!2N2222,-22',!2,2,22(,N
2
!
-!2
!
conserved ,22'-!2,22
Xpredictions. In this way, we incorporate the loss events as allowable degeneracy into a 2!,2,22' 22'H.0H,2,2!22",,2,2''-2,2',',2H,2,3-1,3,!,.022,2
Xnew binding site model. By applying the model to the data from which it was built there 2,H22221',N22,D0,220312,N22,32,2,,!!2NH2,2H,'222,!,12
z
is no sampling error and '22',N221,!!3!,22s2
z
Eno over specification. If the original loss events are the result of 2222,!'2,,!,,22 !2,2!13,2'',2,2',",2,!,'22!2
Uinaccuracies in the original model, incorporating the loss events should correct the r2,,,2",,,'22,2!13,N22,2,2!22!,312,2'',2,2''2222,2"!,,2,2
`Vbinding site model. For example, in position 3 of the Msn2/4 model we observed a high 22221',N22,72!-3,N2,222'2222!2,Y'222N22,H,22(,!2,2,212#2
substitution rat'22'222!,2
> e in semii,2',N
2
-!k2
@conserved sites, with G to A substitutions appearing frequently ,22',!2,2','H2H2H'22'222',22,,!21"!,22,202
FWunder the original model. After rebuilding the Msn2/4 position weight matrix, a G to A 222,!2,2!12,N22,H!,!",222212,Y'32222'22H,12N,!3,H2H2
\substitution, while still expected to be rare, is now 30 times more likely along the branch '22'222H2,'+32,,,222,!,",'22H22N,'N2!,2,0,2212,3!,2,2-2
,lea,,2
,\ding to 2212-2
,
S. mikatae2H,22,-=2
,.! using the new matrix. When we re%2'212,3,HN,!3_2,2H,!-
2
,-!@2
,#annotated the original loss events ,222,,23,2!22,2'',2,2'2
Twith these models, 41% (11/27) of the Msn2/4 loss events remain annotated as highly H22,',N22,'22S"2222!2!2,Y'2222'',2,2'!,N,2,222,-2,'21202
Usignificant loss events. An additional loss event is still significant at the less st'12!,,22'',3,2'H2,2222,2'',2,2'''12!,,2,2,,'('2
ringent 1% !21,22S2
Zfalse positive rate. For Ndt80, 50% (7/14) of the loss events are still annotated as loss !,',22'2,!,,72!I22222S!222!2!2,2'',2,2',!,',222-,2,'2''2
Tusing the rebuilt PWM, and for Rox1, 64% (7/11) remain. Of the three experimentally 2'212,",228_Y+22!2!C23222S!222!!,N,2H!2,2",,,32,!N,2,0 2
kconfirmed semi,22!!N,2',N
2
kX-"_2
kz8conserved sites, the Rox1 and Msn2 binding sites remain ,22',!3-2','2,C232,22Y&2222221','!,N,22
k{
annotated as -222,,2,'2
semi',N
2
-!12
conserved. The Ndt80 semi,22',!3,2=2,H222',N
2
-!_2
38conserved binding site is no longer a significant match ,22',!3,222221','22222,!,'13!,,2N,,22
Qto the semi22,',N
2
Q-!|2
QKconserved model. After incorporating the loss events into the binding site ,22',!3,2N22,H!,!2,3!22!,312,2'',2,2'222,22221',s2
Emodel, we conclude that most of the loss events are due to substitutiN22,H,,22,23,2,N2'2!2,2'',2,2',",22,2'22'2,2
ons in positions with 22'222'22'H2j2
7?stable, and therefore probably correct, nucleotide frequencies. ',2,,222,",!2",2!23,20,2!",,22,,22,"!,22,2,,'
2
7 .
2
B2
VMThird, to test the overall effects of the PWMs models used in this study, we =2!22,'2,22,!,,!!-,'2!2,8_Y'N22,'2',222''23/I,m2
Arepeated this analysis using a second set of binding site models t!,2,,,22',2,0''2'21,(,,222',3!22221',N22,'2
[1]"2"/2
P
. Although these motifs H221122,',N2!'2
Rare based on the same underlying chromatin immunoprecipitation dataset, they were ,!,2,(,2222,',N,232,!031,2!2N,2NN2222!,,2,222,,',2-0H,",42
defined using a different m2,!2,22'31,2!",!,2Nm2
#Aotif finding algorithm and provide a good test for the effect of t2!!2221,12!2N-222!222,,1222,'!3!2,,!",,2!2
vSthe binding site models on our results. In total, using the 101 PWMs, and the same 2,22221',N22,'3222!!,'2' 22,2'212,2228_Y',222,',N,2
Pstatistical thresholds, we identified 21,494 conserved binding sites, 1,758 semi',',,2!,'222'H,2,2!,222222,22',!3,222221','2222'-N
2
--!2
N
conserved ,22',!2,2-
՜.+,0hp|
eW0A
Supplemental File 1Title
!"#$&'()*+,-./012346789:;<=>?@ABCDEFGHIJKLMNOPQRSUVWXYZ[^Root Entry FA`1Table%WordDocument-HSummaryInformation(5t<DocumentSummaryInformation8TCompObjj
FMicrosoft Word Document
MSWordDocWord.Document.89q