Reader Comments

Post a new comment on this article

functional neighbourhoods identified in Table S3

Posted by seb951 on 04 Jul 2012 at 16:52 GMT

I was quite surprised by the results of table S3, which indicates all the functional neighbourhoods (FN) identified in the present study.

First the authors claim that, on average, FN windows contain about 50 genes (table S2). But in table S3, almost none of the significant windows have 50 genes. In fact (for Arabidopsis at least), the average number of genes for the significant windows is 7, indicating a strong tendency for statistically significant tests to be biased towards small sample sizes.

Second, I thought it was surprising that windows with as little as 2 genes could show significant over-representation of GO categories. Take Arabidopsis for example. It has about 25000 genes in total. Imagine a FN with 2 genes (as there are many in table S3), each representing one GO term. As an extreme case, say those two GO terms are unique in the whole dataset.

Such that contingency table for GO term #1 would read:
1 1
2 25000

with FisherExactTest p-value = 0.00024. Then depending on how exactly correction for multiple hypothesis is done, this will remain significant. But is this biologically significant? Any random subset of a few genes will sometimes be significant in such a scheme. I don't think the Fisher Exact Test is therefore appropriate here.

Of course I agree that not all FN fit this example. So what about the ones that contain many genes which all have similar functions? There the FisherExactTest is much more appropriate. However, even though the paper claims that FN do not mainly result from tandem duplication, table S3 does not support this. Take Arabidopsis as an example again. Here, I gathered protein sequences from a FN (table S3). I then aligned these sequences using clustalW. I repeated this for about 10-15 randomly chosen FN. All FN contained many closely related paralogs. This left me puzzled about the true impact of tandem duplicated genes sharing GO terms through homology.

Given the ease at which FN are identified (both due to the nature of the FisherExactTest and the tandem duplication), it is then also not surprising that such general GO terms as “Response to biotic stimulus, Response to stress and Localization” are conserved across the phylogeny.

No competing interests declared.