The authors have declared that no competing interests exist.
Conceived and designed the experiments: NCI YEM ES. Analyzed the data: NCI YEM AS. Contributed reagents/materials/analysis tools: YR RA LIS. Wrote the paper: NCI YEM AS ES.
Organism cells proliferate and die to build, maintain, renew and repair it. The cellular history of an organism up to any point in time can be captured by a cell lineage tree in which vertices represent all organism cells, past and present, and directed edges represent progeny relations among them. The root represents the fertilized egg, and the leaves represent extant and dead cells. Somatic mutations accumulated during cell division endow each organism cell with a genomic signature that is unique with a very high probability. Distances between such genomic signatures can be used to reconstruct an organism's cell lineage tree. Cell populations possess unique features that are absent or rare in organism populations (e.g., the presence of stem cells and a small number of generations since the zygote) and do not undergo sexual reproduction, hence the reconstruction of cell lineage trees calls for careful examination and adaptation of the standard tools of population genetics. Our lab developed a method for reconstructing cell lineage trees by examining only mutations in highly variable microsatellite loci (MS, also called short tandem repeats, STR). In this study we use experimental data on somatic mutations in MS of individual cells in human and mice in order to validate and quantify the utility of known lineage tree reconstruction algorithms in this context. We employed extensive measurements of somatic mutations in individual cells which were isolated from healthy and diseased tissues of mice and humans. The validation was done by analyzing the ability to infer known and clear biological scenarios. In general, we found that if the biological scenario is simple, almost all algorithms tested can infer it. Another somewhat surprising conclusion is that the best algorithm among those tested is Neighbor Joining where the distance measure used is normalized absolute distance. We include our full dataset in
The history of an organism's cells, from a single cell until any particular moment in time, can be captured by a cell lineage tree. Many fundamental open questions in biology and medicine, such as which cells give rise to metastases, whether oocytes and beta cells renew, and what is the role of stem cells in brain development and maintenance, are in fact questions about the structure and dynamics of that tree. Random mutations that occur during cell division endow each organism cell with an almost unique genomic signature. Distances between signatures capture distances in the cell lineage tree, and can be used to reconstruct that tree. On this basis, our lab developed a method for cell lineage reconstruction utilizing a panel of about 120 microsatellites. In this work, we use a large dataset of microsatellite mutations from many cells that we collected in our lab in the last few years, in order to test the performance of different distance measures and tree reconstruction algorithms. We found that the best method is not the one that gives the most accurate estimates of the mean distance, but rather the one with the lowest variance.
A multi-cellular organism develops from a single cell – the zygote, through cell division and cell death, and displays an astonishing complexity of trillions of cells of different types, residing in different tissues and expressing different genes. The development of an organism from a single cell until any moment in time can be captured by a mathematical entity called a cell lineage tree
Uncovering the human or even the mouse cell lineage tree may help to resolve many open fundamental questions in biology and medicine, as illustrated by our earlier work
In the past few years, our lab developed a method for reconstructing the lineage relations among cells of multi-cellular organisms
Besides the use of MS to reconstruct cell lineage trees which was proposed also by others
While phylogenetic lineage tree reconstruction of cells is similar to that of organisms and species, it also has unique characteristics such as the existence of stem cells (that influence the shape of the tree), a sometimes very shallow tree (in the order of dozens of generations), a dramatic variation in the number of divisions the cells have undergone since the zygote (which is much larger than what exists in species with different evolutionary paces), and the fact that the cells have undergone binary cell divisions. Besides the last feature, which has been widely investigated in population genetics
The goal of this work is to test the existing algorithms on experimental cell data and to validate their use. In addition we want to test which of these methods (even though not developed for the purpose of cell lineage reconstruction) performs best.
In order to accomplish this goal, we took experimental data from clearly known biological expectations and examined whether the reconstructed trees present this knowledge. The data was obtained by isolating cells from different mice and humans, and extracting their genomic signatures (see
As mentioned earlier the goal of this work is to validate and quantify the ability to reconstruct a cell lineage tree utilizing genomic signatures of individual cells that record mutations in microsatellites. Since precise inference of tree topology cannot be accomplished using our current limited number of loci (See
The tree reconstruction algorithms that we used are Neighbor Joining (NJ)
In this section we checked the clustering quality of the tree reconstruction methods. An example for a case where cells from different individuals are clustered distinctively on the tree can be seen in
Two reconstructed lineage trees are shown: (A) A lineage tree containing cells from three mice, M3 (blue), M5 (pink) and M6 (green). (B) A lineage tree containing cells from seven humans, H1 (blue), H2 (red), H3 (orange), H4 (pink), H5 (green), H6 (purple) and H7 (turquoise). The root of the trees (colored in black) is the weighted mean of the organisms' putative zygotes. The trees were reconstructed using the NJ algorithm along with the Normalized-Absolute distance measure.
However, due to the many types of noise existing in the system, such a distinct clustering is not likely to happen in all cases, especially if the individuals are related to each other (as in the case of some of the experimental mice), and their zygotes are genetically close. In such cases, due to the small panel of MS used, cells from different individuals can randomly accumulate mutations that reduce the genetic distance between them, and may become closer to each other than to other cells from the same individual. This effect depends on the ratio between the genetic distance between the zygotes of the mice and the number of divisions the cells in each mouse underwent. We showed via computer simulation (
Our panel contains ∼120 loci but we prefer to ignore loci where there are allelic dropouts- i.e. where there is amplification failure of one of the two heterozygous alleles while the other allele successfully amplifies, which may often be misinterpreted and lead to errors in allele size determination. We thus used an average of about 80 loci per cell. In addition, those 80 loci can be different between distinct cells, so the actual number of loci used for the distance calculation can be even smaller (with a minimum limit of 25). We showed that when the ratio is 0.2 and the mutation rate is 1/100, a separation of 90% can be achieved using at least 200 loci. An example of such a case is shown in
A reconstructed lineage tree containing cells from six mice, M1 (turquoise), M2 (red), M3 (blue), M7 (yellow), and M8 (purple). The root of the trees (colored in black) is the weighted mean of the organisms' putative zygotes. The trees are reconstructed using the NJ algorithm along with the Normalized-Absolute distance measure.
Nevertheless, not all the algorithms and distance measures suffer from this problem to the same degree. This may be due to the fact that some measures describe the mutational process more accurately, or due to some other robustness feature of the algorithm. An example for a different performance between different metrics is shown in
Two reconstructed lineage trees containing the same cells from three different mice are shown: M1 (turquoise), M3 (blue), and M4 (orange). (A) A tree using the NJ algorithm along with the Normalized-Absolute distance measure. (B) A tree using the NJ algorithm along with the Equal or Not distance measure. The root of the trees (colored in black) is the weighted mean of the organisms' putative zygotes. It can be seen that the performance of the Normalized-Absolute distance measure is clearly better.
The difference between methods in the clustering separation of cell groups necessitates the quantification of their performance, in order to determine which method is the best (if any). In order to do so, we used three measures to quantify the clustering quality of distinct groups: the Quality of the Largest Cluster (QLC) from each group, the Tree Entropy (TE), and the probability of getting such a cluster under the assumption of hyper geometric sampling (HS). (A detailed explanation of the measures is given in the
We analyzed the performance of the methods on all the information that we have available: cells from nine mice and seven humans, each containing a few types of cells (
We started with rough wide-scale tests, reconstructing all the optional combinations of the datasets, regardless of whether different individuals share the same types of cells. We then continued with more refined tests, which pay attention to the specific types of cells, and used datasets containing only one type of cell. From the nine mice that we have, we could produce
Examples of the results on a few specific datasets are presented in
# | Name | Score | NJ- | NJ- | NJ- | NJ- | NJ-SMM | NJ-SMM | NJ-SMM | NJ-MMM | NJ-MMM | NJ-MMM |
ABS | NormABS | EqualOrNot | EUC | equal rates | monoDi rates | lenDep rates | equal rates | monoDi rates | lenDep rates | |||
M1M2M3M4M5M6M7M8M9 | QLC | 0.347823 | 0.275611 | 0.37813 | 0.301447 | 0.371728 | 0.350756 | 0.368669 | 0.365139 | 0.30166 | 0.287746 | |
All cell types | TE | 84.40917 | 60.78266 | 101.2351 | 97.113 | 92.00133 | 92.45571 | 122.3109 | 102.0853 | 93.23476 | 112.3511 | |
HS | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ||
M1M3M4M5M6 | QLC | 0.682521 | 0.705587 | 0.496068 | 0.487241 | 0.703336 | 0.610997 | 0.633189 | 0.723407 | 0.731626 | 0.674685 | |
All cell types | TE | 70.22079 | 67.6069 | 138.9407 | 83.75034 | 94.94297 | 108.827 | 115.5496 | 98.87133 | 95.21636 | 95.26127 | |
HS | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ||
M1M2M6 | QLC | 0.753549 | 0.945355 | 0.553541 | 0.633164 | 0.78515 | 0.74055 | 0.674994 | 0.638117 | 0.642526 | 0.802127 | |
All cell types | TE | 105.9908 | 14.81306 | 201.029 | 250.6545 | 147.167 | 103.9354 | 116.4198 | 170.2311 | 163.8052 | 96.83945 | |
HS | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ||
M1M2M3M9 | QLC | 0.522825 | 0.761499 | 0.353106 | 0.364581 | 0.568366 | 0.478402 | 0.520778 | 0.497532 | 0.498754 | 0.522246 | |
All cell types | TE | 234.5802 | 98.08597 | 296.6947 | 333.6574 | 216.7348 | 254.1092 | 202.1498 | 237.8789 | 218.5638 | 205.0351 | |
HS | 0 | 0 | 0 | 0.000288 | 0 | 0 | 0 | 0 | 0 | 0 | ||
M1M4 | QLC | 0.805541 | 0.99087 | 0.676715 | 0.738966 | 0.811159 | 0.811159 | 0.996552 | 0.996552 | 0.811159 | 0.996552 | |
All cell types | TE | 98.5246 | 10.80349 | 366.0483 | 253.7946 | 128.6547 | 97.8306 | 5.655992 | 5.655992 | 102.336 | 5.655992 | |
HS | 0 | 0 | 0.000127 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ||
M1M9 | QLC | 1 | 1 | 0.747105 | 0.841311 | 0.988636 | 0.977273 | 1 | 1 | 1 | 1 | |
All cell types | TE | 0 | 0 | 125.1425 | 157.2069 | 4.430817 | 8.501064 | 0 | 0 | 0 | 0 | |
HS | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ||
M5M9 | QLC | 1 | 1 | 1 | 0.899543 | 1 | 1 | 1 | 1 | 1 | 1 | |
All cell types | TE | 0 | 0 | 0 | 87.94826 | 0 | 0 | 0 | 0 | 0 | 0 | |
HS | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
The table represents the results of the three clustering measures (QLC, TE and HS) over 10 NJ reconstructed methods. For the measure QLC higher scores mean better performance, whereas for the measures TE and HS lower values mean better performance. It can be seen that in some cases, for example dataset #7, all the distance measures (except for Euclidian) give the same best score (1 for QLC and 0 for TE and HS). For dataset #1, on the other hand, all the distance measures give a rather similar bad score. It is not surprising that the scores of a dataset which contains 9 individuals (#1) will be lower than the scores of a dataset which contains only 2 individuals (#7). There are cases, like #3 and #4, in which the best performance is achieved by one method (NJ-Normalized Absolute) and other cases, like #5, where a few methods receive a very high score (NJ-Normalized Absolute, NJ-MMM length dependent rates, and more).
The results of the 502 datasets are given in
Upper panels – Mouse, Lower panels- Human. Each column presents a different clustering measure (see
As seen, the NJ- Normalized Absolute method is ranked the best more frequently than the other methods in all the three different measures, and the next best performance is associated with the NJ- Absolute method. The absolute distance measure captures the underlying mutational process precisely only if there are no back mutations, namely the addition of a repeat followed by its deletion, or vice versa. However, this is not likely to be the case in general as empirical data shows that MSs accumulate many mutations and yet have small variability over time, hence back mutations must be present, most likely in accordance with a random walk model
We conducted a similar rough test (i.e. including different types of cells) using the human data which contains seven individuals, thus producing 120 datasets. A summary of the results is given in
In order to validate these results we performed lineage tree reconstruction of simulated trees similar to the ones reconstructed using the real data (as described in
In general one cannot assume that the depth is the same for different cell types, and therefore tree reconstruction algorithms that produce ultrametric trees are not applicable. Even though, by using the same type of cells from organisms with the same age, the same depth assumption is reasonable, still, there might be a problem that the zygotes of the different organisms do not have the same depth. However due to the fact that our mice are MMR deficient, their mutation rates after the zygote are a few orders of magnitude higher than before the zygote, and all of them are genetically related, it is reasonable to assume that they have the same depth from the zygote.
By using the same cells from different mice with the same age, we also tested the UPGMA algorithm with all the distance measures that we have, as well as BATWING. In general the NJ and the UPGMA algorithms perform better than the QMC and the BATWING. The NJ has a small advantage over the UPGMA, but since the dataset size is relatively small we cannot provide a convincing conclusion. Among all the measures, using NJ- Normalized Absolute performs the best as shown in
Each column presents a different clustering measure (see
The Bayesian method did not perform well, compared to the other methods. This is not surprising taking into account the fact that its assumptions do not describe well the cell population, both in the reproduction process, and the demographic model. BATWING assumes an exponential growth, while the true demography is much more complicated and includes for instance bottlenecks for the different types of cells
The comprehensive test presented above is not identical to the common cases in cell lineage research. In the above test, we used cells from different organisms, while in the common research questions of cell lineage, cells from the same organism are used. A priori there is no reason to believe that the evolvement of the mutational process inside the individual is different than the one between different individuals. This is because many general features of the MS mutational process (such as the dependency on the unit size, the length of the MS, and the specific letter) show a similar behavior (Chapal-Illani et al, in preparation), and because the MMR-deficient mice are relatives, and thus the genetic distance between them is relatively small and the vast majority of their mutations are somatic. As a result we assume that there is no difference in the performance of the reconstructing algorithms, and since the NJ-Normalized Absolute algorithm performs best between individuals, we can assume it has the best performance inside a single individual. To validate this assumption we reconstructed lineage trees of single cells from the same individual, using all types of methods mentioned above, without knowing what the true separation of the different cell types is. In
We also performed lineage tree reconstruction of simulated trees of single cells of different types from one individual (described in
In order to examine the clustering abilities in a refined way, we produced datasets using only one type of cells from different individuals, and tested the performance of the different methods. This step is important for two reasons: a. The use of different cell types for different mice may lead to an artificial separation between the mice, simply because of the diverse features of different cells (such as division rate, etc.). The use of the same type of cells assures that the methods indeed distinguish between the different organisms. b. We want to check the presence of a correlation between the clustering qualities of the different methods with specific types of cells, i.e. whether there are some methods that succeed in separating some types of cells, while others succeed in separating other types.
As mentioned before, clustering is just one example of a feature of the tree in which we are interested. Another feature is the depth of specific types of cells. Different depths can indicate different biological scenarios, for example whether some types of cells divide only during the embryonic stage or also in the adult stage. In this section our goal is to quantify the performance of the different methods in identifying depth differences between groups of cells.
In order to obtain cases where the depth separation between the cell groups is known, we used the same type of cells from individuals with a substantial gap between their ages. The list of cell types of each individual is given in
Two reconstructed lineage trees containing whole crypts from M9 (52 days) and two reconstructed lineage trees containing whole crypts from M7 (199 days) are shown. The root of the trees (colored in black) is the signature of the tail extracted from each mouse. (A) Two reconstructed trees using the NJ algorithm along with Equal or Not distance measure. (B) Two reconstructed trees using the NJ algorithm along with the Absolute distance measure.
The depth of each group of cells as reconstructed by the NJ varies even if all the cells of this group actually have exactly the same number of divisions. Therefore for each group of cells, the depth is described by a distribution rather than by a single number. In order to quantify the performance of a method in separating between the two groups, quantities that differentiate between distributions are needed.
The most natural choice is the Kolmogorov-Smirnov (KS) test, which measures the similarities between two datasets. However, this test has some disadvantages for our purposes; the most significant one is the ability to determine that two datasets are different even if they have exactly the same average depth, in cases where their standard deviations are substantially different. Therefore in addition to the KS we added two other measures that focus on the separation between the two distributions. The first is the normalized distance between the mean of the two groups, and the second is the overlap percentage between the two distributions (see
A summary of the depth separation tests' results is presented in
Each column presents a different depth quality measure (see
Bootstrap analysis was used in order to evaluate the robustness and reproducibility of the estimated trees, the clustering of the tree and the depth separation according to cell type. We performed this analysis on several mice and human datasets which showed a good clustering or depth separation using the NJ-Normalized Absolute method.
The bootstrap values were obtained by generating 100 trees using MS values extracted from sampling with replacement of the loci from each dataset. The bootstrap showed that the robustness of any particular branch in the tree is low, but the robustness of the clustering results and depth separation according to cell type is high (see
In the preliminary stages of the cell lineage research conducted by our lab, small-scale investigations of the ability to reconstruct cell lineage trees were done. The large amount of information that was gathered during the last few years enabled us to conduct this investigation in a much more comprehensive way. The main outcome of this research is that even though currently only a limited amount of microsatellite loci are available, preventing the reconstruction of the accurate cell-lineage tree, many biological conclusions can still be confidentially drawn. By this we mean that apart from specific noisy cases, almost always we are able to identify the correct biological scenario from the reconstructed cell lineage if the proper tree reconstruction algorithm and distance measure are used.
Among the NJ methods, we have found that the NJ- Normalized Absolute method outperforms the other methods at inferring the clustering of distinct groups. Interestingly, this shows that clear cluster- separation is not necessarily correlated with the most precise description of the mutational process. A result with a similar spirit was obtained previously
However, such a measure cannot describe accurately cell depth because depth information is eliminated in the normalization procedure. It is not unreasonable to assume that a Likelihood (or a Bayesian) method tailored towards cell lineage analysis that will make use of all the cells' information, without summarizing it into distance measure, will enable one to infer simultaneously both the topology and the depth in an accurate way. We hope to follow such a path in the future.
We expect that in the coming years next generation sequencing methods will provide us with a much richer genetic signature, and thus improve our ability to infer the cell lineage tree more accurately. This in turn will enable relying on even fine details of the cell lineage tree and not only its rough features.
All animal husbandry and euthanasia procedures were performed in accordance with the Institutional Animal Care and Use Committee (IACUC) of the Weizmann Institute of Science.
All human patients signed an informed consent; the study has received Helsinki authorization and was approved by the Rambam Hospital IRB committee and by the Bioethics Committee of the Weizmann Institute of Science.
Our aim is to quantify the performance of different tree reconstruction methods in inferring clustering and depth separation. Most of the methods we tested are distance-based algorithms which use a distance measure between the cells to iteratively join close samples together, such as the Neighbor-Joining algorithm (the full list of methods and distance measures is given below). We tested each method on two features:
The first distance based algorithm we used is UPGMA (Unweighted Pair Group Method with Arithmetic Mean algorithm)
The second algorithm we used is NJ (Neighbor-Joining)
The third algorithm we used is QMC (Quartet MaxCut)
Apart from the distance based algorithms, we tested a Bayesian method for inferring the cell lineage tree. We used the computer software, BATWING
The distance-based methods require a distance measure between cells, which ideally should be linear with the actual number of divisions separating any two samples, and should provide the most robust tree reconstruction. We have tested several different distance functions. In these functions
Absolute distance – the distance between two samples,
Normalized Absolute distance – the distance between two samples,
Euclidean distance – the distance between two samples, i and j, is:
“Equal or not” distance – the distance between two samples,
Maximum Likelihood (ML) – each entry in the distance matrix is taken as the maximum likelihood estimate of the number of divisions separating the two cells (more details in
We used three different measures to quantify the quality of the clustering separation ability:
Quality of the Largest Cluster (QLC) – this measure focuses on the existence of one large cluster and ignores the behavior of the rest of the cells, and it is calculated as follows. For each cell type
Tree Entropy (TE) – this measure assesses the amount of mixing on the tree between each pair of cell groups. It is affected only by the number of clusters on the tree that each cell group has, regardless of the sizes of these clusters. The entropy of the whole tree is the average over the entropy matrix containing the entropy between each pair. It is calculated as follows: the number of clusters of each pair of cell types
Hypergeometric score (HS) – this measure detects a statistically significant clustering of a predefined group of cells on the reconstructed lineage tree. According to the method, given a dichotomous classification of
The HS values presented in the figures of this work were taken as the log HS values. Since they range from 0 to 1 and are normalized by the best value, the higher normalized values mean better performance (i.e. better clustering).
All these measures evaluate the quality of the separation of the distinct groups on the tree, but they measure parameters that are slightly different. The QLC focuses on the existence of one large cluster and ignores the behavior of the rest of the cells. The TE on the other hand is determined by the number of distinct clusters of each cell type, and ignores their sizes. Hence, the TE focuses on the global behavior of the tree and not just on one sub-tree. The HS, like the QLC, focuses on the existence of a large cluster, but does not ignore the rest of the cells as it detects a statistically significant clustering of a group of cells on the lineage tree.
We used three different measures to quantify the quality of the depth separation ability:
Kolmogorov Smirnov (KS) test – this measure answers the question of whether two datasets were taken from the same distribution. Even if the datasets have the same average but have very different standard deviations, the KS test will reject the null hypothesis of both having the same distribution. For some features of separation, this test answers the required question, but for other features, other tests are necessary, such as the other two measures presented below.
Normalized Distance (ND) – this measure evaluates the normalized distance between the averages of two datasets, defined as:
Overlap percentage – this measure gives the percentage of overlap between two datasets and is defined as follows. For each data point of distribution
Mlh1+/−C57Bl/6 (obtained from Prof. Michael Liskay)
Peripheral Blood (PB) and Bone Marrow (BM) biopsies at diagnosis and relapse collected from leukemia patients were received from the Rambam Healthcare Campus, Haifa, Israel. All patients signed an informed consent, and the study was approved by the RambamIRB committee.
The system takes as input a set of DNA samples from individual cells and PCR primers for Microsatellites (MSs), and outputs a reconstructed cell lineage tree of the cells from mutations identified in these MSs. This tree provides an inferred topology and a depth estimation representing the inferred number of cell divisions that occurred along each edge of the lineage tree. The system utilizes a programmable laboratory robot augmented with a PCR and capillary electrophoresis machines. The capillary machine histograms are analyzed by a computer program that utilizes a signal processing algorithm to assign each sample a vector, which is a mathematical representation of the mutations the cell acquired in the MS set. A computer program applies a phylogenetic algorithm to the set of vectors and produces a reconstructed cell lineage tree associated with the DNA samples.
Different cell types were sampled from mice and humans (
It is important to note that since all the loci are fully linked, the question of which of the two alleles mutated is not meaningful, as we treat the same locus on the two alleles as two independent loci. This is acceptable because we generally analyzed loci from the chromosome X of males, where there is only one allele. In the few cases where we analyzed loci from other chromosomes (or from chromosome X of a female), we used only loci with very distinct numbers of repeats on each allele, that enable us to identify which of the two alleles underwent mutation, even though we cannot say which copy of the chromosome has which allele.
We should note that aneuploidy, as well as copy number variations (CNV), both of which are frequent in cancer cells, may lead to a situation where even in the male X-chromosome there will be more than one allele. However, in cases with more than one allele the locus was not taken into account for this specific cell. In other chromosomes, the CNV may lead to deletion of one of the alleles which dramatically affects the results, and therefore such loci were not taken into account. Our principle was to analyze only loci in which we detected the amount of alleles expected.
Mice were sacrificed before tissues isolation. Cells were isolated by tissues digestion followed by MACS or FACS. Aliquots of 0.5 µl were spread on a flat bottom 96 well plate and observed under the microscope. Drops that contained single cell were collected into 0.2 ml tubes and subjected to whole genome amplification. Alternatively, cells were isolated by laser micro-dissection as previously demonstrated
WGA was performed using the IllustraGenomiPhi V2 DNA Amplification kit (GE Healthcare Life Sciences, Piscataway, NJ, USA) according to the manufacturer's instructions as described by G. Kumar et al
We simulated trees similar to the ones reconstructed using the real data, with some trees containing 3 individuals with 5 different cell types for each individual, and other trees composed of a single individual. We simulated several kinds of topologies, which were different from each other in branch length. For example, in one topology the distance between the leaves and their MRCA was high, compared to the distance between the root and these MRCAs, and in another topology this distance between the leaves and their MRCA was much lower. We repeated the simulation 1000 times, where in each iteration, we built a random tree and randomly added MS mutations according to a given mutation rate (1/100, 1/1000 and 1/10000) using a binomial distribution. We then reconstructed the tree using all of our methods and compared it with the actual tree that was generated. The topology comparison between the inferred and the actual trees was done using Penny & Hendy's topological distance algorithm
(TIF)
(TIF)
(TIF)
(TIF)
(TIF)
(TIF)
(TIF)
(TIF)
(TIF)
(TIF)
(TIF)
(TIF)
(TIF)
(TIF)
(XLSX)
(XLSX)
(XLSX)
(XLSX)
(XLSX)
(XLSX)
(XLSX)
(XLSX)
(XLSX)
(DOC)
We thank B. Chor and A. Tanay for helpful comments on the manuscript. Ehud Shapiro is the Incumbent of The Harry Weinrebe Professorial Chair of Computer Science and Biology.