Conceived and designed the experiments: JS MK KB NM. Performed the experiments: JS MK NM. Analyzed the data: JS MK TK MH NM. Contributed reagents/materials/analysis tools: JS MK NM. Wrote the paper: JS MK MH NM. Algorithm design and implementation: JS. Contributed to algorithm design: KB.
The authors have declared that no competing interests exist.
Genetic variation on the non-recombining portion of the Y chromosome contains information about the ancestry of male lineages. Because of their low rate of mutation, single nucleotide polymorphisms (SNPs) are the markers of choice for unambiguously classifying Y chromosomes into related sets of lineages known as haplogroups, which tend to show geographic structure in many parts of the world. However, performing the large number of SNP genotyping tests needed to properly infer haplogroup status is expensive and time consuming. A novel alternative for assigning a sampled Y chromosome to a haplogroup is presented here. We show that by applying modern machine-learning algorithms we can infer with high accuracy the proper Y chromosome haplogroup of a sample by scoring a relatively small number of Y-linked short tandem repeats (STRs). Learning is based on a diverse ground-truth data set comprising pairs of SNP test results (haplogroup) and corresponding STR scores. We apply several independent machine-learning methods in tandem to learn formal classification functions. The result is an integrated high-throughput analysis system that automatically classifies large numbers of samples into haplogroups in a cost-effective and accurate manner.
The Y chromosome is passed on from father to son as a nearly identical copy. Occasionally, small random changes occur in the Y DNA sequences that are passed forward to the next generation. There are two kinds of changes that may occur, and they both provide vital information for the study of human ancestry. Of the two kinds, one is a single letter change, and the other is a change in the number of short tandemly repeating sequences. The single-letter changes can be laborious to test, but they provide information on deep ancestry. Measuring the number of sequence repeats at multiple places in the genome simultaneously is efficient, and provides information about recent history at a modest cost. We present the novel approach of training a collection of modern machine-learning algorithms with these sequence repeats to infer the single-letter changes, thus assigning the samples to deep ancestry lineages.
Genetic variation on the non-recombining portion of the Y chromosome (NRY) has become the target of many recent studies with applications in a variety of disciplines, including DNA forensics
Reconstructing relationships among globally dispersed populations or divergent male lineages requires polymorphisms with lower probabilities of back and parallel mutation (i.e., lower levels of homoplasy) and systems for which the ancestral state can be determined. SNPs and small indels, with mutation rates on the order of 2-4×10−8/site/generation, are best suited for these purposes. Because SNPs and indels are likely to have only two allelic classes segregating in human populations, they are sometimes referred to as binary markers (we refer to both classes of marker as SNPs). The combination of allelic states at many SNPs on a given Y chromosome is known as a haplogroup. A binary tree of NRY haplogroups with a standard nomenclature system (
One of the challenges for geneticists is the cost and time typically needed to genotype an appropriate number of SNPs to assign a given Y chromosome to a haplogroup. Multiplex strategies to type SNPs are also difficult and require a substantial initial investment to implement
We obtained a data set collected by the Hammer laboratory that contains 8,414 globally diverse Y chromosome samples genotyped at 15 Y-STRs. The same samples were also typed with a battery of SNPs to identify the haplogroup of each sample. The SNPs typed and the resulting haplogroup tree are given in
This set of chromosomes, typed at 15 Y-linked STRs, was used as a ground-truth training set (see text for explanation). Haplogroups are named according to the mutation-based nomenclature
Since each sample of STR scores in our data is labeled with a haplogroup, we formalized the problem of classifying a new sample with a haplogroup as a supervised learning task and used our data set of 8,414 samples as a ground-truth set for training. In order to provide high classification accuracy, we combined the results of three disparate types of classifiers: decision trees, generative Bayesian models, and support vector machines. As we describe in
We compared our classification results to an informal nearest neighbor heuristic that labels STR samples with a haplogroup based on the stepwise mutation model
We evaluated the performance of each classifier individually and in tandem using cross-validation on our 8,414 sample ground-truth training set, and compared the results with the nearest neighbor heuristic previously mentioned. We also performed cross-validation on publicly available data from other published research with Y-STR and haplogroup data. Finally, we tested the classification performance on the public data using our data for training. In brief, the results show the classifiers perform very well with a diverse training set and that the number of loci available in the data set is an important determining factor in their performance.
The cross-validation was accomplished by stochastically partitioning the data sets into
We combined the classification output for a sample from the decision trees (J48 and PART), Bayesian models, and support vector machines into a tandem decision. The output haplogroups from each of the classifiers were compared together, and if they were in agreement, accepted, or assigned, the classification; otherwise the sample was left
In practice, an unassigned sample for the tandem approach is selected for manual, expert analysis. Experienced personnel examine the haplogroup assignment from the individual classifiers for familiar patterns. The confidence values from the classifiers may also be analyzed to resolve frequently seen disagreements. If the ambiguity cannot be resolved at this stage, SNP testing is done to ensure a correct haplogroup label. The result of the SNP test is then added to the training set to continually improve the classifiers.
For the nearest neighbor heuristic we used the
Percent Correct of Assigned | Correct | Incorrect | Unassigned | |||||
Mean | SE | Mean | SE | Mean | SE | Mean | SE | |
Tandem | 98.8 | 0.1 | 1426.1 | 1.8 | 17.9 | 0.7 | 238.8 | 1.9 |
SVM | 95.0 | 0.1 | 1598.5 | 1.2 | 84.3 | 1.2 | ||
J48 | 92.7 | 0.1 | 1559.5 | 1.5 | 123.3 | 1.5 | ||
PART | 92.5 | 0.1 | 1556.8 | 1.6 | 126.0 | 1.6 | ||
Bayes | 91.5 | 0.1 | 1540.2 | 1.5 | 142.6 | 1.5 | ||
Nearest | 98.3 | 0.2 | 1227.3 | 2.5 | 21.6 | 0.5 | 433.9 | 2.6 |
Support vector machines has the highest individual accuracy. J48 and PART are decision tree classifiers. Combining the classifiers together into the tandem strategy boosts the performance to a very high accuracy while maintaining a much lower unassignment rate than the nearest neighbor heuristic.
The average accuracy for each of the classifiers per haplogroup is shown in the top panel of
Standard error bars are shown for each point. The lower panel shows the frequency of the 8,414 samples among haplogroups. Support vector machines has the best overall performance, especially in the case of haplogroups with a smaller number of samples in the training data.
The classification accuracy under the tandem approach was very high.
The average proportion of samples with agreement for all four classification methods is also shown. The haplogroups with the highest rate of tandem disagreement have a low representation in the training data.
Haplogroup | Correct | Incorrect | Unassigned | |||
Mean | SE | Mean | SE | Mean | SE | |
A- P97 | 3.1 | 0.2 | 0.1 | 0.0 | 2.4 | 0.2 |
B- M60 | 1.2 | 0.1 | 0.0 | 0.0 | 2.3 | 0.3 |
C- M130,M216 | 1.6 | 0.1 | 0.0 | 0.0 | 6.7 | 0.4 |
D- M174 | 0.7 | 0.1 | 0.2 | 0.1 | 2.5 | 0.2 |
E- M96,SRY4064 | 111.4 | 1.3 | 0.9 | 0.1 | 31.9 | 0.8 |
G- M201 | 114.9 | 1.2 | 2.4 | 0.2 | 39.3 | 0.9 |
H- M69 | 4.0 | 0.3 | 0.1 | 0.0 | 5.7 | 0.3 |
I- M170,P19 | 335.7 | 2.2 | 4.7 | 0.4 | 59.4 | 1.1 |
J- M267 | 125.4 | 1.5 | 2.0 | 0.2 | 9.3 | 0.4 |
J- M172 | 128.4 | 1.5 | 2.3 | 0.2 | 18.0 | 0.7 |
K- M70 | 62.9 | 1.0 | 0.9 | 0.1 | 3.8 | 0.3 |
L- M20 | 6.2 | 0.3 | 1.3 | 0.2 | 2.9 | 0.3 |
N- M231 | 0.6 | 0.1 | 0.9 | 0.2 | 3.0 | 0.2 |
O- M175 | 24.0 | 0.6 | 0.4 | 0.1 | 22.7 | 0.6 |
Q- P36,M242 | 40.5 | 0.8 | 0.4 | 0.1 | 7.3 | 0.4 |
R- SRY10831.2 | 100.0 | 1.3 | 0.1 | 0.0 | 5.7 | 0.3 |
R- M343 | 346.7 | 1.7 | 0.6 | 0.1 | 15.5 | 0.6 |
The far right column gives the number of samples with an unassigned tandem classification—all four methods did not agree.
In addition to testing the performance of the classifiers on our 15-locus data set, we also tested them on published STR data collected from West, South and East Asian populations
The samples were typed with 9 Y-STRs and a battery of Y-linked SNPs. Haplogroup frequencies are statistically significantly different from those in our ground-truth training set (
As before when testing the classifiers on our ground-truth data, we ran 10 iterations of five fold cross-validation on the public data.
Standard error bars are shown for each point. The lower panel shows the frequency of haplogroups in the 1,527 sample public data set.
9-Locus Public Data | ||||||||
Percent Correct of Assigned | Correct | Incorrect | Unassigned | |||||
Mean | SE | Mean | SE | Mean | SE | Mean | SE | |
Tandem | 95.6 | 0.5 | 208.6 | 1.1 | 9.5 | 0.3 | 87.3 | 1.1 |
SVM | 87.1 | 0.2 | 265.9 | 0.7 | 39.5 | 0.7 | ||
PART | 84.3 | 0.3 | 257.6 | 0.8 | 47.8 | 0.8 | ||
J48 | 84.3 | 0.3 | 257.5 | 0.8 | 47.9 | 0.8 | ||
Bayes | 79.0 | 0.3 | 241.3 | 0.9 | 64.1 | 0.9 | ||
Nearest | 92.6 | 0.4 | 212.8 | 1.0 | 17.0 | 0.6 | 75.6 | 0.8 |
9-Locus Ground-Truth Data | ||||||||
Percent Correct of Assigned | Correct | Incorrect | Unassigned | |||||
Mean | SE | Mean | SE | Mean | SE | Mean | SE | |
Tandem | 92.4 | 0.2 | 1,100.2 | 2.4 | 90.0 | 1.1 | 616.7 | 2.4 |
SVM | 82.4 | 0.1 | 1,386.1 | 1.9 | 296.7 | 1.9 | ||
PART | 81.1 | 0.1 | 1,364.2 | 1.9 | 318.6 | 1.9 | ||
J48 | 81.2 | 0.1 | 1,366.5 | 1.9 | 316.3 | 1.9 | ||
Bayes | 76.9 | 0.1 | 1,293.4 | 2.1 | 389.4 | 2.1 | ||
Nearest | 91.8 | 0.4 | 618.6 | 2.9 | 55.6 | 1.1 | 1,008.7 | 3.3 |
The top table shows the average classifier performance for cross-validation on the 9-locus public STR data. The bottom table is the performance for the same test, but on a 9-locus subset of our ground-truth training data. While overall performance is lower than the 15-locus cross-validation test on our ground-truth data (
Compared with the earlier cross-validation results on our ground-truth data (
We tested haplogroup classification for the public STR data using classifiers trained with the 9-locus subset of our data set. The classification accuracy results are reported in
Percent CoA | Correct | Incorrect | Unassigned | |
Tandem | 83.9 | 732 | 140 | 655 |
SVM | 67.2 | 1,026 | 501 | |
PART | 70.5 | 1,077 | 450 | |
J48 | 70.5 | 1,076 | 451 | |
Bayes | 62.8 | 959 | 568 | |
Nearest | 81.9 | 398 | 88 | 1,041 |
A 9-locus subset of our ground-truth data was used to train the classifiers (Percent CoA = % Correct of Assigned).
In this paper we have shown that by using machine learning algorithms and data derived only from a set of Y-linked STRs, it is possible to assign Y chromosome haplogroups to individual samples with a high degree of accuracy. We note that the number of Y-STRs used has a significant impact on the accuracy of haplogroup classification.
Our classification software provides a single turnkey interface to a tandem of machine learning algorithms. It is extensible in that other high-performing classification algorithms can be added to it when they are developed. We have made the software freely available to use for non-commercial purposes and posted it online at
Future work could focus on identifying an optimal set of Y-STRs to obtain the highest accuracy of haplogroup classification. Our preliminary results (data not shown) suggest that different Y-STRs are informative for different haplogroups. Additional work should help to better understand the properties that make different Y-STRs more or less informative for proper haplogroup assignment.
We have assumed in our Bayesian model that the Y-STR loci are statistically independent given the haplogroup. While we have observed good performance for this model, it most likely does not reflect the true relationship among loci. As more information about loci linkage becomes available and our ground-truth data set continues to expand, we could relax this assumption and begin to include such dependencies.
Our Bayesian model assumes that Y-STRs are statistically independent conditioned on the haplogroup. While we observe good performance using this model, this assumption is not realistic given the lack of crossing over among the Y-linked STRs used in our analysis. On the other hand, Y-STRs mutate independently in a stepwise fashion, which may cause particular Y-STRs to be effectively unlinked on some haplogroup backgrounds. As more information about linkage becomes available and our ground-truth data set continues to expand, we may be able to include such information to improve our model.
The software system can be effectively used to construct high throughput SNP test panels, particularly in the case of platforms that restrict the number of SNPs accommodated per panel. Given a corpus of STR data, the classifiers can identify a collection of candidate SNP sites to be placed on the panel to provide maximum coverage over potential haplogroups in a population. In this way the software provides a cost-effective first step in a multi-level process for deep haplogroup identification by facilitating targeted SNP testing.
The 15 Y chromosome STR loci used in this study are: DYS393, DYS390, DYS394 (both copies when duplicated), DYS391, DYS385a, DYS385b, DYS426, DYS388, DYS439, DYS389 I, DYS389 II, DYS392, DYS438, DYS457. These loci are commonly used in the fields of population genetics, forensic science, and commercial genealogical testing
The SNP and STR data for samples utilized to construct the training data set and models for this study were acquired and analyzed over an extended period of time by the Hammer laboratory. The SNPs were identified using a variety of techniques including: DNA sequencing, allele specific PCR scored by agarose electrophoresis, PCR and restriction digest, and TaqMan assays. A test panel comprising the SNPs in
What follows is a brief description of the classifiers we used and how each was adapted and extended to the haplogroup assignment problem. We first introduce some notation shared among all classifier descriptions. Let
In a decision tree classifier, we learn a set of rules for separating samples into hierarchical classification groups according to locus and allele values. The internal nodes of the tree are comprised of locus tests for specific allele values and the terminal nodes represent haplogroup classification. The set of tests from the root node in the tree to a terminal node is the classification rule for a haplogroup. The tree is constructed from a set of training data
The locus tests are constructed using a measure of information gain, which is based on information entropy
Let
Knowing the allele value of a locus may affect the entropy of the data; additional information either does not change or decreases the entropy. When a particular allele at the
We obtain a general conditional entropy for each locus by marginalizing out the allele values. This is equivalent to computing a weighted average of Equation 4, where the weights are given by the probability of each allele.
The difference in variation among haplogroups when a Y-STR allele is both known and unknown is the information gain. It is a measure of how well a locus explains haplogroup membership. Formally, it is defined for the
Given the data set
The general decision tree approach has some limitations, including overfitting by creating too many branches and locus bias. The former can be handled by introducing thresholds or other heuristics for the amount of information gain required to create a branch. The latter is a more fundamental problem of the approach; by definition, the information gain favors Y-STRs taking many different allele values. We used the PART and J48 implementations
Samples from four haplogroups in data set
In the non-parametric Bayesian model, we define a posterior distribution over the haplogroups conditioned on observed allele values. The posterior is expressed as the normalized product of the data likelihood and model prior. For a given sample of allele values, the posterior gives a probability for each haplogroup it could belong to. It is defined as
The fundamental assumption of our naive Bayes model is the independence of the Y-STRs
Mathematically, the independence assumption leads to defining the likelihood as a product over each Y-STR density function,
For each haplogroup, the density functions
The distribution Equation 11 is defined over all haplogroups, but is not by itself a classifier. To make a decision, we choose the maximum under the posterior (MAP)
Support vector machines learn a hyperplane with maximal margin of separation between two classes of data samples in a feature space
For a sample
The goal of training a support vector machine is to find the hyperplane, defined by
Example showing the hyperplane with maximal margin of separation between samples from two different haplogroups. The shaded points lying on the margin define the support vectors.
By noting that the distance of a sample
Without loss of generality, we can rescale
Transforming the problem into its dual shows that the optimization exhibits the Karush-Kuhn-Tucker conditions that
A common and effective kernel to use for SVMs is the Gaussian, which has the form
In order to make the SVM approach work on data that may not be perfectly separable, we allow for some small amount of the training data to be misclassified. Thus, rather than having infinite error when incorrect (zero error when correct), we allow some of the data points to be classified on the wrong side of the separating hyperplane. To accomplish this, we follow the standard treatment of introducing slack variables that act as a penalty with linearly increasing value for the distance from the wrong side
A slack variable
The optimization process is similar to before, but the Lagrange multipliers are now subject to the constraint 0≤
Since SVMs train a binary classifier and we have multiple haplogroups to distinguish between, we trained an SVM for each haplogroup in a one-vs-many fashion. In general, an SVM trained as one-vs-many for a particular haplogroup uses samples in that haplogroup as positive exemplars and samples in other haplogroups we wish to compare against as negative exemplars.
We organized the set of binary classifiers into a hierarchy based on the currently known binary haplogroup lineage
Only the top-level haplogroups are shown.
In order to create a high throughput classification system, all samples of STR data needing haplogroup prediction are batch selected from a database at regular intervals and classified. Once selected, our tandem classification software predicts the haplogroup for each sample and updates its record in our laboratory information management system (LIMS). Laboratory technicians and data reviewers can then view the results in a web interface (
The tandem classification software brings together a collection of algorithms implementing naive Bayes, support vector machines, and decision tree classifiers. Where available, we used standard implementations of these algorithms that are open and available to the public.
For support vector machines, we used the freely available software package libSVM
The decision tree classifiers J48 and PART were used as components of the Weka machine learning software suite
NRY haplogroup SNP tree used to type each sample in our ground-truth training set.
(0.02 MB EPS)
Accuracy of each classifier per haplogroup of predicting the public STR data set using the 9-locus subset of our ground-truth data as training. The lower panel shows the frequency of samples among haplogroups in the public data set.
(1.06 MB EPS)
Tandem classification accuracy and agreement per haplogroup on the public data set using the 9-locus subset of our ground-truth data as training.
(0.03 MB EPS)
Screen capture of the STR score review console from the laboratory information management system. The red dashed box contains STR loci names as headings for the columns of collected loci score data. The pop-up window (in black) show the quality assessment score for a selected sample. The yellow dashed ellipse highlights our software's haplogroup classification and confidence value for each algorithm in the tandem approach.
(0.44 MB EPS)
We would like to thank Fernando Mendez for helping us find large amounts of publicly available STR data. We would also like to thank Saharon Rosset for his insightful comments and suggestions.