ࡱ > m o l '` R 8 bjbj{P{P . : : 8
. . . . B Z ( h j j j j j j $
h C D R R R v R h R R R N .u . " R 0 R " R R R < d . . Supplemental Text S2
Training SVMs on raw data as a baseline for performance evaluation
The main algorithm developed in this paper utilizes a functional relationship network as input to an SVM that predicts gene-phenotype associations as a classification problem. Specifically, we constructed a feature space consisting of the network connection weights to all positive examples of a phenotype as feature vectors.
An intuitive alternative to this approach is to use all of the original data that was integrated into our functional relationship network as the feature space, and to train SVMs using these features ADDIN EN.CITE Guan2008858517Guan, Y.Myers, C. L.Hess, D. C.Barutcuoglu, Z.Caudy, A. A.Troyanskaya, O. G.Department of Molecular Biology, Princeton University, Princeton, NJ 08544, USA.Predicting gene function in a hierarchical context with an ensemble of classifiersGenome BiolGenome BiolS39 Suppl 12008/07/22AlgorithmsAnimalsBayes TheoremMice/ genetics/metabolismMitochondrial Proteins/genetics/metabolismProteins/ genetics/ metabolismSaccharomyces cerevisiae/geneticsSaccharomyces cerevisiae Proteins/genetics/metabolism20081465-6914 (Electronic)
1465-6906 (Linking)18613947gb-2008-9-s1-s3 [pii]
10.1186/gb-2008-9-s1-s3 [doi]eng[1]. Our raw data consisted of 13048 features derived from the data sources discussed in Supplemental Text S1. This large number of features presents two major problems for SVM classification. First, the time used for learning a SVM is O(nN2), where n is the number of examples and N is the number of features. Therefore training the raw data based SVM is approximately 2000 to 200,000 more time-consuming than using our network-based approach. Nevertheless, we tested the performance of raw SVM on predicting the high-level phenotypic terms and compared it to the network-based SVM approach. Second, the larger feature space of the raw data-based SVM is not well-suited for our case where the number of training examples is much smaller than the number of features. Since the number of training examples available to us is limited, the raw SVM is more likely to overfit or lose generalizability. As such, the raw SVM performed significantly worse than our network-based SVM approach in all but one case, most likely due to this curse of dimensionality as shown below.
ADDIN EN.REFLIST 1. Guan Y, Myers CL, Hess DC, Barutcuoglu Z, Caudy AA, et al. (2008) Predicting gene function in a hierarchical context with an ensemble of classifiers. Genome Biol 9 Suppl 1: S3.
! " 2 4 9 W X D P f j g
h
1 2 5 6 " & ' ( 0 1 P R k & = ? Y j k פt j hV[ h8 CJ UhV[ h?p4 CJ
h,r CJ o(
h,r CJ
h8 CJ o(
h8 CJ he h?p4 6CJ hk h?p4 6CJ H*]hk h?p4 6CJ ]
hc CJ j h?p4 CJ Uhk h?p4 CJ
h?p4 CJ h?p4 CJ KH ho h?p4 5CJ
h?p4 5CJ - X Y j l m 5 6 8 gdc 0^`0gdc `gd8 gd?p4 8 k l m n 6 7 8 h_ hc hc CJ hc j hc Uh?p4 CJ KH
h?p4 CJ , 1h/ =!"#$% D d .-mn ^
C : A " r a w _ 1 _ c o m p a r i s o n b xr#f1
˃ D nà xr#f1
PNG
IHDR r pHYs d d vpAg 8- IDATx_s}v'Mi]0s!%i"
\VB$p/"5\R/sљlB"0
op2CHh#]|zgfAҒY|?ߣ|> T ,# X T J` Pi, *M @ 4 & X T J` Pi, *M @ 4 & X T J` Pi, *M @ 4 & X T J` Pi, *M @ 4 & X T J`pŋ/^\^^^^^FhTsVje ^_X V4fY^pMrQJ)z^E~|NNNNNNz|MtcoC e }ݕ="8D2yYHh:'L>(>Jx<`0boN\\\\\\u 9 %_*!{p8qWsD^{\P͊|U|TqFUFh4GV