TY - JOUR T1 - Enhanced Regulatory Sequence Prediction Using Gapped k-mer Features A1 - Ghandi, Mahmoud A1 - Lee, Dongwon A1 - Mohammad-Noori, Morteza A1 - Beer, Michael A. Y1 - 2014/07/17 N2 - Author Summary Genomic regulatory elements (enhancers, promoters, and insulators) control the expression of their target genes and are widely believed to play a key role in human development and disease by altering protein concentrations. A fundamental step in understanding enhancers is the development of DNA sequence-based models to predict the tissue specific activity of regulatory elements. Such models facilitate both the identification of the molecular pathways which impinge on enhancer activity through direct transcription factor binding, and the direct evaluation of the impact of specific common or rare genetic variants on enhancer function. We have previously developed a successful sequence-based model for enhancer prediction using a k-mer support vector machine (kmer-SVM). Here, we address a significant limitation of the kmer-SVM approach and present an alternative method using gapped k-mers (gkm-SVM) which exhibits dramatically improved accuracy in all test cases. While we focus on enhancers and transcription factor binding, our method can be applied to improve a much broader class of sequence analysis problems, including proteins and RNA. In addition, we expect that most k-mer based methods can be significantly improved by simply using the generalized k-mer count method that we present in this paper. We believe this improved model will enable significant contributions to our understanding of the human regulatory system. JF - PLOS Computational Biology JA - PLOS Computational Biology VL - 10 IS - 7 UR - https://doi.org/10.1371/journal.pcbi.1003711 SP - e1003711 EP - PB - Public Library of Science M3 - doi:10.1371/journal.pcbi.1003711 ER -