Sequence Kernels for Predicting Protein Essentiality
Venue
Proceedings of ICML 2008
Publication Year
2008
Authors
Cyril Allauzen, Mehryar Mohri, Ameet Talwalkar
BibTeX
Abstract
The problem of identifying the minimal gene set required to sustain life is of
crucial importance in understanding cellular mechanisms and designing therapeutic
drugs. This work describes several kernel-based solutions for predicting essential
genes that outperform existing models while using less training data. Our first
solution is based on a semi-manually designed kernel derived from the Pfam
database, which includes several Pfam domains. We then present novel and general
{\em domain-based} sequence kernels that capture sequence similarity with respect
to several domains made of large sets of protein sequences. We show how to deal
with the large size of the problem -- several thousands of domains with individual
domains sometimes containing thousand of sequences -- by representing and
efficiently computing these kernels using automata. We report results of extensive
experiments demonstrating that they compare favorably with the Pfam kernel in
predicting protein essentiality, while requiring no manual tuning.
