A transcription factor affinity-based code for mammalian transcription initiation
Venue
Genome Research, vol. 19 (2009), pp. 644-56
Publication Year
2009
Authors
M Megraw, F Pereira, ST Jensen, U Ohler, AG Hatzigeorgiou
BibTeX
Abstract
The recent arrival of large-scale cap analysis of gene expression (CAGE) data sets
in mammals provides a wealth of quantitative information on coding and noncoding
RNA polymerase II transcription start sites (TSS). Genome-wide CAGE studies reveal
that a large fraction of TSS exhibit peaks where the vast majority of associated
tags map to a particular location ( approximately 45%), whereas other active
regions contain a broader distribution of initiation events. The presence of a
strong single peak suggests that transcription at these locations may be mediated
by position-specific sequence features. We therefore propose a new model for
single-peaked TSS based solely on known transcription factors (TFs) and their
respective regions of positional enrichment. This probabilistic model leads to
near-perfect classification results in cross-validation (auROC = 0.98), and
performance in genomic scans demonstrates that TSS prediction with both high
accuracy and spatial resolution is achievable for a specific but large subgroup of
mammalian promoters. The interpretable model structure suggests a DNA code in which
canonical sequence features such as TATA-box, Initiator, and GC content do play a
significant role, but many additional TFs show distinct spatial biases with respect
to TSS location and are important contributors to the accurate prediction of
single-peak transcription initiation sites. The model structure also reveals that
CAGE tag clusters distal from annotated gene starts have distinct characteristics
compared to those close to gene 5'-ends. Using this high-resolution single-peak
model, we predict TSS for approximately 70% of mammalian microRNAs based on
currently available data.
