Logo image
Detection and classification of peaks in 5' cap RNA sequencing data
Journal article   Open access   Peer reviewed

Detection and classification of peaks in 5' cap RNA sequencing data

D. Strbenac, N.J. Armstrong and J.Y.H. Yang
BMC Genomics, Vol.14(Suppl 5), S9
2013
pdf
detection_and_classification.pdfDownloadView
Published (Version of Record) Open Access
url
Link to Published Version *Subscription may be requiredView

Abstract

Background The large-scale sequencing of 5' cap enriched cDNA promises to reveal the diversity of transcription initiation across entire genomes. The process of transcription is noisy, and there is often no single, exact start site. This creates the need for a fast and simple method of identifying transcription start peaks based on this type of data. Due to both biological and technical noise, many of the peaks seen are not real transcription initiation events. Classification of the observed peaks is an essential filtering step in the discovery of genuine initiation locations. Results We develop a two-stage approach consisting of a fast and simple algorithm based on a sliding window with Poisson null distribution for detecting the genomic locations of peaks, followed by a linear support vector machine classifier to distinguish between peaks which represent the initiation of transcription and peaks that do not. Comparison of classification performance to the best existing method based on whole genome segmentation showed comparable precision and improved recall. Internal features, which are intrinsic to the data and require no further experiments, had high precision and recall rates. Addition of pooled external data or matched RNA sequencing data resulted in gains of recall with equivalent precision. Conclusions The Poisson sliding window model is an effective and fast way of taking the peak neighbourhood into account, and finding statistically significant peaks over a range of transcript expression values. It is orders of magnitude faster than doing whole genome segmentation. The support vector classification scheme has better precision and recall than existing methods. Integrating additional datasets is shown to provide minor gains in recall, in comparison to using only the cap-sequencing data.

Details

UN Sustainable Development Goals (SDGs)

This output has contributed to the advancement of the following goals:

#3 Good Health and Well-Being

Source: InCites

Metrics

126 File views/ downloads
33 Record Views

InCites Highlights

These are selected metrics from InCites Benchmarking & Analytics tool, related to this output

Citation topics
1 Clinical & Life Sciences
1.54 Molecular & Cell Biology - Genetics
1.54.100 Epigenetic Regulation
Web Of Science research areas
Biotechnology & Applied Microbiology
Genetics & Heredity
ESI research areas
Molecular Biology & Genetics
Logo image