Prediction of Poly(A) Sites by Poly(A) Read Mapping

Thomas Bonfert, Caroline C. Friedel, Bin Tian
2017 PLoS ONE  
RNA-seq reads containing part of the poly(A) tail of transcripts (denoted as poly(A) reads) provide the most direct evidence for the position of poly(A) sites in the genome. However, due to reduced coverage of poly(A) tails by reads, poly(A) reads are not routinely identified during RNA-seq mapping. Nevertheless, recent studies for several herpesviruses successfully employed mapping of poly(A) reads to identify herpesvirus poly(A) sites using different strategies and customized programs. To
more » ... easily allow such analyses without requiring additional programs, we integrated poly(A) read mapping and prediction of poly(A) sites into our RNA-seq mapping program ContextMap 2. The implemented approach essentially generalizes previously used poly(A) read mapping approaches and combines them with the context-based approach of ContextMap 2 to take into account information provided by other reads aligned to the same location. Poly(A) read mapping using ContextMap 2 was evaluated on real-life data from the ENCODE project and compared against a competing approach based on transcriptome assembly (KLEAT). This showed high positive predictive value for our approach, evidenced also by the presence of poly(A) signals, and considerably lower runtime than KLEAT. Although sensitivity is low for both methods, we show that this is in part due to a high extent of spurious results in the gold standard set derived from RNA-PET data. Sensitivity improves for poly(A) sites of known transcripts or determined with a more specific poly(A) sequencing protocol and increases with read coverage on transcript ends. Finally, we illustrate the usefulness of the approach in a high read coverage scenario by a re-analysis of published data for herpes simplex virus 1. Thus, with current trends towards increasing sequencing depth and read length, poly(A) read mapping will prove to be increasingly useful and can now be performed automatically during RNA-seq mapping with ContextMap 2. Performance of poly(A) read mapping and identification of poly(A) sites was evaluated on RNA-seq data from the ENCODE project [29] for three cell lines. Predicted poly(A) sites were evaluated against a gold standard set obtained from RNA-PET data, which allows identification of transcript 5' and 3' ends [30] . Our mapping-based approach is furthermore compared against an alternative approach based on transcriptome assembly presented recently (KLEAT [31]). With default parameters, poly(A) read mapping has a significantly higher positive predictive value (PPV), i.e. a higher fraction of correct predictions, than the assembly-based approach KLEAT, at the cost of lower sensitivity. With alternative parameters, approximately the same sensitivity (and same PPV) can be obtained as for KLEAT. Here, the advantage of our mapping-based approach is the *3-fold lower runtime compared to KLEAT. While PPV was generally high, sensitivity on all "gold standard" poly(A) sites obtained from RNA-PET was poor for both methods, but improved considerably for poly(A) sites near annotated transcript 3' ends and increased with transcript read coverage. Combined with the observation that the frequency of known poly(A) signal sequences within 50 nt upstream of RNA-PET poly(A) sites was both lower than previously reported [32] and observed for our predictions, this suggests that a substantial fraction of the "gold standard" poly(A) sites are actually incorrect and sensitivity of poly(A) read mapping is underestimated. Indeed, sensitivity more than tripled if identified poly(A) sites were evaluated on more specific poly(A) sequencing data available for one of the evaluated cell lines [33] . In summary, these results show that poly(A) read mapping can successfully recover poly(A) sites with high precision, in particular if read coverage on the corresponding transcripts is high. While major isoforms of highly expressed genes will always be recovered more confidently, more and more poly(A) sites of lowly expressed and minor isoforms will be detected with increasing sequencing depth. This is further illustrated by a re-analysis of the HSV-1 data where both high PPV and sensitivity are achieved, highlighting the value of poly(A) read mapping for host-pathogen transcriptomics. Thus, by integrating poly(A) read mapping into Con-textMap 2, which already supports parallel mapping against both host and pathogen genomes, we additionally extended its suitability for these applications. Moreover, since poly(A) read mapping can now be performed conveniently as part of standard read mapping, without requiring additional software, we expect it to be more commonly applied. 5 / 32 Fig 2. Identification of candidate poly(A) sites. (A) For each alignment, a sliding window of length w l is shifted along the clipped part of the read sequence and the fraction of A's (or T's depending on strandedness of sequencing) is calculated within each window. In this example, the fraction is 5/6 = 0.83 for the first two windows and 6/6 = 1 for all subsequent windows. Thus, at least one window contains ! c 1 = 1 A's and none has < c 2 = 0.7 A's and this is used as a candidate poly(A) site. (B) In this example, the clipping length is shorter than w l . Accordingly, the window approach cannot be used and all clipped nucleotides are required to be A's or T's to predict a candidate poly(A) site, which is the case here. (C) Alignments a 3 and a 4 are considered pairwise overlapping as they are clipped at the same end (dashed lines) and the distance d between the start of clipping is smaller than the read length.
doi:10.1371/journal.pone.0170914 pmid:28135292 pmcid:PMC5279776 fatcat:np3cdo6enfftva4suuxlizyg3u