Peer Review #2 of "fluff: exploratory analysis and visualization of high-throughput sequencing data (v0.1)"
[peer_review]
A Lex
2016
unpublished
Here we introduce fluff, a software package that allows for simple exploration, clustering and visualization of high-throughput sequencing experiments. The package contains three command-line tools to generate publication-quality figures in an uncomplicated manner using sensible defaults. Genome-wide data can be aggregated, clustered and visualized in a heatmap, according to different clustering methods. This includes a predefined setting to identify dynamic clusters between different
more »
... or developmental stages. Alternatively, clustered data can be visualized in a bandplot. Finally, fluff includes a tool to generate genomic profiles. As command-line tools, the fluff programs can easily be integrated into standard analysis pipelines. The installation is straightforward and documentation is available at http://fluff.readthedocs.org. Availability. The fluff tools are implemented in Python and run on Linux. The source code is freely available for download at https://github.com/simonvh/fluff. PeerJ reviewing PDF | ABSTRACT 7 Summary: In this application note we describe fluff, a software package that allows for simple exploration, clustering and visualization of high-throughput sequencing data mapped to a reference genome. The package contains three command-line tools to generate publication-quality figures in an uncomplicated manner using sensible defaults. Genome-wide data can be aggregated, clustered and visualized in a heatmap, according to different clustering methods. This includes a predefined setting to identify dynamic clusters between different conditions or developmental stages. Alternatively, clustered data can be visualized in a bandplot. Finally, fluff includes a tool to generate genomic profiles. As command-line tools, the fluff programs can easily be integrated into standard analysis pipelines. The installation is straightforward and documentation is available at http://fluff.readthedocs.org. 8 9 10 11 12 13 14 15 16 Availability: fluff is implemented in Python and runs on Linux. The source code is freely available for download at https://github.com/simonvh/fluff. 20 21 34 et al., 2010; Akalin et al., 2015), command-line tools (Shen et al., 2014; Giannopoulou and Elemento, 35 2011), web tools (Ramírez et al., 2014), stand-alone applications (Ramírez et al., 2014; Ye et al., 2011) and 36 tools that depend on other software for visualization (Heinz et al., 2010). Here, we present fluff, a Python 37 package for visual, reference-based HTS data exploration. It includes command-line applications to both 38 cluster and visualize aggregated signals in genomic regions, as well as to create genome browser-like 39 profiles. The scripts can be included in analysis pipelines and accept commonly used file formats. The 40 fluff applications are pitched at the beginner to intermediate user. They have sensible defaults, yet allow 41 for customizable creation of high-quality, publication-ready figures. 42 PeerJ reviewing PDF | (Manuscript to be reviewed Profiles. Genome browsers are unrivaled for data exploration and visualization in a genomic context. 81 However, it can be useful to create profiles of HTS data in genomic intervals using a consistent command-82 line tool, that can optionally be automated. The fluff profile tool can plot summarized profiles from one or 83 more profiles, together with (gene) annotation from a BED12-formatted file. 84 Analysis 85 In short, FASTQ files were downloaded from NCBI GEO (Edgar et al., 2002) and mapped to the 86 human genome (hg19) using bwa (Li and Durbin, 2009). Duplicate reads were marked using bamUtil 87 (http://genome.sph.umich.edu/wiki/BamUtil). All BAM files from replicate experiments were merged. 88 Peaks were called using MACS2 (Zhang et al., 2008) with default settings. See Supplemental Information 89 for specific details and accession numbers. 90 RESULTS 91 Demonstrating fluff: dynamic enhancers during macrophage differentiation 92 To illustrate the functionality of fluff we visualized previously published ChIP-seq data (Saeed et al., 93 2014). Here, the epigenomes of human monocytes and in vitro-differentiated naïve, tolerized, and 94 2/7 PeerJ reviewing PDF | (Manuscript to be reviewed 90 percent: light color) of the clusters as identified in Fig. 1A. (C) The H3K27ac ChIP-seq profiles at the CNRIP1 gene locus, which shows a gain of H3K27ac in Mf, LPS-Mf and BG-Mf relative to Mo. trained macrophages were analyzed, with the aim to understand the epigenetic basis of innate immunity. 95 Circulating monocytes (Mo) were differentiated into three macrophages states: to macrophages (Mf), 96 to long-term tolerant cells (LPS-Mf) by exposition to lipopolysaccharide and to trained immune cells 97 (BG-Mf) by priming with β -glucan. We used fluff heatmap to cluster and visualize the signal of histone 3 98 lysine 27 acetylation (H3K27ac), which is located at active enhancers and promoters (Fig. 1A) . The input 99 consisted of a BED file with 7,611 differentially regulated enhancers (Supplemental Table 1 ) and four 100 BAM files, for each of the monocytes and three types of macrophages. Using k-means clustering (k = 101 5) with the Pearson correlation metric, the heatmap recapitulates the H3K27ac dynamics as described 102 (Saeed et al., 2014). 103 While heatmaps are often used for visualization of signals over genomic features, either clustered or 104 ordered by signal intensity, it can be difficult to distinguish relative levels of individual clusters. Figure 105 1B shows an alternative visualization of average enrichment profiles in small multiples. The same clusters 106 as in Fig. 1A are plotted using fluff bandplot. Shown are the median (black line), along with the 50th 107 (darker color) and 90th percentile (lighter color) of the data. This allows for more detailed comparisons. 108 Finally, we illustrate fluff profile, which can visualize one or more genomic regions (Fig. 1C) . This 109 figure highlights the CNRIP1 gene from cluster 2, which shows a consistent increase of H3K27ac from 110 Mo to Mf, LPS-Mf and BG-Mf. The signal profiles are directly generated from the BAM files. Manuscript to be reviewed Here, we identify two clusters with high enrichment (cluster 3 and cluster 5), a cluster with relatively 126 low, narrow enrichment (cluster 1), and two clusters with broad enhancer domains (cluster 4 and 6). 127 However, only two strong dynamic clusters are identified, cluster 2, which shows enhancers specifically 128 activated in mesenchymal stem cells and cluster 7 which shows enhancers specifically activated in 129 trophoblast-like stem cells. Figure 2B shows an alternative clustering approach implemented in fluff 130 heatmap. Here the regions were clustered on basis of the Pearson correlation of read counts in the center 131 of the region (extended to 2kb). This shows a completely different picture and we now can identify 132 enhancers specific to H1 ES cells (cluster 5), mesenchymal (cluster 4), mesendoderm (cluster 7), neuronal 133 progenitor (cluster 3) and trophoblast cells (cluster 6). These lineage-specific enhancer dynamics were 134 not visible in the clustering in Figure 2A . 135 CONCLUSION 136 The analysis of multi-dimensional genomic data requires methods for data exploration and visualization. 137 We provide fluff, a Python package that contains several command-line tools to generate figures for use 138 in high-throughput sequencing analysis workflows. We aim to fill the gap between powerful, flexible 139 libraries that require programming skills on the one hand, and intuitive, graphical programs with limited 140 customization possibilities on the other hand. These tools were developed based on a need for straight-
doi:10.7287/peerj.2209v0.1/reviews/2
fatcat:aictcmek3vaztkvr74u2b4u6ku