65 Hits in 7.9 sec

Modern Data Formats for Big Bioinformatics Data Analytics

Shahzad Ahmed, M. Usman, Javed Ferzund, Muhammad Atif, Abbas Rehman, Atif Mehmood
2017 International Journal of Advanced Computer Science and Applications  
This paper explores the data formats used by different tools and algorithms and also presents modern data formats that are used on Big Data Platform.  ...  It will help researchers and developers in choosing appropriate data format to be used for a particular tool or algorithm.  ...  Some Formats are used for the storage of Bioinformatics data like BAM (Binary Alignment Map), Fastq Format, FASTA Format and VCF (Variant Call Format).  ... 
doi:10.14569/ijacsa.2017.080450 fatcat:4hcci6iunbdnboy5z24m6xzk4q

Communication-Efficient Cluster Scalable Genomics Data Processing Using Apache Arrow Flight [article]

Tanveer Ahmad
2022 bioRxiv   pre-print
As a use case, we use NGS short reads DNA sequencing data for pre-processing and variant calling applications.  ...  Current cluster scaled genomics data processing solutions rely on big data frameworks like Apache Spark, Hadoop and HDFS for data scheduling, processing and storage.  ...  Halvade [7] , which uses the Hadoop MapReduce API, while ADAM [12] and SparkGA2 [13] use the Apache Spark framework and HDFS as a distributed file system are few examples of frameworks which use big  ... 
doi:10.1101/2022.04.01.486780 fatcat:6w4yxg3cx5gp7kwxyklxq5jxq4

Benchmarking distributed data warehouse solutions for storing genomic variant information

Marek S. Wiewiórka, Dawid P. Wysakowicz, Michał J. Okoniewski, Tomasz Gambin
2017 Database: The Journal of Biological Databases and Curation  
To investigate the effectiveness of modern columnar storage [column-oriented Database Management System (DBMS)] and query engines, we have developed a prototypic genomic variant data warehouse, populated  ...  with large generated content of genomic variants and phenotypic data.  ...  Supplementary data Conflict of interest. None declared.  ... 
doi:10.1093/database/bax049 pmid:29220442 pmcid:PMC5504537 fatcat:hgwwc2buifbjfj5i77jrxeh6xi

Ameliorating data compression and query performance through cracked Parquet

Patrick Hansert, Sebastian Michel
2022 Proceedings of The International Workshop on Big Data in Emergent Distributed Environments  
The encoding proposed with Dremel has found widespread use in the form of open approaches like Apache Parquet, which can be used with a multitude of storage engines and processing frameworks, like Apache  ...  Using partitioning, we can decrease the number of runs while at the same time using the partitions for data skipping.  ...  Apache Spark [6] is a big data processing framework that has found widespread use.  ... 
doi:10.1145/3530050.3532923 fatcat:z276ztdt2vb4hkt7upsfaboare

An In-depth Investigation of Large-scale RDF Relational Schema Optimizations Using Spark-SQL

Mohamed Ragab, Riccardo Tommasini, Feras M. Awaysheh, Juan Carlos Ramos
2021 International Workshop on Data Warehousing and OLAP  
This paper discusses one of the most significant challenges of large-scale RDF data processing over Apache Spark, the relational schema optimization.  ...  The choice of RDF partitioning techniques and storage formats using SparkSQL significantly impacts query performance.  ...  This call leads the community to leverage Big Data (BD) processing frameworks like Apache Spark [25] to process large RDF datasets [3] .  ... 
dblp:conf/dolap/00010AR21 fatcat:fzl5ripxdfbgjmwwspf4n5iuce

Albis: High-Performance File Format for Big Data Systems

Animesh Trivedi, Patrick Stuedi, Jonas Pfefferle, Adrian Schüpbach, Bernard Metzler
2018 USENIX Annual Technical Conference  
As high-performance networking and storage devices are used pervasively to process this data in frameworks like Spark and Hadoop, we observe that none of the popular file formats are capable of delivering  ...  Over the last decade, a variety of external file formats such as Parquet, ORC, Arrow, etc., have been developed to store large volumes of relational data in the cloud.  ...  Apache CarbonData is an indexed columnar data format for fast analytics on big data platforms [7] . It shares similarities with the Arrow/Parquet project.  ... 
dblp:conf/usenix/TrivediSPSM18 fatcat:x7ztrwaybrf2lnnityzd5p6ofu

Swarm: A federated cloud framework for large-scale variant analysis

Amir Bahmani, Kyle Ferriter, Vandhana Krishnan, Arash Alavi, Amir Alavi, Philip S. Tsao, Michael P. Snyder, Cuiping Pan, Mihaela Pertea
2021 PLoS Computational Biology  
We demonstrate its utility via common inquiries of genomic variants across BigQuery in the Google Cloud Platform (GCP), Athena in the Amazon Web Services (AWS), Apache Presto and MySQL.  ...  Compared to single-cloud platforms, the Swarm framework significantly reduced computational costs, run-time delays and risks of security breach and privacy violation.  ...  , runtimes between CSV input versus Parquet input were compared and significant P values were indicated (two sample t-tests).  ... 
doi:10.1371/journal.pcbi.1008977 pmid:33979321 fatcat:s62adyyabffwzkoczg7j36mr4i

Rethinking Data-Intensive Science Using Scalable Analytics Systems

Frank Austin Nothaft, Michael Linderman, Michael J. Franklin, Anthony D. Joseph, David A. Patterson, Matt Massie, Timothy Danford, Zhao Zhang, Uri Laserson, Carl Yeksigian, Jey Kottalam, Arun Ahuja (+1 others)
2015 Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data - SIGMOD '15  
In this paper, we describe ADAM, an example genomics pipeline that leverages the open-source Apache Spark and Parquet systems to achieve a 28× speedup over current genomics pipelines, while reducing cost  ...  From building this system, we were able to distill a set of techniques for implementing scientific analyses efficiently using commodity "big data" systems.  ...  As ADAM is an open source project, we also would like to thank the community members who have contributed code and use cases to the project, and would especially like to thank Neil Ferguson, Andy Petrella  ... 
doi:10.1145/2723372.2742787 dblp:conf/sigmod/NothaftMDZLYKAH15 fatcat:nokfli3y4fe6zi6avrluhncvau

VC@Scale: Scalable and high-performance variant calling on cluster environments

Tanveer Ahmad, Zaid Al Ars, H Peter Hofstee
2021 GigaScience  
Almost all existing cluster-scaled variant-calling workflows that use Apache Spark/Hadoop as big data frameworks loosely integrate existing single-node pre-processing and variant-calling applications.  ...  using the standardized Apache Arrow data representations.  ...  Acknowledgements This work was carried out on the Dutch national e-infrastructure with the support of SURF Cooperative.  ... 
doi:10.1093/gigascience/giab057 pmid:34494101 pmcid:PMC8424057 fatcat:ftxz5aws4zh5tb44rzhxtg2snu


Andrei Costea, Peter Boncz, Adrian Ionescu, Bogdan Răducanu, Michał Switakowski, Cristian Bârca, Juliusz Sompolski, Alicja Łuszczak, Michał Szafrański, Giel de Nijs
2016 Proceedings of the 2016 International Conference on Management of Data - SIGMOD '16  
VectorH achieves fault tolerance and storage scalability by relying on HDFS, and extends the state-of-the-art in SQL-on-Hadoop systems by instrumenting the HDFS replication policy to optimize read locality  ...  We describe the changes made to single-server Vectorwise to turn it into a Hadoopbased MPP system, encompassing workload management, parallel query optimization and execution, HDFS storage, transaction  ...  itself as the software base for a variety of technologies, thanks to its evolution which decoupled the YARN resource manager from MapReduce and the wide adoption of its distributed file system HDFS, which  ... 
doi:10.1145/2882903.2903742 dblp:conf/sigmod/CosteaIRSBSLSNB16 fatcat:2u6x43ugl5ccnaizhgkb6todrm

Next Generation Distributed Computing for Cancer Research

Pankaj Agarwal, Kouros Owzar
2014 Cancer Informatics  
, namely computing, data storage and management, and networking.  ...  tremendous challenges in data management and analysis.  ...  Acknowledgment The authors thank the reviewers for insightful and helpful comments.  ... 
doi:10.4137/cin.s16344 pmid:25983539 pmcid:PMC4412427 fatcat:wpjhyuakejhirekbzouxtjdqnu

SparkBWA: Speeding Up the Alignment of High-Throughput DNA Sequencing Data

José M. Abuín, Juan C. Pichel, Tomás F. Pena, Jorge Amigo, Ruslan Kalendar
2016 PLoS ONE  
In this work we introduce SparkBWA, a new tool that exploits the capabilities of a big data technology as Spark to boost the performance of one of the most widely adopted aligner, the Burrows-Wheeler Aligner  ...  The design of SparkBWA uses two independent software layers in such a way that no modifications to the original BWA source code are required, which assures its compatibility with any BWA version (future  ...  Acknowledgments This work was supported by Ministerio de Economía y Competitividad (Spain-http://www. grants TIN2013-41129-P and TIN2014-54565-JIN.  ... 
doi:10.1371/journal.pone.0155461 pmid:27182962 pmcid:PMC4868289 fatcat:i3ziuweua5hfvix3dh2wda2frm

High-Efficient Fuzzy Querying with HiveQL for Big Data Warehousing

Bozena Malysiak-Mrozek, Jadwiga Wieszok, Witold Pedrycz, Weiping Ding, Dariusz Mrozek
2021 IEEE transactions on fuzzy systems  
Apache Hive is a data warehousing framework working on top of the Hadoop platform for Big Data processing.  ...  Such extensions make Big Data warehousing more flexible and contribute to the portfolio of tools used by the community of people working with fuzzy sets and data analysis.  ...  Hive queries and analyzes Big Data stored in the Hadoop Distributed File System using SQLlike query language, called HiveQL.  ... 
doi:10.1109/tfuzz.2021.3069332 fatcat:kgktgaza7fdrbb5xxjqahj4qly


Avrilia Floratou, Umar Farooq Minhas, Fatma Özcan
2014 Proceedings of the VLDB Endowment  
Both systems optimize their data ingestion via columnar storage, and promote different file formats: ORC and Parquet.  ...  In this paper, we compare the performance of these two systems by conducting a set of cluster experiments using a TPC-H like benchmark and two TPC-DS inspired workloads.  ...  Impala automatically sets the HDFS block size and the Parquet file size to a maximum of 1 GB. In this way, I/O and network requests apply to a large chunk of data.  ... 
doi:10.14778/2732977.2733002 fatcat:7onmassrafh33dp3kpo4eod2jy

Fusion insight librA

Le Cai, Jacques Hebert, Kamini Jagtiani, Suzhen Lin, Ye Liu, Demai Ni, Chunfeng Pei, Jason Sun, Yongyan Wang, Li Zhang, Mingyi Zhang, Jianjun Chen (+8 others)
2018 Proceedings of the VLDB Endowment  
In particular, we focus on top four requirements from our customers related to data analytics on the cloud: system availability, auto tuning, query over heterogeneous data models on the cloud, and the  ...  It started as a prototype more than five years ago, and is now being used by many enterprise customers over the globe, including some of the world's largest financial institutions.  ...  SQL on HDFS Apache HAWQ [8] was originally developed out of Pivotal Greenplum database [10] , a database management system for big data analytics.  ... 
doi:10.14778/3229863.3229870 fatcat:zwxgz2se5fcehmnvyliwmktoey
« Previous Showing results 1 — 15 out of 65 results