Filters








134 Hits in 1.9 sec

A Formal Semantics For Data Analytics Pipelines

Maurizio Drocco, Claudia Misale, Guy Tremblay, Marco Aldinucci
2017 Zenodo  
In this report, we present a new programming model based on Pipelines and Operators, which are the building blocks of programs written in PiCo, a DSL for Data Analytics Pipelines. In the model we propose, we use the term Pipeline to denote a workflow that processes data collections -- rather than a computational process -- as is common in the data processing community. The novelty with respect to other frameworks is that all PiCo operators are polymorphic with respect to data types. This makes
more » ... t possible to 1) re-use the same algorithms and pipelines on different data models (e.g., streams, lists, sets, etc); 2) reuse the same operators in different contexts, and 3) update operators without affecting the calling context, i.e., the previous and following stages in the pipeline. Notice that in other mainstream frameworks, such as Spark, the update of a pipeline by changing a transformation with another is not necessarily trivial, since it may require the development of an input and output proxy to adapt the new transformation for the calling context. In the same line, we provide a formal framework (i.e., typing and semantics) that characterizes programs from the perspective of how they transform the data structures they process -- rather than the computational processes they represent. This approach allows to reason about programs at an abstract level, without taking into account any aspect from the underlying execution model or implementation.
doi:10.5281/zenodo.571802 fatcat:vqzxnt2lxne55nxzgchyfgipsq

A Dynamic, Hierarchical Resource Model for Converged Computing [article]

Daniel J. Milroy, Claudia Misale, Stephen Herbein, Dong H. Ahn
2021 arXiv   pre-print
Milroy, Claudia Misale, Stephen Herbein, and Dong H. Ahn  ... 
arXiv:2109.03739v1 fatcat:zl7z4hjrsvg53ju2ltbnhwul2y

Pico: A Domain-Specific Language For Data Analytics Pipelines

Claudia Misale, Marco Aldinucci, Guy Tremblay
2017 Zenodo  
Drocco, Claudia Misale, and M. Aldinucci. A cluster-as-accelerator approach for SPMD-free data parallelism. In Proc. of Intl.  ...  IEEE (vi) Claudia Misale. Accelerating bowtie2 with a lock-less concurrency approach and memory affinity. In Proc. of Intl.  ... 
doi:10.5281/zenodo.579753 fatcat:aadje57qh5hk3ijmqn4j7vkhpm

A Formal Semantics for Data Analytics Pipelines [article]

Maurizio Drocco and Claudia Misale and Guy Tremblay and Marco Aldinucci
2017 arXiv   pre-print
In this report, we present a new programming model based on Pipelines and Operators, which are the building blocks of programs written in PiCo, a DSL for Data Analytics Pipelines. In the model we propose, we use the term Pipeline to denote a workflow that processes data collections -- rather than a computational process -- as is common in the data processing community. The novelty with respect to other frameworks is that all PiCo operators are polymorphic with respect to data types. This makes
more » ... t possible to 1) re-use the same algorithms and pipelines on different data models (e.g., streams, lists, sets, etc); 2) reuse the same operators in different contexts, and 3) update operators without affecting the calling context, i.e., the previous and following stages in the pipeline. Notice that in other mainstream frameworks, such as Spark, the update of a pipeline by changing a transformation with another is not necessarily trivial, since it may require the development of an input and output proxy to adapt the new transformation for the calling context. In the same line, we provide a formal framework (i.e., typing and semantics) that characterizes programs from the perspective of how they transform the data structures they process -- rather than the computational processes they represent. This approach allows to reason about programs at an abstract level, without taking into account any aspect from the underlying execution model or implementation.
arXiv:1705.01629v1 fatcat:gk5ga5zidbbnhcaddbpgkq4lza

A Comparison of Big Data Frameworks on a Layered Dataflow Model [article]

Claudia Misale and Maurizio Drocco and Marco Aldinucci and Guy Tremblay
2016 arXiv   pre-print
Misale et al.  ...  operations with their (multiset-based) semantics: 2 groupByKey(a) = {(k, {v : (k, v) ∈ a})} (1) join(a, b) = {(k, (v a , v b )) : (k, v a ) ∈ a ∧ (k, v b ) ∈ b} (2) map f (a) = {f (v) : v ∈ a} (3) 6 Claudia  ... 
arXiv:1606.05293v1 fatcat:l5xkqpqcjbbjfk7aih2t54af6q

PiCo: A Novel Approach to Stream Data Analytics [chapter]

Claudia Misale, Maurizio Drocco, Guy Tremblay, Marco Aldinucci
2018 Lecture Notes in Computer Science  
In this paper, we present a new C++ API with a fluent interface called PiCo (Pipeline Composition). PiCo's programming model aims at making easier the programming of data analytics applications while preserving or enhancing their performance. This is attained through three key design choices: 1) unifying batch and stream data access models, 2) decoupling processing from data layout, and 3) exploiting a streamoriented, scalable, efficient C++11 runtime system. PiCo proposes a programming model
more » ... sed on pipelines and operators that are polymorphic with respect to data types in the sense that it is possible to re-use the same algorithms and pipelines on different data models (e.g., streams, lists, sets, etc.). Preliminary results show that PiCo can attain better performances in terms of execution times and hugely improve memory utilization when compared to Spark and Flink in both batch and stream processing.
doi:10.1007/978-3-319-75178-8_10 fatcat:2yaxzubpibhtnlniose365ww6m

Sequence Alignment Tools: One Parallel Pattern to Rule Them All?

Claudia Misale, Giulio Ferrero, Massimo Torquati, Marco Aldinucci
2014 BioMed Research International  
In this paper, we advocate high-level programming methodology for next generation sequencers (NGS) alignment tools for both productivity and absolute performance. We analyse the problem of parallel alignment and review the parallelisation strategies of the most popular alignment tools, which can all be abstracted to a single parallel paradigm. We compare these tools to their porting onto the FastFlow pattern-based programming framework, which provides programmers with high-level parallel
more » ... s. By using a high-level approach, programmers are liberated from all complex aspects of parallel programming, such as synchronisation protocols, and task scheduling, gaining more possibility for seamless performance tuning. In this work, we show some use cases in which, by using a high-level approach for parallelising NGS tools, it is possible to obtain comparable or even better absolute performance for all used datasets.
doi:10.1155/2014/539410 pmid:25147803 pmcid:PMC4131566 fatcat:bgxvxbavxrbx3gkd3xhi53od5a

A Comparison of Big Data Frameworks on a Layered Dataflow Model

Claudia Misale, Maurizio Drocco, Marco Aldinucci, Guy Tremblay
2017 Parallel Processing Letters  
In the world of Big Data analytics, there is a series of tools aiming at simplifying programming applications to be executed on clusters. Although each tool claims to provide better programming, data and execution models-for which only informal (and often confusing) semantics is generally provided-all share a common underlying model, namely, the Dataflow model. The model we propose shows how various tools share the same expressiveness at different levels of abstraction. The contribution of this
more » ... work is twofold: first, we show that the proposed model is (at least) as general as existing batch and streaming frameworks (e.g., Spark, Flink, Storm), thus making it easier to understand high-level data-processing applications written in such frameworks. Second, we provide a layered model that can represent tools and applications following the Dataflow paradigm and we show how the analyzed tools fit in each level.
doi:10.1142/s0129626417400035 fatcat:bwsjg4qs7rf6jpkvqd5mnablqm

Accelerating Bowtie2 with a lock-less concurrency approach and memory affinity

Claudia Misale
2014 2014 22nd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing  
The implementation of DNA alignment tools for Bioinformatics lead to face different problems that dip into performances. A single alignment takes an amount of time that is not predictable and there are different factors that can affect performances, for instance the length of sequences can determine the computational grain of the task and mismatches or insertion/deletion (indels) increase time needed to complete an alignment. Moreover, an alignment is a strong memorybound problem because of the
more » ... irregular memory access patterns and limitations in memory-bandwidth. Over the years, many alignment tools were implemented. A concrete example is Bowtie2, one of the fastest (concurrent, Pthread-based) and state of the art not GPU-based alignment tool. Bowtie2 exploits concurrency by instantiating a pool of threads, which have access to a global input dataset, share the reference genome and have access to different objects for collecting alignment results. In this paper a modified implementation of Bowtie2 is presented, in which the concurrency structure has been changed. The proposed implementation exploits the task-farm skeleton pattern implemented as a Master-Worker. The Master-Worker pattern permits to delegate only to the Master thread dataset reading and to make private to each Worker data structures that are shared in the original version. Only the reference genome is left shared. As a further optimisation, the Master and each Worker were pinned on cores and the reference genome was allocated interleaved among memory nodes. The proposed implementation is able to gain up to 10 speedup points over the original implementation.
doi:10.1109/pdp.2014.50 dblp:conf/pdp/Misale14 fatcat:6k225zt4hza7fn44fr6phsqvjq

Memory-Optimised Parallel Processing of Hi-C Data

Maurizio Drocco, Claudia Misale, Guilherme Peretti Pezzi, Fabio Tordini, Marco Aldinucci
2015 2015 23rd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing  
This paper presents the optimisation efforts on the creation of a graph-based mapping representation of gene adjacency. The method is based on the Hi-C process, starting from Next Generation Sequencing data, and it analyses a huge amount of static data in order to produce maps for one or more genes. Straightforward parallelisation of this scheme does not yield acceptable performance on multicore architectures since the scalability is rather limited due to the memory bound nature of the problem.
more » ... This work focuses on the memory optimisations that can be applied to the graph construction algorithm and its (complex) data structures to derive a cache-oblivious algorithm and eventually to improve the memory bandwidth utilisation. We used as running example NuChart-II, a tool for annotation and statistic analysis of Hi-C data that creates a gene-centric neighborhood graph. The proposed approach, which is exemplified for Hi-C, addresses several common issue in the parallelisation of memory bound algorithms for multicore. Results show that the proposed approach is able to increase the parallel speedup from 7x to 22x (on a 32-core platform). Finally, the proposed C++ implementation outperforms the first R NuChart prototype, by which it was not possible to complete the graph generation because of strong memory-saturation problems.
doi:10.1109/pdp.2015.63 dblp:conf/pdp/DroccoMPTA15 fatcat:5ufzw36ex5f6nikikagq3pbe64

A Cluster-as-Accelerator Approach for SPMD-Free Data Parallelism

Maurizio Drocco, Claudia Misale, Marco Aldinucci
2016 2016 24th Euromicro International Conference on Parallel, Distributed, and Network-Based Processing (PDP)  
In this paper we present a novel approach for functional-style programming of distributed-memory clusters, targeting data-centric applications. The programming model proposed is purely sequential, SPMD-free and based on highlevel functional features introduced since C++11 specification. Additionally, we propose a novel cluster-as-accelerator design principle. In this scheme, cluster nodes act as general interpreters of user-defined functional tasks over node-local portions of distributed data
more » ... ructures. We envision coupling a simple yet powerful programming model with a lightweight, localityaware distributed runtime as a promising step along the road towards high-performance data analytics, in particular under the perspective of the upcoming exascale era. We implemented the proposed approach in SkeDaTo, a prototyping C++ library of data-parallel skeletons exploiting cluster-as-accelerator at the bottom layer of the runtime software stack.
doi:10.1109/pdp.2016.97 dblp:conf/pdp/DroccoMA16 fatcat:bbbvx77tcbhhbhrtdtppjltyfy

Exercising High-Level Parallel Programming on Streams: A Systems Biology Use Case

Marco Aldinucci, Maurizio Drocco, Guilherme Peretti Pezzi, Claudia Misale, Fabio Tordini, Massimo Torquati
2014 2014 IEEE 34th International Conference on Distributed Computing Systems Workshops  
The stochastic modelling of biological systems, coupled with Monte Carlo simulation of models, is an increasingly popular technique in Bioinformatics. The simulation-analysis workflow may result into a computationally expensive task reducing the interactivity required in the model tuning. In this work, we advocate high-level software design as a vehicle for building efficient and portable parallel simulators for a variety of platforms, ranging from multi-core platforms to GPGPUs to cloud. In
more » ... ticular, the Calculus of Wrapped Compartments (CWC) parallel simulator for systems biology equipped with online mining of results, which is designed according to the FastFlow pattern-based approach, is discussed as a running example. In this work, the CWC simulator is used as a paradigmatic example of a complex C++ application where the quality of results is correlated with both computation and I/O bounds, and where high-quality results might turn into big data. The FastFlow parallel programming framework, which advocates C++ patternbased parallel programming makes it possible to develop portable parallel code without relinquish neither run-time efficiency nor performance tuning opportunities. Performance and effectiveness of the approach are validated on a variety of platforms, inter-alia cache-coherent multi-cores, cluster of multi-core (Ethernet and Infiniband) and the Amazon Elastic Compute Cloud.
doi:10.1109/icdcsw.2014.38 dblp:conf/icdcsw/AldinucciDPMTT14 fatcat:xxet3emit5catkev7xcunsq2e4

Towards Memory-Optimized Data Shuffling Patterns for Big Data Analytics

Bogdan Nicolae, Carlos Costa, Claudia Misale, Kostas Katrinis, Yoonho Park
2016 2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid)  
Big data analytics is an indispensable tool in transforming science, engineering, medicine, healthcare, finance and ultimately business itself. With the explosion of data sizes and need for shorter time-to-solution, in-memory platforms such as Apache Spark gain increasing popularity. However, this introduces important challenges, among which data shuffling is particularly difficult: on one hand it is a key part of the computation that has a major impact on the overall performance and
more » ... so its efficiency is paramount, while on the other hand it needs to operate with scarce memory in order to leave as much memory available for data caching. In this context, efficient scheduling of data transfers such that it addresses both dimensions of the problem simultaneously is non-trivial. State-of-the-art solutions often rely on simple approaches that yield sub-optimal performance and resource usage. This paper contributes a novel shuffle data transfer strategy that dynamically adapts to the computation with minimal memory utilization, which we briefly underline as a series of design principles.
doi:10.1109/ccgrid.2016.85 dblp:conf/ccgrid/NicolaeCMKP16 fatcat:deohde667raxpixgafx4jtpxy4

Parallel Exploration of the Nuclear Chromosome Conformation with NuChart-II

Fabio Tordini, Maurizio Drocco, Claudia Misale, Luciano Milanesi, Pietro Lio, Ivan Merelli, Marco Aldinucci
2015 2015 23rd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing  
High-throughput molecular biology techniques are widely used to identify physical interactions between genetic elements located throughout the human genome. Chromosome Conformation Capture (3C) and other related techniques allow to investigate the spatial organisation of chromosomes in the cell's natural state. Recent results have shown that there is a large correlation between co-localization and co-regulation of genes, but these important information are hampered by the lack of
more » ... ndly analysis and visualisation software. In this work we introduce NuChart-II, a tool for Hi-C data analysis that provides a gene-centric view of the chromosomal neighbourhood in a graph-based manner. NuChart-II is an efficient and highly optimized C++ re-implementation of a previous prototype package developed in R. Representing Hi-C data using a graphbased approach overcomes the common view relying on genomic coordinates and permits the use of graph analysis techniques to explore the spatial conformation of a gene neighbourhood.
doi:10.1109/pdp.2015.104 dblp:conf/pdp/TordiniDMMLMA15 fatcat:rvw7lu5gvjh35mhgzn2cfji6za

Leveraging Adaptive I/O to Optimize Collective Data Shuffling Patterns for Big Data Analytics

Bogdan Nicolae, Carlos H. A. Costa, Claudia Misale, Kostas Katrinis, Yoonho Park
2017 IEEE Transactions on Parallel and Distributed Systems  
Claudia Misale is a PhD candidate at Computer Science Department of the University of Torino and a member of the parallel computing Alpha group.  ... 
doi:10.1109/tpds.2016.2627558 fatcat:ksejutgfbvet3g23mr7ru7hmxq
« Previous Showing results 1 — 15 out of 134 results