A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2020; you can also visit the original URL.
The file type is
Data-intensive science offers new opportunities for innovation and discoveries, provided that large datasets can be handled efficiently. Data management for data-intensive science applications is challenging; requiring support for complex data life cycles, coordination across multiple sites, fault tolerance, and scalability to support tens of sites and petabytes of data. In this paper, we argue that data management for data-intensive science applications requires a fundamentally differentdoi:10.1145/2538542.2538566 dblp:conf/sc/SimonetFRA13 fatcat:2rdqaw2vgbfnpcuyqskypr7idq
more »... ment approach than the current ad-hoc task centric approach. We propose Active Data, a fundamentally novel paradigm for data life cycle management. Active Data follows two principles: data-centric and event-driven. We report on the Active Data programming model and its preliminary implementation, and discuss the benefits and limitations of the approach on recognized challenging data-intensive science use-cases.
Multiple threats have been identified when citizens interact with online services such as unknown provenance of information, unknown quality of service providers, spread of fake news, fraud, personal privacy violation, centralisation of power to name a few. Blockchain has been considered as key technology to address many of these challenges; however, in reality, building trustworthy decentralized applications (Dapps) is not straightforward as much blockchain-based functionality needs to bedoi:10.5281/zenodo.6811328 fatcat:vxsm76solnfxvat27u46dtfcd4
more »... oped from scratch and combined with data semantics. In this paper, we propose a new software framework, namely ONTOCHAIN, that leverages semantic web and blockchain technology to build, as distinct value for the Next Generation Internet, fundamental support for trustworthy data/services exchange and trustworthy content handling. It comprises a novel protocol suite grouped into high-level application protocols, such as data provenance, reputation models, decentralised oracles, market mechanisms, ontology representation and management, privacy aware and secure data exchange, multi-source identity verification, value sharing and incentives, and lower-level core protocols that include authorisation, certification, privacy-aware data processing, cross-chain gateways, identity management, secure data exchange, and data semantics in smart contracts. We demonstrate that these protocols are already available and combined to implement interesting NGI Dapps.
In order to achieve near-time insights, scientific workflows tend to be organized in a flexible and dynamic way. Data-driven triggering of tasks has been explored as a way to support workflows that evolve based on the data. However, the overhead introduced by such dynamic triggering of tasks is an under-studied topic. This paper discusses different facets of dynamic task triggers. Particularly, we explore different ways of constructing a data-driven dynamic workflow and then evaluate thearXiv:2004.10381v1 fatcat:7jf3lcczpbbnbcpqigggj5ibsq
more »... ds introduced by such design decisions. We evaluate workflows with varying data size, percentage of interesting data, temporal data distribution, and number of tasks triggered. Finally, we provide advice based upon analysis of the evaluation results for users looking to construct data-driven scientific workflows.
Academic and industry experts are now advocating for going from large-centralized Cloud Computing infrastructures to smaller ones massively distributed at the edge of the network. Among the obstacles to the adoption of this model is the development of a convenient and powerful IaaS system capable of managing a significant number of remote data-centers in a unified way. In this paper, we introduce the premises of such a system by revising the OpenStack software, a leading IaaS manager in thedoi:10.1109/ic2e.2017.35 dblp:conf/ic2e/LebrePSD17 fatcat:vnswwl5yh5gbzpl47a2a5bxwpe
more »... stry. The novelty of our solution is to operate such an Internet-scale IaaS platform in a fully decentralized manner, using P2P mechanisms to achieve high flexibility and avoid single points of failure. More precisely, we describe how we revised the OpenStack Nova service by leveraging a distributed key/value store instead of the centralized SQL backend. We present experiments that validate the correct behavior and gives performance trends of our prototype through an emulation of several data-centers using Grid'5000 testbed. In addition to paving the way to the first large-scale and Internet-wide IaaS manager, we expect this work will attract a community of specialists from both distributed system and network areas to address the Fog/Edge Computing challenges within the OpenStack ecosystem.
Academics and industry experts are now advocating for going from large-centralized Cloud Computing (CC) infrastructures to smaller ones massively distributed at the edge of the network. Referred to as "fog/edge/local computing", such a dawning paradigm is attracting growing interest as it improves the whole services agility in addition to bringing computing resources closer to end-users. While several initiatives investigate how such Distributed Cloud Computing (DCC) infrastructures can bedoi:10.1109/ic2ew.2016.48 dblp:conf/ic2e/SimonetLO16 fatcat:nlj7nmkff5d5tg5677nmuudhb4
more »... ted, the economical viability of such solutions is still questionable, especially if the objective is to propose attractive prices in comparison to those proposed by giant actors such as Amazon, Microsoft and Google. In this article, we go beyond the state of the art of the current cost model of DCC infrastructures. First, we provide a classification of the different ways of deploying DCC platforms. Then, we propose a versatile cost model that can help new actors evaluate the viability of deploying a DCC solution. We illustrate the relevance of our proposal by instantiating it over three use-cases and compare them according to similar computation capabilities provided by the AWS solution. Such a study clearly shows that deploying a DCC infrastructure makes sense for telecom operators as well as new actors willing to enter the game.
h i g h l i g h t s • We present a formal model to represent the life cycle of data distributed and replicated on many systems. • We leverage this model to propose a programming model that allows users to react to life cycle progression. • We illustrate the approach with examples of applications that we programmed with this model. a b s t r a c t The Big Data challenge consists in managing, storing, analyzing and visualizing these huge and ever growing data sets to extract sense and knowledge.doi:10.1016/j.future.2015.05.015 fatcat:lvgmxiialbh4zb3pgk6vt7xziy
more »... s the volume of data grows exponentially, the management of these data becomes more complex in proportion. A key point is to handle the complexity of the data life cycle, i.e. the various operations performed on data: transfer, archiving, replication, deletion, etc. Indeed, data-intensive applications span over a large variety of devices and e-infrastructures which implies that many systems are involved in data management and processing. We propose Active Data, a programming model to automate and improve the expressiveness of data management applications. We first define the concept of data life cycle and introduce a formal model that allows to expose data life cycle across heterogeneous systems and infrastructures. The Active Data programming model allows code execution at each stage of the data life cycle: routines provided by programmers are executed when a set of events (creation, replication, transfer, deletion) happen to any data. We implement and evaluate the model with four use cases: a storage cache to Amazon-S3, a cooperative sensor network, an incremental implementation of the MapReduce programming model and automated data provenance tracking across heterogeneous systems. Altogether, these scenarios illustrate the adequateness of the model to program applications that manage distributed and dynamic data sets. We also show that applications that do not leverage on data life cycle can still benefit from Active Data to improve their performances.
Modern scientific experiments often involve multiple storage and computing platforms, software tools, and analysis scripts. The resulting heterogeneous environments make data management operations challenging; the significant number of events and the absence of data integration makes it difficult to track data provenance, manage sophisticated analysis processes, and recover from unexpected situations. Current approaches often require costly human intervention and are inherently error prone. Thedoi:10.1109/pdp.2015.76 dblp:conf/pdp/SimonetCFF15 fatcat:7rkvxivepbhobdrqlovsgacwna
more »... difficulties inherent in managing and manipulating such large and highly distributed datasets also limits automated sharing and collaboration. We study a real world e-Science application involving terabytes of data, using three different analysis and storage platforms, and a number of applications and analysis processes. We demonstrate that using a specialized data life cycle and programming model-Active Data-we can easily implement global progress monitoring, and sharing; recover from unexpected events; and automate a range of tasks.
By massively adopting OpenStack for operating small to large private and public clouds, the industry has made it one of the largest running software project, overgrowing the Linux kernel. However, with success comes increased complexity; facing technical and scientific challenges, developers are in great difficulty when testing the impact of individual changes on the performance of such a large codebase, which will likely slow down the evolution of OpenStack. Thus, we claim it is now time fordoi:10.1109/ccgrid.2017.87 dblp:conf/ccgrid/CherrueauPSLS17 fatcat:kel6xjez6fg5bpx3bna5exxlpy
more »... e scientific community to join the effort and get involved in the development of OpenStack, like it has been once done for Linux. In this spirit, we developed Enos, an integrated framework that relies on container technologies for deploying and evaluating OpenStack on any testbed. Enos allows researchers to easily express different configurations, enabling fine-grained investigations of OpenStack services. Enos collects performance metrics at runtime and stores them for post-mortem analysis and sharing. The relevance of the Enos approach to reproducible research is illustrated by evaluating different OpenStack scenarios on the Grid'5000 testbed.
doi:10.1109/dsdis.2015.58 dblp:conf/dsdis/DesprezILOPS15 fatcat:kpc2x4zl7zdxrga4bnb3tzxmxm
Several metabolic enzymes undergo reversible polymerization into macromolecular assemblies. The function of these assemblies is often unclear, but in some cases they regulate enzyme activity and metabolic homeostasis. The guanine nucleotide biosynthetic enzyme inosine monophosphate dehydrogenase (IMPDH) forms octamers that polymerize into helical chains. In mammalian cells, IMPDH filaments can associate into micron-length assemblies. Polymerization and enzyme activity are regulated in part bydoi:10.1091/mbc.e17-04-0263 pmid:28794265 pmcid:PMC5620369 fatcat:rkum2tmvnfc5bmy43nlyhf36iq
more »... nding of purine nucleotides to an allosteric regulatory domain. ATP promotes octamer polymerization, whereas guanosine triphosphate (GTP) promotes a compact, inactive conformation whose ability to polymerize is unknown. Also unclear is whether polymerization directly alters IMPDH catalytic activity. To address this, we identified point mutants of human IMPDH2 that either prevent or promote polymerization. Unexpectedly, we found that polymerized and nonassembled forms of recombinant IMPDH have comparable catalytic activity, substrate affinity, and GTP sensitivity and validated this finding in cells. Electron microscopy revealed that substrates and allosteric nucleotides shift the equilibrium between active and inactive conformations in both the octamer and the filament. Unlike other metabolic filaments, which selectively stabilize active or inactive conformations, recombinant IMPDH filaments accommodate multiple states. These conformational states are finely tuned by substrate availability and purine balance, while polymerization may allow cooperative transitions between states. Monitoring Editor
Since its introduction in 2004 by Google, MapReduce has become the programming model of choice for processing large data sets. Although MapReduce was originally developed for use by web enterprises in large data-centers, this technique has gained a lot of attention from the scientific community for its applicability in large parallel data analysis (including geographic, high energy physics, genomics, etc.). So far MapReduce has been mostly designed for batch processing of bulk data. Thedoi:10.1109/smartcity.2015.141 dblp:conf/smartcity/HeSSFTLSJMSCA15 fatcat:er5vyaxk7vgqpmqqiofqovajzq
more »... of D 3 -MapReduce is to extend the MapReduce programming model and propose efficient implementation of this model to: i) cope with distributed data sets, i.e. that span over multiple distributed infrastructures or stored on network of loosely connected devices; ii) cope with dynamic data sets, i.e. which dynamically change over time or can be either incomplete or partially available. In this paper, we draw the path towards this ambitious goal. Our approach leverages Data Life Cycle as a key concept to provide MapReduce for distributed and dynamic data sets on heterogeneous and distributed infrastructures. We first report on our attempts at implementing the MapReduce programming model for Hybrid Distributed Computing Infrastructures (Hybrid DCIs). We present the architecture of the prototype based on BitDew, a middleware for large scale data management, and Active Data, a programming model for data life cycle management. Second, we outline the challenges in term of methodology and present our approaches based on simulation and emulation on the Grid'5000 experimental testbed. We conduct performance evaluations and compare our prototype with Hadoop, the industry reference MapReduce implementation. We present our work in progress on dynamic data sets that has lead us to implement an incremental MapReduce framework. Finally, we discuss our achievements and outline the challenges that remain to be addressed before obtaining a complete D 3 -MapReduce environment.
Our research aims to improve the accuracy of Earthquake Early Warning (EEW) systems by means of machine learning. EEW systems are designed to detect and characterize medium and large earthquakes before their damaging effects reach a certain location. Traditional EEW methods based on seismometers fail to accurately identify large earthquakes due to their sensitivity to the ground motion velocity. The recently introduced high-precision GPS stations, on the other hand, are ineffective to identifydoi:10.1609/aaai.v34i01.5376 fatcat:rlrhesacb5gbjff3jg6thqfuny
more »... edium earthquakes due to its propensity to produce noisy data. In addition, GPS stations and seismometers may be deployed in large numbers across different locations and may produce a significant volume of data consequently, affecting the response time and the robustness of EEW systems.In practice, EEW can be seen as a typical classification problem in the machine learning field: multi-sensor data are given in input, and earthquake severity is the classification result. In this paper, we introduce the Distributed Multi-Sensor Earthquake Early Warning (DMSEEW) system, a novel machine learning-based approach that combines data from both types of sensors (GPS stations and seismometers) to detect medium and large earthquakes. DMSEEW is based on a new stacking ensemble method which has been evaluated on a real-world dataset validated with geoscientists. The system builds on a geographically distributed infrastructure, ensuring an efficient computation in terms of response time and robustness to partial infrastructure failures. Our experiments show that DMSEEW is more accurate than the traditional seismometer-only approach and the combined-sensors (GPS and seismometers) approach that adopts the rule of relative strength.
As Map-Reduce emerges as a leading programming paradigm for data-intensive computing, today's frameworks which support it still have substantial shortcomings that limit its potential scalability. In this paper we discuss several directions where there is room for such progress: they concern storage efficiency under a Corresponding author G. Antoniu et al. massive data access concurrency, scheduling, volatility and faulttolerance. We place our discussion in the perspective of the currentdoi:10.1504/ijcc.2013.055265 fatcat:zgckseqpkzc5ne2x3a2yub7i34
more »... n towards an increasing integration of large-scale distributed platforms (clouds, cloud federations, enterprise desktop grids, etc.). We propose an approach which aims to overcome the current limitations of existing Map-Reduce frameworks, in order to achieve scalable, concurrency-optimized, fault-tolerant Map-Reduce data processing on hybrid infrastructures. This approach will be evaluated with reallife bio-informatics applications on existing Nimbus-powered cloud testbeds interconnected with desktop grids.
Florida Bar Journal
FR 7-4627 SIMONET, Jose, Immigration & Natz. Service, 4706 Hollyridge, Son Antonio, Texas . GE 3-3396 SIMONET, William Floyd, 401 E. Robinson St., Orlando . 425-4631 SIMONHOFF, Harry, 5925 N. ... JE 8-1461 SICKING, Richard Anthony, 704 Ainsley Bldg., Miami . 377-1505 SICKLES, Blaine T., 2749 Wellesley Dr., Columbus, Ohio . 486-5847 SIDWELL, Benjamin C., 25 Western Union Bldg., Tampa . 229-8018 ...
« Previous Showing results 1 — 15 out of 271 results