Filters








57,117 Hits in 4.6 sec

MLlib: Machine Learning in Apache Spark [article]

Xiangrui Meng, Joseph Bradley, Burak Yavuz, Evan Sparks, Shivaram Venkataraman, Davies Liu, Jeremy Freeman, DB Tsai, Manish Amde, Sean Owen, Doris Xin, Reynold Xin, Michael J. Franklin (+3 others)
2015 arXiv   pre-print
Shipped with Spark, MLlib supports several languages and provides a high-level API that leverages Spark's rich ecosystem to simplify the development of end-to-end machine learning pipelines.  ...  Apache Spark is a popular open-source platform for large-scale data processing that is well-suited for iterative machine learning tasks.  ...  Most machine learning libraries do not provide native support for the diverse set of functionality required for pipeline construction.  ... 
arXiv:1505.06807v1 fatcat:xqjx7ioxizgldgec5vb7l64o24

Machine Learning Pipelines: Provenance, Reproducibility and FAIR Data Principles [article]

Sheeba Samuel, Frank Löffler, Birgitta König-Ries
2020 arXiv   pre-print
Rather, ML, similar to many other disciplines, faces a reproducibility crisis. In this paper, we describe our goals and initial steps in supporting the end-to-end reproducibility of ML pipelines.  ...  Machine learning (ML) is an increasingly important scientific tool supporting decision making and knowledge generation in numerous fields.  ...  Acknowledgments The authors thank the Carl Zeiss Foundation for the financial support of the project "A Virtual Werkstatt for Digitization in the Sciences (K3)" within the scope of the program-line "Breakthroughs  ... 
arXiv:2006.12117v1 fatcat:fldgjlz2o5gpfj52stkbhwvene

Implicit Provenance for Machine Learning Artifacts [article]

Alexandru A. Ormenisan, Mahmoud Ismail, Seif Haridi, Jim Dowling
2020 Zenodo  
Our provenance framework is integrated into the open-source Hopsworks framework, and used in production to enable full provenance for end-to-end machine learning pipelines  ...  Machine learning (ML) presents new challenges for reproducible software engineering, as the artifacts required for repeatably training models are not just versioned code, but also hyperparameters, code  ...  Machine learning (ML) is a relatively new software engineering discipline, where we strive to continuously deliver new versions of models, and, in the event of performance, security, or behavioural regressions  ... 
doi:10.5281/zenodo.3941628 fatcat:ial2fj62uzd2zhhmxmyrpjproa

SOLIS – The MLOps journey from data acquisition to actionable insights [article]

Razvan Ciobanu, Alexandru Purdila, Laurentiu Piciu, Andrei Damian
2022 arXiv   pre-print
This approach however does not supply the needed procedures and pipelines for the actual deployment of machine learning capabilities in real production grade systems.  ...  Being able to define very clear hypotheses for actual real-life problems that can be addressed by machine learning models, collecting and curating large amounts of data for model training and validation  ...  Background When analyzing the DevOps requirements for production-grade scalable machine learning systems, one can choose various methods for operationalizing the end-to-end pipelines.  ... 
arXiv:2112.11925v2 fatcat:bgkdomsm3rcmzn2x6bmzmppdmq

FLRA: A Reference Architecture for Federated Learning Systems [article]

Sin Kit Lo, Qinghua Lu, Hye-Young Paik, Liming Zhu
2021 arXiv   pre-print
Federated learning is an emerging machine learning paradigm that enables multiple devices to train models locally and formulate a global model, without sharing the clients' local data.  ...  for software architecture design for federated learning.  ...  for an end-to-end federated learning system.  ... 
arXiv:2106.11570v1 fatcat:fh37vlxbh5gvlas5chi7ikuq2q

Evaluation of a Tree-based Pipeline Optimization Tool for Automating Data Science [article]

Randal S. Olson, Nathan Bartley, Ryan J. Urbanowicz, Jason H. Moore
2016 arXiv   pre-print
As the field of data science continues to grow, there will be an ever-increasing demand for tools that make machine learning accessible to non-experts.  ...  In this paper, we introduce the concept of tree-based pipeline optimization for automating one of the most tedious parts of machine learning---pipeline design.  ...  TPOT uses a version of genetic programming [1] to automatically design and optimize a series of data transformations and machine learning models that maximize the classification accuracy for a given  ... 
arXiv:1603.06212v1 fatcat:vpxg4qx4ffbybaayt65fv4v2qu

tf.data: A Machine Learning Data Processing Framework [article]

Derek G. Murray, Jiri Simsa, Ana Klimovic, Ihor Indyk
2021 arXiv   pre-print
We demonstrate that input pipeline performance is critical to the end-to-end training time of state-of-the-art machine learning models. tf.data delivers the high performance required, while avoiding the  ...  Training machine learning models requires feeding input data for models to ingest.  ...  We gratefully acknowledge Andrew Audibert, Brennan Saeta, Fei Hu, Piotr Padlewski, Rachel Lim, Rohan Jain, Saurabh Saxena, and Shivani Agrawal for their engineering contributions to tf.data.  ... 
arXiv:2101.12127v2 fatcat:l3465y64sfd3ve6egbxkzvgs6e

Enabling End-To-End Machine Learning Replicability: A Case Study in Educational Data Mining [article]

Josh Gardner, Yuming Yang, Ryan Baker, Christopher Brooks
2018 arXiv   pre-print
This work demonstrates an approach to end-to-end machine learning replication which is relevant to any domain with large, complex or multi-format, privacy-protected data with a consistent schema.  ...  We discuss the challenges of end-to-end machine learning replication in this context, and present an open-source software toolkit, the MOOC Replication Framework (MORF), to address them.  ...  Instead, we propose a paradigm of end-to-end reproducibility for machine learning: fully reproducing (or replicating) the pipeline from raw data to model evaluation.  ... 
arXiv:1806.05208v2 fatcat:xrt7geusajbxjgoi3aqqsjsz3e

Data Infrastructure for Machine Learning

Samridhi Jha
2019 International Journal for Research in Applied Science and Engineering Technology  
This paper reviews the data infrastructure we built at Google to address these challenges in the context of large-scale production machine learning pipelines.  ...  Data quality is critical for effective machine learning, and this makes data a first-class citizen in the context of machine learning, on par with algorithms, software, and infrastructure.  ...  Model unit testing is one part of testing an end-to-end machine learning system.  ... 
doi:10.22214/ijraset.2019.4133 fatcat:b5iojbgus5ai3lbqsevsinvquu

Autostacker: A Compositional Evolutionary Learning System [article]

Boyuan Chen, Harvey Wu, Warren Mo, Ishanu Chattopadhyay, Hod Lipson
2018 arXiv   pre-print
Using EA, Autostacker quickly evolves candidate pipelines with high predictive accuracy. These pipelines can be used as is or as a starting point for human experts to build on.  ...  We introduce an automatic machine learning (AutoML) modeling architecture called Autostacker, which combines an innovative hierarchical stacking architecture and an Evolutionary Algorithm (EA) to perform  ...  Automatic Machine Learning AutoML research has focused on combining two tasks: machine learning pipeline building and intelligent model hyperparameter search.  ... 
arXiv:1803.00684v1 fatcat:xhw7hj5jsbhujh5zcw3lvlo4hm

PipeDream: Fast and Efficient Pipeline Parallel DNN Training [article]

Aaron Harlap, Deepak Narayanan, Amar Phanishayee, Vivek Seshadri, Nikhil Devanur, Greg Ganger, Phil Gibbons
2018 arXiv   pre-print
PipeDream is a Deep Neural Network(DNN) training system for GPUs that parallelizes computation by pipelining execution across multiple machines.  ...  PipeDream keeps all available GPUs productive by systematically partitioning DNN layers among them to balance work and minimize communication, versions model parameters for backward pass correctness, and  ...  Fourth, weight versions need to be managed carefully to obtain a high-quality model at the end of training.  ... 
arXiv:1806.03377v1 fatcat:ufq4mwvp2jf35e4iigqpzb3yzm

ReinBo: Machine Learning pipeline search and configuration with Bayesian Optimization embedded Reinforcement Learning [article]

Xudong Sun, Jiali Lin, Bernd Bischl
2019 arXiv   pre-print
Machine learning pipeline potentially consists of several stages of operations like data preprocessing, feature engineering and machine learning model training.  ...  Each operation has a set of hyper-parameters, which can become irrelevant for the pipeline when the operation is not selected. This gives rise to a hierarchical conditional hyper-parameter space.  ...  a data analysis pipeline with machine learning methods and parameter settings that are optimized for a given data set, in order to make machine learning methods available for non-expert users.  ... 
arXiv:1904.05381v1 fatcat:gxzjabu4enamfnyd5n4abfbyta

MLCask: Efficient Management of Component Evolution in Collaborative Data Analytics Pipelines [article]

Zhaojing Luo, Sai Ho Yeung, Meihui Zhang, Kaiping Zheng, Lei Zhu, Gang Chen, Feiyi Fan, Qian Lin, Kee Yuan Ngiam, Beng Chin Ooi
2021 arXiv   pre-print
In this paper, we identify two main challenges that arise during the deployment of machine learning pipelines, and address them with the design of versioning for an end-to-end analytics system MLCask.  ...  With the ever-increasing adoption of machine learning for data analytics, maintaining a machine learning pipeline is becoming more complex as both the datasets and trained models evolve with time.  ...  INTRODUCTION In many real-world machine learning (ML) applications, new data is continuously fed to the ML pipeline.  ... 
arXiv:2010.10246v4 fatcat:4xepaumanng25mdfjmgniq76tu

The Vision of BigBench 2.0

Tilmann Rabl, Michael Frank, Manuel Danisch, Hans-Arno Jacobsen, Bhaskar Gowda
2015 Proceedings of the Fourth Workshop on Data analytics in the Cloud - DanaC'15  
This leaves users in the dilemma of choosing a system that features good end-to-end performance for the use case.  ...  To this end, we have developed BigBench, an application level benchmark focused only on big data analytics.  ...  Because we do not want to dictate the machine learning algorithms, we need to set minimum thresholds for the accuracy of a machine learning task.  ... 
doi:10.1145/2799562.2799642 dblp:conf/sigmod/RablFDJG15 fatcat:c5hbpoof3rhphatjrjkkz75cbu

Data Civilizer 2.0

El Kindi Rezig, Lei Cao, Michael Stonebraker, Giovanni Simonini, Wenbo Tao, Samuel Madden, Mourad Ouzzani, Nan Tang, Ahmed K. Elmagarmid
2019 Proceedings of the VLDB Endowment  
In this demo, we will show how we used Data Civilizer 2.0 to help scientists at the Massachusetts General Hospital build their cleaning and machine learning pipeline on their 30TB brain activity dataset  ...  In this paper, we introduce Data Civilizer 2.0, an end-to-end workflow system satisfying both requirements.  ...  Decoupling Data Cleaning and Machine Learning: When it comes to building complex end-to-end data science pipelines, data cleaning is often the elephant in the room.  ... 
doi:10.14778/3352063.3352108 fatcat:otuma54mpbcwtgowmwlhtde7cq
« Previous Showing results 1 — 15 out of 57,117 results