JRBridge: A Framework of Large-Scale Statistical Computing for R

Xia Xie, Jie Cao, Hai Jin, Xijiang Ke, Wenzhi Cao
2012 2012 IEEE Asia-Pacific Services Computing Conference  
Demands for highly scalable parallel data processing platforms is raising due to an explosion in the number of massive-scale data intensive applications both in industry and in sciences. Performing statistical computing over huge data repositories poses a significant challenge to existing statistical software and computational infrastructure. After analyzing various open source computational infrastructures and their programming paradigm APIs, the results have shown that most of them are JVM
more » ... of them are JVM based, and their APIs are given as Java interfaces or abstract classes. This paper proposes a generic framework JRBridge, which can integrate R and JVM-based computational infrastructures by generating Java APIs code wrapper around the native R code automatically and handling type conversion. Using this framework, we build a distributed statistical computing environment by integrating R with Hadoop. With the Hadoop Distributed File System plugin, it brings a way to store and access datasets with millions of objects. With MapReduce plugin, it brings a natural environment to code MapReduce algorithms in R. The experiment result shows JRBridge scales linearly with the size of the datasets and thus provides a scalable solution for largescale statistical computing in R.
doi:10.1109/apscc.2012.74 dblp:conf/apscc/XieCJKC12 fatcat:pjthqqpdbjcdbbydzqt6zoxudm