Biocompute 2.0: an improved collaborative workspace for data intensive bio-science

Rory Carmichael, Patrick Braga-Henebry, Douglas Thain, Scott Emrich
2011 Concurrency and Computation  
The explosion of data in the biological community requires scalable and flexible portals for bioinformatics. To help address this need, we proposed characteristics needed for rigorous, reproducible, and collaborative resources for data-intensive science. Implementing a system with these characteristics exposed challenges in user interface, data distribution, and workflow description/execution. We describe ongoing responses to these and other challenges. Our Data-Action-Queue design pattern
more » ... sses user interface and system organization concepts. A dynamic data distribution mechanism lays the foundation for the management of persistent datasets. Makeflow facilitates the simple description and execution of complex multi-part jobs and forms the kernel of a module system powering diverse bioinformatics applications. Our improved web portal, Biocompute 2.0, has been in production use since the summer of 2010. Through it and its predecessor, we have provided over 56 years of CPU time through its five modules-BLAST, SSAHA, SHRIMP, BWA, and SNPEXP-to research groups at three universities. In this paper, we describe the goals and interface to the system, its architecture and performance, and the insights gained in its development. Biocompute is arranged into three primary components. A single server hosts the website, submits batch jobs, and stores data. A relational database stores metadata for the system such as user data, job status, runtime and disk usage statistics. Each dataset is stored in a distributed cache [4] over a cluster of 32 machines that have been integrated into our campus grid computing environment [5] . These machines serve as a primary runtime environment for batch jobs and are supplemented by an 2307 extended set of machines running the same distributed caching system and advertising availability for Biocompute jobs using the Condor classad system [6]. Interface: Data-Action-Queue Having the described functionality is insufficient if users cannot effectively use the resource. To provide the requisite interface, we employ the Data-Action-Queue (DAQ) interface design pattern. Like Model-View-Controller, this suggests a useful structure for organizing a program. However, DAQ describes an interface, rather than an implementation. The DAQ design pattern rests on the idea that users of a scientific computing web portal will be interested in three things: their data, the tools by which they can analyze that data, and the record of previous and ongoing analyses. This also suggests a modular design for the implementing system. If tool developers need only specify the interface for the initial execution of their tool, it greatly simplifies the addition of new actions to the system. The queue view documents job parameters and meta information, and permits users to drill down to a single job in order to resubmit it or retrieve its results. Because the queue also shows the ongoing work in the system, it gives users a simple way to observe the current level of resource contention. Social challenges of Biocompute Up to this point, we have only addressed how Biocompute meets its technical challenges. However, as a shared resource, Biocompute requires several mechanisms to balance the needs of all of Figure 4. Runtime of BLAST versus a medium sized (491 MB) database for varying timeout lengths.
doi:10.1002/cpe.1782 fatcat:fzcwy6kg7bcjvmoshbnk45vzlm