PARMON: a portable and scalable monitoring system for clusters

Rajkumar Buyya
2000 Software, Practice & Experience  
Workstation/PC clusters have become a cost-effective solution for high performance computing. C-DAC's PARAM 10000 (or OpenFrame, internal code name) is a large cluster of high-performance workstations interconnected through low-latency and high bandwidth networks. The management and control of such a huge system is a tedious and challenging task since workstations/PCs are typically designed to work as a standalone system rather than part of a cluster. We have designed and developed a tool
more » ... PARMON that allows effective monitoring and control of large clusters. It supports the monitoring of critical system resource activities and their utilization at three different levels: entire system, node and component level. It also allows the monitoring of multiple instances of the same component; for instance, multiple processors in SMP type cluster nodes. PARMON is a portable, flexible, interactive, scalable, location-transparent, and comprehensive environment based on client-server technology. The major components of PARMON are parmon-server-system resource activities and utilization information provider and parmon-clienta GUI based client responsible for interacting with parmon-server and users for data gathering in realtime and presenting information graphically for visualization. The client is developed as a Java application and the server is developed as a multithreaded server using C and POSIX/Solaris threads since Java does not support interfaces to access system internals. PARMON is regularly used to monitor PARAM 10000 supercomputer, a cluster of 48+ Ultra-4 workstations powered by the Solaris operating system. The recent popularity of Beowulf-class clusters (dedicated Linux clusters) in terms of price-performance ratio has motivated us to port PARMON to Linux (accomplished by porting system dependent portions of parmonserver). This enables management/monitoring of both Solaris and Linux-based clusters (federated clusters) through a single user interface. C-DAC HPCC SOFTWARE ENVIRONMENT C-DAC HPCC (high-performance computing and communication) software is an open and flexible parallel and distributed processing environment for a cluster of UNIX workstations [5]. The software Figure 6. Systems calls generated by processes running on a node in a cluster. Memory parameters PARMON allows continuous instrumentation of memory availability, memory in-use, free memory, percentage of memory in-use, reserved swap space, allocated swap space and available swap space. Disk parameters PARMON allows the monitoring of disk operations such as reads, writes, number of jobs waiting in the queue for disk service, and disk request run-time and wait-time. Network parameters The software instrumentation of network parameters such as input packets, output packets, errors in packet transmission helps to detect network bottlenecks. PARMON supports the display of percentage of incoming and outgoing data packets containing packet format errors. Component view: physical and logical PARMON displays the entire system physical picture and a few important components, which helps the user to quickly understand the machine's physical look and feel. It also displays different views of
doi:10.1002/(sici)1097-024x(200006)30:7<723::aid-spe314>3.0.co;2-5 fatcat:n2p6sjqcdvd77jcccom44wnhne