Computing the Throughput of Replicated Workflows on Heterogeneous Platforms
2009 International Conference on Parallel Processing
In this paper, we focus on computing the throughput of replicated workflows. Given a streaming application whose dependence graph is a linear chain, and a mapping of this application onto a fully heterogeneous platform, how can we compute the optimal throughput, or equivalently the minimal period? The problem is easy when workflow stages are not replicated, i.e., assigned to a single processor: in that case the period is dictated by the critical hardware resource. But when stages are
... ages are replicated, i.e., assigned to several processors, the problem gets surprisingly complicated, and we provide examples where the optimal period is larger than the largest cycle-time of any resource. We then show how to model the problem as a timed Petri net to compute the optimal period in the general case, and we provide a polynomial algorithm for the one-port communication model with overlap. Finally, we report comprehensive simulation results on the gap between the optimal period and the largest resource cycle-time. Résumé Dans ce papier, nous étudions le débit de graphes de tâches répliqués. Étant donnée une application de streaming dont le graphe de dépendance est une chaîne, et un placement de cette application sur une plate-forme hétérogène, comment pouvons-nous calculer le débit optimal, ou, de façon équivalente, la période minimale ? Ce problème est simple quand les différentes tâches ne sont traitées que par un seul processeur : dans ce cas, la période est donnée par le débit de la ou des ressources critiques. Cependant, quand les tâches sont répliquées, c'est-à-dire placées sur plusieurs processeurs, le problème devient étonnamment compliqué, et nous présentons des exemples d'instances sans aucune ressource critique, c'est-à-dire que chacune des ressources connaît des moments d'inactivité lors de l'exécution du système. Nous montrons comment calculer la période du système en utilisant les réseaux de Petri temporisés, et nous donnons un algorithme polynomial pour la calculer pour le modèle de communication avec overlap. Nous exposons également les résultats de nombreuses simulations montrant l'écart entre la période réelle entre le système et le maximum des temps d'occupation des ressources. In this paper we deal with streaming applications, or workflows, whose dependence graph is a linear chain composed of several stages. Such applications operate on a collection of data sets that are executed in a pipeline fashion [11, 10, 14] . They are a popular programming paradigm for streaming applications like video and audio encoding and decoding, DSP applications, etc [7, 13, 16] . Each data set is input to the linear chain and traverses it until its processing is complete. While the first data sets are still being processed by the last stages of the pipeline, the following ones have started their execution. In steady state, a new data set enters the system every P time-units, and several data sets are processed concurrently within the system. A key criterion to optimize is the period, or equivalently its inverse, the throughput. The period P is defined as the time interval between the completion of two consecutive data sets. With this definition, the system can process data sets at a rate 1/P (the throughput). The workflow is executed on a fully heterogeneous platform, whose processors have different speeds, and whose interconnection links have different bandwidths. When mapping application stages onto processors, we enforce the rule that any given processor will execute at most one stage. However, the converse is not true. If the computations of a given stage are independent from one data set to another, then two consecutive computations (different data sets) for the same stage can be mapped onto distinct processors. Such a stage is said to be replicated, using the terminology of Subhlok and Vondran [11, 12] and of the DataCutter team [4, 10, 15] . This corresponds to the dealable stages of Cole  . Given an application and a target heterogeneous platform, the problem to determine the optimal mapping (maximizing the throughput) has been shown NP-hard in  . The main objective of this paper is to assess the complexity of computing the throughput when the mapping is given. The problem is easy when workflow stages are not replicated, i.e., assigned to a single processor: in that case the period is dictated by the critical hardware resource. But when stages are replicated, i.e., assigned to several processors, the problem gets surprisingly complicated, and we provide examples where the optimal period is larger than the largest cycle-time of any resource. In other words, during the execution of the system, all the resources will be idle at some points. We then show how to use timed Petri nets to compute the optimal period in the general case, and we provide a polynomial algorithm for the one-port model with overlap. Finally, we report comprehensive simulation results on the gap between the optimal period and the largest resource cycle-time.