Algorithmic Techniques for Regular Networks of Processors [chapter]

Russ Miller, Quentin Stout
<span title="2009-11-20">2009</span> <i title="Chapman and Hall/CRC"> Algorithms and Theory of Computation Handbook, Second Edition, Volume 2 </i> &nbsp;
Introduction This chapter is concerned with designing algorithms for machines constructed from multiple processors. In particular, we discuss algorithms for machines in which the processors are connected to each other by some simple, systematic, interconnection patterns. For example, consider a chess board, where each square represents a processor (for example, a processor similar to one in a home computer) and every generic processor is connected to its 4 neighboring processors (those to the
more &raquo; ... rth, south, east, and west). This is an example of a mesh computer, a network of processors that is important for both theoretical and practical reasons. The focus of this chapter is on algorithmic techniques. Initially, we define some basic terminology that is used to discuss parallel algorithms and parallel architectures. Following this introductory material, we define a variety of interconnection networks, including the mesh (chess board), which are used to allow processors to communicate with each other. We also define an abstract parallel model of computation, the PRAM, where processors are not connected to each other, but communicate directly with a global pool of memory that is shared amongst the processors. We then discuss several parallel programming paradigms, including the use of high-level data movement operations, divideand-conquer, pipelining, and master-slave. Finally, we discuss the problem of mapping the structure of an inherently parallel problem onto a target parallel architecture. This mapping problem can arise in a variety of ways, and with a wide range of problem structures. In some cases, finding a good mapping is quite straightforward, but in other cases it is a computationally intractable NP-complete problem. Terminology In order to initiate our investigation, we first define some basic terminology that will be used throughout the remainder of this chapter. 1 Shared Memory versus Distributed Memory In a shared memory machine, there is a single global image of memory that is available to all processors in the machine, typically through a common bus, set of busses, or switching network, as shown in Figure 1 (top). This model is similar to a blackboard, where any processor can read or write to any part of the board (memory), and where all communication is performed through messages placed on the board. As shown in Figure 1 (bottom), each processor in a distributed memory machine has access only to its private (local) memory. In this model, processors communicate by sending messages to each other, with the messages being sent through some form of an interconnection network. This model is similar to that used by shipping services, such as the United States Postal Service, Federal Express, DHL, or UPS, to name a few. For example, suppose Tom in city needs some information from Sue in city ¡ . Then Tom might send a letter requesting such information from Sue. However, the letter might get routed from city to a facility (i.e., "post office") in city ¢ , then to a facility in city £ and finally to the facility in city ¡ before being delivered locally to Sue. Sue will now package up the information requested and go to a local shipping facility in city ¡ , which might route the package to a facility in city ¤ , then to a facility in city ¥ , and finally to a facility in city before being delivered locally to Tom. Note that there might be multiple paths between source and destination, that messages might move through different paths at different times between the same source and destination depending on congestion, availability of the communication path, and so forth. Also note that routing messages between processors that are closer to each other in terms of the interconnection network (fewer hops between processors) typically require less time than is required to route messages between pairs of processors that are farther apart (more hops between processors in terms of the interconnection network). In such message-passing systems, the overhead and delay can be significantly reduced if, for example, Sue sends the information to Tom without him first requesting the information. It is particularly useful if the data from Sue arrives before Tom needs to use it, for then Tom will not be delayed waiting for critical data. This analogy represents an important aspect of developing efficient programs for distributed memory machines, especially general-purpose machines in which communication can take place concurrently with calculation so that the communication time is effectively hidden. For small shared memory systems, it may be that the network is such that each processor can 2 access all memory cells in the same amount of time. For example, many symmetric multiprocessor (SMP) systems have this property. However, since memory takes space, systems with a large number of processors are typically constructed as modules (i.e., a processor/memory pair) that are connected to each other via an interconnection network. Thus, while memory may be logically shared in such a model, in terms of performance each processor acts as if it is distributed, with some memory being "close" (fast access) to the processor and some memory being "far" (slow access) from the processor. Notice the similarity to distributed memory machines, where there is a significant difference in speed between a processor accessing its own memory versus a processor accessing the memory of a distant processor. Such shared memory machines are called NUMA (non-uniform memory access) machines, and often the most efficient programs for NUMA machines are developed by using algorithms efficient for distributed memory architectures, rather than using ones optimized for uniform access shared memory architectures. Efficient use of the interconnection network in a parallel computer is often an important consideration for developing and tuning parallel programs. For example, in either shared or distributed memory machines, communication will be delayed if a packet of information must pass through many communication links. Similarly, communication will be delayed by contention if many packets need to pass through the same link. As an example of contention at a link, in a distributed memory machine configured as a binary tree of processors, suppose that all leaf processors on one side of the machine need to exchange values with all leaf processors on the other side of the machine. Then a bottleneck occurs at the root since the passage of information proceeds in a sequential manner through the links in and out of the root. A similar bottleneck occurs in a system if the interconnect is merely a single Ethernet-based bus. Both shared and distributed memory systems can also suffer from contention at the destinations. In a distributed memory system, too many processors may simultaneously send messages to the same processor, which causes a processing bottleneck. In a shared memory system, there may be memory contention, where too many processors try to simultaneously read or write from the same location.
<span class="external-identifiers"> <a target="_blank" rel="external noopener noreferrer" href="">doi:10.1201/9781584888215-c24</a> <a target="_blank" rel="external noopener" href="">fatcat:kb74cr53lngdtig6ciieirncz4</a> </span>
<a target="_blank" rel="noopener" href="" title="fulltext PDF download" data-goatcounter-click="serp-fulltext" data-goatcounter-title="serp-fulltext"> <button class="ui simple right pointing dropdown compact black labeled icon button serp-button"> <i class="icon ia-icon"></i> Web Archive [PDF] <div class="menu fulltext-thumbnail"> <img src="" alt="fulltext thumbnail" loading="lazy"> </div> </button> </a> <a target="_blank" rel="external noopener noreferrer" href=""> <button class="ui left aligned compact blue labeled icon button serp-button"> <i class="external alternate icon"></i> Publisher / </button> </a>