Exploring Trade-Offs in Performance and Programmability of Processing Element Topologies for Network Processors [chapter]

Matthias Gries, Chidamber Kulkarni, Christian Sauer, Kurt Keutzer
2004 Network Processor Design  
Network processors exploit task and packet level parallelism to achieve high throughput. To date, this has resulted in a huge diversity of architectures for similar applications. One of the main differences among network processors is the inter-connection topology between processing elements. This topology significantly influences the achievable performance as well as the programmability of the system. Thus it is essential to understand the trade-offs involved in performance as well as
more » ... bility for different processing element topologies. Driven by practical implementations, in this paper, we explore trade-offs for topologies based on an analytical framework. The performance results based on our setup and metrics indicate that for applications like IPv4 forwarding, a pooled topology is best suited. The discussion on trade-offs for additional criteria such as programmability and scalability also reveals benefits of pooled topology over other topologies. Architecture Application Mapping Evaluation Service curves Task graph and arrival curves Binding of tasks and communication Worst-case packet delay Resource utilization Memory requirements Architecture Application Mapping Evaluation Service curves Task graph and arrival curves Binding of tasks and communication Worst-case packet delay Resource utilization Memory requirements Figure 2. Design space exploration using Y-Chart. anced for our application scenario since, for instance, in the pure pipeline case, up to 18 packets per flow could be in the network processor concurrently (as opposed to up to nine packets in Table 1) . Only under ideal assumptions the pure pipeline (configuration (1)) is able to match latency values with pool configurations (configuration (IV and V)) by over-provisioning the throughput of point-to-point (P-2-P) connections and thereby decreasing the transport delay. Under practical assumption however, the pipeline always falls behind the pooled topology. This effect is in particular emphasized in our configurations since the pool topologies employ an optimized mapping of tasks to processing elements so that the full processing of a packet occurs on the same processing element, thus minimizing communication overhead. Consequently, the dynamic range between lower and upper latency bounds for a pipeline must be larger than for a pool in our case. If we tried to map pipelined processing onto a pool, the results for a pool would degrade as an example will reveal in Section 4.2. On the other hand, since P-2-P connections are always available, the latency experienced along the pipeline is relatively less affected by higher utilizations of P-2-P connections as compared to a pool by higher bus utilizations (causing higher arbitration penalties). The crossover point however is not visible in our analysis and would only be apparent using more than eight computing cores. Note that the bus load for the topologies II, III, and IV must be the same since the buses are used to distribute the traffic from 16 ports to the corresponding number of pipelines. The implementation costs for those topologies would however be quite different, since the buses require a varying number of bus interfaces (see Table 1 ). Depends on the number of parallel pipelines.
doi:10.1016/b978-012198157-0/50009-x fatcat:7dkmymrfbfgdhoahewbylztmhq