A low latency and low power indirect topology for on-chip communication

This paper presents the Hybrid Scalable-Minimized-Butterfly-Fat-Tree (H-SMBFT) topology for on-chip communication. Main aspects of this work are the description of the architectural design and the characteristics as well as a comparative analysis against two established indirect topologies namely Butterfly-Fat-Tree (BFT) and Scalable-Minimized-Butterfly-Fat-Tree (SMBFT). Simulation results demonstrate that the proposed topology outperforms its predecessors in terms of performance, area and power dissipation. Specifically, it improves the link interconnectivity between routing levels, such that the number of required links isreduced. This results into reduced router complexity and shortened routing paths between any pair of communicating nodes in the network. Moreover, simulation results under synthetic as well as real-world embedded applications workloads reveal that H-SMBFT can reduce the average latency by up-to35.63% and 17.36% compared to BFT and SMBFT, respectively. In addition, the power dissipation of the network can be reduced by up-to33.82% and 19.45%, while energy consumption can be improved byup-to32.91% and 16.83% compared to BFT and SMBFT, respectively.


Introduction
The growing complexity of System-on-Chip (SoC) designs, characterized by an increasing amount of Processing Elements (PEs), requires intelligent solutions for the communication among the PEs. In response to this challenge, Networks-on-Chip (NoC) has been proposed. NoC is a new promising paradigm, which targets efficient communication between PEs [1]. It is based on packet switching and routing techniques in order to improve utilization of the on-chip interconnections, leading to enhanced network scalability and communication bandwidths as well as reduced power consumption and chip area utilization [2][3][4]. PLOS  Background Mesh is a well-known network topology using direct interconnections and is widely applied to on-chip communication due to its simple and regular design characteristics [18]. However, its simplicity comes at the cost of poor scalability. That means, increasing the size of a Meshbased NoC system considerably degrades its performance, which is mainly based on the small bisection widths and large network diameters [19]. Amongst the several strategies for reducing the network diameter, concentration is the most promising one [20][21][22]. For example, the Concentrated-Mesh (C-Mesh) topology proposed in [20] reduces the network diameter by a factor of fourat the cost of higher router complexity. However, increasing the concentration factor increases quadratically the crossbar circuitry [21]and limits the concentration factor, and thus, reduces the network scalability. The flattened butterfly topology presented in [21] aims at circumventing this conflict by using Mesh, including improved bandwidth, reliability, scalability and regularity [14,[23][24]. The Extended-Butterfly-Fat-Tree-Interconnection (EFTI) [14] has an improved network diameter in comparison to the Butterfly-Fat-Tree (BFT) topology. The latter is an updated version of FT and received attention in recent applications [25][26][27][28][29][30]. The Scalable-Minimized-Butterfly-Fat-Tree (SMBFT) topology proposed in [12] is a minimized version of EFTI and BFT that reduces the number of routers, links and levels, and consequently, provides improved performance in the terms of average network latency and power consumption [12,14].

H-SMBFT topology
The proposed Hybrid Scalable-Minimized-Butterfly-Fat-Tree topology is a combination of both SMBFT and BFT. It applies a concentration factor of four and sibling links to reduce the number of network levels compared to BFT. Limiting the concentration factor to four reduces both the router complexity as well as the structural design of the network, leading to improved scalability.  1(d)). Compared to SMBFT, the bottom level of H-SMBFT provides improved connectivity among the nodes and reduces the number of network levels. In the H-SMBFT version, all levels, except the bottom one, apply 5-port routers, while the SMBFT implementation uses the more complex and costlier 8-port routers.
In H-SMBFT, one router link connects to a parent router, and four links are used to connect to the router's children. In the bottom level, each router has three additional sibling links that interlink to sibling routers, enabling improved connectivity compared to the BFT topology.

H-SMBFT network design
During the design of an H-SMBFT network, routers are positioned at vertices and nodes at the leaves. Each node, which can be a node or a router, is represented with (p, l), where p is the position and l indicates the level of the node. Each router R(p, l) has one parent port and four child ports. The total number of network levels of the design follows from: Where N is the total number of nodes. The number of routers at the l th level can be estimated by Eq 2: The position p(i) of the parent router of R(i,1) can be calculated via Eq 3: Where i indicate the position of a router at level l and ranges from 0 to routers-1 (see Eq (2)). Consequently, the parent router of each router R(p(i), l) is R(i mod 4,l+1). The routers at bottom level 1 have three additional links that are connected to the router's siblings (Right, Left, and Next or Cross), with (Ri,1), (Li,1) and (Ni,1) are the representations of the right, left and next or cross siblings, respectively. The positions Ri, Ni and Li are given by Eqs 4 to 6:

H-SMBFTaverage distance
The average distance (D avg ) of a network follows from: Here N is the number of nodes, n i and n j are the source and destination nodes, respectively, and D is the shortest path for traversing data packet between n i and n j is given as unit hops [20,29]. Fig 2 depicts for the H-SMBFT the number of hops from the reference point node 0 to the remaining routers. Here, one hop is required to traverse a packet from node 0 to whit-colored routers and its linked nodes. Traversing a data packet from node 0 to blue-colored routers and linked nodes require two hops while traversing data to brown-colored routers requires three hops. Similarly, traversing data from node 0 to grey colored routers requires four hops. Table 1 presents the distance in terms of hops from node 0 to all other nodes in BFT, SMBFT and H-SMBFT networks. The average distances computed using Eq 7 without considering router delay and errors comes out to be D avg_BFT = 4.38, D avg_SMBFT = 3.63 and D avg_H−SMBFT = 3.44 for BFT, SMBFT and H-SMBFT networks having 64 nodes respectively. These values show that H-SMBFT can reduce the average distance by 5.2% and 21.4% compared to SMBFT and BFT network topologies respectively.

Simulation results
We simulated all components of the network in order to obtain information about the area, power dissipation, network latency and energy dissipation [16][17]. Therefore, we employed the ORION 3.0 simulator [16] for estimation of area and power dissipation of all routers and links in the networks. The ORION 3.0 model achieves average estimation errors below 9.3% across microarchitecture and different RTL implementations of router components. We applied the NoCTweak [17] simulator for the scalability analysis of the BFT, SMBFT and H-SMBFT networks, by comparing average network latency and energy consumption. A notable characteristic of NoCTweak is that the tool not only considers the number of hops but also uses the post-layout synthesis results of all the router components for computations. Firstly, the RTL designs in Verilog of all router components were synthesized with a design compiler and placed and routed with Cadence SoC Encounter using a CMOS standard cell library of a 22nm technology. We defined links of 1000 μm between router modules and used the actual post-layout delay, throughput and energy values. Hence, the results obtained include the fidelity of real implementations.
When a processing element receives a packet, it subtracts the packet's generating time (in the packet's header flit) from the current simulation time to get the packet latency. The results obtained are therefore not merely dependent on the number of hops but also consider the post-layout results of delay, power and energy [17].
In the following, we detail the employed models for our analysis. Network latency: Latency means the time taken by a header flit of a packet to traverse between any source-destination pair in the network. Latency also includes the time a packet waits at all intermediate buffers during its way from source to destination node due to the network congestion. The average network latency L avg is therefore given by Eq (8): Where L i , j is the packet latency of packet j and N i is the number of packets received by node i, and N is the number of nodes in the network. Power consumption: The power consumption of the network results from the activities of all components while running a certain traffic pattern. The average power P i of router i is described by Eq (9): P act,j and P inact,j mean the active and inactive powerof component j, while α i,j is the percentage of time the component j in router i is active (after the warm-up time). Consequently, the average power of all the routers in the network is given by Eq (10): Network energy consumption: The average energy E avg dissipated by each router during the simulation time T sim after warm-up time T warm− up is given by Eq (11): The average energy E P dissipated per packet by each router is given by Eq (12): Where N p is the total number of packets transferred on the network and is given by Router and link area: The router area results from the sizes of the basic building blocks of the router, i.e. SRAM-FIFOs, crossbar and arbiter. For example, the area of an SRAM follows from Eqs (13) and (14): Where f w , w cell , P r , P w , d w , Bandh cell , are flit width in bits, memory cell width, number of read ports, number of write ports, wire spacing, buffer size and memory cell height, respectively. Hence, the total area Area fifo for a B-entry buffer results to: The area of the remaining router components, i.e., crossbar and arbiter, can be estimated via its cell-level description and the information about cell sizes [17].
The area occupied by links is due to wires and repeaters. To estimate the area of repeaters the area of global wiring can be calculated from Eq (16): where Area Link denotes the wire area, f w is the flit width in bits, and w s the wire width and s w the spacing computed from the width and spacing of the layer using a particular design style. This section compares the characteristics of the proposed H-SMBFT topology in terms of average distance, router and link complexity. Further, simulation results for synthetic data and real-world examples are presented.

Router and link complexity
The ORION 3.0 simulator was used for the estimation of the power consumption and area utilization of the routers and links [16]. Therefore, we divided grouped all links into Large-Links (L-Link), Medium-Links (M-Link) and Small-Links (S-Link). The S-Links are used to connect nodes with routers in the first level. M-Links are used at level 1 to connect to sibling routers (R). Finally, L-Links connect routers of level l to level l-1. Fig 2 depicts all of these links for the H-SMBFT. We assumed the length of S-, M-& L-Links as 1000, 3000 and 8000 μm for the estimation of power consumption and area utilization. The different number of links type, router type, total area and total power for BFT, SMBFT and H-SMBFT with 64 node networks are shown in Table 2. Table 2 compares the required links and routers types for BFT, SMBFT and H-SMBFT networks with 64 nodes. The results indicate that H-SMBFT reduces the power consumption by 9.8% and 17.3% compared to SMBFT and BFT respectively. Also,H-SMBFT improves area utilization by 7.9% and 21.1% compared to SMBFT and BFT networks respectively. The error estimation was performed for the link, router power consumption and area utilization using the five seed t-student test [31]. The ±1.31, ± 1.64 and ± 1.82 percent error was found for the link and router power consumption w.r.t H-SMBFT, SMBFT, and BFT networks respectively. Similarly, the ±1.63, ± 1.87 and ± 1.96 percent error was recorded for the link and router area utilization w.r.t H-SMBFT, SMBFT and BFT networks respectively. The maximum recorded error value of ±1.96 shows the correctness of simulated results.

Simulation environment
We applied the NoCTweak simulator for the comparison of the topologies BFT, SMBFT and H-SMBFT regarding latency as well as power and energy consumption [17]. We choose NoCTweak because it considers the post-layout synthesized results of the entire router and link components. Furthermore, it is cycle-level accurate and also permits the integration of different network topologies. In order to compare the topologies, we integrated the BFT, SMBFT and H-SMBFT topologies and chose the simulation parameters as given in Table 3.
In an initial step, NoCTweak simulator synthesized the RTL designs of all router components with the Synopsys Design Compiler for a standard cell library in commercial CMOS technology. Next, the designs were placed and routed with the Cadence SoC Encounter, followed by the extraction of basic delay and power data. These data were then fed for estimation of power, energy, and delay based on the activities of components while running the selected traffic patterns. The standard link length between router modules is defined with 1000 μm. This length is set for each node by the NoCTweak simulator accordingly to the requirements of the design and follows the classification as Large-Links (L-Link), Medium-Links (M-Link) and Small-Links (S-Link) discussed in section 4.2. Two kinds of traffic patterns such as synthetic traffic traces and traces extracted from real-world application workloads were applied to all the networks for fair comparisons. The related results are presented in the following section.

Simulation results for synthetic data
The synthetic traffic patterns of Random, Hotspot, Transpose, Shuffle and Neighbor were applied for an initial comparison of all three topologies. Border critical cases are simulated by applying 100% traffic load and assigning a high priority to the extreme pairs in the 64 node networks. Table 4 shows the absolute values and percentage savings of the H-SMBFT topology in terms of network power, energy, and average latency measurements. Fig 3 depicts the results of network power, energy per packet and average latency for all the topologies under consideration. In the case of Random traffic trace as shown in Table 4 and Fig 3(a)-3(c), the H-SMBFT topology gives 17.36%, and 8.86% improvement in power consumption, 16.65%, and 7.46% improvement in the energy consumption, and 15.93% and 6.16% improvement in the average network latency as compared to BFT and SMBFT respectively. Similarly, the power consumption improvement for Hotspot traffic is 15.26%, and 6.24%, the energy consumption improvement is 15.19%, and 6.73% and the average network latency improvement is 19.75%, and 9.34% compared to BFT and SMBFT respectively. For Transpose traffic, it delivers 33.82% power savings than BFT topology and 19.45% improvement over SMBFT network and the reduction in the energy consumption is 32.71% and 16.83% as compared to BFT and SMBFT networks. The H-SMBFT has 18.51% and 8.75% saving in average latency than BFT and SMBFT networks under the Transpose traffic trace. The  and 13.61%, and improves average latency of 35.63% and 17.36% compared to BFT and SMBFT as detailed in Table 4 and shown in Fig 3(a)-3(c). The average network power, energy and latency error estimation is performed using the five seed t-student test. Table 4 and Fig  3(a)-3(c) (top of the bars) show the results of error estimation using five seed t-student test for H-SMBFT, SMBFT and BFT network topologies.

Embedded applications
The NoCTweak simulator provides a variety of real-world embedded application workloads. Table 5 lists the number of cores and the required number of streams for selected applications. We mapped all applications onto a network of 64 cores using Near-Optimal Mapping (NMAP) supported by NoCTweak. We applied for this task the very same mapping strategies as reported in [17]. Further, we used the source routing algorithm integrated into NoCTweak to compute the shortest path between all pairs of sources and destinations. The comparative analysis of BFT, SMBFT and the proposed H-SMBFT shall be illustratedwith the help of the Dual Video Object Plane Decoder (DVOPD) application workload [13]. Here, two video streams are decoded in parallel by utilizing 32 cores. This application is a scaled version of the Video Object Plane Decoder (VOPD), which consists of 16 cores (see Fig  4(a)) [13]. Each core is represented by a unique number given in parenthesis: Variable Length decoder (1), Run Length decoder (2), Inverse scan (3), AD/DC prediction (4), Iquant (5), IDCT (6), Up Sampling (7) VOP reconstructs (8), Padding (9), VOP Memory (10), Up Sampling (11), Reference memory (12) Down Sampling and Content Calculation (13), Arithmetic Decoder (14), Memory (15), Stripe memory (16) .Fig 4(b) depicts the related core graph of the VOPD. The communication characteristics with uni/bidirectional links and required bandwidth in MB/s between different cores of the DVOPD benchmark are shown in Fig 4(c).
The results of the mapping of the DVOPD using NMAP algorithm on the selected network topologies BFT, SMBFT and H SMBFT are shown in Fig 5. All cores communicate amongst each other via routers. The communication between cores connected to the same router will take one hop. If the cores are linked to different routers, then the length of the communication path increases accordingly to the topology of the network. For example, the communication C1 ! C2 between cores C1 and C2 requires all three topologies one hop via router R0 (see Fig 5). In case of the communication C3 ! C4, the shortest path in the topologies SMBFT and H SMBFT involves the two level 1 router R0 and R1 (see Fig 5(b) and 5(c)). In contrast, in the topology, BFT the same communication requires the data to traverse the level 1 and level 2 routers R0, L2 R0 and R1 (see Fig 5(a)).
The results are shown in Table 6 and depicted in Fig 6(a)-6(c) in terms of Network Power, Energy and Average Latency Improvements with 64 nodes of H-SMBFT against BFT and SMBFT networks for five different real time embedded applicationworkloads using the NoCTweak simulator.  Table 6 that in the case of two parallel streams of DVOPD application mapped on the proposed H-SMBFT and the competitors topologies BFT and SMBFT, the H-SMBFT topology gives 27.10%, and 14.81% improvement in power consumption, 26.32%, and 12.61% improvement in the energy consumption, and 22.34%, and 12.61%average network latency improvement as compared to BFT and SMBFT respectively. Power consumption improvement for two Wifirx embedded application workloads is 21.23%, and 11.71%, the energy consumption improvement is 25.65%, and 12.36% and the average network latency improvement is 28.75%, and 16.94% as compared to BFT and SMBFT respectively. The H-SMBFT, for five parallel streams of Mpeg4 workloads, saves 18.45% power than BFT topology and 8.94% power over SMBFT network and the reduction in the penalty of energy consumption is 23.72% and 10.65% as compared to BFT and SMBFT networks respectively. The H-SMBFT also improves 24.51% and 14.35% average latency than BFT and SMBFT networks, respectively. The reduction in the power consumption is 32.91% and 17.21%, the energy savings are 30.89% and 15.72% and improved performance in terms of average latency is 14.23% and 6.53% as compared to BFT and SMBFT networks under the four parallel streams of Cavlc application workloads. Similarly the H-SMBFT in the in the case of two parallel streams of Telecom application workloads, saves powerof25.13% and 12.39%, lowers energy consumption of24.63% and 11.61%, and lowers average latency of 22.73% and 11.32% compared to BFT and SMBFT as shown in Table 6 and Fig 6(a)-6(c). The average network power, energy and latency error estimation were also performed using the five seed t-student test for H-SMBFT, SMBFT and BFT networks under embedded application workloads and are shown in Table 6

Results and discussion
The H-SMBFT network topology for on-chip communication is compared to its predecessors SMBFT and BFT topologies. The results show that the proposed topology can reduce the average distances as compared to SMBFT and H-SMBFT networks. The theoretical values in Table 1 show that H-SMBFT can reduce the average distance by 5.2% and 21.4% compared to SMBFT and BFT respectively. The proposed topology also has lower demands in terms of a number of links and router complexities that in turn leads to reduced costs and improved communication performance. Table 2 compares the required links and routers of BFT, SMBFT and H-SMBFT networks with 64 nodes. The results indicate that H-SMBFT manages to reduce the power consumption by 9.8% and 17.3% and improves area utilization by 7.9% and 21.1% as compared to SMBFT and BFT networks topologies respectively. Further, the proposed topology is fairly compared to its predecessor topologies by applying both the synthetic as well as real-time embedded application workloads in terms of average latency, costs, power and energy consumption of the networks.
The simulation results of Table 4 and Fig 3(a)-3(c) indicate that H-SMBFT is an efficient candidate compared to its predecessor's topologies, with notable improvements in average