A New Cross-By-Pass-Torus Architecture Based on CBP-Mesh and Torus Interconnection for On-Chip Communication

A Mesh topology is one of the most promising architecture due to its regular and simple structure for on-chip communication. Performance of mesh topology degraded greatly by increasing the network size due to small bisection width and large network diameter. In order to overcome this limitation, many researchers presented modified Mesh design by adding some extra links to improve its performance in terms of network latency and power consumption. The Cross-By-Pass-Mesh was presented by us as an improved version of Mesh topology by intelligent addition of extra links. This paper presents an efficient topology named Cross-By-Pass-Torus for further increase in the performance of the Cross-By-Pass-Mesh topology. The proposed design merges the best features of the Cross-By-Pass-Mesh and Torus, to reduce the network diameter, minimize the average number of hops between nodes, increase the bisection width and to enhance the overall performance of the network. In this paper, the architectural design of the topology is presented and analyzed against similar kind of 2D topologies in terms of average latency, throughput and power consumption. In order to certify the actual behavior of proposed topology, the synthetic traffic trace and five different real embedded application workloads are applied to the proposed as well as other competitor network topologies. The simulation results indicate that Cross-By-Pass-Torus is an efficient candidate among its predecessor’s and competitor topologies due to its less average latency and increased throughput at a slight cost in network power and energy for on-chip communication.


Introduction
The growing complexity of System-on-Chip (SoC) designs, characterized by an increasing number of Processing Elements (PEs), requires intelligent solutions for on chip communication. In alignment with this challenge, Networks-on-Chip (NoC) is emerging as a new and promising paradigm that targets an efficient communication between PEs [1]. NoC-based systems appear as an enhanced solution, as an evolution of flexibility, multitasking parallel increase in network diameter size having small bisection width [19]. Many authors presented different architectures by modifying Mesh design, adding links on network for improving the performance. To improve the Mesh performance, Torus topology added T-Links connected to all terminal node pairs to reduce the network diameter shown in Fig 1(B) [19]. Diagonal Mesh (D-Mesh) and Diagonal Torus (D-Torus), shown in Fig 1C and 1D) introduced additional diagonal links to reduce network diameter and to reduce the network latency [20][21]. D-Mesh can be constructed by adding D-Links on simple Mesh. D-Mesh comprises of nine degree inner nodes, which reduces the average hop counts of the network at the cost of network power [18]. D-Torus merged links of D-, T-and M-Links on a one network, Fig 1(D). D-Torus achieved high performance from topologies like Mesh, X-Mesh, D-Mesh and SD-Torus as a comparison in [18]. Consequently, the cost of D-Torus network is drastically increased in terms of area and power consumption [18,21]. Hence, high degree routers are required to implement D-Mesh and D-Torus network topologies leading to increasing the cost of power consumption [18].
Center-Connected Mesh (C 2 -Mesh) and Center-Connected Torus (C 2 -Torus) networks shown in Fig 1E and 1F) are based on simplicity and cost effectiveness features [19,22]. Additional four Cross-links (C-Links) on 5×5 Mesh network centrally interconnect nodes in C 2 -Mesh whereas in C 2 -Torus topology all terminal node pairs are also connected to the network. The C 2 -Mesh and C 2 -Torus networks are simple and have low cost; however, their performances are less efficient in comparison to D-Mesh and D-Torus topologies [23]. Therefore, efficient, high performance and low cost network architecture are required to account for ever increasing number of PEs. A CBP-Mesh [23] is upgraded to C 2 -Mesh with improved performance as compared to its predecessors, which is depicted in Fig 1(G). CBP-Links in CBP-Mesh are more effective to reduce the average latency and improves cost effectiveness as compared to D-Mesh and D-Torus [23]. Fig 1A-1H) depicts 3×3 node's network regarding Mesh, Torus, D-Mesh, D-Torus, C 2 -Mesh, C 2 -Torus, CPB-Mesh, and proposed CBP-Torus topologies where the hexagonal box router (Rt) and interconnected links represent these networks.

CBP-Torus Architecture
To increase the performance of Mesh network, the worst case scenario of hop count for Mesh should be addressed. The worst cases in hop count of Mesh topology include the opposite corner nodes (Rt 0,0 $ Rt 2,2 and Rt 0,2 $ Rt 2,0 in Fig 1A) which are four in a 3×3 network. By using the T-links in Torus network, it covers this distance in two hops by intersecting corner nodes. A D-Mesh and D-Torus topologies also take two hops to traverse a packet to its opposite corner node using D-Links. In case of a 3×3 Mesh network, C 2 -Mesh uses one extra C-Link that reduces the hop count between two opposite corner nodes (Rt 0,0 $ Rt 2,2 in Fig  1E). However, the communication between the second opposite corner node is not affected (Rt 0,2 $ Rt 2,0 in Fig 1E). C 2 -Torus connects the terminal with T-Links to reduce the distance between the opposite terminal nodes for an increase in the performance of the network. In CBP-Mesh design the two CBP-Links are added to a Mesh network, placed between both pairs of opposite corner nodes (Rt 0,0 $ Rt 2,2 and Rt 0,2 $ Rt 2,0 in Fig 1G) and minimizes two to one hop against the Torus, D-Mesh, D-Torus and connecting other side of nodes Rt 0,2 $ Rt 2,0 from C 2 -Mesh, C 2 -Torus networks. Consequently, all four corner nodes are interlinked in the 3×3 CBP-Mesh by over-passing the central node (Rt 1,1 in Fig 1G). CBP-Links also reduces the distance between Rt 0,0 $ Rt 1,2 , Rt 2,1 by hopping Rt 2,2 to two hops. Similarly, all the corner nodes can access the other side of both middle nodes Rt 0,1 Rt 1,0 Rt 1,2 Rt 2,1 and vice-versa in one hop that leads to higher performance in the network [23].
The proposed network is the Torus version of CBP-Mesh. T-Links added to CBP-Torus connects the terminal nodes as shows 3×3 network in Fig 1(H). The addition of T-Links reduces the network diameter from the terminal sides of the proposed network. T-Links also provide multipath along with the CBP-Link and M-Links in the proposed network and helps to accommodate more adaptive and dynamic routing algorithms in the network.   The CBP-Links connect the corner nodes (Rt 0,0 , Rt 0,4 , Rt 4,0, Rt 4,4 $ Rt 2,2 ) to a central node in one hop. The middle terminal nodes (Rt 0,2, Rt 4,2 $ Rt 2,0 , Rt 2,4 ) take also one hop to connect (see blue arrow lines in Fig 4). Similarly corner nodes (Rt 0,0 , Rt 0,4 , Rt 4,0, Rt 4,4 ) via center node (Rt 2,2 ) take two hops to traverse in-between nodes. The middle terminal nodes (Rt 0,2 $ Rt 4,2 ) via (Rt 2,0 or Rt 2,4 ) and (Rt 2,0 $ Rt 2,4 ) via (Rt 0,2 or Rt 4,2 ) to connect with each other and take one hop. The adjacent green router nodes will take one more and blue router nodes will take two more hops using M-Links from the above router nodes in CBP-Torus network. The T-Links in network connects the other side of terminals like a loop (see green lines in Fig 4).

CBP-Torus Design
Each T-Link in CPB-Torus reduces the distance in the same coordinate nodes maximum by half [19]. Further advantages are the connection of CBP-Links and T-Links to the central/terminals of the network (see Fig 4), which provides improved traffic flow and reduced hop count. For example, the hop count between nodes Rt 0,0 $ Rt 4,4 or nodes Rt 0,4 $ Rt 4,0 reduces from nine in a 5×5 Mesh network to two hops in the CBP-Torus network.
The gray areas in Fig 4 indicate four types of network diameters for the m×n CBP-Torus, namely the diagonal diameter (D Di ), the end to end diameter (E Di ), middle diameter (M Di ) and the Torus diameter (T Di ). These diameters can be computed for symmetric CBP-Torus with dimension n×n following Eqs (1-5): As shows 3×3 network in network. As the proposed CBP-Torus scale-up, the CBP-Links and T-Links become more effective in reducing the distance between nodes in the network. The 3×9 CBP-Torus scale is shown in Fig 5. The Rt 0,0 $ Rt 0,4 and Rt 2,0 $ Rt 2,4 (see blue dotted arrow in Fig 5) reduces the hop count to two as opposed to four in Mesh, Torus, D-Mesh and D-Torus networks. Similarly Rt 0,0 $ Rt 0,6 and Rt 2,0 $ Rt 2,6 will take three hops by using the CBP-links and adjust green router nodes take one more hop to traverse the packets.
Moreover, in the gray area Fig 5 indicates the path between nodes Rt 1,0 and Rt 1,6 in a 3×9 network, which would have a hop count of six in a Mesh and other selected network. In contrast, in the proposed CBP-Torus the hop count reduces to five (see double arrow lines in Fig 5). For networks with larger amount of nodes, the gain due to CBP-and T-Links increases considerably. For example, in the 3×9 network depicted in Fig 5, the hop counts between extreme nodes Rt 1, 0 and Rt 1,8 reduce from 8 for a common Mesh to 6 in case of the proposed CBP-Torus by using CBP-Links. The T-links reduce this by one hop (see T-Link with green dotted lines in Fig 5).
The existence of alternative paths between two nodes increases the tolerance of the network against failing links and routers. Consequently, the proposed CBP-Torus having T-, CBP-and M-links in a network give more robust than the classic Mesh, Torus, C 2 -Mesh C 2 -Torus and CBP-Mesh topologies.

Characteristics of CBP-Torus Architecture
The addition of links impacts the topology characteristics which include network diameter, bisection width, degree of routers, number of links and path diversity and average distance of network [19]. The selected topologies' characteristics as follows, whereas symmetric (n × n) sizes are assumed.

Network Diameter
The network diameter is the minimum number of hop counts between farthest terminal node pairs of network [3]. By reducing the network diameter, hop counts between nodes is minimized leading to the reduced overall latency of the network. Each dimension of mesh can be made symmetrical by taking an equal number of rows and columns (n × n). Therefore, the mesh network diameter would be (2n-2) [20]. The reduced diameter of CBP-Mesh is shown in Eq (6) realized with CBP-Links in the network. Network diameter of Torus by terminal connections is shown in Eq (7). The average network diameter of CBP-Torus can be the average network diameters of both the Torus and CBP-Mesh topologies. The proposed CBP-Torus average network diameter can be represented by Eq (8).

Bisection Width
Bisection width is the smallest width in the network, which divides (n × n) Mesh nodes of network into equal sets of nodes [18]. The bisection width of Mesh network is specified by (n) [19]. Adding links in network architecture design increase the value of (n), which gives better throughput and traffic flow in the network [23]. To divide a CPB-Mesh network with (n × n) nodes into two equal sets of nodes, is given as (2n) when topology is even and (2n + 1) when it is odd. Similarly, for CBP-Torus bisection width is (3n) for even and (3n + 2) for odd topology.

Degree of Router
Five degrees are needed for all routers in Torus topology. Mesh, D-Mesh, D-Torus, C 2 -Mesh and CBP-Mesh and proposed CBP-Torus topologies consist of varying degrees of links for routers such as three, four, five, six, seven and nine, depending upon the nature of the network, detail including local port is given in Table 1.

Number of links
The number of links required to construct (n × n) Mesh network is (2n 2 -2n) whereas (2n 2 ) links are required for a Torus network [23]. It can be interpreted from Fig 3 that CBP-Torus architecture increases the router degree in some routers due to increase in number of links, however improvement in bisectional width gives better control over traffic flow and enhancement of throughput in the network shown in Table 1.

Path Diversity
CBP-Torus topology shows the existence of multiple paths between all node pairs of the network in Fig 3. Therefore, each node pair has more than one path for traversing packets from source to destination which increases the fault tolerance capability of the network. In proposing CBP-Torus, three types of path are available to route the data packets in the network. Fig 3  depicts the Mesh, Torus and CBP-Links by black, blue and green lines respectively.

Average Distance
The average distance of 'N' node network (D avg ) given in Eq (9) is calculated by the minimum hop count from source-nodes to destination-nodes [24][25]. D SP is the shortest path from the source node (Rti) to the destination node (Rtj) specified in units of hops.
The computation results in Table 2 showed that CBP-Torus traverses less average distance in different scale size networks compared to other selected topologies. Table 3 summarizes the network characteristic for the selected topologies.

Performance Vs Cost Comparison
Performance for NoC can be measured in terms of average latency, throughput, power and energy of the network [26][27][28][29][30]. Different NoC networks need a different number of routers with varying degree of ports to link routers and nodes in the networks. To analyze the behavior and effectiveness of the proposed topology, a comparison is presented as performance versus cost of the network. The selected topologies are the classic Mesh and Torus, some of CBP-Torus predecessor C 2 -Torus and CBP-Mesh and its competitor D-Torus.

Simulation experiments
The NoCTweak [31] simulator was used to implement the classic Mesh, Torus, C 2 -Torus, CBP-Mesh, D-Torus and proposed CBP-Torus and analysis of all the NoC topologies. The simulator is an open source and cycle-level accurate tool written in SystemC [31]. NoCTweak was selected for simulation due to the availability of large sets of workloads. The synthetic traffic model and some real embedded system application workloads are considered for simulations. The simulator provides results in terms of average network latency, throughput and total network power and energy. The simulator configurations used are wormhole 3-stage pipeline routers with ten-flit buffers, round-robin arbiters and 1000-μm links, 65 nm CMOS, 1.0 V operating voltage and 1.0 GHz frequency. Each simulation runs for 100,000 cycles with 20,000 cycles of warm-up cycle time. The existing source routing algorithm to compute the shortest path and NMAP algorithm to map embedded application on the processing cores of network are used [31]. The uniform random traffic traces and packet length of ten flits at a flit injection rate of 0.30 flits/cycle/node over the five different network sizes 3×3 to 7×7 and 9×9 are used for simulation and analysis of selected topologies vs proposed on-chip architectures. The results of latency and throughput are depicted in Fig 6A and 6B) showing that the Mesh topology is worst case for latency and throughput among other topologies. But Mesh has also taken low cost in terms of total network power and energy due to simple network design as shown in Fig 6C and 6D) Whereas CBP-Torus topology is the best candidate among Mesh, Torus, C 2 -Torus, CBP-Mesh and D-Torus as it takes less average network latency in different scale networks. Fig 6(B) also indicates that CBP-Torus gives higher throughput in the different scale networks and is the second best among other selected topologies except D-Torus. D-Torus gives the highest throughput against other networks. As all nine degree routers are required for inner nodes and highest number of links (see in Table 3) to implement D-Torus network topology, it increases the cost of power and energy as compared to other topologies (see Fig 6C and 6D). The proposed CBP-Torus topology uses different degree routers and less number of links as compared to D-Torus to connect the network (sees in Table 1). Hence, CBP-Torus takes less power consumption and energy utilization as compared to D-Torus (see Fig 6C and 6D). The addition of links and increased ports of routers in CBP-Torus increases the cost of power which is evident from Fig 6(C).

Embedded Applications
Besides the synthetic traffic, the NoCTweak simulator provides several real time embedded application traces. A NMAP algorithm is adopted to convert the task-graph for placement of tasks of the application on the cores of the NoC. Table 4 shows some embedded applications selected for comparisons of topologies.
The complete task graph of one of the chosen applications i.e; MPEG-4 decoder having 12 cores V0 to V11 is shown in Fig 7(A). The bandwidth required for communication between different tasks is depicted with arrow lines in Fig 7(A).
The mapping of MPEG4 decoder application on CBP-Torus using NMAP algorithm is shown in Fig 7(B). The addition of M-, T-, and CBP-links in CBP-Torus network minimizes the paths between nodes of V 0 ! V 9 and V 2 ! V 8 connected directly with the CBP-Links (see the blue lines in Fig 7B) The V 0 , V 11 ! V 8 also directly connected with the T-links in a network (see green lines in Fig 7B).
The comparison of average network latency, throughput, total network power and energy under the workload of five different embedded applications are shown in Fig 8A-8D). The CBP-Torus takes less average latency cycles as compared to Mesh, Torus, C 2 -Torus, CBP-Mesh and D-Torus by 14.2%, 11.5%, 7.4%, 6.4% and 5.1% respectively under the embedded traffic of MPEG-4 decoder application. CBP-Torus also produces high throughput as opposed to Mesh, Torus, C 2 -Torus and CBP-Mesh by 28%, 20%, 16%, and 8% except from D-Torus which is less than 15%. The proposed architecture takes more network power for MPEG4 application than Mesh, Torus, C 2 -Torus, CBP-Mesh by 37.7%, 21.2%, 7.5%, 4.2% but 13.6% less than D-Torus. It is evident From Fig 8A-8D that under the traffic of all the selected applications, CBP-Torus takes less average network latency cycles than Mesh, Torus, C 2 -Torus, CBP-Mesh and D-Torus topologies.

Results and Discussion
To show the scalability of the proposed network, different sizes of networks such as 3×3 to 9×9 were used for simulation and analysis of selected topologies. The synthetic traffic trace is  applied as workload to all the networks in order to get a fair comparison shown in Fig 6A-6D in terms of average network latency, throughput, total network power and energy of data packets transferred. In order to achieve good performance in NoC Mesh network, some authors modified the design and presented D-Torus network to increase the performance of Mesh and Torus topologies. However, they achieved lower latency at the high cost of power consumption and energy utilization of the network. C 2 -Torus topology showed improved performance with increase in cost, but it is not comparable with D-Torus like topologies in terms of performance. CBP-Torus provides a better trade off with low latency among all others and lower power consumption against D-Torus network. CBP-Torus gives less average latency with better throughput among its predecessor and competitor topologies under both the synthetic as well as embedded application as shown in Fig 6A-6D and Fig 8A-8D. The CBP-Torus proved to be more effective in reducing the network diameter because terminal node pair links are connected with CBP-Links which provides the best connectivity in the network. The addition of such features reduces network diameter and number of hops between nodes in the network.

Conclusion
Intelligent placement extra links in 2D Mesh architecture for interconnecting the nodes of the network can play an important role in achieving high performance with low cost. Proposed CBP-Torus is the modified design of 2D Mesh architecture that can achieve goals of high performance and low power. The Proposed design integrated the features of CBP-Mesh and Torus topologies to reduce the latency in the network. The introduction of M-, T-and CBPlinks in CBP-Torus architecture design achieves the goals of reducing the network diameter, minimizing the average number of hops in the network and providing multi-paths for the adoption of 2D based adaptive routing algorithms. CBP-Torus also provides fault tolerance due to the presence of additional paths between node pairs. Comparison of performance versus cost for proposed CBP-Torus compared to its predecessor and competitor topologies is analyzed. The results show that CBP-Torus takes lowest average latency with good throughput among its predecessor and competitor topologies under both kinds of traffic traces i.e; synthetic and embedded applications. CBP-Torus gives better performance among other selected meshes with a slight increase of cost from its predecessor and low cost against its competitor topologies. The scalable routing algorithm for CBP-Torus will be proposed in the future work. r t (x,y) CBP-l NE with r t (x-2,y+2),