Temperature-Aware Optimization of Monolithic 3D Deep Neural Network Accelerators

Document Version
Accepted author manuscript

Link to publication record in Manchester Research Explorer

Citation for published version (APA):

Citing this paper
Please note that where the full-text provided on Manchester Research Explorer is the Author Accepted Manuscript or Proof version this may differ from the final Published version. If citing, it is advised that you check and use the publisher's definitive version.

General rights
Copyright and moral rights for the publications made accessible in the Research Explorer are retained by the authors and/or other copyright owners and it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights.

Takedown policy
If you believe that this document breaches copyright please refer to the University of Manchester’s Takedown Procedures [http://man.ac.uk/04Y6Bo] or contact uml.scholarlycommunications@manchester.ac.uk providing relevant details, so we can investigate your claim.

Download date:03. Apr. 2022
Temperature-Aware Optimization of Monolithic 3D Deep Neural Network Accelerators

ABSTRACT
We propose a design automation methodology to help design of energy-efficient Mono3D DNN accelerators with safe on-chip temperatures for mobile systems. We introduce an optimizer capable of investigating the impact of different aspect ratios of the chip and chip footprint specifications, and selecting energy-efficient accelerators under user-specified thermal and performance constraints. We also demonstrate that using our optimizer we can reduce energy consumption by 1.6× and area by 2× with a maximum of 9.5% increase in latency compared to a Mono3D DNN accelerator optimized only for performance.

1 INTRODUCTION
Deep Neural Networks (DNNs) are extremely popular for numerous machine learning applications, such as image classification or object detection [1]. There is an increasing demand for DNNs in mobile systems, such as IoT devices, autonomous drones, tablets, etc. To satisfy the performance demands of these devices, accelerators for DNNs are actively being developed [2]. However, the high energy demand of DNNs (due to their heavy computation and data movement) is a major design issue. In addition, mobile systems have tight area and power/thermal budgets (e.g., due to the absence of heat sinks and fans) that add to the constraints associated with designing energy-efficient mobile DNN accelerators.

A systolic array-based DNN accelerator comprises a two dimensional (2D) array of simple processing elements (PEs), with on-chip scratchpad memories for input feature map (IFMAP), filter weights (Filter), and output feature map (OFMAP), as shown in Fig. 1 [3]. Each PE consists of a Multiply-and-Accumulate (MAC) unit along with internal registers to store the inputs and partial sums. In a systolic architecture, data flows into the array from the PEs along the top and left edges in Fig. 1 and is passed onto their neighboring PEs every clock cycle. This data flow is uni-directional. Straightforward design and high compute density make systolic arrays a popular choice for DNN accelerators in mobile systems [2].

With technology scaling slowing down, improving performance under energy, power, and thermal constraints is increasingly more challenging. Monolithic 3D (Mono3D) is a three-dimensional (3D) integration technology that can overcome 2D scaling bottlenecks by achieving small chip footprint, dense integration, wire length savings, power savings, and high bandwidth [4]. These properties make Mono3D attractive for designing DNN accelerators in mobile systems. However, 3D architectures have significant thermal challenges due to high power densities and vertical thermal resistance [5]. In addition, Mono3D systems have thin device layers, which result in limited lateral heat flow and high inter-tier thermal coupling (unlike through silicon via based 3D stacking), thus exacerbating thermal problems in mobile systems [6]. Consequently, temperature becomes an indispensable part of the methodologies and tools used to architect Mono3D systems.

2 RELATED WORK
DNN accelerators. Energy efficiency is a major design objective for DNN accelerators. Recent works target energy efficiency in systolic array-based accelerators by adjusting DRAM design parameters, such as supply voltage and access latency [7], replacing off-chip DRAM with non volatile memories [2], or designing dataflow mechanisms to improve data re-use and reduce SRAM accesses [8]. Prior works have also focused on co-designing DNN models and their corresponding hardware accelerators (e.g., [9]). These works focus on 2D accelerators without considering temperature. Another work achieves DNN energy efficiency and latency improvement by stacking memory-on-logic using through-silicon vias (TSVs) [10].

Mono3D. Mono3D is an emerging 3D integration technology where multiple tiers (or device layers) are fabricated sequentially, separated by thin dielectrics, even though current Mono3D fabrication challenges limit the number of tiers to two [4]. The vertical connections between the tiers are achieved using nano-scale inter-tier vias (MIVs) [4]. The thin tiers and MIVs can overcome 2D scaling limitations and provide greater interconnect density, wire length reduction, power savings, and denser integration than traditional TSV-based 3D ICs. There are three types of partitions possible in Mono3D: block-, gate-, and transistor-level. While there are several works in gate- and transistor-level partition [11–13], we focus on a
This section describes our proposed design automation method-(such as MobileNet or VGG11 [18]) and design constraints as inputs to a Mono3D optimizer that determines design parameters for the accelerators for the subsequent iterations and finally outputs a near-optimal accelerator with safe chip temperatures. This optimization flow starts with performance evaluation using SCALE-Sim, a cycle-accurate simulator for systolic array-based DNN accelerators [19]. SCALE-Sim outputs, along with CACTI-6.5 [20] and Mono3D power models, are then used to generate power traces for the accelerator. CACTI calculates power, energy, access time, and area of SRAMs. We then use HotSpot v6.0 (which we configure to simulate Mono3D systems) to obtain on-chip temperatures at steady state [21]. In addition, inter-tier thermal coupling can affect the temperature-dependent static power, which further influences peak temperature and energy. Therefore, we implement a feedback loop that updates the power traces with the static power, after which HotSpot reruns to obtain updated chip temperatures. This loop continues until the temperature converges.

### 3.1 Mono3D DNN Accelerator Design

To simulate a realistic Mono3D stack, we have limited our design to two tiers (see Fig. 2 for a cross-sectional view) because existing Mono3D technologies can typically support only two tiers due to the low temperature requirements during the fabrication process of upper tiers [4]. The number of metal layers, dielectric/device layer thickness, and other material properties of the stack are taken from recent work [4, 12]. The systolic array has a higher power consumption than SRAMs and is placed on the tier closer to the heat spreader. The systolic array and SRAMs have a high degree of connectivity through the MIVs since there are many read/write accesses to the SRAMs throughout the computations in the systolic array. We assume a high logic density for the tier with the systolic array, with SRAMs of the appropriate size on the other tier. Any whitespace (as a result of area mismatch between the two tiers) always appears on the SRAM tier in our design. We place whitespace along chip edges so that thermal analyses are not affected.

### 3.2 Mono3D Optimizer

We construct a multi-start simulated annealing (MSA) based optimizer to systematically sweep a sufficient portion of the design space of accelerators and select near-optimal energy-efficient Mono3D architectures for mobile systems. MSA is a probabilistic algorithm that accepts solutions that temporarily degrade the optimization goal to escape from local minima. MSA can launch multiple “start’s in parallel to increase the probability of finding the global minima. As shown in Fig. 3, our optimizer takes a DNN topology and the following design constraints as inputs: (i) chip footprint budget; (ii) bounds on chip aspect ratio; (iii) limits on systolic array size, (iv) maximum SRAM size, (v) maximum allowed whitespace (as a result of mismatch between the two tiers in the Mono3D chip), (vi) thermal budget (i.e., maximum allowed peak temperature, \(T_{\text{threshold}}\)), and (vii) maximum performance loss \((C_{\text{loss,max}})\) w.r.t. the fastest design that satisfies the design constraints (i)-(vi). The optimizer generates performance, power, and thermal traces for systematically selected Mono3D accelerators, and converges to a near-optimal design for the user-specified optimization goal (e.g., minimizing energy or latency) while satisfying performance and thermal constraints. For the systematic selection of new design candidates, the optimizer uses the operating frequency, chip’s aspect ratio, and combinations of systolic array and SRAMs (that satisfy the whitespace constraint) as its control knobs.

Algorithm 1 details our optimizer, which is inherently parallelizable because all the “start’s run in parallel (line 1). Each start is assigned an operating frequency and an aspect ratio range \((AR)\), within which the optimizer determines a near-optimal solution by minimizing the objective function, \(O bj.\ Obj\) can be execution time (i.e., inference latency), chip power, energy or another energy efficiency metric. \(T_{\text{start}}, T_{\text{finish}},\) and decay \((\delta)\) are parameters of the optimizer that define the annealing temperatures and the rate
We then randomly perturb $S_i$ (lines 3-7). We set

\[
\Delta T = \frac{T_{\text{peak},i} - T_{\text{threshold}}}{1 + \text{random}(0,1)}
\]

randomly select an accelerator ($S_i$) with AR, and a frequency

\[
\frac{\Delta C_{\text{class}}}{C_{\text{class}}} = \frac{C_{\text{p}} - C_{\text{curr}}}{C_{\text{curr}}}
\]

ACCTI, we fit a linear model (a linear model estimate SRAM leakage at a finer granularity than the 10 degree structure. Finally, the optimizer selects the best design among all the starts with the least Obj while satisfying the performance and thermal constraints (line 38). Note that if the objective of the user is to design one accelerator that can run multiple DNNs efficiently, then some additional meta strategies could be integrated to the optimizer. For example, the optimizer can select the fastest/most efficient design out of several optimized solutions for all target DNNs on average, or pick the design that yields the best results for the most frequently run DNNs.

### 3.3 Performance Model

SCALE-Sim is an open-source, state-of-the-art cycle-accurate simulator for DNN accelerators that operate on 8-bit integer data. It takes the size of systolic array and scratchpad memories and DRAM bandwidth as inputs, simulates a stall-free DNN inference, and outputs compute cycles, non-overlapping DRAM cycles, array utilization, SRAM accesses, and DRAM bandwidth to support stall-free inference. Compute cycles include cycles spent in data transfer between SRAMs and systolic array, along with DRAM cycles that overlap with the computation. We divide the compute cycles and non-overlapping cycles by chip and DRAM frequencies, respectively, to calculate the latency. Among the several dataflows SCALE-Sim supports, we use output stationary as it has been shown to outperform the other dataflows [19].

### 3.4 Mono3D Power Models

We use SCALE-Sim outputs to obtain the average dynamic power of the systolic array ($P_{SA,DYNAMIC}$) using Eqs. (1) and (2):

\[
U_{dq} = \frac{\sum_{i=1}^{N} U_i}{N} \sum_{i=1}^{N} C_i
\]

\[
P_{SA,DYNAMIC} = U_{dq} \cdot P_{MAC,DYNAMIC}
\]

where $N$ is the total number of convolutional layers in the DNN, $U_i$ and $C_i$ are the utilization and compute cycles, respectively, for the $i$th layer, and $P_{MAC,DYNAMIC}$ is the dynamic power for a MAC unit. We also integrate an exponential leakage model for MAC (see Sec. 4.1.1 for details on MAC’s power model).

We use the minimum SRAM bandwidth ($\text{bytes per cycle}$) generated by SCALE-Sim to decide the number of banks in SRAM. We use CACTI to calculate the SRAM dynamic power and leakage. To estimate SRAM leakage at a finer granularity than the 10 degree default granularity of CACTI, we fit a linear model (a linear model can accurately estimate leakage across close temperatures [22]).

We deploy a generic interconnect power model, where the interconnects consume 15% of the total chip dynamic power because (i) DNNs require large amounts of memory for inputs, weights, and outputs, and (ii) there is frequent data movement between the systolic array and SRAMs [23]. We then reduce the interconnect

---

1. Annealing temperature is a unitless parameter in MSA that allows it to escape a local minima by accepting a design with a higher Obj value. Rate of cooling is the rate at which the annealing temperature decays to achieve convergence.
power by 10%, which is equal to MONO3D iso-performance power savings obtained from a recent work [24]. The interconnect power is then uniformly distributed across the metal layers.

We evaluate energy efficiency for the accelerators using the following metrics: system energy (\(E_{sys}\), includes both the chip and DRAM energy), energy-delay-area-product (EDAP), energy-delay\(^2\)-product (ED2P), and energy-delay-product (EDP). While EDP and ED2P emphasize the execution time, EDAP offers a comparison across accelerators with different chip footprints.

### 3.5 MONO3D Thermal Model

We build a compact thermal model (CTM) in HotSpot for the chip w.r.t. COMSOL. We also report average, maximum and RMS errors with their temperature-dependent leakage, and rerun HotSpot. This setup show a maximum error of 1 \(\%\) for both MAC and SRAM temperatures between consecutive HotSpot runs.

We validate our CTM with a model for the same design in COMSOL, a multiphysics simulator that uses finite element method to solve a second order heat diffusion equation [25]. We model various aspect ratios, hot spot locations, sizes, and power densities. Overall, we observe a maximum error in peak temperature of 3.89\% w.r.t. COMSOL. We also report average, maximum and RMS errors of 1.53\% (corresponds to 3.2\% w.r.t. COMSOL), and 1.76\%, respectively. We also observe that power profiles resembling our MONO3D setup show a maximum error of 1\% (corresponds to 1.3\%) for peak temperatures close to 80\(^\circ\)C \(T_{\text{threshold}}\) in our analysis.

### 4 EXPERIMENTAL RESULTS

In this section, we describe our experimental setup, evaluate our optimizer for correctness and speed, and present the results of our optimization flow. For our analyses, we have used eight DNN inference benchmarks, six from MLPerf [18], namely VGG19, VGG16, VGG11, ResNet50, MobileNet, and GoogLeNet, along with faster R-CNN [26] and Tiny-YOLO [27]. We group MobileNet, GoogLeNet, ResNet50, and Tiny-YOLO as lower-complexity (LoC) DNNs because of their lower memory usage and fewer number of MAC operations and the rest as higher-complexity (HiC) DNNs because of their greater number of MAC operations and higher memory usage [28].

#### 4.1 Experimental setup

##### 4.1.1 SRAM/Systolic Array MAC model

We synthesize a 65 nm 8-bit MAC unit at 250 MHz using the Synopsys Design Compiler (DC) and scale it down to 22 nm technology node. The scaled down area, dynamic power, and the frequency are 121 \(\mu m^2\) (length = 11 \(\mu m\)), 0.25 mW, and 735 MHz, respectively. We also fit a temperature-dependent exponential leakage model for a MAC unit using data points (temperature, leakage) from our synthesized MAC model.

Furthermore, we model 22 nm SRAMs in CACTI-6.5 and the off-chip DRAM is based on 8 Gb LPDDR2-800 x32 chips at 400 MHz, with 8.5 Gbps bandwidth and 200 pJ/byte energy consumption [29].

#### 4.2 Optimizer Evaluation

##### 4.2.1 Setup and Running Times

We launch 6 starts for each frequency and each start is assigned an aspect ratio range. Each start has 6 annealing temperatures with 35 perturbations. We ensure convergence by observing that the optimizer does not accept worse designs as it approaches termination. We run the optimizer multiple times to tune its parameters, i.e., \(T_{\text{start}}, T_{\text{finish}}\) and \(\delta\), to achieve better solutions. Furthermore, our optimizer can work with a larger range of frequencies and still select a near-optimal point (this may require launching more starts in parallel).

SCALE-Sim and HotSpot take 10-60 and 5-45 mins, respectively, depending on the chip footprint and DNN. HiC DNNs have a higher number of MAC operations that lead to higher power densities and peak temperatures (more active PEs), which increase temperature-dependent leakage. Thus, these DNNs require more iterations (4-5) to converge in HotSpot. LoC DNNs require fewer iterations (2-3) due to fewer MAC operations [28] and lower chip power. Long simulation times are bottlenecks to perform an exhaustive search in our large design space and demonstrate the need for an optimizer.

##### 4.2.2 Correctness of the Optimizer

To demonstrate the correctness of our optimizer, we select a smaller design space with one frequency (735 MHz), 0.94 to 1 aspect ratio range (step size of 0.01), under the same constraints listed in Sec. 4.1.2. We evaluate the optimizer with 10\%, 5\%, and 3\% performance constraints. In total, there are 1,196 valid accelerator configurations. We select 2 DNNs, Tiny-YOLO and VGG11, and compare the designs chosen by our optimizer to those determined by an exhaustive search in this smaller design space. The optimizer’s parameters \(T_{\text{start}}, T_{\text{finish}}, \delta\) for Tiny-YOLO and VGG11 are set to [1.446, 0.738611, 0.8] and [1.446, 0.885963, 0.85], respectively. The 6 starts are assigned aspect ratio ranges: [0.94, 0.95], [0.95, 0.96], and so on till [0.99, 1]. Across all the objectives (performance, power, energy, EDP, ED2P, and EDAP), the near-optimal designs selected by the optimizer and the global optimal differ by \(\leq 2\%\) in \(\text{Obj}\) values, showing close agreement.

Exhaustive search for Tiny-YOLO and VGG11 requires 48.3 and 55.2% in \(\text{Obj}\) values, showing close agreement. Exhaustive search for Tiny-YOLO and VGG11 requires 48.3 and 55.2% in \(\text{Obj}\) values, respectively, with 6 parallel searches, while the optimizer requires 4.5 and 5.5 hours, respectively, with 6 parallel starts.

#### 4.3 Optimization Results

We next discuss the temperature-aware optimization results for various objective functions. The 6 starts are assigned aspect ratio

---

Table 1: Design space for DNN accelerators.

<table>
<thead>
<tr>
<th>Systolic array size</th>
<th>16x16 to 256x256</th>
</tr>
</thead>
<tbody>
<tr>
<td>Each SRAM size</td>
<td>[32, 64, 128, 256, 512, 1024, 2048, 4096] K B</td>
</tr>
<tr>
<td>Aspect ratio of the chip</td>
<td>0.7 to 1.3</td>
</tr>
<tr>
<td>Frequencies</td>
<td>[735, 600, 500] MHz</td>
</tr>
</tbody>
</table>

---


ranges: [0.7, 0.8], [0.8, 0.9], and so on till [1.2, 1.3]. Note that in the following results, the SRAM sizes are ordered as IFMAP, Filter, and OFMAP. If we mention one size, we refer to the total SRAM size.

4.3.1 Performance. Fig. 4 shows performance versus temperature results for all the designs that our optimizer evaluates before converging to near-optimal solutions for ResNet50 and VGG19 when minimizing latency. The dashed lines are the user-defined performance and thermal constraints. The optimizer selects a 198×184 systolic array with a 4160 MB SRAM at 735 MHz for ResNet50 (Fig. 4a). The figure also shows a few points with slightly worse performance but higher temperature within the performance constraint. Those points have a slightly larger footprint (1%) with more active PEs, which results in higher power and peak temperatures. LoC DNNs have adequate thermal headroom to run on big systolic arrays at 735 MHz without sacrificing performance (see Table 2).

In contrast, HiC DNNs have a higher array utilization (due to more MAC operations) and lead to more thermal violations (due to higher chip power) compared to the LoC DNNs (e.g., VGG19 in Fig. 4b). The optimizer selects 170×214 with 4160 KB SRAM for VGG19. Fig. 4b shows a 5% performance tradeoff w.r.t. the lowest execution time accelerator to obey the tight thermal budget for VGG19. The lowest execution time accelerator has higher utilization (with same SRAM size), which leads to better performance but higher dynamic power and temperature in the systolic array tier. The inter-tier thermal coupling in Mono3D further increases the static power by 4% (despite the same SRAM size), eventually leading to a 3°C higher peak temperature. On average, HiC DNNs tradeoff 2% performance to operate under safe temperatures (Table 2).

4.3.2 Power. Fig. 5 shows performance, power, and temperature tradeoffs for ResNet50 and VGG19. We see at low total chip power (< 1 W), peak temperatures can be high (80°C for ResNet50 and 82°C for VGG19). Here, the DNNs are running on smaller chip footprints (=1 mm²), i.e., with smaller systolic arrays and SRAMs, which leads to higher power density and peak temperatures. The optimizer selects 126×144 with 2112 KB SRAM at 735 MHz for ResNet50. 600 MHz designs present under the imposed constraints have a larger chip footprint with more PEs operating in parallel, which results in a net higher power (than the selected design). 500 MHz accelerators violate the performance constraint and thus, are not selected by the optimizer. Similarly, the optimizer selects 735 MHz designs for the other LoC DNNs (Table 2).

Fig. 5 shows performance, power, and temperature tradeoffs for ResNet50 and VGG19. At 735 MHz for ResNet50. 600 MHz designs present under the imposed constraints have a larger chip footprint with more PEs operating in parallel, which results in a net higher power (than the selected design). 500 MHz accelerators violate the performance constraint and thus, are not selected by the optimizer. Similarly, the optimizer selects 735 MHz designs for the other LoC DNNs (Table 2).

For VGG19, the optimizer selects a 180×202 systolic array with 2080 KB SRAM at 600 MHz (see Fig. 5b). At 735 MHz, the most power-efficient design under the user-specified constraints is almost of the same size as the selected design (≈0.99x) with a similar array utilization and same SRAM size. The higher dynamic power (due to faster PEs) causes higher temperatures in the systolic array tier, which further increases the static power by 9% due to inter-tier thermal coupling (despite the same size of the SRAM), eventually resulting in a 7°C higher peak temperature. Similarly, a 600 MHz
design is selected for VGG16. On the other hand, Faster-RCNN and VGG11 have lower power and lead to relatively fewer thermal violations. Hence the optimizer finds power-efficient accelerators at 735 MHz under the thermal constraint (see Table 2).

Energy Efficiency. Fig. 6 shows system energy (Esys) distribution for ResNet50 and VGG19. The optimizer selects 212×172 with 2112 KB SRAM at 735 MHz for ResNet50. While there exist few 600 MHz accelerators with lower power under the performance constraint, the higher execution time negates the power savings. Similarly, 735 MHz accelerators are selected for the other LoC DNNs. Even for HiC DNNs, the optimizer selects 735 MHz accelerators for all but VGG19 (Table 2). VGG19, being the highest power DNN, benefits from both Mono3D iso-performance power savings and slower PEs, thus making up for the performance loss w.r.t. 735 MHz designs. On average, our optimizer achieves 1.2x energy and 1.1x area savings, with a performance loss of 5.3% across all the DNNs. Finally, selections made by the optimizer for minimizing EDAP achieve up to 2x chip footprint and 1.6x Esys savings, by sacrificing up to 9.7% latency (average: 1.2x, 1.4x, 5.5%, respectively).

5 CONCLUSION
We propose a design automation methodology that yields near- optimal energy efficient DNN accelerators based on Mono3D under user-specified thermal and performance constraints. Based on tradeoff analysis few conclusions can be drawn: (i) HiC DNNs with higher dynamic power result in higher temperature, which further increases leakage due to inter-tier thermal coupling, eventually resulting in thermal violations. As a result, HiC DNNs have to tradeoff performance to operate under safe temperatures. (ii) Although we can add more SRAM and PEs (i.e., larger systolic array) to utilize the two tiers in a given chip footprint, power efficiency can drop (even at lower frequencies) due to (a) higher dynamic power (more active PEs) and (b) higher SRAM static power, as a result of both SRAM size and inter-tier thermal coupling in Mono3D across all DNNs. (iii) HiC DNNs (e.g., VGG19) with more PEs running in parallel can benefit from running at lower frequency, along with Mono3D power savings, thereby achieving higher energy efficiency.

REFERENCES