A fixed point exponential function accelerator for a neuromorphic many-core system

DOI:
10.1109/ISCAS.2017.8050528

Document Version
Accepted author manuscript

Link to publication record in Manchester Research Explorer

Citation for published version (APA):

Published in:
IEEE International Symposium on Circuits and Systems

Citing this paper
Please note that where the full-text provided on Manchester Research Explorer is the Author Accepted Manuscript or Proof version this may differ from the final Published version. If citing, it is advised that you check and use the publisher's definitive version.

General rights
Copyright and moral rights for the publications made accessible in the Research Explorer are retained by the authors and/or other copyright owners and it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights.

Takedown policy
If you believe that this document breaches copyright please refer to the University of Manchester's Takedown Procedures [http://man.ac.uk/04Y6Bo] or contact uml.scholarlycommunications@manchester.ac.uk providing relevant details, so we can investigate your claim.
A Fixed Point Exponential Function Accelerator for a Neuromorphic Many-Core System

Johannes Partzsch, Sebastian Höppner, Matthias Eberlein, Rene Schüffny, Christian Mayr  
Chair for Highly-Parallel VLSI Systems and Neuromorphic Circuits, Technische Universität Dresden, 01062 Dresden, Germany. Email: johannes.partzsch@tu-dresden.de

David R. Lester, Steve Furber  
School of Computer Science, University of Manchester, Oxford Road, Manchester M13 9PL, UK. Email: david.r.lester@manchester.ac.uk

Abstract—Many models of spiking neural networks heavily rely on exponential waveforms. On neuromorphic multiprocessor systems like SpiNNaker, they have to be approximated by dedicated algorithms, often dominating the processing load. Here we present a processor extension for fast calculation of exponentials, aimed at integration in the next-generation SpiNNaker system. Our implementation achieves single-LSB precision in a 32bit fixed-point format and 250Mexp/s throughput at 0.44nJ/exp for nominal supply (1.0V), or 0.21nJ/exp at 0.7V supply and 77Mexp/s, demonstrating a throughput multiplication of almost 50 and 98% energy reduction at 2% area overhead per processor on a 28nm CMOS chip.

Keywords—MPSoC, neuromorphic computing, SpiNNaker, exponential function

I. INTRODUCTION

In models of neural processing, one of the main mathematical primitives encountered is the exponential function. Examples include the shape of the postsynaptic current, activation function of ion channels, time courses of short term depression, or spike-time-dependent plasticity (STDP) [1]. When emulating these models in neuromorphic circuits, a circuit implementation of the exponential function has to be found. Traditionally, the subthreshold voltage-current dependence of CMOS transistors has been used to generate an exponential current waveform based on a linear (saw-tooth) voltage [2]. The OTA-C approach uses a transconductance to emulate an RC decay [3], [4]. In deep submicron technologies, solutions have to be found that rely less on analog performance, such as a switched capacitor charge sharing emulating an exponential decay [5]. However, the above approaches have four main drawbacks. First, they suffer to varying degrees from deviations due to the inevitable fixed-pattern noise [6]. Second, due to the analog, non-programmable nature of the implementation, the model behaviour can usually only be configured in a narrow range. Third, they are limited by leakage to maximum time constants on the order of seconds [7], insufficient for various plasticity models [1]. Fourth, as analog circuits, they do not scale to ultra-deep submicron technologies and therefore cannot take advantage of the high packing densities and low power operation offered by these technologies. In contrast, the SpiNNaker digital neuromorphic approach [8] uses many-core processor systems to execute software programs that emulate neuroscience models. This achieves unlimited time constants plus flexibility and reliability. However, the exponential calculation on the current SpiNNaker system is carried out in software, making it several orders of magnitude less energy efficient than analog realizations. In this paper, we present a digital exponential function accelerator that is integrated in a prototype chip of the 2nd generation SpiNNaker system. It combines scalability to ultra-deep submicron, unlimited time constants, flexible programmable operation with an energy consumption competitive to analog approaches.

II. BACKGROUND

In neuromorphic MPSoCs, spiking neural networks are emulated by calculating on each processor a set of model equations. SpiNNaker [8] is the first system to exploit this approach, modifying and extending the MPSoC infrastructure (for details, see [9]). One SpiNNaker chip includes 18 ARM cores, and together with 128MByte external DRAM consumes 1W, being significantly more energy efficient than standard high-performance computing. Exponentials are expensive operations on standard processors, as they have to be approximated via tens of basic mathematical operations, and often dominate processing load. Thus, strategies have been developed for avoiding or coarsely approximating exponentials in spiking neural network simulation and hardware [10], [11], [12]. In fixed time step simulations, exponential decays can be realized by a multiplication per time step. However, state variables have to be accessed in each time step, which can become a bottleneck when fetching them from external DRAM. Alternatively, exponentials can be stored in look-up tables [10]. However, this is efficient only if a limited number of different values has to be evaluated repeatedly. The above solutions are restricted to special cases or compromise accuracy. Thus, a hardware accelerator for exponential calculation is an attractive option for boosting performance in the 2nd generation SpiNNaker system. A prototype chip for this purpose has been implemented, following the structure shown in Fig. 1. Each of the four processing elements (PEs) includes one exponential accelerator, whose design is described next.

III. EXPOENTIAL ALGORITHM

The exponential accelerator should be easy to integrate in existing software implementations. Therefore, we use the signed fixed-point s16.15 number format, which has been widely used in the SpiNNaker software framework.
From the wide range of available algorithms for exponential calculation, we chose a hybrid algorithm, combining two LUTs with a polynomial approximation [13], utilizing the self-similarity of the exponential function:

\[ y = \exp(x) = \exp(n) \cdot \exp(p) \cdot \exp(q) \text{ with } x = n + p + q \]

The integer part of the operand is represented by \( n \), whereas the fractional digits are split into an upper part \( p \) and a lower part \( q \). The two functions \( \exp(n) \) and \( \exp(p) \) are evaluated via two separate LUTs, whereas the function \( \exp(q) \) represents a polynomial.

There are several arguments for this hybrid approach. First, the integer LUT utilizes the amplifying effect of the exponential function in conjugation with the limited dynamic range of the employed fixed-point s16.15 format: Only a limited range of operands \( x \) produce a valid result \( y \) in the supported range. The maximum representable value \( 2^{16} - 1 \approx \exp(11.1) \) limits meaningful \( x \) at positive numbers, whereas the minimum representable value \( 2^{-15} \approx \exp(-10.4) \) restricts \( x \) at negative numbers. All operands outside this range evaluate either in an overflow or a result of \( y = 0 \), and can be filtered out beforehand. In effect, the integer LUT needs only 23 entries. The fractional look-up is introduced to restrict the operands range of the polynomial approximation. As a result, the required order of the polynomial is reduced compared to a full polynomial approximation of the fractional part at the same accuracy level. The trade-off here is to make the fractional LUT small enough, so that the savings due to the reduced polynomial order dominate over the effort for the LUT and the additional multiplication. The resulting split into two LUTs significantly reduces the total memory size compared to a combined LUT.

Overall, the look-up and polynomial evaluation can be well parallelized in hardware, which results in a short latency, especially compared to iterative algorithms. This advantage comes at the price of several multiplications for polynomial evaluation and calculation of the final result. However, due to the limited operand range of the polynomial, most multipliers can be designed with relatively low bit widths, which reduces the overall hardware effort. This will be discussed next.

IV. IMPLEMENTATION

The exponential accelerator should be accurate to one LSB for all operands to not compromise on accuracy compared to a software implementation. In worst-case (high operand values), all 32 digits except the sign bit are filled in the result. Thus, all components have to have at most an absolute error of \( 2^{-31} \). To fulfill this requirement, we chose a 6bit fractional LUT for \( f_{\text{frac}} \), which restricts \( f_{\text{poly}} \) to a 4th-order polynomial on the interval \( 0 \leq q < 2^{-6} \). Lower LUT bit widths would require higher-order polynomials with correspondingly higher effort. Starting from a Taylor expansion, the polynomial coefficients were optimized for low number of valid digits in binary notation, to later simplify multiplication with these coefficients. This resulted in:

\[ f_{\text{poly}}(q) = 1 + q + \frac{1}{2} \cdot q^2 + c_3 \cdot q^3 + c_4 \cdot q^4, \text{ with} \]

\[ c_3 = 2^{-3} + 2^{-5} + 2^{-7} + 2^{-9} + 2^{-11} + 2^{-13} \]

\[ c_4 = 2^{-5} + 2^{-7} + 2^{-8} \]

The coefficient \( 1/2 \) is realized by a constant shift, while \( c_3 \) and \( c_4 \) have only 6 and 3 valid digits, respectively, effectively replacing two multiplications by a few additions. Figure 2 shows the absolute error of the polynomial. For the envisaged interval, it stays well below the required accuracy of \( 2^{-31} \), which leaves headroom for rounding errors and quantization of intermediate results.

The structure of the complete implementation is shown in Fig. 3. Multiplication of the fractional LUT result was interleaved with the polynomial evaluation. This results in three more multipliers. However, their bit widths are greatly reduced, which relaxes timing and makes the implementation more suited for tight timing requirements. Only one multiplication remains with two wide operands of 32bit or more, whose operation could easily be split over more than one clock cycle to meet increased clock speed requirements.

Due to its structure, the implementation can be operated as a pipeline, which is made accessible from outside by an output FIFO with four entries. Integration with the ARM-M4F controller is realized via the AMBA high-performance bus (AHB), see Fig. 4. Calculation of one exponential requires one bus write and read operation, resulting in a minimum of 2 clock cycles per exponential in pipelined operation, and a total latency of 6 clock cycles for single operation. The bus interface makes the exponential accelerator easily portable to other processors and also simplifies software integration via memory-mapped access, i.e. write and subsequent read of one memory address. The read operation is stalled if the output FIFO is empty and a value is still in the calculation pipeline, making timing of access completely transparent to the user.
V. RESULTS

The prototype chip of Fig. 1 has been implemented and manufactured in GLOBALFOUNDRIES 28nm SLP CMOS technology. Its photograph is shown in Fig. 5. The exponential accelerator is included in each of the four PEs.

We evaluated the performance of the exponential accelerator with respect to accuracy, speed and energy per operation. Accuracy was assessed via comparison to the double-precision floating point exponential of the C standard library. The resulting sweep over all operands is shown in Fig. 6. The absolute error is always below one LSB, meaning that the result of the exponential accelerator is always one of the two neighbouring fixed-point values of the double-precision reference.

Figure 7 shows measurement results for the energy per exponential for different operating voltages and frequencies. The values were extracted from measurements of total PE power when running the exponential accelerator at maximum throughput in pipelined operation (2 clock cycles per exponential, see Sec. IV), so power consumption of the processor and its local memory is included like in a real application scenario. At nominal operating conditions of 1.0V and 500MHz clock frequency, a value of 0.44nJ/exp at a throughput of 250Mexp/s is achieved. For a reduced supply voltage of 0.7V and 154MHz clock frequency (maximally achievable at this voltage level), energy goes down to 0.21nJ/exp at 77Mexp/s.

We compared the exponential accelerator with a software version of the same algorithm, running on the ARM-M4F on the testchip. It typically requires 95 clock cycles per exponential. Correspondingly, measured energy per exponential is significantly higher, as shown by the comparison in Fig. 8. The exponential accelerator requires only 1.8% of the software implementation for both nominal operating conditions (1.0V/500MHz) and reduced supply voltage (0.7V/154MHz).

The main characteristics of the exponential accelerator and its corresponding software version are summarized in Table I. The additional area for the accelerator block makes up for approx. 2% of the area per processing element.

Table II compares our exponential accelerator to other hardware implementations. The work in [14] achieves higher energy efficiency at lower resolution and speed. However, their
power figure is only estimated after Synthesis and done for the exponential block exclusively, whereas our value measures power spent for the whole processing element. In contrast to our work, the FPGA implementations of [13] and [15] both target floating point values. The implementation of [13] achieves almost the same throughput and higher accuracy. However, this comes at the price of a significantly higher pipeline latency and a big total LUT size of 144kByte (compared to 404Byte in our work). The floating-point representation significantly extends the numerical range compared to our fixed-point implementation, which may be a necessity in certain use cases or result in less effort to be put into normalization. Our fixed-point implementation could still be utilized in this case for calculating the mantissa, while the exponent would require a few additional operations in the normalization. Our fixed-point implementation could still be utilized in this case for calculating the mantissa, while the exponent would require a few additional operations in the

VI. Conclusion

This work presents a hardware accelerator for exponential function calculation, targeted for integration in the next-generation SpiNNaker neuromorphic MPSoC. It achieves single-LSB precision for the employed s16.15 fixed-point format. Measurements on a prototype chip demonstrate a maximum throughput of 250Mexp/s at a total energy of 0.44nJ per exponential. Scaling of supply voltage to 0.7V reduces this value to 0.21nJ/exp, outperforming a software implementation by a factor of 55 at a speed-up of almost 50. With these characteristics, the exponential accelerator will help in speeding up event-based computations of a wide range of neuron and synapse models.

ACKNOWLEDGMENT

This work was supported by the European Union under Grant Agreements No. 604102 and DLV-720270 (Human Brain Project), No. 692519 (PRIME), ERC Advanced Grant BIMPC (320689), EPSRC grant BIPMA (EP/G015740) and the Center for Advancing Electronics Dresden (cfaed). The authors thank ARM and Synopsis for IP and the Vodafone Chair at Technische Universität Dresden for contributions to RTL design.

REFERENCES


TABLE I. SUMMARY OF EXPONENTIAL ACCELERATOR IMPLEMENTATION AND SOFTWARE IMPLEMENTATION

<table>
<thead>
<tr>
<th>Measure</th>
<th>This work [14]</th>
<th>[15]</th>
<th>[13]</th>
</tr>
</thead>
<tbody>
<tr>
<td>Throughput, pipeline</td>
<td>250Mexp/s</td>
<td>5.3Mexp/s</td>
<td>30Mexp/s</td>
</tr>
<tr>
<td>Throughput, single</td>
<td>83Mexp/s</td>
<td>2.8Mexp/s</td>
<td>5.9Mexp/s</td>
</tr>
<tr>
<td>Technology/Platform</td>
<td>28nm</td>
<td>65nm</td>
<td>FPGA</td>
</tr>
<tr>
<td>Total area</td>
<td>10800µm²</td>
<td>20700µm²</td>
<td>FPGA</td>
</tr>
<tr>
<td>Rel. accuracy</td>
<td>fixed</td>
<td>fixed</td>
<td>fixed</td>
</tr>
<tr>
<td>Energy per exp</td>
<td>0.44nJ/exp</td>
<td>0.996nJ/exp</td>
<td>*</td>
</tr>
</tbody>
</table>

* total system power ** only Synthesis results available *** checked only on range -2.2

TABLE II. COMPARISON WITH OTHER HARDWARE IMPLEMENTATIONS