Structure Optimizations of Neuromorphic Computing Architectures for Deep Neural Network

Heechun Park* and Taewhan Kim†
School of Electrical and Computer Engineering, Seoul National University, Seoul, Korea
{*phc, †tkim}@snucad.snu.ac.kr

Abstract—This work addresses a new structure optimization of neuromorphic computing architectures. This enables to speed up the DNN (deep neural network) computation twice as fast as, theoretically, that of the existing architectures. Precisely, we propose a new structural technique of mixing both of the dendritic and axonal based neuromorphic cores in a way to totally eliminate the inherent non-zero waiting time between cores in the DNN implementation. In addition, in conjunction with the new architecture we propose a technique of maximally utilizing computation units so that the resource overhead of total computation units can be minimized. We have provided a set of experimental data to demonstrate the effectiveness (i.e., speed and area) of our proposed architectural optimizations: ∼2x speedup with no accuracy penalty on the neuromorphic computation or improved accuracy with no additional computation time.

I. INTRODUCTION

In the new era of neuromorphic computing, biologically inspired neural networks are realized and accelerated by various hardware platforms such as GPU, FPGA, ASIC chip, and memristor crossbar (e.g., [1]–[4]) to overcome the memory-computation gap in the traditional von Neumann architecture. TrueNorth chip, released by IBM [3], is composed of 64×64 neuromorphic cores, containing total of 1 million neurons and 256 million synapses, and each core can represent a neural network with 256 axons, 256 neurons, and a network of 256×256 synapses. TrueNorth is shown to achieve two to three orders of magnitude speedup and five orders of magnitude lower energy consumption over the traditional processors. On the other side, architectures with memristor [5] based synapse network have attracted many researchers. The feature of a memristor is similar to that of a synapse that has a tunable weight value [6], and many cognitive applications were tested with memristor crossbar of n×n memristors [7], [8].

However, most of the prior research directions focused on designing and optimizing a single neuromorphic core corresponding to a synapse network, and have no attention to the cross optimization of multiple synapse networks in DNN (deep neural network) implementation. To our best knowledge, there is no cross optimization techniques applied to any of the previous works [3] [4] [9], and they simply serially connect individual cores. Since DNN computing will be highly demanded in the future as the application complexity with high accuracy or detailed learning increases [10], structure optimization on the neuromorphic inter-core, as well as on the neuromorphic intra-core, is necessary. (In this work, intra-core refers to the internal structure of a neuromorphic core that performs computations using various neuron models.

Meanwhile, inter-core refers to the structure that integrates two neuromorphic cores. It corresponds to implementing the connections between two consecutive synapse networks in neural networks.)

This work proposes the implementation-friendly structure optimization that enables to accelerate the performance of the neuromorphic chip for DNN implementation, which is theoretically twice as fast as the conventional one, with similar use of hardware resources.

II. PRELIMINARY: NEUROMORPHIC INTRA-CORE ARCHITECTURES

Fig. 1(b) shows a conceptual view of a part of neural network (Fig. 1(a)), in which the output value zj at the jth output neuron is computed in two steps:

\[
y_j = \sum_{i=1}^{n} w_{ij} x_i + b_j, \quad z_j = h(y_j)
\]

where \(x_i\) is the input value from \(i\)th input neuron, \(w_{ij}\) is the weight value between \(x_i\) and \(y_j\), \(h(\cdot)\) is a nonlinear activation function, and \(b_j\) is the bias value. To implement this particular network with neuromorphic core, the concept of synapse crossbar is generally used, as depicted in Fig. 1(c), and the weight values in the crossbar are stored in external memory [3] [9] or memristors [4].

The conventional computation flows in neuromorphic core can be classified by the difference of its computation order into two models: axonal-based and dendritic-based models\(^1\).

The left figure in Fig. 2(a) illustrates the axonal-based computation [9] which concurrently fetches the weight values from a single input (axon) to all output neurons, providing

\^1Although previous works [9] [3] are based on spiking neural network model [11], we used MAC (multiply-accumulate) unit, instead of specific module for spike inputs, as computation unit to describe both models for readability. We are only interested in the computation order of both models.

---

Fig. 1. An example of neural network, a conceptual view and a structure of neuromorphic implementation.

978-3-9819263-0-9/DATE18/©2018 EDAA
an iterative accumulation at every output neuron, resulting in parallel generation of all output values. The axonal-based model does not need all inputs ready at once, since the computation can start as long as one of the inputs is ready. However, to accumulate synapse weights every time an input triggers, extra storage elements are necessary for all output neurons to retain their intermediate values until the computation for all input values is completed. Otherwise, all intermediate values of the output neurons shall be read from and restored into memory whenever an input triggers, requiring so many memory access operations which may cause a substantial increase of latency and power consumption.

On the other hand, the left figure in Fig. 2(b) illustrates the dendritic-based computation [3], which concurrently fetches the weight values from all inputs to a single output (dendrite), providing the accumulation of all weights for one output neuron at a time, resulting in sequential generation of output values. Therefore, unlike the axonal-based model, only one storage element for the output neuron is required and reused for the whole computation flow, to retain intermediate value and perform accumulation of the focused output neuron. However, the dendritic-based model can start computation only when all input values are injected, which means that the start time will be delayed until all input values are received, and additional storages to hold input values during whole computations are also needed.

III. STRUCTURE OPTIMIZATION OF NEUROMORPHIC INTER-CORE ARCHITECTURE

A. Zero Wait Inter-core Architecture

Conventionally, DNN is simply implemented by connecting multiple identical neuromorphic cores serially. The right figure in Fig. 2(a) shows the axonal-based inter-core connection where two axonal-based cores are synchronized at the output neurons of each core due to the simultaneous output generation. We can see that even though the values of all output neurons in the first core are produced in parallel, a considerable delay waste occurs at the inputs of the second core because although all inputs are ready, MAC unit for an output neuron can only accumulate at most one input at a time, and other inputs have to wait idle. On the other hand, the right figure in Fig. 2(b) shows the dendritic-based inter-core connection where two dendritic-based cores are synchronized at the input of each core for waiting all inputs to be ready. Similarly, a large delay occurs at the inputs of the second core because the early arriving inputs for the second core have to wait for other inputs to arrive and be ready.

Based on the computation flow analysis discussed above, we devise a hybrid structure that can take advantage of the parallel computation at each of the axonal-based and dendritic-based intra-cores, as shown in Fig. 3. We call it denaxo-driven neuromorphic inter-core structure. The first (left) network in the denaxo-driven inter-core is implemented with a dendritic-based intra-core while the second (right) network is with an axonal-based intra-core. As a result, as shown in the computation flow denoted by red dotted arrows in Fig. 3, the dendritic-based computation in the first core and the axonal-based computation in the second core can be executed in parallel, causing no stall between the two intra-cores. In detail, both cores in a denaxo-driven structure can perform computation in parallel in a way that the $i^{th}$ output value from the first (dendritic-based) core is received by the second core as soon as it is generated, and the accumulation for this value runs in the second (axonal-based) core in tandem with the accumulation for the $(i+1)^{th}$ output value of the first core. Therefore, unlike the conventional inter-core structures, computation process of the first and the second cores can be performed simultaneously. This parallel computation concept of denaxo-driven approach can be applied to any kind of neuromorphic computing structures with limited computation resources, and when resources are well allocated for fully utilized parallelism, denaxo-driven inter-core structure can achieve up to 2x speedup compared to the traditional approaches of connecting identical cores serially.

Consider a $m \times n \times p$ DNN consisting of $m \times n$ network...
and \(n \times p\) network aligned serially. For axonal-based implementation, \(m\) inputs for the first core and \(n\) inputs for the second core are treated sequentially, and both cores cannot run in parallel. Then total computation time \(T_{\text{axo}}\) becomes:

\[
T_{\text{axo}} = m \cdot t_{\text{axo},1} + n \cdot t_{\text{axo},2},
\]

where \(t_{\text{axo},1}\) and \(t_{\text{axo},2}\) are the time to process computations related to one axonal input of the first \((m \times n)\) and second \((n \times p)\) networks, respectively. (e.g., the dotted lines in Fig. 2(a))

For dendritic-based implementation, \(n\) outputs of the first core and \(p\) outputs of the second core are generated sequentially, and both cores also cannot run concurrently. Then total computation time \(T_{\text{den}}\) is:

\[
T_{\text{den}} = n \cdot t_{\text{den},1} + p \cdot t_{\text{den},2},
\]

where \(t_{\text{den},1}\) and \(t_{\text{den},2}\) are the time to generate one output value with all axonal inputs of the first and second networks, respectively. (e.g., the dotted lines in Fig. 2(b))

For denaxo-driven implementation, the total computation time \(T_{\text{denaxo}}\) is:

\[
T_{\text{denaxo}} = \begin{cases} n \cdot t_{\text{den},1} + p \cdot t_{\text{den},2} & \text{if } t_{\text{den},1} \geq t_{\text{axo},2}, \\ t_{\text{den},1} + n \cdot t_{\text{axo},2} & \text{otherwise}. \end{cases}
\]

Lemma 1 is derived from the observation in Fig. 3 that both of the dendritic-based (first) and the axonal-based (second) intra-cores are fully parallelized, and thus the longer computation time of the intra-cores becomes the computation time of the denaxo-driven inter-core structure. For fair comparison, assume that all structures use one identical computation unit (i.e., MAC) for each of their intra-cores. Then, the computation time improvement by the denaxo-driven inter-core structure over the conventional structures can be abstracted as:

**Theorem 1.** The computing speed improvement ratio \(\rho_{\text{den}}\) of \(T_{\text{denaxo}}\) to \(T_{\text{den}}\) and ratio \(\rho_{\text{axo}}\) of \(T_{\text{denaxo}}\) to \(T_{\text{axo}}\) for \(m \times n \times p\) DNN are:

\[
\rho_{\text{den}} = \rho_{\text{axo}} = \begin{cases} \frac{m+n+p}{m+n+p} & \text{if } t_{\text{den}} \geq t_{\text{axo}}, \\ \frac{m+n}{m+n} & \text{otherwise}. \end{cases}
\]

Let us consider the improvement ratio when both synapse networks in the inter-core structure have same size, which means \(m \times n = n \times p\). By Theorem 1, \(\rho\) becomes:

\[
\rho_{\text{den}} = \rho_{\text{axo}} = \frac{2n}{n+1}.
\]

Thus, for a DNN of two consecutive synapse networks with identical size, the speedup ratio of denaxo-driven inter-core structure over others theoretically approaches 2 as the network size increases.

**B. Resource Configuration of Inter-core Architecture**

The internal configuration of the proposed inter-core structure should enable the first and second networks in the structure to be executed in parallel. Fig. 4 shows an example of denaxo-driven inter-core structure configuration with 100% computation resource utilization for a \(4 \times 4 \times 4\) DNN. The left plane implements a dendritic-based intra-core structure and produces the internal output values by using two computation units (MACs) while the right plane implements an axonal-based intra-core structure and produces the external output values by using another two MACs. We can confirm that if all MACs have identical computation time, the internal output values from the left (dendritic-based) plane can be handled by the MACs at the right (axonal-based) plane just before next output values are generated from the left plane.

**Observation 1.** Let \(N_1\) and \(N_2\) be the numbers of MACs allocated to the first and second networks of denaxo-driven inter-core structure for the implementation of \(m \times n \times p\) neural network, respectively, and let \(t_{\text{mac}}\) denote the computation delay of a MAC. Then,

\[
t_{\text{den}} = m \cdot t_{\text{mac}}, \quad t_{\text{axo}} = p \cdot \frac{N_1}{N_2} \cdot t_{\text{mac}}.
\]

Observation 1 follows since each of \(N_1\) MACs of the first (dendritic-based) network can multiply and accumulate the \(m\) input and weight values in \(t_{\text{den}}\) time independently, and the second (axonal-based) network receives \(N_1\) intermediate values generated from the first network and performs multiply-accumulate with \(p\) weights for each value using \(N_2\) MACs in parallel in \(t_{\text{axo}}\) time.

**Lemma 2.** \(T_{\text{denaxo}}\) defined inLemma 1 can be refined for the cases of \(N_1 \geq 1\) and \(N_2 \geq 1\):

\[
T_{\text{denaxo}} = t_{\text{mac}} \cdot \left[ m + \left[ \frac{n}{N_1} - 1 \right] \cdot \max\{m, p \cdot \frac{N_1}{N_2} + p \cdot \frac{N_1}{N_2} \} \right].
\]

**Theorem 2.** The computing speed improvement ratios \(\rho_{\text{den}}\) and \(\rho_{\text{axo}}\) when \(N_1 \geq 1\) and \(N_2 \geq 1\) MACs are allocated respectively to the first and second planes of a denaxo-driven inter-core structure for \(m \times n \times p\) DNN:

\[
\rho_{\text{den}} = \rho_{\text{axo}} = \begin{cases} \frac{m+n}{m+n} & \text{if } t_{\text{den}} \geq t_{\text{axo}}, \\ \frac{m+n}{m+n} & \text{otherwise}. \end{cases}
\]
For implementing DNNs with \( m \times n = n \times p \) and \( n \gg N_1, N_2 \) (which means \( N_1 \approx N_2 \)), \( \rho \) becomes:

\[
\rho_{\text{den}} = \rho_{\text{axo}} = \frac{2n}{n + N_1}.
\]

Thus, the maximum theoretical speedup ratio approaches 2 as the network size increases. The variation of the speedup ratios for several configurations of \( N_1 \) and \( N_2 \) are shown in the experimental sections. Theoretically, the highest ratio with full utilization comes from the configuration that satisfies (1) \( 1 \leq N_1 \leq n \), (2) \( 1 \leq N_2 \leq p \) and (3) \( \frac{N_1}{N_2} = \frac{n}{p} \). (Proof is omitted for space limitation.)

C. Using Denaxo-driven Inter-core in Large DNNs

We suggest two options to use the proposed denaxo-driven inter-core as a basic building block in implementing DNNs with many synaptic connections.

**Option 1**: The straightforward option is to implement DNN of \( k \) synapse networks with \( k/2 \) denaxo-driven inter-core structures, in which one inter-core corresponds to the implementation of two serially connected synapse networks. The computation speedup exactly follows that in Theorem 2. Note that this option can implement any form of neural networks only if \( k \geq 2 \), at the expense of the extra work to allocate and deploy MACs in the constituent inter-core structures.

**Option 2**: This option does not need redesigning denaxo-driven inter-cores depending on the size and structure of neural networks to implement. Instead, it requires a slight transformation on the structure of the input DNN, i.e., insert a hidden layer of smaller size between input-output neurons of every synapse networks and re-train weight values with transformed DNN structure. This transformation provide two distinct benefits:

1. The first one is convenience. The denaxo-driven inter-core structure can be in a compact form of size \( m \times n \) that can implement transformed \( m \times \lfloor \frac{mn}{m+n} \rfloor \times n \) neural network. Fig. 5 shows our proposed denaxo-driven inter-core architecture that can be used as a basic building block of the transformed DNN. In other words, it can be plugged into any transformed neural network to use it as a basic building block.

2. When we assume that an equal amount of arithmetic resources (i.e., \( N = N_1 + N_2 \)) is used by both of the conventional intra-cores (i.e., axonal-driven and dendritic-driven) and proposed basic block (i.e., denaxo-driven) for a \( n \times n \) neural network, the conventional cores will take \( \frac{m \times n}{N} \times t_{\text{mac}} \) time, while our basic block takes \( \left( \frac{m \times n}{N} + n \right) \times t_{\text{mac}} \) time (according to Eq.(8)). Thus, the speed ratio \( \rho \) is 
\[
\rho = \frac{n}{n+N} \approx 1 \quad \text{as} \quad n \gg N,
\]
and the transformed neural network with more layers (almost doubled) enables a great improvement on the learning accuracy.

IV. EXPERIMENTAL RESULTS

Experiments are conducted on a Linux machine of 3.50GHz Intel i7 Processor and 16GB memory. We modeled various structures with Verilog HDL description, synthesized them with Synopsys Design Compiler using NANGATE 45nm open cell library [12], and compared the cell area reported from the tool. All structures are simulated on Cadence Incisive Enterprise Simulator to obtain DNN computation time and speed improvement ratio (\( \rho \)). The conventional inter-core structures are produced by aligning two identical intra-core structures, either dendritic-based [3] or axonal-based [9], in serial\(^2\). The proposed denaxo-driven inter-core structure is implemented by aligning dendritic-based core on the first and the axonal-based core on the second, followed by various MAC allocations. All values are encoded in a 16-bit fixed-point number according to [13]. A 4-cycle 16-bit multiplier and a 1-cycle 32-bit accumulator compose one MAC unit, and ReLU (rectified linear unit) module is used for activation.

A. Evaluation of Computation Speed

Table I shows a comparison of our denaxo-driven inter-core structure with that of the conventional inter-core structures with various network sizes and MAC unit allocations. Throughout this experiment, identical weight values are used for all types of inter-core implementations, which means all DNN implementations have the same accuracy. Since the computation times of the conventional structures are similar, the speedup ratio (\( \rho \)) is calculated by comparing the computation time of denaxo-driven structure with the average of both conventional structures. Comparison of cell area will be described in the next subsection. Our denaxo-driven inter-core architecture shows better performance than the conventional ones by 38% ~ 99%. Denaxo-driven structure enhances the speedup by performing calculations of both network planes in parallel, revealing from the last column of Table I that \( \rho \) becomes larger when the network size difference between two network planes is smaller.

\(^2\)Although the intra-core structures in [3] and in [9] both were used in computing a spiking neural network in a prior work, for a fair comparison we trained and tested the conventional intra-cores that use 16-bit fixed-point precision.
TABLE I
COMPARISON OF THE PERFORMANCE OF PRE-DEFINED DENAXO-DRIVEN INTER-CORE STRUCTURE WITH THAT OF THE CONVENTIONAL DENDRITIC-BASED (E.G., [3]) INTER-CORE AND AXONAL-BASED (E.G., [9]) INTER-CORE.

<table>
<thead>
<tr>
<th>Neural networks (m x n x p)</th>
<th>Size ratio (m x n) : (n x p)</th>
<th>#MAC</th>
<th>Dendritic-based inter-core [3]</th>
<th>Axonal-based inter-core [9]</th>
<th>Denaxo-driven inter-core</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td></td>
<td></td>
<td>Area(μm²)</td>
<td>Runtime(cycles)</td>
<td>Area(μm²)</td>
</tr>
<tr>
<td>64 x 64 x 64</td>
<td>1 : 1</td>
<td>1</td>
<td>25027</td>
<td>82698</td>
<td>58638</td>
</tr>
<tr>
<td>128 x 128 x 128</td>
<td></td>
<td>1</td>
<td>48564</td>
<td>329226</td>
<td>113623</td>
</tr>
<tr>
<td>256 x 256 x 256</td>
<td></td>
<td>1</td>
<td>88048</td>
<td>1313802</td>
<td>224370</td>
</tr>
<tr>
<td>128 x 128 x 64</td>
<td>2 : 1</td>
<td>2</td>
<td>56400</td>
<td>181113</td>
<td>88684</td>
</tr>
<tr>
<td>256 x 128 x 128</td>
<td></td>
<td>2</td>
<td>85976</td>
<td>361933</td>
<td>114365</td>
</tr>
<tr>
<td>256 x 250 x 128</td>
<td></td>
<td>2</td>
<td>107052</td>
<td>727201</td>
<td>168520</td>
</tr>
<tr>
<td>128 x 128 x 32</td>
<td>4 : 1</td>
<td>4</td>
<td>123207</td>
<td>214003</td>
<td>89602</td>
</tr>
<tr>
<td>512 x 128 x 128</td>
<td></td>
<td>4</td>
<td>215466</td>
<td>427827</td>
<td>117083</td>
</tr>
<tr>
<td>512 x 256 x 64</td>
<td>8 : 1</td>
<td>8</td>
<td>357385</td>
<td>279983</td>
<td>95920</td>
</tr>
<tr>
<td>512 x 256 x 64</td>
<td></td>
<td>8</td>
<td>378510</td>
<td>558591</td>
<td>149248</td>
</tr>
</tbody>
</table>

Fig. 6. Comparison of computation speedup ratio (p) for various MAC allocations (N₁, N₂) to two network planes in an inter-core.

TABLE II
COMPARISON OF CELL AREA USED BY DENAXO-DRIVEN INTER-CORE STRUCTURE WITH THAT OF THE CONVENTIONAL INTER-CORES.

<table>
<thead>
<tr>
<th></th>
<th></th>
<th></th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td>Area(μm²)</td>
<td>Runtime(cycles)</td>
<td>Area(μm²)</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td>Runtime(cycles)</td>
</tr>
<tr>
<td>256 x 64 x 128</td>
<td>56515 × 1.18</td>
<td>86148 × 1.32</td>
<td>10079</td>
</tr>
<tr>
<td>256 x 256 x 128</td>
<td>66943 × 1.18</td>
<td>113636 × 1.32</td>
<td>100819</td>
</tr>
<tr>
<td>512 x 256 x 64</td>
<td>56493 × 1.18</td>
<td>169009 × 1.96</td>
<td>73301</td>
</tr>
<tr>
<td>256 x 32 × 32</td>
<td>51255 × 1.17</td>
<td>44935 × 1.44</td>
<td>59576</td>
</tr>
<tr>
<td>256 x 64 x 32</td>
<td>56471 × 1.1</td>
<td>72342 × 2.32</td>
<td>59616</td>
</tr>
</tbody>
</table>

B. Evaluation of Cell Area

Fig. 7 shows the impact of MAC allocation on cell area. Changes of cell area for two conventional intra-core (one neural network) implementations as the amount of MAC resource (N) increases are shown. When N is 1 or 2, the area of axonal-based core is twice larger than that of dendritic-based one for the same network size. However, as the value of N increases, the rate of area increment in dendritic-based core is larger than that in axonal-based one. This is because if more MACs are allocated for dendritic-based core, the storage units for weight values and intermediate values are also needed. Therefore, to design a denaxo-driven inter-core structure, a careful distribution of arithmetic modules to the first (dendritic-based) plane with large input width is required so that it needs more storages for loading weights from memory.

Fig. 7. Comparison of cell area used by the intra-cores under various MAC resource usages (N).

Table II summarizes the relation of network size to cell area of all inter-core structures. We assumed N₁ = N₂ = 1 and the same weight values are used. The area of dendritic-based structure is proportional to the input width of each network (m) and (n) to store all input values for accumulation. In contrast, the area of axonal-based structure is proportional to the output.
compiled implementation can achieve speedup over 60% with reasonable number of MACs.

One more interesting point from this table is that the runtime of 784×256×256×10 DNN with denaxo-driven structure is even comparable to that of much smaller DNN (784×256×10) with conventional (dendritic-based, axonal-based) structures, and it is almost equal to the cases of little usage of (i.e., 1 or 2) MACs. This implies that using our proposed denaxo-driven inter-core structure with restricted computation resources can improve accuracy by implementing ‘deeper’ neural network with no additional computation time.

V. CONCLUSIONS

We proposed a new structure optimization technique to improve the computation speed in neuromorphic computing architectures for deep neural networks. Proposed denaxo-driven inter-core structure, exploiting both characteristics of dendritic-based and axonal-based neuromorphic core architectures, was able to increase the computation speed theoretically by a factor of 2 over that of the conventional structures, and practically it was able to increase the computation speed by 38%~99% according to the network size and resource utilizations.

VI. ACKNOWLEDGEMENTS

This work was supported by Samsung Research Funding & Incubation Center of Samsung Electronics under Project Number SRFC-IT1703-00.

REFERENCES


TABLE III

<p>| COMPARISON OF THE PERFORMANCE OF INTER-CORE STRUCTURES WITH MNIST DATABASE [14]. |</p>
<table>
<thead>
<tr>
<th>N1/N2</th>
<th>784×256×10 (Accuracy = 83.90%)</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td>Runtime(#cycle)</td>
</tr>
<tr>
<td></td>
<td>Dendritic-based</td>
</tr>
<tr>
<td>-------</td>
<td>-----------------</td>
</tr>
<tr>
<td>1/1</td>
<td>2055513</td>
</tr>
<tr>
<td>2/1</td>
<td>1223995</td>
</tr>
<tr>
<td>4/1</td>
<td>830373</td>
</tr>
<tr>
<td>8/1</td>
<td>629212</td>
</tr>
<tr>
<td>1/11</td>
<td>2092791</td>
</tr>
<tr>
<td>2/11</td>
<td>1888739</td>
</tr>
<tr>
<td>4/11</td>
<td>1487017</td>
</tr>
<tr>
<td>8/11</td>
<td>1286165</td>
</tr>
</tbody>
</table>

Although higher accuracy can be achieved with better weight initialization (e.g., [15]) or efficient learning algorithms (e.g., [16]), we used a simple and fast method for training since trying to improve accuracy with learning algorithm is not relevant to this work.