Development of application specific accelerators for deep convolutional neural networks (ConvNets) have mainly focussed on accelerating the computationally intensive layers, that is the convolutional layers, to improve performance and energy efficiency. Traditional approaches in this space have relied on handcrafted dataflow implementations to leverage the fine-grained parallelism and datalocality properties within these layers. However, ConvNets layers also have an untapped potential from cross-layer data locality.
In our work, we explore a novel approach in the context of deep neural networks accelerators by modelling the computation as a task-dependency directed acyclic graph and proposing a memoryaware heuristic based on Heterogeneous Earliest Finish Time (HEFT) for task-graph scheduling on shared memory systems.
Our results show the benefits of task graphs in terms of better memory use (23.4 % less) over conventional layer-by-layer processing in a simulated environment with the first three layers of LeNet-5. Certain task-graphs trade-off makespan (10% increase) for memory use (20 % decrease). Finally, our exploration of graphs with different slicing configurations for the pooling layer while using memory-aware HEFT versus the original HEFT reveals that regular shaped tiles across layers offers better makespan and memory use than tiles with large dimensions along one axis.