AccPar: Tensor Partitioning for Heterogeneous Deep Learning Accelerators

Venue: HPCA 2020
Authors: Linghao Song, ..., Xuehai Qian, Hai Li, Yiran Chen

Introduction.
Training Deep Neural Networks (DNNs) is both a compute-intensive and a communication-intensive task. Parallelizing work across many machines (or nodes) is the typical way training is handled these days. When using many machines to train, the work needs to be partitioned, and there are many works that discuss different techniques to partition the network. The basic two types are data and model parallelism. The former is when the model weights are replicated across machines, and each machine works on different sets of inputs. The later is when the model is distributed and all machines work on parts of the same set of inputs. A prior work, HyPar -- Hybrid Parallelism, proposes to analyze the cost of the partitioning of each layer based on data and model parallelism and picks the one with the lowest cost. This work, AccPar, also proposes a technique to partition a DNN across multiple machines, but here are the key differences compared to the prior work:

HyPar considered only the communication cost and used it as a proxy for the cost model. AccPar using both communication and computation cost to develop the cost model
HyPar does not consider a scenario where the machines to which the DNNs are partitioned to are heterogeneous
A layer can be partitioned based on three tensors; HyPar considers only two of them and hence does not do a complete search of the design space
HyPar does not support network with multiple paths (like in Inception), but AccPar does

The idea.
The key concept around which everything else in this paper is built on is the communication model. Each layer can be partitioned in three different ways (explained later), and depending on the partition, the intra-layer and inter-layer communication costs would differ. The cost of partition is analyzed layer-wise. DNN training has three computations (convolutions) involved, namely, forward pass, backward pass, and weight gradient. Each of the convolutions can be partitioned in three different ways. The first partition is where different inputs of the batch are mapped to an accelerator, and the model is replicated. This partition incurs communication cost in the weight gradient step as gradients of all inputs in the mini-batch needs to be aggregated before performing the weight update. In the second partition type, the input tensor of all inputs in the mini-batch is partitioned across accelerators. By doing so, the forward pass needs communication to accumulate all inputs, whereas the backward pass and weight gradient step do not need inter-accelerator communication. The final partition type is where inputs are replicated across accelerators, and each accelerator operates on a set of kernels. In this partition type, intra-layer communication is needed for the backward pass.

The cost of each partition type is analyzed for each layer, along with the computation cost, to arrive at the overall cost model. For computation cost, the compute density is taken into account along with the number of computations required per layer. By doing a recursive cost analysis of each layer, the final cost of partitions is estimated.

Results.

The efficiency of AccPar is evaluated over a heterogeneous system consisting of 128 TPU-v2s and 128 TPU-v3s. They analyze a variety of CNNs like AlexNet, ResNet, and VGG-net. On average, compared to the baseline, which is data-parallel training, AccPar achieves 6.3x improvement, whereas HyPar achieves 3.78x. This is when a heterogeneous system is used. On the other hand, the improvement on a homogeneous system is reduced. On average, HyPar achieves a speedup of 3.51x over the baseline, whereas AccPar achieves 3.86x over the baseline.

Thoughts.

I have mixed feelings about this paper. At a high level, this seems to be an incremental work (the same way I feel that HyPar is incremental compared to OWT). I'm skeptical about the improvement including computation cost to the overall cost model would bring. There isn't a breakdown of improvements brought about by each idea. The authors mention type-3 partition has never been considered before. Once again, an empirical evidence of how much improvement is achieved just by adding this 3rd partition type isn't shown. The average improvement in speedup over HyPar is around 2x. I suspect the majority of this improvement is due to partitioning based on the accelerator (heterogeneous flexibility).

Search This Blog

Research Paper Summaries

AccPar: Tensor Partitioning for Heterogeneous Deep Learning Accelerators

Comments

Post a Comment