Fig 1.
Illustration of Decoupled Training with Re-computation (DTR): A K = 3 example.
Each color in the figure represents a completed forward and backward process in the whole model of a single batch.All modules execute backward stage (upper part) first and then forward stage (lower part) except the last module which processes reversely. In initial iterations where modules do not have backward stages, wt−k+2 is set to be equal to wt−k+1 (e.g. in module m1 at iteration t = 1, W2 = W1).
Table 1.
Assume in the ideal case that the communication time between different GPUs is negligible and the model is evenly split into K modules.
Tf, Tb and Taux denote the time taken for forward, backward passes, and the auxiliary network, respectively.
Table 2.
Minput, Mactivations, and represent the memory space taken by one batch of input feature and activation graphs, and memory space taken by module k respectively.
Fig 2.
An example of weight delay between two forward passes.
Fig 3.
Visualization of delay regulator.
Fig 4.
Example of module imbalance.
Fig 5.
GPU behavior.
Fig 6.
Utilization level without batch compensation (red line: Average utilization level).
Fig 7.
Memory usage of ResNet56 with different Ks.
Fig 8.
Top-1 accuracy of ResNet18 using different methods.
Table 3.
Top-1 accuracy (%) of ResNet18 using different methods.
Table 4.
Top-1 accuracy (%) for models with different K on CIFAR10 and CIFAR100.
Table 5.
The validation errors (%) of the compared methods on (a) CIFAR10, (b) CIFAR100, and (c) ImageNet for K = 2.
Results with * are rerun using our training strategy, while those without * are the results reported in their original papers [17, 18, 20]. β represents the gradient shrinking factor used in FDG [21] and γ denotes the learning rate shrinking factors in DTRP.
Table 6.
Top-1 errors (%) for models with different K on CIFAR10.
Table 7.
Comparison of speed using different methods.
Table 8.
Top-1 errors of ResNet20.