Decoupled neural network training with re-computation and weight prediction | PLOS One

Advertisement

Browse Subject Areas

?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

< Back to Article

Fig 1 — Fig 1.

Illustration of Decoupled Training with Re-computation (DTR): A K = 3 example.
Each color in the figure represents a completed forward and backward process in the whole model of a single batch.All modules execute backward stage (upper part) first and then forward stage (lower part) except the last module which processes reversely. In initial iterations where modules do not have backward stages, w^t−k+2 is set to be equal to w^t−k+1 (e.g. in module m₁ at iteration t = 1, W2 = W1).

More »

Table 1 — Table 1.

Assume in the ideal case that the communication time between different GPUs is negligible and the model is evenly split into K modules.
T_f, T_b and T_aux denote the time taken for forward, backward passes, and the auxiliary network, respectively.

More »

Table 2.

M_input, M_activations, and represent the memory space taken by one batch of input feature and activation graphs, and memory space taken by module k respectively.

More »

Table 2 — Table 2.

M_input, M_activations, and represent the memory space taken by one batch of input feature and activation graphs, and memory space taken by module k respectively.

More »

Fig 2 — Fig 2.

An example of weight delay between two forward passes.

More »

Fig 3 — Fig 3.

Visualization of delay regulator.

More »

Fig 4 — Fig 4.

Example of module imbalance.

More »

Fig 5 — Fig 5.

GPU behavior.

More »

Fig 6 — Fig 6.

Utilization level without batch compensation (red line: Average utilization level).

More »

Fig 7 — Fig 7.

Memory usage of ResNet56 with different Ks.

More »

Fig 8 — Fig 8.

Top-1 accuracy of ResNet18 using different methods.

More »

Table 3 — Table 3.

Top-1 accuracy (%) of ResNet18 using different methods.

More »

Table 4 — Table 4.

Top-1 accuracy (%) for models with different K on CIFAR10 and CIFAR100.

More »

Table 5 — Table 5.

The validation errors (%) of the compared methods on (a) CIFAR10, (b) CIFAR100, and (c) ImageNet for K = 2.
Results with * are rerun using our training strategy, while those without * are the results reported in their original papers [17, 18, 20]. β represents the gradient shrinking factor used in FDG [21] and γ denotes the learning rate shrinking factors in DTRP.

More »

Table 6 — Table 6.

Top-1 errors (%) for models with different K on CIFAR10.

More »

Table 7 — Table 7.

Comparison of speed using different methods.

More »

Table 8 — Table 8.

Top-1 errors of ResNet20.

More »