Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

< Back to Article

Fig 1.

Illustration of Decoupled Training with Re-computation (DTR): A K = 3 example.

Each color in the figure represents a completed forward and backward process in the whole model of a single batch.All modules execute backward stage (upper part) first and then forward stage (lower part) except the last module which processes reversely. In initial iterations where modules do not have backward stages, wtk+2 is set to be equal to wtk+1 (e.g. in module m1 at iteration t = 1, W2 = W1).

More »

Fig 1 Expand

Table 1.

Assume in the ideal case that the communication time between different GPUs is negligible and the model is evenly split into K modules.

Tf, Tb and Taux denote the time taken for forward, backward passes, and the auxiliary network, respectively.

More »

Table 1 Expand

Table 2.

Minput, Mactivations, and represent the memory space taken by one batch of input feature and activation graphs, and memory space taken by module k respectively.

More »

Table 2 Expand

Fig 2.

An example of weight delay between two forward passes.

More »

Fig 2 Expand

Fig 3.

Visualization of delay regulator.

More »

Fig 3 Expand

Fig 4.

Example of module imbalance.

More »

Fig 4 Expand

Fig 5.

GPU behavior.

More »

Fig 5 Expand

Fig 6.

Utilization level without batch compensation (red line: Average utilization level).

More »

Fig 6 Expand

Fig 7.

Memory usage of ResNet56 with different Ks.

More »

Fig 7 Expand

Fig 8.

Top-1 accuracy of ResNet18 using different methods.

More »

Fig 8 Expand

Table 3.

Top-1 accuracy (%) of ResNet18 using different methods.

More »

Table 3 Expand

Table 4.

Top-1 accuracy (%) for models with different K on CIFAR10 and CIFAR100.

More »

Table 4 Expand

Table 5.

The validation errors (%) of the compared methods on (a) CIFAR10, (b) CIFAR100, and (c) ImageNet for K = 2.

Results with * are rerun using our training strategy, while those without * are the results reported in their original papers [17, 18, 20]. β represents the gradient shrinking factor used in FDG [21] and γ denotes the learning rate shrinking factors in DTRP.

More »

Table 5 Expand

Table 6.

Top-1 errors (%) for models with different K on CIFAR10.

More »

Table 6 Expand

Table 7.

Comparison of speed using different methods.

More »

Table 7 Expand

Table 8.

Top-1 errors of ResNet20.

More »

Table 8 Expand