An improved deep learning model for hierarchical classification of protein families

doi:10.1371/journal.pone.0258625

Table 1.

Details of datasets.

More »

Expand

Fig 1.

A sequence as a matrix.

Representation of a sequence as a matrix (2D array) after encoding the raw amino acid sequence. The x-axis represents invariant 20 positions of amino acids and 1 position for non-amino acids. y-axis represents the sequence with fixed length of at most 1000 amino acids positions.

More »

Expand

Fig 2.

The architecture of the proposed feedforward DeepHiFam model.

Encoded sequences are fed into the feature extraction blocks. Extracted feature maps are concatenated before the 1D max pool layer which reduces the dimensionality. After the flattening layers, two different classification sections are separately designed as shown in parts A and B for non-hierarchical and hierarchical multi-class classification respectively. The classification section-A has a fully connected (Dense) layer with softmax activation function. Section-B has multi-outputs which include fully connected layers that are connected to the final fully connected layers. Dropout layers are also used with a 0.5 rate after the flattening layer and the fully connected layers to avoid the overfitting of the model.

More »

Expand

Fig 3.

The architecture of a block of the feature extraction section of the proposed DeepHiFam.

First, encoded sequence is fed into the 1st convolution unit. The extracted feature maps are batch normalized and input to the Relu activation, which helps to improve the speed, performance, and stability. The output of the 2nd convolutional unit is added to the output of the 1st convolution unit. 9 parallel blocks of different kernel sizes (see Table 3) are used for feature extraction.

More »

Expand

Table 2.

Hyper-parameters of DeepHiFam.

More »

Expand

Fig 4.

Loss value graph of validation and train at the 1st iteration of the of COG-A dataset.

It shows that the DeepHiFam is not overfitting as there is rarely a gap between two curves.

More »

Expand

Table 3.

3-fold cross-validation results of COG A, B, and C dataset (see Table 1).

More »

Expand

Table 4.

The prediction accuracy (%) comparison of COG A, B and C datasets.

More »

Expand

Table 5.

Parameter comparison with ProtCNN and bi-directional LSTM.

More »

Expand

Fig 5.

Training and validation accuracies without overfitting.

(A, B- 1000 classes and 250 sequence length, C, D- 3000 classes and 100 sequence length) Left: Shows DeepHiFam model’s training and validation accuracy vs number of epochs. Right: Shows DeepHiFam model’s loss vs number of epochs. Both graphs show how the model learns without over-fitting.

More »

Expand

Fig 6.

Prediction accuracy (%) comparisons of GPCR dataset.

Results are extended from DeepFam [18]. This chart shows that our model has the power of hierarchical classification simultaneously using multi-outputs.

More »

Expand

Table 6.

Parameter comparison with ProtCNN and bi-directional LSTM.

More »

Expand