Deep neural network based histological scoring of lung fibrosis and inflammation in the mouse model system

doi:10.1371/journal.pone.0202708

Fig 1.

Workflow to obtain histological scores from microscopy images of murine lungs by using convolutional neural networks (CNN).

We built two types of models: a CNN to classify the Ashcroft score (used as an example in the figure) and a CNN to classify an inflammation score. A whole slide scan of a mouse lung (left) is divided into smaller image tiles. The tiles are fed into a CNN model and a probability distribution over the image classes is obtained as an output. We used the Inception-V3 CNN architecture, pre-trained on the Image-Net dataset (1.28 10⁶ images) and re-trained on labelled tiles of lung tissue (between 3.5 10³ and 1.4 10⁴ images, see Methods). From the probability outputs of the two neural networks, the Ashcroft fibrosis and inflammation scores are computed as the score-weighted sum of the class probabilities after a renormalization to 1 without p_ignore (see Methods).

More »

Expand

Fig 2.

Example tiles used for the Ashcroft fibrosis score CNN.

The value on the top left indicates the score ranging from 0 (healthy) to 7 (large fibrotic masses). In addition, an ignore class was used to recognize various kinds of non-alveolar tissue, such as fat tissue (example), lymph nodes, large bronchi or blood vessels or air-bubbles in the mounting medium. Scale bar 50 μm.

More »

Expand

Fig 3.

Learning curve of the Ashcroft fibrosis-CNN.

The learning curve shows the accuracy of the CNN on the training and validation data vs. the epochs (iterations over the training set). Both curves overlap indicating a good generalization of the CNN to the unseen validation data (i.e. no overfitting).

More »

Expand

Fig 4.

Analyzing annotator agreement and inherent partial ambiguity of image data.

A. Confusion matrix of predicted labels of the validation data (columns) compared to the ground truth (rows). The numbers are classification probabilities, normalized to a row sum of one. Note that the highest values are either on the diagonal (agreement of ground truth and prediction) or in an element next to the diagonal (a deviation with a neighboring class). The ignore class can be to some extent misinterpreted as all other classes and vice versa. The overall accuracy was A = 79.5%. B. Confusion matrix of the agreement of two human experts (annotators 1 and 2) using 400 randomly selected image tiles. The overall result was similar, however the inter annotator agreement of the human experts in terms of accuracy was A = 64.5% and lower than the agreement of the CNN on the unseen validation data. However, the exact value of the agreement of two human experts will depend on the type and amount of training. C. Visualization of the inner representation of the image data in the CNN. Here, the last hidden CNN layer representation of the image data was projected in two dimensions with t-SNE, a method to visualize high-dimensional data. Each dot represents one tile from ~2000 validation images. Insets show example images, along with the predicted label and their approximate locations in the cluster. Most classes are separated, however they are interconnected and especially in the transition areas there is ambiguity. Note the smaller area of class 0 tiles close to the “ignore” class (left) showing already properties of the “ignore” class (e.g. an only partially covered tile).

More »

Expand

Fig 5.

Comparison of CNN and human expert Ashcroft scores and analysis of amount of data required for training.

A. Comparison of the Ashcroft score performed by a human pathologist vs the Ashcroft scored by the CNN based algorithm. Each value is the mean over a whole lung slice from experiments (n = 72) where animals obtained varying doses of Bleomycin to trigger lung fibrosis. Both curves are in good correlation (r² = 0.92), with a slope m close to 1 and a y-intercept b close to 0 (m = 1.07 ± 0.04, b = -0.04 ±0.08, fit parameters are optimum fit parameters ± the difference at the 5% and 95% confidence intervals.) B. Dependency of the accuracy of the Ashcroft fibrosis CNN model on the amount of training data available. A(n) / A_max compares the accuracy A(n) of a model trained using n randomly selected images to the accuracy A_max of a model trained all available labelled images (n = 12000). The dashed line is an empirical fit using an asymptotic function.

More »

Expand

Fig 6.

Whole slide scan of Masson´s trichrome stained mouse lungs and corresponding color coded Ashcroft fibrosis scores.

A and B represent a lung sample with no fibrosis. Note that the lymph node (dense structure at bottom center of A) is not considered in the Ashcroft score image, since it is recognized as to be ignored. C and D represent an image and the corresponding Ashcroft map of a fibrotic lung. Scale bar 1 mm.

More »

Expand

Fig 7.

Representative tiles to illustrate the inflammation score.

Numbers represent the inflammation score. The score is defined by the number of inflammatory cells in a field of view (0: 0–5, 1: 6–10, 2: 11–20, 3: above 20 inflammatory cells). In addition a ignore class was defined, as shown in Fig 2 (not shown). Scale bar 50 μm.

More »

Expand

Fig 8.

Confusion matrix comparing the classifications of the inflammation score CNN vs the ground truth.

Classifications of the validation data (columns) are compared to the ground truth (rows). Numbers are classification probabilities, normalized to a row sum of one. The highest values are mostly on the diagonal (agreement of ground truth and prediction) or in an element next to the diagonal (a deviation with a neighboring class).

More »

Expand

Fig 9.

Example of the spatial inflammation score.

A. Whole slide scan of a mouse lung with inserts showing a more inflamed tile (top) and a non-inflamed tile (bottom). Scale bar 5 mm. B. Corresponding color coded map of the inflammation score from 0–3.

More »

Expand