Deep learning with robustness to missing data: A novel approach to the detection of COVID-19

In the context of the current global pandemic and the limitations of the RT-PCR test, we propose a novel deep learning architecture, DFCN (Denoising Fully Connected Network). Since medical facilities around the world differ enormously in what laboratory tests or chest imaging may be available, DFCN is designed to be robust to missing input data. An ablation study extensively evaluates the performance benefits of the DFCN as well as its robustness to missing inputs. Data from 1088 patients with confirmed RT-PCR results are obtained from two independent medical facilities. The data includes results from 27 laboratory tests and a chest x-ray scored by a deep learning model. Training and test datasets are taken from different medical facilities. Data is made publicly available. The performance of DFCN in predicting the RT-PCR result is compared with 3 related architectures as well as a Random Forest baseline. All models are trained with varying levels of masked input data to encourage robustness to missing inputs. Missing data is simulated at test time by masking inputs randomly. DFCN outperforms all other models with statistical significance using random subsets of input data with 2-27 available inputs. When all 28 inputs are available DFCN obtains an AUC of 0.924, higher than any other model. Furthermore, with clinically meaningful subsets of parameters consisting of just 6 and 7 inputs respectively, DFCN achieves higher AUCs than any other model, with values of 0.909 and 0.919.


S1 Fig. Data Distribution
The input data distribution per hospital and per RT-PCR test outcome is provided.

Training settings
We train Resnet-18 using a cyclic learning rate [1] between 0.001 and 0.01 and 2.5 epoch step size. We use stochastic gradient descent with Nesterov momentum of 0.95. We minimize the cross-entropy loss between the softmax activations and binary labels. We perform all experiments with a batch size of 16.
When fine-tuning the last layer on RT-PCR test results, we use a heavy label smoothing regularization [2] of 0.2 to prevent overfitting. We apply a weight of 0.625 to the positive samples, to account for the class imbalance. This number is calculated by dividing the number of negative cases in the training dataset by the number of positive cases. After training, we restore the model weights that have achieved the best validation cross-entropy loss.

Image preprocessing
We convert the DICOM files to 8-bit PNG files by clipping the values above the 99th percentile and scaling the resulting values between 0 and 255. We resize the resulting images to 512 by 512. Before we feed those images to the convolutional neural network, we apply data augmentation to prevent overfitting. We augment the images by cropping a width and height in the range of (409, 512] randomly. We resize the resulting image to 448 by 448. We randomly scale (by [0.75, 1.25]) and shift (±64) the pixel values and clip the values above 255. We randomly flip the image from left to right. We standardize the resulting pixel values using the ImageNet dataset means and variances.

Training settings
Similar to the previous experiment, we train DAE, SDAE, FCN, and DFCN using a cyclic learning rate [1] between 0.001 and 0.01 and 2.5 epoch step size. We use stochastic gradient descent with Nesterov momentum of 0.95. We minimize the cross-entropy loss between the softmax activations and binary labels. For all models we use ReLU activations for intermediate layers. We train our models until there is no improvement in the validation loss for 10 epochs. We perform all experiments with a batch size of 16. When reconstruction regularization is used, the loss is calculated using the sum of squared errors of the prediction differences, applied with a coefficient of 0.03. We normalize each of the 27 laboratory parameters to unit mean and variance with respect to the training dataset statistics. After normalization, we set the masked and missing values to zero. We calculate the reconstruction loss only using the known values that are not missing in the training dataset. We apply a weight of 0.625 to the positive samples to account for the class imbalance as described in Supporting information Training settings.

Random Forest Settings
We use the Random Forest implementation of Scikit-learn (v0.22.1) [3]. This implementation does not allow partial training or mini batching. Hence, we apply input masking by repeating the dataset 100 times and applying the input masking on this set. For comparability with the other methods, we impute the missing values using the mean. We use the default model parameters because they have performed the best in our preliminary experiments.
S2 Table. Robustness validation experiments expanded results The results of Table 2 is truncated for simplicity. Here in Table 5, we provide the full robustness validation experiment results.