Automated thresholding algorithms outperform manual thresholding in macular optical coherence tomography angiography image analysis

Introduction For quantification of Optical Coherence Tomography Angiography (OCTA) images, Vessel Density (VD) and Vessel Skeleton Density (VSD) are well established parameters and different algorithms are in use for their calculation. However, comparability, reliability and ability to discriminate healthy and impaired macular perfusion of different algorithms are unclear, yet, of potential high clinical relevance. Hence, we assessed comparability and test-retest reliability of the most common approaches. Materials and methods Two consecutive 3×3mm OCTA en face images of the superficial and deep retinal layer were acquired with swept-source OCTA. VD and VSD were calculated with manual thresholding and six automated thresholding algorithms (Huang, Li, Otsu, Moments, Mean, Percentile) using ImageJ and compared in terms of intra-class correlation coefficients, measurement differences and repeatability coefficients. Receiver operating characteristic analyses (healthy vs. macular pathology) were performed and Area Under the Curve (AUC) values were calculated. Results Twenty-six eyes (8 female, mean age: 47 years) of 15 patients were included (thereof 15 eyes with macular pathology). Binarization thresholds, VD and VSD differed significantly between the algorithms and compared to manual thresholding (p < 0.0001). Inter-measurement differences did not differ significantly between patients with healthy versus pathologic maculae (p ≥ 0.685). Reproducibility was higher for the automated algorithms compared to manual thresholding on all measures of reproducibility assessed. AUC was significantly higher for the Mean algorithm compared to the manual approach with respect to the superficial retinal layer. Conclusions Automated thresholding algorithms yield a higher reproducibility of OCTA parameters and allow for a more sensitive diagnosis of macular pathology. However, different algorithms are not interchangeable nor results readily comparable. Especially the Mean algorithm should be investigated in further detail. Automated thresholding algorithms are preferable but more standardization is needed for clinical use.


Introduction
Optical coherence tomography angiography (OCTA) provides depth resolved high resolution images of the retinal and choroidal blood flow. [1,2] A number of different approaches are available to quantify OCTA image data, but to date both their reproducibility as well as their comparability are unclear.
Image processing is a crucial step when generating comparable and reliable quantitative data from retinal images. For the calculation of global vessel density (VD) from OCTA images, definition of a threshold for image binarization is essential. The three most common solutions are manual binarization methods, automated binarization methods using open source software and automated binarization using commercial software. [3][4][5][6][7][8][9] Manual and semi-automated methods often gain a threshold for binarization based on the signal within the vesselfree foveal avascular zone (FAZ). Automated algorithms use e.g. the histogram of a complete OCTA image or local clusters to obtain a threshold. Advantages of manual/semiautomated and automated binarization methods, as compared to commercial software, include high transparency for research purposes. Besides these two options, commercial software is available from various device manufacturers as well as other sources, mostly using proprietary image processing algorithms not publicly available. Additionally, fixed threshold and machine learning approaches are available. [10,11] A recent study by Rabiolo and colleagues [12] found significant differences in VD calculations between manual and automated approaches but used arbitrary cut-offs for their manual binarization and did not assess test-retest reliability of two consecutive examinations. Thus, in this study we assessed both repeatability and comparability of manual binarization based on the FAZ and six automated algorithms for OCTA image binarization in patients with and without macular disease.

Subject recruitment
Participants both healthy and with any macular pathology impairing the vasculature were consecutively recruited at the Department of Ophthalmology, University of Bonn, Germany, between April and August 2018. Ethical approval was obtained from the ethics committee of the University of Bonn (approval ID 089/08) and informed consent was obtained from all study participants prior to study initiation after explanation of the nature and possible consequences of the study. The study was conducted in adherence to the tenets of the Declaration of

Image analysis
Fiji [14], an open-source image processing software based on ImageJ [15] (version 1.51w) was used for image analysis. Per eye, two 8-bit grey scale en face bitmap images of the superficial and deep retinal layers were binarized by manual thresholding and six previously published automated algorithms [16][17][18][19][20][21] implemented in Fiji, named as followed, based on the abbreviations used in the software [15]: Huang, Li, Otsu, Moments, Mean and Percentile. As previously described, the manual approach was based on delineating the FAZ by selecting its outer borders with a free-hand tool on the superficial retinal layer and using the maximum grey value in this area as the threshold for binarization. [3,8,[22][23][24][25][26][27][28] A second measurement was performed after at least 1 week by the same examiner, in case of a grey value threshold difference between the 2 measurements � 5, a third measurement was performed immediately and the median of these 3 measurements used as the threshold for the manual method. It has been shown that this methodology has high interrater reliability. [3] For the deep retinal layer, the FAZ selection of the superficial retinal layer was applied to the respective image area of the deep layer for binarization threshold determination. This approach combines best practice manual thresholding approaches published in the literature. [3,22] The threshold determination of the six automated algorithms has been previously published elsewhere. [16][17][18][19][20][21] VD was calculated based on the binarized images according to the formula VD ¼ nðwhite pixels in binarized imageÞ 2 nðall pixels in binarized imageÞ 2 . [22] For calculation of Vessel Skeleton Density (VSD) images were skeletonized by ImageJ and VSD was calculated according to the formula VSD ¼ nðwhite pixels in skeletonized imageÞ nðall pixels in skeletonized imageÞ 2 . [22]

Statistical analyses
Statistical analyses were performed with SPSS Statistics for Windows, version 25 (IBM Corporation, Armonk, New York). Mean values of all VD and VSD measurements per layer and per eye were calculated and tested for associations with signal strength index of the OCTA image, intraocular pressure and patient age. The relative differences between test and retest VD and VSD were calculated. Linear regression analysis was performed to adjust for age, including relative differences between test and retest VD and VSD values as dependent variables and age as well as seven binary variables for the respective algorithm used as independent variables. Intra-class correlation coefficients (ICCs) between the two OCTA images of each eye were determined. Additionally, the Repeatability Coefficient (RC) was calculated according to the formula RC ¼ 1:96 � ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi P ðmeasurement 2À measurement 1Þ 2 n q . [29][30][31] The Mann-Whitney-U test, the Friedman test and the Kruskal Wallis test were used as indicated. To measure discriminatory ability (healthy versus macular pathology), differences between the different approaches were assessed using Receiver Operating Characteristic (ROC) curves and area under the curve (AUC) values. For age adjustment, we performed a binary logistic regression analysis to discrminate between healthy eyes and eyes with a macula based on VD or VSD and age (per algorithm and per retinal layer). The resulting probabilities were then used for ROC analysis. A pvalue of <0.05 was considered statistically significant. Correction for multiple testing was done using the Holm-Bonferroni method [32]. Corrected p-values are reported as p c .

Results
Twenty-six eyes (8 female, 18 male; mean age: 47 years) of 15 patients were included, resulting in 104 images (two consecutive images of two layers per eye) of the superficial and deep retinal layer. These included 11 healthy eyes and 15 eyes with an impaired macular vasculature (due to diabetic changes, previous central venous occlusions or other maculopathies). Some eyes had to be excluded due to insufficient image quality. Sphere ranged between ± 2.0 dpt in all eyes. Intraocular pressure was within a normal range in all patients (mean ± standard deviation: 17 mmHg ± 4 mmHg). In all included images, the OCTA en face image quality was high with a minimum signal strength index of 9/10.
The average threshold values differed significantly between different binarization approaches (p < 0.0001, Table 1). Overall, the Moments algorithm produced the highest and the Percentile and Li algorithms the lowest mean thresholds. Overall binarization thresholds did not differ significantly between subjects with healthy maculae and subjects with macular vessel pathology (p = 0.447). Exemplary unprocessed and binarized images are displayed in VD and VSD differed significantly between different binarization algorithms (p < 0.0001, Table 1). Both values were significantly lower in individuals with macular pathology compared to those without (p < 0.0001). For instance, mean VD of the superficial layer was 0.22 ± 0.01 for healthy eyes and 0.19 ± 0.03 for diseased eyes while VSD of the superficial layer was 6.9×10 -8 ± 0.3×10 -8 for healthy eyes and 6.1×10 -8 ± 0.8×10 -8 for diseased eyes according to the Mean algorithm. VD and VSD did not differ significantly between different image signal strength indices (p = 0.157 and p = 0.079, respectively) or intraocular pressures (p = 0.271 and  Relative test-retest measurement differences of VD and VSD obtained from two consecutive OCTA examinations varied significantly between the algorithms when comparing all approaches using the Friedman test both in the superficial and deep retinal layers (p < 0.0001). The manual thresholding approach had a significantly lower VD and VSD repeatability in the superficial and deep retinal layer compared to most automated algorithms in pair-wise comparisons between the different binarization approaches (p c < 0.0084 in paired algorithm comparisons, Fig 2). The relative inter-measurement differences of the Huang algorithm (VD, deep retinal layer: p c = 0.126; VSD, superficial and deep retinal layer: p c � 0.064) and the Li algorithm (VSD, superficial retinal layer: p c = 0.183) did not differ significantly from the repeatability values of the manual method. The relative inter-measurement differences, however, did not differ significantly when comparing automated algorithms with one another (p c > 0.05). To adjust for age, linear regression analysis was performed. For both VD and VSD, the relative test-retest differences significantly depended on use of the manual approach compared to not using it (β = 0.661, p < 0.0001 and β = 0.163, p < 0.0001, respectively). The VD and VSD relative test-retest differences did not significantly depend on age or use of any of the automated algorithms. Inter-examination VD and VSD differences between algorithms were not significantly different between healthy eyes and eyes with macular vessel pathologies (VD, p = 0.685; VSD, p = 0.770), indicating that the results of both groups can be interpreted in total.
A post-hoc power analysis for the Friedman test using a 10000-fold simulation with data generated from a normal distribution according to our sample characteristics revealed that a sample size of 6 eyes is sufficient to detect significant differences at 5% level with 80% power between VD relative differences of different algorithms in the superficial and in the deep retinal layers. Therefore, the sample size available in our study was considered appropriate from a statistical standpoint.
Except for the percentile algorithm, ICCs of VD and VSD measurements between the two consecutive examinations per eye were noticeably higher for all the automated algorithms when compared to the manual approach ( Table 2). Repeatability Coefficients of VD and VSD of the superficial and deep retinal layers were noticeably higher (i.e. poorer repeatability) for the manual approach compared to the six automated algorithms investigated ( Table 2). ROC analysis (healthy versus macular pathology) based on binary logistic regression models to adjust for age revealed AUC values between 0.838 and 0.997. The manual method was less sensitive to pathologic change than most automated algorithms (Fig 3) and the four AUC values of the manual approach were lower than almost all (23/24, 96%) values of automated algorithms. The AUC value for the manual method of VD acquisition of the superficial retinal layer was significantly lower (indicating lower sensitivity and specificity values) than the AUC value for the Mean algorithm of VD acquisition of the same layer ( Table 3).

Discussion
To our knowledge this is the first quantitative study on test-retest reliability of manual versus automated thresholding approaches for OCTA images. Automated algorithms outperformed manual thresholding, lead to more reproducible results and, therefore, allow for a more sensitive discrimination of healthy maculae from maculae with pathology. Thus, automated binarization algorithms should be preferred over manual approaches for OCTA image analysis. However, the different algorithms are not interchangeable and results of the algorithms differed significantly. Therefore, a better -ideally international -standardization of algorithms is needed to increase comparability of studies.
Our study supports existing data showing low inter-method agreement for image binarization. [12,33] The thresholds between the algorithms varied significantly as did the VD and VSD values, highlighting a lack of comparability. Rabiolo and colleagues also compared different methods to quantify perfusion in the macular region, however, they only assessed VD, used lower resolution 6×6mm en face OCTA images and did not investigate test-retest reliability of the different approaches. [12] Mehta and colleagues recently applied five automated binarization algorithms to OCTA images and found significant differences between the detected VD values. However, they investigated neither the repeatability of VD values based on the algorithms nor their ability to detect pathology. [33] Shoji and colleagues compared different automated thresholding algorithms but they did not assess any manual methods. [34] In this study, we used high-detail 3×3mm en face OCTA images, provide data on VD as well as VSD and evaluated test-retest reliability of both manual and automated algorithms on consecutive images.
Reproducibility was excellent for five of the automated algorithms according to the scale proposed by Chan. [35] This is in keeping with the current literature where automated algorithms tend to outperform manual image analysis in terms of reproducibility. [36][37][38] The inter-measurement differences were significantly higher after image binarization using the manual algorithm compared to all automated algorithms. This effect was independent of age. Due to this, automated binarization algorithms should be preferred over manual approaches in OCTA image binarization.
Intra-class correlation coefficients between the two consecutive OCTA measurements were significantly higher when using the Otsu, Moments or Mean algorithms compared to the manual approach in at least three out of four categories investigated (VD and VSD of the superficial and deep retinal layers, respectively). For this reason, these three algorithms should be 3.4×10 -9 3.9×10 -9 4.3×10 -9 3.0×10 -9 3.4×10 -9 Deep 1.9×10 -8 5.4×10 -9 3.1×10 -9 3.2×10 -9 3.1×10 -9 3.4×10 -9 3.0×10 -9 CI = confidence interval; ICC = Intra-class correlation coefficient; p c = corrected p-value; RC = Repeatability Coefficient; Superf. = superficial https://doi.org/10.1371/journal.pone.0230260.t002 investigated in further detail. The automated Percentile algorithm proved not to be appropriate for analysis of VD or VSD due to inconsistent results. It measures the grey intensity closest to a percentile which limits the use of this algorithm when applied to OCTA (Table 4). We therefore do not recommend the Percentile Algorithm for future investigation on this topic. Ability to discriminate between healthy and pathological maculae was good for almost all automated algorithms except the Percentile algorithm. Rabiolo et al found no significant differences between the algorithms tested in their study. [12] Interestingly, the manual approach used by Rabiolo and colleagues discriminated healthy individuals from those with macular disease noticeably better than our manual approach. However, information on reliability of their manual approach is missing. Our method for manual thresholding was standardized against the FAZ whereas Rabiolo et al used arbitrary binarization thresholds. Thus it is doubtful whether their results can be reproduced with other images and manual thresholding cannot be recommended. Shoji et al compared different automated global and local thresholding methods across two OCTA devices [34]. They also included the Otsu and the Mean algorithms in their analysis but reported significant lower intra-class correlations compared to our data (based on the 95% confidence intervals). Our work additionally included age-adjusted statistical comparisons of different algorithms, showing that the Mean algorithm can be used to detect pathology significantly better that the manual approach. Such data have not been available in the previous literature. According to our results, specifically the Mean algorithm might be a good option to use for binarization of en face OCTA images of the superficial retinal layer with the PLEX Elite device. It tends to enhance the retinal vasculature compared to other algorithms as reflected in relatively high mean VD and VSD values. For this reason, it might be more prone to image artefacts than algorithms that overall set higher binarization thresholds like the Otsu or Moments algorithms.
The strengths of our study include a comprehensive evaluation of reliability and comparability of seven different approaches for OCTA image binarization, all previously published, including four measures of repeatability. We also assessed ability to discriminate healthy from diseased eyes for all approaches. For image acquisition, well established protocols were used, allowing for easy replication of our study. In the literature, more comprehensive preprocessing steps for image processing have been proposed, including the use of filters such as the Frangi filter to enhance image contrasts. We have omitted such steps in our analysis on purpose, since these methods have several disadvantages including generation of image artifacts resembling vessel structures and different results for vessels that are not equally distributed in size [39,40]. Limitations include the relatively small number of subjects (therefore limited generalizability of the comparisons between different automated algorithms), having only one repeat measurement per eye and limited comparability of vessel density and skeleton density values because of different calculation approaches in the literature. We did not assess comparability and reproducibility across different OCTA devices and evaluated OCTA scans of the macula only. We did not compare the algorithms to commercial tools for quantification of vessel parameters because for scientific purposes, understanding as many of the image processing steps as possible is warranted. Future research should also focus on binarization of OCTA images of the optic disc with different algorithms.
In conclusion, because of higher repeatability and improved discrimination, automated binarization algorithms should be preferred over manual approaches. Better standardization of algorithms is needed to improve comparability of studies.