Fully Automated Pulmonary Lobar Segmentation: Influence of Different Prototype Software Programs onto Quantitative Evaluation of Chronic Obstructive Lung Disease

Objectives Surgical or bronchoscopic lung volume reduction (BLVR) techniques can be beneficial for heterogeneous emphysema. Post-processing software tools for lobar emphysema quantification are useful for patient and target lobe selection, treatment planning and post-interventional follow-up. We aimed to evaluate the inter-software variability of emphysema quantification using fully automated lobar segmentation prototypes. Material and Methods 66 patients with moderate to severe COPD who underwent CT for planning of BLVR were included. Emphysema quantification was performed using 2 modified versions of in-house software (without and with prototype advanced lung vessel segmentation; programs 1 [YACTA v.2.3.0.2] and 2 [YACTA v.2.4.3.1]), as well as 1 commercial program 3 [Pulmo3D VA30A_HF2] and 1 pre-commercial prototype 4 [CT COPD ISP ver7.0]). The following parameters were computed for each segmented anatomical lung lobe and the whole lung: lobar volume (LV), mean lobar density (MLD), 15th percentile of lobar density (15th), emphysema volume (EV) and emphysema index (EI). Bland-Altman analysis (limits of agreement, LoA) and linear random effects models were used for comparison between the software. Results Segmentation using programs 1, 3 and 4 was unsuccessful in 1 (1%), 7 (10%) and 5 (7%) patients, respectively. Program 2 could analyze all datasets. The 53 patients with successful segmentation by all 4 programs were included for further analysis. For LV, program 1 and 4 showed the largest mean difference of 72 ml and the widest LoA of [-356, 499 ml] (p<0.05). Program 3 and 4 showed the largest mean difference of 4% and the widest LoA of [-7, 14%] for EI (p<0.001). Conclusions Only a single software program was able to successfully analyze all scheduled data-sets. Although mean bias of LV and EV were relatively low in lobar quantification, ranges of disagreement were substantial in both of them. For longitudinal emphysema monitoring, not only scanning protocol but also quantification software needs to be kept constant.


Material and Methods
66 patients with moderate to severe COPD who underwent CT for planning of BLVR were included. Emphysema quantification was performed using 2 modified versions of in-house software (without and with prototype advanced lung vessel segmentation; programs 1 [YACTA v.2

Introduction
Pulmonary emphysema, a phenotype of chronic obstructive pulmonary disease (COPD) induced mostly due to cigarette-smoke, is currently ranked 12 th as a worldwide burden of disease and is projected to be ranked 5 th by the year of 2020 as a cause of loss of quantity and quality of life [1]. There is currently no definite cure to this major health problem, and many patients remain significantly disabled although evolving pharmacological treatment and pulmonary rehabilitation [2]. Lung volume reduction surgery (LVRS) and bronchoscopic lung volume reduction (BLVR) are strategies in treatment of advanced emphysema. With careful selection of cases, it is reported that LVRS or less invasive BLVR can be clinically beneficial to patients who suffer from heterogeneous advanced emphysema [3][4][5][6]. The basic mechanism of these methods is the reduction of hyperinflation of the target lobe, which is identified as the most diseased lobe in case of heterogeneous distribution of emphysema on chest computed tomography (CT) [7]. Expansion of healthier adjacent lung parenchyma and improvement of overall lung function follow.
Complementary to clinical and pulmonary function testing (PFT) [8][9][10], quantitative multi-detector computed tomography (MDCT) densitometry allows to evaluate the distribution of emphysema (i.e., lobar-based volume and attenuation changes), which can be useful not only for patient selection in treatment planning but also post-interventional follow-up [11][12][13]. State-of-the art emphysema quantification software programs are being developed, and different programs and/or versions have been implemented into clinical trials or routine care at different institutions. Potential variations in the results of emphysema quantification obtained from those different programs even for the same patient, and the resulting differences in interpretation may pose a source of substantial differences in patient selection and management. The variation of fully automated densitometry results of the whole lung with different software tools was reported recently [14]. Selection for LVRS and BLVR, however, depends mainly on heterogeneity of emphysema and the definition of a target lobe with highest emphysema severity [15]. Most recent tools allow for a lobe-based quantification of emphysema for optimal target lobe definition. This introduces a novel reader-independent further step of lung lobe segmentation into quantitative MDCT accompanied by potential sources of error.
Consequently, we aimed to evaluate the inter-software variability of lobe-based emphysema quantification using 2 versions of scientific software, 1 commercially available software and a pre-commercial prototype for fully automated pulmonary lobar segmentation. We additionally hypothesized and tried to prove that high intra-patient variability of emphysema distribution (one of the requirements for volume reduction surgery or BLVR, as mentioned before) may also predispose for a higher inter-software measurement variability.

Study population
This study retrospectively enrolled consecutively chronic obstructive pulmonary disease (COPD) patients who underwent clinically indicated MDCT for planning of endoscopic lung volume reduction from August to October 2012. Informed written consent for examination and pseudonymized data processing was obtained from all patients. The responsible Heidelberg University Ethics Committee has approved this study according to Good Clinical Practice (GCP) guidelines and applicable law (S-609/2012). Patients who had pneumothorax at the time of CT scan and history of previous lung operation and those with severe artifacts derived from poor respiratory control at CT were not eligible. Table 1 shows a summary of the patientś clinical characteristics. There was one never-smoker and smoking history was unknown in one of the patients with smoking history.

MDCT acquisition and reconstruction
Non-enhanced, thin-section MDCT was performed in supine position as recommended for COPD [16,17]. All patients were trained for full-inspiration and were carefully monitored for inspiration level to be stabilized at full-inspiration before the start of MDCT scanning (64-slice Somatom Definition AS64, Siemens Medical Solutions AG, Forchheim, Germany). The system underwent dedicated routine calibration for water every 3 months and for air daily. We used a dose-modulated protocol using a reference of 120 kV, 70 mA or 100 kV, 117 mA with automated kV and mA modulation (Caredose4D, Siemens Medical Solutions, Forchheim, Germany), collimation of 64 x 0.6 mm, pitch of 1.45, reconstruction slice thickness of 1.0mm with 0.825 mm increment, and medium soft B40f algorithm considered optimal for densitometry and automatic segmentation [10,18]. Quantitative image evaluation MDCT datasets were subjected to the following 4 software programs for fully automated lobar emphysema quantification. No manual correction for segmentation results was carried out. The results were compared for inter-software reproducibility. The processing was repeated after one week for each case and software program in order to evaluate the intra-software reproducibility.
A reader with more than 6 years of experience (HL) performed a visual inspection of the segmentation results for each case and software in order to identify obvious errors on lobar segmentation (e.g. false annotation of a lobe, false identification of a fissure, leakage of airway segmentation into the parenchyma and vice versa). The comparison of the measurement results between programs was thus repeated with the remaining datasets after removal of those with obvious segmentation errors in at least one of the programs for inter-software reproducibility after user-interaction.
YACTA. Two versions of in-house program YACTA ("yet another CT analyzer") (version v.2.3.0.2 and v.2.4.3.1, both programmed by O. W.) applying algorithms without and with advanced lung vessel segmentation were used in this study (program 1 and 2, respectively). The program analyzed each stack of around 300 images per patient fully automatically, as employed in previous studies [10,[19][20][21]. YACTA operates in a server-mode and may receive DICOM data directly from the PACS. The exact steps of lung and airway segmentation, and emphysema quantification were performed as described in detail elsewhere [22,23]. When the density of lung voxel was equaled or below the threshold of -950 HU (which is the most often used value currently), it was assigned to emphysema [18,24], noise correction was performed for voxels with -910 to -949 HU which needed at least 4 adjacent voxels with a density of -950 HU to be annotated as emphysema. The following variables were computed and exported as a structured report: total lung volume (LV) and respective lobar volume (LV LUL, LV LLL, LV RUL, LV RML, LV RLL ) of the lung, EV, EI, MLD, and 15 th . In Program 2 (YACTA v.2.4.3.1), an additional algorithm was introduced for an advanced lobe segmentation. While program 1 did assess bronchial tree only into account for lobe separation, program 2 did include the pulmonary vessels additionally.
Pulmo3D. Syngo.Via (Pulmo3Dversion VA30A_HF2, Siemens Medical Solutions, Forchheim, Germany) is a commercial post-processing software environment for routine diagnostics (referred to as program 3 in the following). The MDCT datasets were sent from the PACS to the respective post-processing server. The emphysema threshold of -950 HU was chosen as for the other software programs. The parameters measured were: LV, MLD, 15 th , full width at half maximum of lung density histogram (FWHM), and low attenuation volume in percent (equals EI of other programs). The EV needed to be calculated manually by multiplying low attenuation volume in percent with lung volume.
CT COPD. CT COPD is a pre-commercial prototype visualization software package (ISP ver7.0, Philips, Boston, MA) (referred to as program 4 in the following). The DICOM data of each patient was loaded manually into the software surface on a dedicated workstation. A preselection of the emphysema threshold is possible, and -950 HU was used for the present study as for the other software programs. The following parameters were calculated by program 4: LV, 15 th , MLD, EV and EI.

Pulmonary Function Testing
Whole-body plethysmography (MasterScreen Body, E. Jaeger, Hoechberg, Germany) was performed for each patient within one week prior referral to MDCT [25], and the European Coal and Steal Community (ECSC) predicted values served as the standard of reference [26]. The following lung function parameters (absolute and percent predicted values) were used for further analysis: forced expiratory volume in 1s (FEV1), vital capacity (VC), FEV1 to VC ratio (FEV1/VC, "Tiffeneau index"), residual volume (RV), total lung capacity (TLC). To estimate the degree of hyperinflation, the RV to TLC ratio was calculated (RV/TLC).

Statistical Analysis
Before exerting statistical analysis, the results from 4 programs were reviewed by a reader with more than 6 years of expertise in chest radiology. None of the software program developers participated in statistical data analysis or interpretation, to allow a fair comparison of all the softwares analyzed. Statistics were done by an independent professional statistician (T.H.). O. W. (the programmer of YACTA) did not participate in data analysis or interpretation in order to allow an un-biased comparison between all programs. Technical replicates per patient from two pairs sessions were averaged. Values were summarized as mean/median and standard deviation/mean absolute deviation by lobes and in total. Spearman's correlation coefficient based on aggregated measurements over all individual lobes per patient was calculated (For spearman's correlation analysis, repeated measurements due to multiple lobe measurements per patients could not be accounted for in a sensible manner. Lobe measurements were aggregated into a single value per patient. Total measurements refer to the sum across lobes for volume parameters and the average for the other parameters.). Agreement between programs was analyzed for each parameter (LV, MLD, 15 th , EI and EV) separately, employing Bland-Altman plots and 95% limits of Agreement (LoA). LoA were based on all individual lobes using a random effects model with by segment linked replicates [27]. Pairwise differences in measurements between methods were tested based on a linear random effects model with additional fixed segment effect and random patient effect. EI values were logit transformed prior to testing (Emphysema index is given as percentage which in generally cannot not be considered to be a normal distributed variable. Logit transformation is a standard approach for percentage values-or any variable with values in the range of 0 to 1-to get more normal-like distributed values. Since we used a linear regression approach to test for difference, this transformation was advisable).
To see whether there is an influence of intra-patient variability of EI on inter-program variability, we first assessed intra-patient variability (standard deviation) of EI among lobes for each software. Then, we divided patients into two groups to analyze the effect of intra-patient variability of EI on LOA: patients with low intra-patient variability and those with high intrapatient variability by using the median value of SD. The ratio of predicted SD of LOAs between patients with low and high intra-patient variability was acquired for each pair of programs to see the difference of inter-program variability.
P-values for all pair-wise comparisons were multiplicity adjusted. All tests were two-sided. P-values below 0.05 were considered statistically significant. Statistical analyses were performed using R program [28] with add-on packages MethComp [29], nlme [28] and multcomp [30].

Data processing
66 patients with advanced COPD were included initially and all 4 software programs successfully loaded all datasets in DICOM format into their respective servers. Complete unsuccessful segmentation using programs 1, 3 and 4 occurred in 1 (1%), 7 (10%) and 5 (7%) patients, respectively.
Program 1 failed to segment the right upper lobe for one patient in both of the sessions. The segmentation failed because the right upper lobe bronchi were not segmented by the bronchial tree segmentation algorithm causing the following lobe segmentation algorithm to fail also. There was an unexpected halt with program 3 during the lobar segmentation process of 6 patients (Fig 1A), and erroneous outline of the lung was produced in another patient (Fig 1B). Program 4 also failed to generate results during the segmentation process of 5 patients. In one patient in which segmentation could not be achieved with program 3, program 4 also delivered erroneous results due to segmentation of part of the central airway as right upper lobe (Fig 1C). Program 2 was able to analyze all datasets. We observed that right upper lobe and right middle lobe were major source of relative variability of lobar segmentation resulting in difference of quantification results considering the course of the minor fissure. Fig 2 shows the example of the patients who had substantially different values by programs.
Excluding image date retrieval, pure computational runtime was around 3-4 minutes for all programs. The values from the 53 patients who were successfully assessed with all programs were used for further analysis (for software variability using lobe-by-lobe basis).  Intra-software reproducibility There was 100% reproducibility for each value between 2 paired sessions of program 1, 2 and 3. In program 4, the values based on the total lung were the same between 2 sessions. However, minimal discrepancies of lobar-based results occurred. The mean difference was almost negligible with a mean difference in two sessions of 0.063 ml for lobar volume, 0.3 HU for MLD, 0.011 HU for 15 th , 0.057 ml for EV and 0.05% for EI (using lobe-by-lobe basis as mentioned above).

Inter-software variability of fully automated analysis
The means of measurements from each program are shown in Table 2. The results of Bland-Altman analysis for each parameter are summarized in Table 3. According to the pairwise tests on difference, LV was significantly different between program 1 and 4, and 2 and 4 (p = 0.02, The difference for MLD was significant between program 1 and 3, 1 and 4, 2 and 3, 2 and 4, and 3 and 4 (p<0.001 for all except between program 3 and 4, p = 0.008 between program 3 and 4). The largest difference for MLD was between program 1 and 4. The LoA was widest between program 2 and 4 for MLD. In Bland-Altman plot describing MLD, 95% confidence interval is narrower between program 3 and 4 than other pairs (data not shown), indicating better agreement between two programs. As for comparing MLD values between program 3 and 1, program 3 and 2, program 4 and 1, and program 4 and 2 (data not shown), 95% confidence interval is below the line of equality, indicating that program 1 and 2 overestimates MLD in all cases relatively to program 3 and 4 (which is connected with the fact that program 1 and 2 calculate greater lung volumes).
In case of 15 th , there were significant differences between program 1 and 3, 2 and 3, and 3 and 4 (p<0.001). As for the 15 th , the narrowest interval was depicted between program 1 and 2 (data not shown). Program 3 and 4 revealed the largest mean difference and the widest LoA for 15 th .
As for EV, there were significant differences between program 1 and 4, 2 and 4, and 3 and 4 (p = 0.005, 0.005 and 0.02, respectively). The difference for EV was largest between program 1 and 4 with a mean difference of 61 ml. However, the widest LoA existed between program 3 and 4 [-148, 250 ml].

Influence of intra-patient variability
The median standard deviation (inter-quartile range) of the EI amongst the lobes of each single patient as a marker of intra-patient variability was 9.86% (7.67-13.24) for program 1, 9.86% . We then used the median SD of the intra-patient EI to separate patients into groups with low and high intra-patient EI variability. Interestingly, the group with high intra-patient variability also showed wider LAO for inter-software variability oft he EI, which was up to 1.81 times higher than in the group with low intra-patient variability (Table 4). This effect was not dependent on the software used for determining intra-patient EI variability (data not shown).

Influence of user interaction
After visual inspection by a thoracic radiologist considerable errors in lobar segmentation were found in 27 of 53 patients: program 1: 11 patients, program 2: 9 patients, program 3: 2 patients, program 4: 3 patients, both program 1 and 4: 2 patients). Notice that the datasets where the programs 1, 3 and 4 delivered complete unsuccessful segmentations (1, 7 and 4 datasets, respectively) were not contained in these 27 datasets. The comparison of the measurement results between programs was thus repeated with the remaining 26 datasets (S1 Table and S2 Table) after removal of those with obvious segmentation errors in at least one of the programs for inter-software reproducibility after userinteraction.
The LoA between the software tools for LV, MLD, 15 th , EV and EI became smaller after user interaction and removal of all datasets with obvious segmentation errors in one or more software tool, reflecting the improvement of inter-software agreement in the remaining pairs after this interaction (S3 Table). The mean differences, however, did not change substantially for all parameters (S4 Table). Importantly, even after this user interaction densitometric results for MLD, 15 th and EI remained significantly different between the programs.

Discussion
The main findings of the present study about emphysema quantification using fully automated lobar segmentation are as follows: (1) intra-program reproducibility was generally excellent in all four programs in moderate to severe emphysema patients with various degree of destroyed lungs and distorted course of a fissure due to severe emphysematous changes of adjacent lung parenchyma; (2) the mean difference of lobar LV, MLD, 15 th , EV and EI are very small among different software programs; (3) LoA, however, remains substantially wide, resulting in non- Table 4. Predicted standard deviation (SD) of Limits-of-agreement (LAO) for emphysema index (EI) for the inter-software comparison grouped for patients with low and high intra-patient variability of the EI.

Programs
Low intra-patient variability High-intra-patient variability ratio interchangeability of the results obtained from different software programs; (4) high intrapatient variability of EI resulted in a higher inter-software variability of EI. Compared to PFT or other clinical examinations, one of the main strength of quantitative MDCT in evaluating emphysema lies in offering regional information about the distribution of emphysema from lobar segmentation, which enables selection of the target lobe to be treated regionally e.g. by BLVR [31]. Fully automated lobar segmentation and lobe-based emphysema quantification should be preferred to semi-automated or manual segmentation methods because it is more time efficient especially in the setting of clinical routine practice in specialized centers. The latter two obviously imply inter-and intra-user variability depending on the operators' skill and perseverance [32]. Also, previous study reported that automated and semiautomated lobar quantification of emphysema are concordant and show good agreement with visual scoring [33]. However, the software used here did rely mainly upon vessel segmentation instead of bronchial course [33], which does result in problems especially in joint vascular regions such as segment 3 as part of upper lobe and segment 4 as part of middle lobe most frequently having common pulmonary vasculature (S1A-S1F Fig) [34].
The clinical relevance of a diagnostic modality depends on its ability to provide reproducible results regardless of the influence of external factors [35]. Along the chain of quantitative MDCT for emphysema, most factors have been previously studied including inspiration depth, radiation exposure parameters, kernels, reconstruction methods and slice thickness [14]. In a recent study, we could show that inter-software variability for whole-lung emphysema quantification is higher than the natural inter-individual variability of emphysema [14]. Since then, novel software has emerged, introducing fully automated lobe segmentation into quantitative MDCT for regional emphysema quantification.
In the present study, we intended to evaluate the impact of selecting different software tools on fully automated lobar emphysema quantification by comparing the results from 4 different software programs. Firstly, even commercial programs were not able to process all provided data successfully. After exclusion of those error inducing data-sets, we found a high degree of correlation and linearity between results derived from the four software programs in our study. However, these high correlation values should not be incorrectly interpreted as a measure of the interchangeability between results from different programs. In Bland-Altman analysis, calculated values are located around the mean line within a 95% confidence interval for perfect agreement [36].
In the present study, we did not observe good agreement between different programs mostly because LoA on Bland-Altman plots were not narrow enough to be considered as negligible from our radiological perspective. In general, program 1 and 2 (which are different version of YACTA basing therefore upon the same start algorithm and applying identical noise-correction) showed better agreement compared with other pairs of software programs. Even between program 1 and 2 which are different version of the same program without and with application of advanced lung vessel segmentation, however, there existed some bias (mean difference) for all of the five parameters. For example, we could notice one extreme example of substantial improvement in LV measurement by program 2 compared with that by program 1, which is probably due to an additional algorithm considering pulmonary vessels (Fig 2C and 2D). Program 2 did improve the lobar segmentation at the level of the smaller broncho-pulmonary bundles, where not all the bronchi might have been segmented. This explains program 2 revealing the smallest LV for middle lobe as compared to all other 3 programs (Table 2).
There are some researchers who believe that EI directly reflects the phenomenon of destruction of lung tissue that occurs in emphysema, diminishing the partial volume effect of air and lung structures on each voxel of the affected zone, unlike percentile lung density [37]. In the current study, the lines of mean difference for EI locate close to the line of equality in every pair of programs (S2 Fig), suggesting that all programs provide similar results. If we take the largest bias of 4% (between program 3 and 4), however, this value is too big considering Harriś proposal and previously published data [38]. Also, this bias amounted to 13% of the values from global EI measurement in this study. It was reported that the variability of EI measured with two different programs should be approximately less than 1% [14].
According to the previous study [14], potential reasons of error in whole lung emphysema quantification among different software programs are as follows: the steps of lung segmentation (e.g the use of different morphologic algorithms or the the inclusion of leakage from airways into lung parenchyma), airway segmentation (the extent of airway segmentation into the periphery of airway tree) and subsequent emphysema segmentation, and variations in noise correction among software programs.
The additional measurement variation in the present study probably comes from different lobar segmentation algorithms in the current study, resulting in different LV and subsequent different densitometry values. In patients with an inhomogeneous emphysema distribution who are generally more suitable for volume reduction surgery or BLVR, we found a higher inter-software variability, probably related to a stronger distortion of normal lobe anatomy posing additional challenges to lung lobe detection. Thus, patients with inhomogeneous emphysema are prone to a higher measurement variability of computational emphysema quantification.
As we expected, we found that right upper lobe and right middle lobe were major sources of relative variability (considering fissure) due to sharing of the other fissure (minor fissure) by visual observation, in agreement with the previous study [39]. Fissure analysis was not included as a part of algorithm in any of 4 programs that we used. Although fissure analysis might give additional information for lobe separation and more accurate values, it also related to very variable results between subjects, the problem of frequent incompleteness of fissures and confounding factors such as scarring and subsegmental atelectasis near the fissure especially in the target patients. For more accurate lobar segmentation results, it would be ideal to use editing function of the each program by radiologists. However, this would require extra time and endeavor, which prevents it from being widely used in real clinical practice. There is also the problem of objectivity including inter-and intra-observer reproducibility when we use manual or even semi-automated methods. The scope of this study was focused on the evaluation of the software´s current potential in fully automated lung densitometry.
There are several limitations in the current study. First, there is no in-vivo gold standard existing for lobar-based emphysema quantification. Measurements of emphysema quantification software programs are usually based on segmentation of airways on MDCT data sets and an algorithm that translates the segmented voxels into lung volumes. Depending on the voxels included in each segmentation, the result of density histograms can be produced in a different way although segmentation volumes are the same or very similar in amount. It is also known that the segmentation of the pulmonary lobes becomes cumbersome when the inter-lobar boundary is unclear [40], which might cause further problem of accuracy among software programs. Therefore, the visual assessment performed by an experienced reader frequently serves as a "silver-standard". To learn at least something about the weaknesses of fully-automated lobe segmentation, a second work-up was implemented in patients with limited inter-software match and an experienced reader improve the segmentation algorithms. However, the purpose of this study was not to evaluate accuracy, as this is currently impossible. It was rather to compare the current state of different software programs and see whether it is possible to monitor patients with different programs or to interchange the results among hospitals using different programs. The message is: As CT-equipment and scanning protocols have to be kept constant, also the software used for emphysema quantification has to be the same, although all software programs tested here do deliver a good quality. Second, dose protocol was adapted on the fly individually to the patients' absorption by using a 4D-care dose technology. Before starting this study, we examined whether there is any significant effect of this on emphysema quantification. We used the same type of GOLD II-IV patients for this and found that there is no impact of two types of protocols after statistical analysis (not shown). Besides many technical parameters in CT-scanning and image reconstruction, radiation exposure is one of factors that have effect on densitometry results [21,41]. However, the factor of radiation exposure did not influence our analysis of inter-software comparison that included one MDCT examination per patientwe did not find a trend between the slightly different acquisition technologies. The subjects of the current study are composed of moderate to severe degree emphysema patients, who are the main population of interest for BLVR. Thus, the results are valid in this target population. The potential differences of variability induced by the degree of emphysema were not determined.
In conclusion, we should not interpret the results from different software programs as interchangeable. The significant differences between software programs used for lobar emphysema quantification may lead to contradictory target lobe selection for BLVR in some cases. Another important issue is that different emphysema quantification programs or different versions of the same program may be used in different institutions, impairing comparability of study results. When performing follow-up studies in patients, the software tool should be kept exactly constant.