Crowdsourcing Airway Annotations in Chest Computed Tomography Images

Measuring airways in chest computed tomography (CT) scans is important for characterizing diseases such as cystic fibrosis, yet very time-consuming to perform manually. Machine learning algorithms offer an alternative, but need large sets of annotated scans for good performance. We investigate whether crowdsourcing can be used to gather airway annotations. We generate image slices at known locations of airways in 24 subjects and request the crowd workers to outline the airway lumen and airway wall. After combining multiple crowd workers, we compare the measurements to those made by the experts in the original scans. Similar to our preliminary study, a large portion of the annotations were excluded, possibly due to workers misunderstanding the instructions. After excluding such annotations, moderate to strong correlations with the expert can be observed, although these correlations are slightly lower than inter-expert correlations. Furthermore, the results across subjects in this study are quite variable. Although the crowd has potential in annotating airways, further development is needed for it to be robust enough for gathering annotations in practice. For reproducibility, data and code are available online: \url{http://github.com/adriapr/crowdairway.git}.


Introduction
Chest computed tomography (CT) can be used to quantify structural abnormalities in the lungs, such as bronchiectasis, air trapping and emphysema, which in turn can be used for diagnostic or prognostic purposes. For example, the airwayto-artery ratio (AAR) is an objective measurement of bronchiectasis which is sensitive to detect early lung disease [7,13]. Other promising measurements are the wall-area percentage (WAP) and the wall thickness ratio (WTR) which characterize the ratio of the airway wall to the airway lumen [9]. Unfortunately, manual measurements of the airways and vessels suffer from intra-and inter-observer variation and are time-consuming (8-16 hours per chest CT) [3]. Machine learning techniques such as [4,6,11] can be an alternative, but may require a large amount of annotated data to be able to generalize to all situations.
In various applications, crowdsourcing has been proposed as an alternative for tasks where annotated data is scarce. Crowdsourcing refers to outsourcing tasks (often referred to as human intelligence tasks or HITs) to a group of online users (often referred to as knowledge workers or KWs). This strategy has also been quite effective in medical image analysis - [8] surveys over 50 papers where results have been mostly positive. One of these papers is our earlier study [2] where we described our experiences with crowdsourcing airway measurements. We found that 67.8% of the collected results were not valid, i.e. the airway measurements could not be extracted. However, after filtering out such results, strong correlations between the crowd and expert were observed. Although these experiences were encouraging, they only concerned a single chest CT image, and it was unclear whether they could be generalized to other scans.
In this paper we describe crowdsourced airway measurements collected shortly thereafter for a larger set of 24 chest CT images, and with a slightly updated crowdsourcing procedure. With this follow-up study we aim to answer the following questions: • Does the crowd create valid results?
• What is the quality of the crowd compared to a trained expert, after combining different results per task?
• Can we predict the quality of the crowd results, given a particular scan?

Chest CT scans
We used inspiratory pediatric CT scans from a cohort of 24 subjects [5,10], collected at the Erasmus MC -Sophia Children's Hospital. In each scan, a number of airways were annotated by an expert. The expert localized an airway, outlined the airway lumen (inner airway boundary) and airway wall (outer airway boundary) in a plane approximately perpendicular to the airway center line, and recorded the measurements of the areas.
Generating Airway Images Fig. 1 shows a global overview of our method. The first step is to create a crowdsourcing task for each airway, which requires extracting 2D image slices from a 3D volume. This requires having a 3D location and orientation of the airway. Normally this localization would be done by the expert, however in this study we assume that localization was already done, and focus only on outlining the airway in the image. More specifically, we used 3D voxel coordinates, at which experts have previously outlined airways using the Myrian TM software. We generated 2D slices of 50 × 50 voxels which we reviewed by one observer (APR) to retain only the images with a visible airway that was cut approximately perpendicularly. There were 1026 such images, which are further analysed here. We used cubic interpolation and an intensity range between -950 and 550 Hounsfield units for better contrast, as recommended by the experts. Each image slice was rescaled to 500 × 500 pixels for annotation purposes.

Annotating Airway Images
Each of the generated airway images is a crowdsourcing task. A worker assigned to a task creates a result, consisting of one or more annotations (outlines) placed in the image. To gather these results, we used Amazon Mechanical Turk [1]. All decisions regarding Amazon Mechanical Turk (MTurk) were based on consultation with colleagues who had used MTurk in the past. Apart from updating the instructions to workers, we used the same settings as in our preliminary study, which we repeat here for completeness. All results were collected in 2016.
The annotation interface was integrated into the platform by supplying a dynamic webpage, built with HTML5 and Javascript. This custom-made interface had an ellipse tool, which resembled the tool used by the experts more closely than the default annotation tools available on MTurk. The details of our HIT, which the workers could see when searching for HITs, are shown in Table 1. The workers were instructed to draw two ellipses outlining the airway lumen and the airway wall, or to place a small circle in the top right corner of the image, if no airway is visible. Following our experiences in the preliminary experiments, we revised our instructions, placing more emphasis on the need to draw two ellipses. A screenshot is shown in Fig. 2.

Title
Save lives by annotating airways! Description Draw two contours to annotate an airway (dark circle or ellipse) in image from a lung scan Keywords image, annotation, contour, draw, drawing, segmentation, medical We randomly created HITs with 10 images per HIT. A worker could request a HIT, annotate 10 images, and then submit the HIT. The workers were paid $0.10 per completed HIT. Only workers who had previously done at least 100 HITs with an acceptance rate of 90% could request the HITs.
We collected 20 results per image, because with 10 results per image as in the preliminary experiment, some images did not have valid annotations. For each result, we recorded an anonymized ID of the worker and the coordinates of the annotations. The data collection was done in 2016, shortly after our preliminary study [2].

Measuring Crowd Annotations
We applied a simple filtering step to filter out invalid results. The following results were excluded: • number of ellipses not equal to 2 • not resized ellipses (default size of the tool is a circle) • not overlapping ellipses After filtering, we measured the areas of the inner (a i ) and outer (a o ) ellipse, and calculated the wall thickness ratio (WTR) and wall area percentage (WAP). The WTR is the wall thickness divided by the outer diameter: where d o is the diameter of the outer ellipse and d i is the diameter of the inner ellipse, and the wall thickness W T is defined as: The WAP is the percentage of the total airway area that is airway wall:

Quality of Crowd Measurements
Before measuring how good the crowd is on each task, we need to combine the results per task. We used three different strategies for this: Median Taking the inner/outer areas of all valid results, and combining them with the median function. WAP and WTR are then calculated based on these median values. This is the strategy used in our preliminary study.
Random Selecting a random valid result per task. This gives an indication of how good the crowd could be, if each task was assigned to only one worker, and gives a pessimistically biased indication of how good the crowd could be.
Best Taking the valid result that is closest to the expert measurement, based on the inner and outer measurements. This is an optimistically biased indication of how good the crowd could be, if we only selected the best workers.
Additionally, we can choose to exclude tasks that have less than v valid results. This will reduce the number of tasks for which a combined result is available, but will presumably increase the quality of the result.
After combining the results per task, we use the Pearson's correlation coefficient, ρ, between the crowd measurement and the expert measurement. Correlation coefficients are interpreted as follows: weak correlation for 0 ≤ ρ < 0.3, moderate correlation for 0.3 ≤ ρ < 0.5, strong correlation for 0.5 ≤ ρ < 1. Note that, if a task has had no valid results, it will be excluded from the analysis.

Predicting Crowd Quality
Lastly, we investigate whether any factors contribute to the crowd's performance across different scans in our data. We use the inner airway after median combining as a proxy for the quality.
We then look at the relationship between the quality and the following characteristics: • Whether or not the subject has cystic fibrosis (CF) • Forced expiration volume in 1 second (FEV1), which measures how much air a participant can exhale in 1 second.
• Forced vital capacity (FVC), which measures the total volume of air a participant can exhale.
• Number of airways as indicated by the expert.
• Average airway generation, which indicates the number of bifurcations between the current branch and the trachea. Higher generations correspond to smaller airways and vice versa.
We use the Spearman correlation to investigate the relationship of these characteristics, because we cannot assume a linear relationship between them (in particular, the CF status variable is binary). We report the correlation coefficient and the p-value from a two-sided hypothesis test, where the null hypothesis is that the characteristics are not correlated. We use a significance threshold of 0.05. Since we perform five comparisons in total, after adjusting for multiple comparisons the threshold becomes 0.01.

Validity of Crowdsourcing Annotations
In total we collected 20520 results for 1026 tasks. A few typical examples are shown in Fig. 3. Of these 11742 results (57.2%) were classified as invalid, and 624 (3.0%) contained multiple pairs of ellipses per image, which we excluded to simplify the analysis.
Of the 11742 invalid results, 8809 tasks only had one annotation. This could indicate not seeing an airway, which was the case for 2641 of the results. A further 2933 results had signs of the worker trying to annotate the image (placing ellipses on top of airways), but not following the instructions of outlining two ellipses. Figure 3: Example results acquired for the same task: valid result with two annotations, and two invalid results: a worker who indicates not seeing an airway, and a worker who detects the airway but does not outline it.
The results were created by 577 workers in total, who made as little as 1 or as many as 2313 results. Similar to the observations in [12], most workers only created a few results, and a few workers were responsible for a lot of the results, as shown in Fig. 4 (left). Fig. 4 (right) shows the number of valid and invalid results made by each worker. Overall there is a tendency for workers to create more valid than invalid results. However, there are a few workers who have created a lot of results overall, and who tend to create more invalid results. They contribute to 57.2% of the invalid results. Finally, there are no workers that created only invalid results.

Quality of Airway Measurements
When considering each result independently (without combining the results per task), there is a correlation of 0.803 for the inner airway and 0.697 for the outer airway.
Additionally, we found moderate correlations for the ratio based measures, 0.426 for the WAP and 0.424 for the WTR.
The airway measurements and correlations after combining the results across workers are shown in Table 2, as well as Fig.5, 6, 7 and 8. Combining improves all  Table 2: Pearson correlations between the expert and the crowd with different combining methods, and between two experts. Figure 5: Measurements of the inner airway, comparing expert 1 (x-axis) and to three combining methods and expert 2 (y-axis).
correlations, and for the ratios the correlations can be categorized as strong for median and "best" combining. "Best" combining gives the highest correlations, although the difference with median combining is rather small for the inner airway, WAP and WTR. For the outer airway, the difference is more pronounced (0.769 vs 0.896), suggesting that the task is more difficult, leading to more variation in the crowd.
Overall, since the "best" combining method is optimistically biased due to access to ground truth, our results suggest median combining is a good choice for this data.
Median combining simply combines all (between 1 and 20) the valid results available for a particular task. To understand how the number of valid results affects the correlations, we investigated combining only for tasks where at least a certain number of valid results must be available.
The correlations are shown in Fig. 9. There is almost no effect on the correlations for the inner and outer airway, and the correlations for WAP and WTR steadily improve as more valid results are combined. This could also indicate that the tasks with more valid results, are in general easier images to annotate.
To summarize, the crowd can create good annotations, and combining annotations using the median helps to improve the quality, although not to the quality of the expert. For median combining strong correlations for the inner and outer airways (0.844, 0.769), but moderate to strong correlations for the ratios are observed (0.572, 0.565). It is important to note that a similar trend is Figure 6: Measurements of the outer airway, comparing expert 1 (x-axis) and to three combining methods and expert 2 (y-axis).     Table 3: Characteristics of the subject: ID (not used in modeling), whether a subject has CF (1 = yes), FVC1, FEV (as percentage of predicted value), number of airways (n), and correlations between the crowd (median combining) and the expert. Horizontal lines inserted for legibility.
noticeable in the expert-to-expert correlations: the correlations for the airway dimensions are much higher (0.964, 0.925) than correlations of WAP and WTR (0.701, 0.687).

Predicting Crowdsourcing Quality
Next, we look at the correlations per subject, and whether this correlation can be predicted based on subject characteristics. The individual subject characteristics and correlations (between expert and median combining) are shown in Table 3. Overall we can see high variability across subjects. Correlations range between 0.64 and 1.00 for inner airway, 0.60 and 0.98 for outer airway, 0.07 and 0.75 for WAP, and 0.14 and 0.68 for WTR. Note that these correlations are based on smaller (and different) numbers of tasks (column "n").  Table 4: Spearman correlation between the crowd quality (measured by the crowd-expert correlation of the inner airway) and five subject characteristics.
Lastly we looked at the relationship between the quality of the crowd (here represented by the inner airway correlation) and five subject characteristics. The Spearman correlations and corresponding p-values are shown in Table 4. There is a weak negative correlation between the subject having CF and the crowd quality, however this correlation is not significant for the adjusted alpha level of 0.01. The other characteristics show almost no correlation with the crowd quality.
These results suggest that other factors, not investigated here, are more important. We suspect that these factors are related to the difficulty of individual tasks (for example depending on size, shape, and contrast of the airway and its proximity to vessels or other structures), and/or assignment of workers to different tasks.

Discussion
This paper describes a follow-up study of [2]. In that study we concluded that workers try to annotate airways in the images but often do not create valid results. After filtering out the invalid results, the correlations between the crowd and the expert were 0.69 for the inner and 0.75 for the outer airway, and could be further improved by combining the results. As follow-up steps, we revised our instructions to the crowd, increased the number of workers from 10 to 20 per slice, and collected annotations for all 24 subjects in the cohort.
The current study (increased from 1 to 24 subjects) shows that despite revising the instructions, the number of invalid annotations is still high (57.2). This can happen when a worker does not see an airway, sees an airway but annotates it incorrectly, or due to spam (workers who submit random results just to get the reward). Our analysis shows that most workers create both valid and invalid results. An improvement to increase the number of valid results would be to perform checks (such as requiring one ellipse to be inside the other) inside the annotation interface.
After removing the invalid results, we examined the correlations between the crowd and the expert. Without combining, the correlations were strong for the inner and outer airway, and moderate for WAP and WTR. Combining results across tasks improved correlations, so that all four measures had strong correlations. However, all correlations were still lower than correlations between two experts (see Table 3).
We used simple combining methods to combine the results of the crowd. Here many alternatives are possible, such as weighting the workers by their estimated quality. Instead we tried to estimate an "upper limit" for the crowd with a method which selected the best available result for each task. This method did indeed lead to the highest correlations, but simple median combining was a close second. We conclude that median combining is suitable for this data.
Overall we conclude that the crowd is capable of producing good-quality, but not expert-quality, results. As such, in its current form the proposed method is not robust enough for gathering measurements "in the wild". In our experience this is primarily due to the difficulty of converting a clinical problem into a crowdsourcing problem, such as figuring out how to display parts of a 3D image in a 2D interface, explaining the task to the workers, and dealing with the constraints of the crowdsourcing platform.
There are a number of important lessons from this study, which could be valuable for other researchers doing similar studies. Firstly, our interface was custom built by a crowdsourcing start-up, in the context of a pilot for academic groups. This allowed us to use an ellipse tool that is similar to the tool used by the experts. A disadvantage of this approach is that we could not easily access the interface after the pilot ended, and thus would not be able to collect additional data.
Secondly, although we did a test run of the task (collecting results described in [2]), we did not gather feedback from the workers about their experiences. This is possible through various online groups such as https://www. turkernation.com, and could have reduced possible misunderstandings of the instructions. Furthermore, we used crowd qualifications (such as acceptance rate) and rewards that are outdated by today's standards, so we would recommend other researchers to consult the latest crowdsourcing literature before setting up such a study.
For reproducibility of the results and any follow-up analyses, we made the airway images, crowd results and our code available via http://github.com/ adriapr/crowdairway.git.