Quantifying the foodscape: A systematic review and meta-analysis of the validity of commercially available business data

This paper reviews studies of the validity of commercially available business (CAB) data on food establishments (“the foodscape”), offering a meta-analysis of characteristics associated with CAB quality and a case study evaluating the performance of commonly-used validity indicators describing the foodscape. Existing validation studies report a broad range in CAB data quality, although most studies conclude that CAB quality is “moderate” to “substantial”. We conclude that current studies may underestimate the quality of CAB data. We recommend that future validation studies use density-adjusted and exposure measures to offer a more meaningful characterization of the relationship of data error with spatial exposure.


Introduction
The influence of local food environments on dietary behaviors has generated much interest among researchers and policymakers concerned about lifestyle, obesity, and other chronic health conditions [1][2][3][4][5][6]. However, associations between measures of exposure to food establishments (e.g. access or availability) and health or health-related behaviours are mixed [7][8][9][10][11]. While some researchers have found positive associations between measures of food establishment exposure and health outcomes [12][13][14][15], several studies report negative associations [16,17]. Errors in the information used to identify food establishments may contribute to the disparate nature of existing results [18].
Researchers seeking area-based measures of exposure to food establishments, commonly referred to as the "food environment" [10] or the "foodscape" [19,20], often rely on commercially available business (CAB) data. CAB data are often more readily available than PLOS  "food stores", "foodscape" OR "eating places". The detailed search strategy is available in the supporting information document (S1 File-Search strategy) and presents all keywords used for each block. The review was limited to primary studies published in English between January 1 st 2006 and June 30 th 2015, covering the last decade, where considerable progress has been made in GIS-based investigations [33]. Titles and abstracts were then examined by two researchers (BL, AL) to identify all studies that compared a CAB data source to a gold standard, such as primary data collection (e.g. ground truthing) or government lists (food establishment inspections or licensing records). For those titles and abstracts that did not reveal these criteria, two researchers examined the entire article (BL, MD) and two researchers checked the final selection (MD, AL). The search procedure was summarized in a flow chart (Fig 1). Examples of manuscripts that did not meet our inclusion criteria are listed in the supporting information document (S2 File-Examples not included).
All included studies reported epidemiologic validation measures to quantify error in the CAB datasets. These measures were typically constructed from the number of true positives, false positives, and false negatives (see Table 1). Authors used these measures to calculate sensitivity (the proportion of establishments in the gold standard also found in the CAB data source), positive predictive value (the proportion of establishments in the CAB data source also found in the gold standard) and concordance (the proportion of all establishments identified in the gold standard or CAB that are in both data sources, including true positives, false positives, and false negatives). Because most studies reported validation measurements across a variety of store types or between multiple CABs, we calculated the median and interquartile range of the measures reported in each study. We also examined whether these studies reported evidence of systematic bias according to the most commonly reported contextual measurements: neighbourhood socioeconomic status, population density, and neighbourhood racial composition. Each paper measured significance differently. As a result, we also relied on author interpretations to evaluate the results; details of author interpretations can be found in the supplementary documentation (S3 File-Author interpretations).

Meta-analysis of CAB validity measures
The second component of this study, a meta-analysis of validity results, aimed to assess whether the use of classification schemes, characteristics of the CAB data source, or the sample size examined in the study were associated with error rates. To construct the meta-analysis, we followed several steps. First, one researcher (MD) extracted the concordance, positive predictive value (PPV), and sensitivity values across stores and CAB types from each reviewed study (S4 File-Meta-analysis dataset). For example, a study that validated both Dun & Bradstreet and InfoUSA data with ground-truthed food outlet locations for supermarkets, grocery stores, and fast food restaurants would have six entries for each concordance, PPV, and sensitivity category (separate for each CAB and each type of food establishment). Hereafter, we refer to these different types of food establishments as CAB subsamples and the multiple entries per subsamples as measures on CAB subsamples.
First, boxplots compared the distribution of sensitivity, PPV and concordance estimate across aggregated samples of all food outlets and across the subsamples to evaluate whether detailed store type classifications led researchers to report lower validity scores.
Next, we examined the associations between CAB characteristics and levels of validity. Studies commonly reported the geographic region for which the CAB was obtained as well as the CAB name. We used these data to construct scatterplots comparing subsample validity estimates with the sample size (defined below), stratified by country. Boxplots additionally compared the distributions of validity estimates for the most commonly examined CABs (InfoUSA and Dun & Bradstreet).
Finally, we examined the association of sample size and validity. We estimated the correlation of validity measurements of each CAB subsample with its sample size using Spearman's rank correlation coefficient. Sample size was calculated as the number of food outlets of the  type under examination that exist in the CAB, whenever available, or as the total unique outlets examined in either CAB or gold standard when CAB numbers alone were not reported.

Case study comparing validity scores and correlation of per capita exposure
This case study analysis used data from Boston (Massachusetts, USA) to assess the relationships of commonly used validity measures and food outlet exposure per capita at the neighbourhood level, the type of measurement ultimately of interest in health and place research. InfoUSA food outlet data for 2009 (obtained through ESRI Business Analyst) was compared against the 2009 food store database maintained by the city of Boston's Inspectional Services Department (ISD); the former dataset served as the CAB data, while the latter-a comprehensive, well-maintained and validated government data source-was treated as the gold standard. We considered the ISD data to be the gold standard because the city of Boston is required by law to license all food establishments and to conduct annual food safety inspections [34]. Food safety inspectors visit these fixed locations, and food establishments are required to obtain a permit to operate. Therefore, there is regular "ground truthing" by the government officials. Some establishments could be missed if they did not obtain proper permits, of if they mobile installations.
The InfoUSA data set included all business establishments located within 500m buffers of the study's selected census tracts; North American Industry Classification System (NAICS) codes were used to identify and classify establishments selling food or beverages (n = 7465). Each store classification was reviewed and category assignments were revised according to keywords as well as researchers' local area knowledge. In the ISD data, each entry was reviewed individually to remove duplicates and non-commercial entities (e.g. children's feeding programs), and was categorized according to the NAICS codes definition. All establishments (n = 1581) except those without identifiable civic addresses (n = 40) were geocoded with Arc-GIS 10.0; the coordinates for addresses that could not be geocoded (n = 4) were obtained from Google Maps and validated in the field.
The clean data sets were merged in ArcGIS 10.0 according to spatial location. Each unique food establishment was examined to determine the number of stores found only in the ISD data (false negatives), those found only in the InfoUSA data (false positives), and those found in both datasets (true positives). These counts were assessed across all food establishmentsregardless of classification-as well as across each of four food outlet types: full-service restaurants, fast-food restaurants, caterers and grocery/convenience stores. We consider a listing to be in both data sources (a true positive) if an outlet with the same name was observed was very close (within +/-200 m) and on the same street in both data sources. Sensitivity, PPV and concordance between the two data sets were calculated following the formula in Table 1.
In addition to the validity statistics describing the entire area, we computed the correlation between the per capita food environment exposure estimated by both data sets. We used Spearman's rank correlation coefficient to account for the non-normal distribution of the data. For each Boston neighbourhood (n = 27), the number of stores per capita based on the estimated 2009 population in census tracts was calculated for both the InfoUSA data and the ISD data; the correlation between the per capita exposure across neighbourhoods was then calculated for all food establishments as well as for full-service restaurants, fast food restaurants, caterers, and grocery/convenience stores. In addition of showing the level consistency of both datasets, were compared the validation measurements (concordance, PPV and sensitivity) and the correlations of the per capita exposure estimations to reveal if these different validation indicators provided diverging assessments of CAB data quality.
Thirteen studies examined the relationship of CAB data source validity scores and neighbourhood characteristics. Seven of the nine studies that examined neighbourhood socioeconomic status and three of the five studies that examined race concluded that there were no significant differences in CAB validity across neighbourhoods. In contrast, four of the seven studies that examined population density did find evidence of systematic differences in validity according to commercial or population density. It should be noted, however, that many of these studies tested several associations across different subsets of the CAB data without correcting for multiple testing, and thus the results may be subject to an inflated type 1 error rate [53].

Meta-analysis of CAB validity measures
A total of 540 measures on subsamples were extracted from the 20 studies under review. Sixteen studies reported sensitivity (n = 235), 15 studies reported PPV (n = 163), and 8 studies reported concordance.
In the comparison across different sources of CAB data, median validation scores tended to be lower in studies using Dun & Bradstreet datasets. However, all sources had a similar and wide range of validity measurements across studies, even among government data, and does not allow to clearly identify if a data source is more valid than the others (Fig 4).
The number of stores listed in the CAB was positively associated with sensitivity (Spearman's Rank ρ = 0.178, p = 0.007) and inversely associated with PPV (Spearman's Rank ρ = -0.287, p < 0.001) and concordance (Spearman's Rank ρ = -0.646, p < 0.001), but this association was in part due to the presence of a very small number of stores examined. As an example in Table 3, when we examined the associations of validity measures and sample size while keeping CAB subsamples with a higher food store number (above 3, 10, and 30 observations), the strength, the sign and the p-value of the correlations changed importantly, suggesting the correlations were sensitive to the presence of subsamples having a small sample size in the distribution.

Case study comparing validity scores and correlation of per capita exposure
The mean food store density per 1000 people, estimated for the 27 neighbourhoods of Boston, varied between the InfoUSA and the ISD datasets (Table 4). Both datasets had a very high standard deviation, which limited our ability to demonstrate significant differences either for all food stores or between each food store types. The validity estimates obtained for the 2009 Boston foodscape (Table 5) were comparable to those observed across the studies surveyed ( Table 2). For all food stores, InfoUSA had sensitivity of 68%, PPV of 51%, and concordance of 41%. According to the Landis scale (<0.00 poor, 0.00-0.20 slight, 0.21-0.40 fair, 0.41-0.60 moderate, 0.61-0.80 substantial, and 0.81-1.00 almost perfect reliability) [30], which was used to interpret validity scores in a CAB related literature review [25], the dataset sensitivity would qualify as substantially reliable, while PPV and concordance would be considered as moderately reliable.
In contrast with the validity estimates, the relative food store exposure by neighbourhoodcalculated as the number of stores available per capita in each neighbourhood-was similar between the two datasets ( Table 5). The correlation of exposure to food stores per 1000 people between the gold standard (ISD) and the CAB (InfoUSA) was 86.9%. For each food store category, the correlations were 99.6% for full-service restaurants, 96.8% for fast-food restaurants, 83.5% for grocery and convenience stores, and 76.9% for caterers. All correlations were significant at the 1% level.

Discussion
Public health authorities and researchers are increasingly seeking to estimate the association of the food environment with health outcomes or diet, but the quality of food environment data poses a significant challenge. The main purpose of this research was to analyse studies assessing the validity of commercially available business (CAB) data sources for food establishments in order to characterize and interpret the validation indicators commonly used in health and place studies. This study consists of three main components: 1) a description of CAB performance across studies, 2) a meta-analysis of the associations of data errors with area characteristics, and 3) a critique of the interpretation of validity measures through an alternative method of validating geographic data.   Between study CAB performance The quality of CAB food outlet databases has been the subject of at least twenty studies to date. The reviewed studies used the epidemiological validation measures of sensitivity, positive predictive value and concordance to assess the data quality. The resulting measures showed a high variability, but the majority of sensitivity and PPV results fall between 40% and 85%. Applying the interpretations of the Landis Scale, the above-mentioned results can be seen as moderate to substantial reliability. However, the Landis Scale was originally designed to evaluate Kappa statistics, which are slightly different from the validity measures surveyed in this study (Munoz and Bangdiwala 1997). The Kappa statistic is a measure of precision between raters that compares the observed agreement between two sources with the agreement that would occur by chance; in contrast, sensitivity, PPV and concordance are not adjusted for random agreement (cite: Viera & Garrett 2005), and thus its levels deserve a stricter interpretation. Furthermore, as Landis and Koch noted, the scale's statistical thresholds were not supported by empirical investigations, but rather provided a useful benchmark for a discussion (Landis and Koch 1977). Furthermore, it is important to mention that several CAB validation studies directly referred to an interpretation scale proposed by Paquet [54] to analyze the concordance of their observations, which in turn referred to Janse [55]. The latter is actually a meta-analysis of patient-doctor agreement on the quality of life, and provide no justification to interpret the degree of agreement. Analysing the concordance between CAB databases is a very different research context and may not be directly transferable. The validity of a CAB would be better evaluated in terms of the error's likely effect on study outcomes. For example, if 20% of fast food outlets are incorrectly classified in the CAB, will associations of fast food outlet exposure and diet-related health be compromised? Not necessarily, because one type of food outlet may be replaced by a similar type of establishment. In this situation, the validity measure will go down, while the exposure to food outlet of similar type would stay about the same. As only one study has examined the effect of dataset error on measurements of the food environment [56] and no study, to our knowledge, has examined the effect of data set error on study outcomes, this question remains unanswered. Future research could address this gap through methods similar to those presented in this paper's case study-i.e. through field research that, in addition to calculating validity scores, also examines the correlations between food environment exposure measures constructed from secondary and from gold-standard data-or through simulation studies that estimate the potential effects of various levels of error on measures of food environment exposure. This study found a statistically significant relationship between sample size and validity measures. However, the association of sensitivity and CAB sample size reversed direction when subsamples with very few listings-and thus with extreme values-were excluded. Although the associations of PPV and concordance were negative and statistically significant both for all subsamples and for subsamples with large n, excluding subsamples with few listings led to a large decrease in the magnitude of the association. As a result, comparing validity statistics between studies with large differences in the number of observations appears highly questionable, and we recommend that researchers use caution when interpreting data disaggregated into very small subcategories.
This study did not find evidence of noteworthy differences in quality across different CABs. This finding does not endorse those reported in a recent review, which reported high levels of agreement in InfoUSA and government data in comparison with other secondary data sources [25]. Although we also observe that these two sources had a slightly higher median validity measurements, there is strong variability around the median values, preventing a clear conclusion regarding each CABs relative reliability.
Comparability between countries is also limited. There is some evidence that studies conducted in Denmark, Canada, and the United Kingdom obtained higher validity measures than those conducted in the United States, but studies in the former three countries have been much fewer in number and used smaller samples than many of the studies conducted in the U.S.

Associations with area characteristics
This meta-analysis did not reveal evidence of a systematic relationship between CAB error and neighbourhood characteristics such as socioeconomic status or neighbourhood racial composition. Of the nine studies that disaggregated measures by neighbourhood socioeconomic status, only two reported a relationship with validity measures, and three of the five studies that examined racial demographics found no significant association with CAB data validity. These results align with the measures reported in a recent, similar review [25]. Among studies using CAB data in areas with variability in commercial or population density, four out of seven studies found that validity measures differed significantly between areas with high versus low densities. This result is possibly linked to the number of food stores under investigation as we demonstrated previously, and where the smallest samples (n<3) tended to lead to extreme validity scores. This finding suggests that validity scores are highly sensitive to very small sample size and thus may offer limited insight for studies conducted in rural areas or studies that disaggregate outlet data into many food outlet categories.

Comparison of validity indicators with a measure of exposure
This paper used a case study from Boston (Massachusetts, USA) to compare the validity measures with a more common characterization of spatial exposure data, correlation of per capita exposure. While the three validity scores identified many errors in the CAB data, the per capita exposure to the foodscape was highly correlated between the CAB and gold standard data sources. The validity measures, originally developed to evaluate the quality of diagnostic tests, may not be suited to the measurement of spatial exposure data. The calculation of true positives, false positives, and false negatives requires that the outlet characteristics in the CAB data be nearly identical to those in the gold standard dataset. Many studies did consider listings with slight errors (e.g. incorrect names but correct classifications) as true positives, but minor errors in address or classification would have been listed as false positives, while their corresponding "real-world" outlet would be considered a false negative. Small errors can thus lead to large differences in validity measures despite a high level of similarity between per capita exposure to CAB food outlets and to gold standard food outlets.

Strengths and limitations
This study is, to our knowledge, the first study to compare estimates of food environment dataset validity across countries; our assessment of the association between validity scores and sample sizes also offers researchers insight on the effects of detailed store classification schemes. However, this study did not test for associations between study characteristics (e.g. funding sources or research design) and CAB validity scores. The high variance observed in estimates of sensitivity, specificity, and positive predictive value thus may reflect differences in the quality of the studies examined rather than true differences in dataset quality. It should also be noted that this review relied only on data extracted from published studies. We did pursue unpublished data; thus the results may be affected by publication bias.
Although exposure measurements would allow a better assessment of the food environment, they also have limitations. The computation of a relative indicator, such as per-capita measures, is clearly pertinent for between-area or between-study comparison analyses, but it is dependent on the geography on which it is computed (e.g. the size and the borders of a neighbourhood) [57]. Also, correlation may not be the best validation tool when the objective is to construct measures of access to food sources (e.g. measuring the closest fast-food restaurant from home, or the mean distance to the three closest convenience stores) for which the precision of the geographic information is particularly important.

Conclusions
All studies inspected here examined global error in preliminary food environments data. Further research is needed to understand how error affects the food environment measurements that are ultimately used in health and place research, but this work can offer guidance for future validation studies.
Although the majority of CAB data sources have moderate to substantial reliability according to the Landis scale, this scale may not provide adequate guidance to evaluate CAB validity. No guidelines currently exist to interpret validity measures specifically for geocoded built environment databases and their interpretation requires caution. We thus suggest that the analysis of validity measures should be accompanied by relative measure of exposure. Researchers should further be cautious in disaggregating data by outlet classification and geography as the use of data subsets with very small sample sizes can lead to the proliferation of extreme results. The results of the case study in Boston brought new insight on this aspect, suggesting that existing validation studies may underestimate the quality of CAB data sources for food environments research. Although validity measures indicated substantial errors between the CAB and the gold standard, when adjusted for neighbourhood population density (i.e. per capita exposure to foodscape), a relatively high correlation was found between both datasets. Future studies should include measures that better evaluate the effective of error on spatial exposure-correlation or "representativity" [50]-to offer a more meaningful characterization of CAB data quality when the aim is to estimate the exposure to the food environment.
While the evidence, presented in this study, of a high correlation in measures of per capita exposure obtained from CAB and gold standard data sets will be reassuring to researchers, the results are less promising for practitioners. A policymaker who prohibits fast food restaurants from locating within a set distance of schools, for example, will need exact data on outlet locations; the lower levels of validity observed in our systematic review suggest that policies requiring exact information on store locations will need to be accompanied by improved data collection mechanisms.
Although all CAB datasets include error, the systematic underestimation of CAB data validity may be leading researchers to conduct time-and cost-intensive primary data collection efforts that ultimately lead to little improvement in the research quality. Such primary data collection may be necessary in the case of a study area with high variability in population density, but food environment validation research does not offer evidence of systematic error in relation to race or socioeconomic deprivation. Further research should be conducted to develop validity measurements adapted for geographic data and to quantify the effect of data set error on measures of exposure.