Exposure assessment of adults living near unconventional oil and natural gas development and reported health symptoms in southwest Pennsylvania, USA

Recent research has shown relationships between health outcomes and residence proximity to unconventional oil and natural gas development (UOGD). The challenge of connecting health outcomes to environmental stressors requires ongoing research with new methodological approaches. We investigated UOGD density and well emissions and their association with symptom reporting by residents of southwest Pennsylvania. A retrospective analysis was conducted on 104 unique, de-identified health assessments completed from 2012–2017 by residents living in proximity to UOGD. A novel approach to comparing estimates of exposure was taken. Generalized linear modeling was used to ascertain the relationship between symptom counts and estimated UOGD exposure, while Threshold Indicator Taxa Analysis (TITAN) was used to identify associations between individual symptoms and estimated UOGD exposure. We used three estimates of exposure: cumulative well density (CWD), inverse distance weighting (IDW) of wells, and annual emission concentrations (AEC) from wells within 5 km of respondents’ homes. Taking well emissions reported to the Pennsylvania Department of Environmental Protection, an air dispersion and screening model was used to estimate an emissions concentration at residences. When controlling for age, sex, and smoker status, each exposure estimate predicted total number of reported symptoms (CWD, p<0.001; IDW, p<0.001; AEC, p<0.05). Akaike information criterion values revealed that CWD was the better predictor of adverse health symptoms in our sample. Two groups of symptoms (i.e., eyes, ears, nose, throat; neurological and muscular) constituted 50% of reported symptoms across exposures, suggesting these groupings of symptoms may be more likely reported by respondents when UOGD intensity increases. Our results do not confirm that UOGD was the direct cause of the reported symptoms but raise concern about the growing number of wells around residential areas. Our approach presents a novel method of quantifying exposures and relating them to reported health symptoms.

It is particularly interesting that the correlation with total reported symptoms is stronger with CWD than with IDW, and the authors make a good case for why that is-that being, the authors did not have data on the particulars of each well's activities (beyond location and annual emissions), and assigning higher weights to wells closer to a residence compounds that uncertainty. In my opinion, that points out a notable limitation in the methodology and conclusions of this study, which I believe the authors should be more straightforward in acknowledging. That is, the health concerns reported by residents are correlated only with annual data on well locations and emissions. It is not known if the health issues were transient or longer lasting (which the authors acknowledge), and it is not known exactly what was going on at the well pads within 5 km of their house. We know that wells under development can have highly variable emissions, perhaps by orders of magnitude, and some wells may only be under development for weeks before going into production mode, during which emissions are generally much smaller. The body of literature suggests that higher air concentrations resulting from O&G activities are much more likely to occur during development, and that reports from local residents of health issues and nuisances also tend to peak during development. Therefore, correlating one-time reports of health complaints with annual O&G data is missing an opportunity to more directly investigate possible connections between health complaints and O&G operations in real time. Can you say if new well development is active and thriving in these counties, mixed with wells in long-term production mode? Are there really no data on which wells were under development vs. in production, with sufficient time resolution to draw closer connections?
I appreciate that the authors used CONC as a third exposure-intensity metric. Though the emissions data are annualized, and the dispersion model is screening-level, this is an important metric because it combines proximity to residence with emission source strength. And if you are asserting that these health-symptom reports may be linked inhaling chemicals emitted from well pads, then emissions and proximity are key to that exposure route. However, you should be clearer in your assumption that the respondents' exposures are entirely at their residence (or at least that's what this intensity metric represents) and that there is full chemical penetration into their home. I also found the "Ambient Air Emissions" methodology section to be rather unclear on a number of fronts, as I discuss in more detail below. This is the section that I advise the most revisions to.
I also appreciate that the authors used several characteristics of respondents as part of the GLM application. However, the discussion of the role of those characteristics in correlating symptom reporting with changes in exposure intensity is non-existent. In my opinion, if you are making observations about statistical differences in males/females, smokers/non-smokers, age, water source, etc., then the discussion on them should be more complete and include speculations about the meaning of those differences. If you are not willing to speculate, then say why.
More specific line-by-line comments are provided below. Some of them are related to my comments above, while others are newly mentioned below (some of them major).
I appreciate you giving me the opportunity to review your work. I hope that my comments are helpful and fair, and I am available to review a revised manuscript should you decide to go that route.

Additional Comments
• "Household" refers to the people in the house. I think in most cases you mean "residence" (i.e., the location of the house). • In many cases, you use "gas well" as shorthand for "oil and gas well" but it implies by omission that they're not oil wells. Consider just using "well". • If you estimate concentrations from emissions, then consider if your model names and results discussions should refer to a "concentration model" and "concentration intensity" and "concentration gradient" etc. rather than emissions model, intensity, gradient, etc. • Not defining duration of symptom (short periods vs long periods of symptom persistence) is a concern in terms of understanding if reported health issues are episodic versus chronic. Not correlating time of symptom with UOGD activity also weakens assumptions about correlations between well activities and health issues. • I take some issue with calling the reported symptoms "health effects", "health impacts", and similar phrasing. These phrases imply cause (O&G emissions) and effect (itchy eyes, etc.). Perhaps terms like "negative health symptoms" are more appropriate? • I also think you should be more careful about referring to CDW and IDW as measurements of exposure. They're metrics of proximity to wells, and that's it. CONC is closer to an exposure metric, as you attempt to estimate air concentrations of O&G-emitted chemicals, at residences. I think at the least you should acknowledge this, and perhaps then establish that for convenience you will refer to them as metrics of potential exposure intensity (or something like that).
• Introduction (minor issues) This section provides a good though brief review of the literature on potential connections between O&G operations and impacts on human health. ○ Lines 48-49: change "human health risk" to "human-health risk" ○ Line 61: change "number" to "numbers" ○ Lines 67-68: insert comma after the [8] citation, and change "inverse distance weighting" to "IDW" ○ Lines 69-70: I think you should change "well emissions exposure" to "emissions exposure metric"? Also, change "calculate ambient air at the" to "calculate ambient-air concentrations at the". Change "exposure metric comparison as well, however, their" to "exposure-metric comparison as well, but their" ○ Line 72: I think you should put "[16]" after the Hess citation? ○ Lines 74-75: suggest revising as "...and this analysis-comparing three estimates of exposure, including reported emissions-attempts…" ○ Final sentence starting on Line 78: suggest changing to "The aggregate of methodologies applied here-using statistical modeling to analyze the influence of different exposures on symptom reporting, and applying a technique to identify specific symptoms that might be indicative of exposure-is novel in UOGD research and provides insight into new techniques for studying relationships between health and exposure variables."

• Study Sites & Health Outcomes
This section provides a good description of the symptom reports and provides confidence that they were screened appropriately. ○ Major Issues ■ Line 90: What does "Appendix A" refer to? The reference doesn't have an Appendix A, and neither do you? Also, the link provided for reference 18 is broken, I think you're missing a hyphen between "individual" and "heath". ○ Minor Issues ■ Line 87: change "Between" to "In" ■ Line 95: I think you should change "Weinberger et al. the" to "Weinberger et al. [19], the"? ■ Line 97: change "oil and gas industry" to "oil-and-gas industry" ■ Line 98: suggest changing "complete the assessment form (n=118). The 118 health assessments" to "acomplete the assessment form (17 excluded). The remaining 118 health assessments" ■ Line 99: change "health care providers" to "health-care providers", and "occupational health physician" to "occupational-health physician" ■ Line 103: change "one of eight counties" to just "eight counties" ■ Figure 1 caption: remove comma in "Southwestern, PA"; change "Lawrence county" to "Lawrence County"; insert "County" after "Butler"; change the "[20]" citation to "[22]" ■ Figure 1: suggest making county names more readable (move them on top of the well locations?)

• Cumulative Well Density and Inverse Distance Weighting
This section lacks some clarity, as discussed below. ○ Major Issues • Line 119: You say that three radii were drawn initially. That implies something changed later…? • Line 122: suggest updating this sentence to "Active, unconventional wells for the year of a completed health assessment were plotted within the three radii around the respondent's home." Though 5 km was the maximum radius, it's probably more clear to say within the three radii. • Line 123: update this sentence to "A cumulative well density was calculated per respondent for the year of their survey, equal to the total number of wells divided by the radius (in km)." Again, it's using three radii with 5 km being the max, right? • Line 135 about the four residences with wells outside of PA within their 5-km radius (insert hyphen there!): were they also outside of PA for their 1-and 2-km radii? Also, why not throw these respondents out of your " after the Weinberger citation. • Line 120: change "1km" to "1 km" (insert space) • Line 126: the "IDW" abbreviation was already established earlier.
• Line 127: should "qualifying" be "quantifying"? Leading into the next line, change "closer to the respondents' home" to "closer to a respondent's home". • Line 128: suggest updating this sentence to "The inverse distance of each well within 1-, 2, and 5-km radii of a residence was calculated, and those values were summed into one IDW score per respondent, per radius, as shown in the following equation:" • Lines 132-133: change "respondents' home, and n is the number of wells within the 5 km buffer" to "respondent's home, and n is the number of wells within the radius" • Ambient Air Emissions ○ Major Issues ■ "Ambient Air Concentrations" is a more appropriate section title, as those are what you are deriving in this section. ■ Line 139: should year 2012 been removed from your assessment, given that 25% of the year's emissions data were unavailable? ■ Line 155: for the Pittsburgh meteorological data, can you speak to their representativeness of conditions across your study area? ■ Line 163: what is the significance of modeled concentrations being less than 10 μg/m 3 , and what was that based on? ■ Lines 171-173: an air concentration is not a "rate of emissions exposure", it's a concentration. You can then say (if correct) that you assume that it is the concentration at which residents are exposed (i.e., constant exposure to outdoor concentrations at their residence), and refer to it as an exposure concentration. What is meant by "total, or aggregated, emissions"-the one emissions rate you say earlier that you used in your calculations? Repeating that here makes it sound like there's something more going on here at the end to determine individual exposure. ■ This section is very difficult to follow. This is due to several factors.
• There is an inconsistency in terminology that can be easily rectified: you are using emissions of chemicals from well sites (which indeed are ambient air emissions, though "air emissions" is clear enough; and they are reported as rates, not volumes; a consistent use of emission "rate" rather than "amount" is desired here, too) and meteorological data to estimate concentrations (not "levels" or "emissions concentrations" or "air level value") of those chemicals in the ambient air at various distances from the well. • You kindly offer a brief summary of a box-model methodology more fully described in other papers, but the brief summary as it is currently written is inadequate and confusing. Pasquill used five wind-speed categories, along with cloud cover and time of day, to define six stability classes, not 30. I see that your reference [23] (Brown et al., 2019) has a Table 1 that defines 30 stability classes from A1 to D30, but I don't recall seeing these from Pasquill's work (certainly correct me if I'm wrong!) and I don't see how that's used in concert with Figure 1 of [23] which is the vertical mixing/stability/distance look-up chart showing just the six stability classes. You say you pulled data on wind direction from NOAA, but the box model does not utilize wind direction. Taking a step back from the details, it would be clearer to say (roughly) that the model utilizes atmospheric stability, wind speed, and an assumption about the size of a well-pad facility to estimate the size of a box in which the emissions are well mixed, which in turn is a measure of plume dilution, where the chemical concentration in the box is calculated as emission rate divided by box volume.
(Hourly wind speed is used as part of the box volume calculation, right? That's the "meters of air that pass over a site/minute" stated in [23]?) Then you can march through how you identified each of those parameters (Pasquill stability from hourly data on cloud cover and wind speed from NOAA; an assumed 100-m diameter of well pad; Pasquill assumptions on vertical mixing given stability and horizontal distance; an assumption of constant 300 g/h emissions). • The relationship between the reference emission rate used in the modeling (300 g/h) and the actual facility emission rates (variable) is not entirely clear. I think you're telling me that you ran the model to get concentrations per unit emissions at five different distances from a well, based on a high-end metric of hourly concentrations in a year (why 300 g/h and not just 1 g/h?). Then you got a well's real emissions and multiplied them by the modeled concentration per unit emissions. If that is correct, please consider updating the final paragraph of this section to be more clear about this. • The use of quadrants around a residence is not clear to me. How does this affect concentration in any way? ○ Minor Issues