Evidence of air quality data misreporting in China: An impulse indicator saturation model comparison of local government-reported and U.S. embassy-reported PM2.5 concentrations (2015–2017)

This paper analyzes hourly PM2.5 measurements from government-controlled and U.S. embassy-controlled monitoring stations in five Chinese cities between January 2015 and June 2017. We compare the two datasets with an impulse indicator saturation technique that identifies hours when the relation between Chinese and U.S. reported data diverges in a statistically significant fashion. These temporary divergences, or impulses, are 1) More frequent than expected by random chance; 2) More positive than expected by random chance; and 3) More likely to occur during hours when air pollution concentrations are high. In other words, relative to U.S.-controlled monitoring stations, government-controlled stations systematically under-report pollution levels when local air quality is poor. These results contrast with the findings of other recent studies, which argue that Chinese air quality data misreporting ended after a series of policy reforms beginning in 2012. Our findings provide evidence that local government misreporting did not end after 2012, but instead continued in a different manner. These results suggest that Chinese air quality data, while still useful, should not be taken entirely at face value.


Introduction
For several decades, air quality in China has consistently ranked among the world's worst. Since the beginning of the reform era in 1978, Chinese air pollution has caused tens of millions of deaths [1] and reduced GDP by trillions of dollars [2]. Nonetheless, widespread reporting of real-time air quality data in China began only recently. Today, nearly all Chinese air quality monitoring stations are owned and operated by government officials, which raises questions about the reliability and integrity of the reported data. Because substandard environmental performance can cause local party leaders to be punished and denied promotion, they have a strong incentive to alter the data reported to the public and the central government in a way that understates pollution. The tendency of Chinese officials to misreport environmental data is widely documented. Government statistics misreport fish catch [3], coal use [4], GDP growth [5], and carbon emissions [6]. Government data also understates the rate at which agricultural land is converted to urban areas [7] and the rate at which burning coal emits carbon dioxide [8]. These misrepresentations make it difficult for the international community to assess Chinese compliance with international treaties [9][10][11] and prevent domestic policymakers from accurately gauging environmental impacts. For example, government data indicate that urban air quality improved significantly over the past several years [12][13][14]. While independent data sources confirm this positive trend [15,16], the magnitude of improvements could be exaggerated if the Chinese government misreports the data.
Chinese central leaders increasingly recognize the dangers posed by inaccurate environmental information, and in recent years government reforms have attempted to improve oversight and increase penalties for officials accused of data fraud. However, government efforts to fix these institutional problems beg the question: are such efforts effective? In this paper, we analyze hourly data for PM 2.5 from five Chinese cities to evaluate the success/failure of central government efforts to eliminate misreporting of local air quality data.

Bureaucratic incentives for local air quality data misreporting in China
In the absence of democratic elections, all local state and party leaders in China are appointed by government officials at higher levels of the political system. The placement and promotion of these officials is determined by the cadre evaluation system (CES), which ranks the performance of local leaders using a formula that weights various 'hard' and 'soft' performance metrics. While the CES has traditionally emphasized economic growth, family planning, and maintaining social order, since 2012 the central government has also affirmed environmental performance as an important 'hard' target [17]. Each year, environmental targets are passed from central to local party leaders, who then sign a 'target responsibility document' with the director of the local environmental protection bureau (EPB). The EPB enforces pollution control measures and collects local pollution data that are then submitted to the central environmental ministry and released to the public [18].
Due to the fragmented nature of China's political system, city-level EPBs effectively serve two 'masters,' each with different policy goals [19,20]. First, city-level EPBs must achieve the targets set by higher-level environmental bureaucrats, who tend to favor stricter pollution control policies. Second, they must also maintain the support of the city's Communist Party standing committee, which typically has close ties with local business leaders and tends to favor more growth-oriented policies. In response to such conflicting demands, city-level EPBs use short-term coping strategies to achieve their bureaucratic goals [21]. With bottom-line objectives prioritized above all else, local bureaucrats face immense pressure to report the 'correct' numbers to their higher-ups, and some resort to colluding with other local officials or misreporting data [22,23]. Given these institutional incentives to cheat, official air pollution data in China often is treated with a high degree of skepticism, by both outside observers and the general public.

Evidence of pre-2012 data tampering
In China, concerted attempts to control local-scale air pollution began in 1996 with the adoption of the National Ambient Air Quality Standards (NAAQS). NAAQS mandated the collection and publication of daily data for the atmospheric concentrations of three pollutants: sulphur dioxide (SO 2 ), nitrogen dioxide (NO 2 ), and suspended particulates with a diameter of 10 microns or less (PM 10 ). These three values were aggregated to form a composite measure of general air quality called the air pollution index (API), which ranged from 0 to 500. By the early 2000s, 86 cities across China reported daily API values to the central environmental ministry (then called the State Environmental Protection Administration, or SEPA), which released the data to the public online [24].
At the end of each year, SEPA used API data to rank the performance of cities from best to worst, with the most important metric being the annual percentage of 'blue sky days' (days with an average API less than 100). Cities with at least 85% blue sky days were awarded full credit for air pollution control and could be designated as a 'National Model City for Environmental Protection.' This designation was intended to spur competition and improve local environmental quality, as city leaders could receive favorable publicity and possibly even increase their odds of promotion [25].
However, the arbitrary dividing line between 'good' and 'bad' air quality created a strong incentive for local leaders to misreport data when API was close to the blue sky day threshold of 100. Distortions around the blue sky threshold were first noted in 2008 by Andrews [26], who analyzed air quality data from Beijing and found a much higher-than-expected frequency of API values just below 100 (and a correspondingly lower-than-expected frequency just above 100). These distortions were confirmed in a 2012 study by Chen et al. [25], who used daily API data from 37 Chinese cities and found a statistically significant discontinuity at the blue sky threshold of 100. Expanding on this work, Ghanem and Zhang [27] tested air quality data from 86 Chinese cities between 2001 and 2010, finding sharp discontinuities at the blue sky threshold for roughly half the cities in their dataset. Notably, these discontinuities were more pronounced on days with low wind speed and high visibility, which suggests that city officials were more likely to misreport data on days when pollution was harder to visibly detect.

Post-2012 government reforms: The end of local air quality data misreporting?
Beginning in 2012, the central government restructured the country's air quality monitoring system, in part to discourage local officials from misreporting data [28,29]. The number of monitored cities was increased from 86 to 363, and API was replaced with the more sophisticated and comprehensive Air Quality Index (AQI), which added ground-level ozone (O 3 ), carbon monoxide (CO), and suspended particulates with a diameter of 2.5 microns or less (PM 2.5 ). Additionally, after a two-year transition period, all cities were required to report hourly pollution concentrations instead of daily averages, with all data relayed directly from local monitoring stations to the central environmental ministry without any handling by city EPBs. Finally, the blue sky day metric was officially discontinued in 2013. Under the revised Air Pollution Prevention and Control Action Plan (APPCAP), cities were instead judged by their ability to reduce average annual particulate concentrations between 2012 and 2017 [30].
Early reports suggested that these reforms reduced the misreporting of pollution data. Applying Benford's Law (a statistical benchmark often used to detect financial fraud) to observations between 2008 and 2012, Stoerk [31] found that differences between concentrations of particulate matter reported by government-controlled monitoring stations and the U.S. embassy in downtown Beijing suggested manipulation. However, including observations for 2013 changed the results to suggest that manipulation stopped in that year. Similarly, using data collected between January 2013 and December 2015, Liang et al. [32] showed that hourly concentrations of PM 2.5 reported by government-controlled monitoring were not lower than concentrations reported at U.S.-controlled stations in five Chinese cities. Together, these results suggest that misreporting of air quality data no longer is an acute problem, at least in China's largest megacities.
Contrary to these recent findings, we postulate that some government measurements of local air quality still are being misreported. However, the form of misreporting has changed; instead of manipulating around a given threshold, Chinese officials are now more likely to understate pollution during periods when concentrations are high. We test this hypothesis by estimating the relation between hourly PM 2.5 concentrations measured by U.S. embassies and consulates in five Chinese cities-which we assume to be reported accurately-and concentrations reported by municipal governments in those same five cities. We choose PM 2.5 because it is the only pollutant measured by stations controlled by both the U.S. and Chinese governments.

Air quality data
Since January 1, 2015, China's central environmental ministry (now called the Ministry of Ecology and Environment, or MEE) has published continuous, real-time hourly measurements of PM 2.5 in 363 Chinese cities. These data are available to the public for 48 hours, after which they are removed from the MEE website. To obtain these deleted data, we use scraped and archived data from Beijing Sinaapp (publicly available at <https://beijingair.sinaapp. com> [33]), which is the only website to preserve continuous hourly measurements of PM 2.5 concentrations for the five cities in our sample. These data, and all other air quality data used in our paper, are available for public use, and our research complies with the websites' terms and conditions. Each city has several local government-controlled monitoring stations. To create a single value for each city that can be compared to the single U.S. station, we create an average hourly value from values reported by individual Chinese stations in each city. This average weights measurements from individual stations based on the inverse of their distance from the station controlled by the U.S. embassy as follows: in which z it represents the hourly PM 2.5 concentration at a given local government-controlled monitoring station, and d i represents that station's distance (in km) from the city's U.S. embassy or consulate. This inverse distance weighting (IDL) gives the largest weight to government-controlled stations closest to the U.S. embassy. To test the degree to which our findings are robust to the weighting scheme used in Eq (1), we repeat the analysis using inverse quadratic distance weighting (i.e. 1 ). The results of this alternative weighting specification, which are described in S1-S3 Tables in S1 File, do not affect our conclusions in a substantive manner.
Observations for hourly PM 2.5 concentrations at stations operated by U.S. embassies are obtained from the U.S. State Department, and are publicly available at <https://china. usembassy-china.org.cn> [34]. Measurements at U.S. embassies extend through June 30, 2017, which allows us to compare hourly PM 2.5 concentrations during a 30-month period (January 2015-June 2017) when the two datasets overlap.
As described S4 Table in S1 File, each city contains eight to twelve stations that generally are within 10 km of the station controlled by the U.S. government. The large number of stations per city implies that conditions unique to a single station have relatively little effect on the city average. The importance of local weather conditions is reduced further by the proximity of stations and locations that 'surround' the station controlled by the U.S. government.
We also collect hourly PM 2.5 concentration data from 74 government-controlled monitoring stations in Taiwan, publicly available at <https://airtw.epa.gov.tw/ENG/default.aspx> [35], which we use to test our methodology for Type I errors (Section 3.5).

Statistical methodology
Consistent with our hypothesis that officials report lower values when concentrations are high, we identify the hours when the relation between measurements at embassy-controlled and government-controlled monitoring stations changes in a statistically significant fashion using the following general equation: in which EMB t is the concentration of PM 2.5 (μg /m 3 ) measured by U.S. embassy-controlled stations during hour t; GOVT t is the corresponding inverse-distance-weighted average concentration measured by local-government controlled stations; IMP t is an impulse that identifies a statistically significant change in this relation at hour t; α, β 1 , and β 2 are regression coefficients; and ε t is a random, heteroskedastic regression residual.
In the absence of misreporting, geographical, or meteorological differences between stations, we expect α = 0 and β 1 = 1.0. These expectations are strongly rejected by estimating the relation across all five cities;â 1 ¼ 4:42 ðt ¼ 38:1; p < 0:000001Þ andb 1 ¼ 1:016 (t = -13.9, p < 0.000001). Alone, these results do not indicate misreporting because measurements can be affected by the instruments used to measure concentrations as well as geographic and meteorological differences between stations. For example, U.S. embassies and consulates are typically located in the urban core of cities, whereas government-controlled monitoring stations are diffused throughout each city, and so embassy-controlled stations are likely to record slightly higher pollution values (e.g. / 1 >0 and/or β 1 >1.0). As such, empirical estimates ofâ andb alone cannot be used to detect misreporting.
Misreporting could be detected by testing the relation EMB t = α 1 +β 1 GOVT t +ε t for one or more changes in α 1 and/or β 1 [36,37]. However, this approach is not well suited for testing our hypothesis because change points identify systemic changes that continue over an extended period. For example, if one or both of the instruments used to measure PM 2.5 is not maintained correctly or replaced, the relation between measurements could change for an extended period, and this change would likely alter the intercept α 1 . Furthermore, extended changes in α 1 and/or β 1 are inconsistent with our hypothesis because systematic changes would be easier to detect than changes in individual measurements.
To detect changes for short periods when the value of EMB t is high, we focus on the quantity, sign, and timing of values of β 2 in Eq (1). Values of β 2 that are statistically different from zero identify hours when the relation between measurements reported by Chinese and U.S. controlled stations differs from the relation that prevails during most other hours. This approach is flexible because it contains no a priori assumptions about misreporting. Misreporting can occur at any time, be positive or negative, and have any magnitude. These hourly divergences are termed impulses and they can be used to detect misreporting by testing the following hypotheses: • Null Hypothesis #1: The number of impulses is consistent with random chance. Rejecting this null hypothesis would indicate the data are being misreported.
• Null Hypothesis #2: The number of negative impulses equals the number of positive impulses. Rejecting this null hypothesis would indicate the data are being misreported in a specific direction.
• Null Hypothesis #3: The timing of the impulses is random. Rejecting the third null hypothesis would indicate that data are being misreported during strategically important hours.
To evaluate the degree to which our methodology is robust, we check for Type I and Type II errors by testing hypotheses #4 and #5: • Null Hypothesis #4: Using similar hourly PM 2.5 concentration data from Taiwan, where no misreporting is suspected, the number of impulses is consistent with random chance. Rejecting this null hypothesis would indicate that the methodology is prone to false positives (Type I errors) and is therefore not well suited to detecting whether the Chinese government misreports data.
• Null Hypothesis #5: Using data from two Chinese cities where local officials were caught manipulating data at specific times and locations, the number of impulses during this period is consistent with random chance. Failing to reject this null hypothesis would indicate that the IIS methodology is not well suited to detecting misreporting when and where it is already known to have occurred (Type II error).
The impulses in Eq (2) that are used to test these hypotheses are identified using an econometric technique, impulse indicator saturation (IIS) [38,39]. IIS creates an impulse (a zeroone dummy variable) for every hourly observation in a sample. Impulses are dropped/retained by (1) dropping irrelevant variables (gauge) based on a specified nominal significance level, (2) retaining relevant variables (potency) near the theoretical average power, based on a significance level that is specified by the user. We specify a significance level of p = .01 because it reduces the likelihood that the impulses are chosen by random chance, but still identifies a sufficient number of observations via random chance that the results can be evaluated statistically (see Eq (3)). The asymptotics of IIS are thoroughly documented [40].
Although developed by statisticians to analyze econometric data, the IIS procedure is used well beyond economics. For example, the IIS procedure is used to identify the timing of the socalled hiatus in climate change [41], and also to identify periods when speculation and policy changes move oil prices away from market fundamentals [42].
To identify significant impulses in our dataset, we use the IIS procedure to estimate Eq (2) using the R-package gets (https://cran.r-project.org/web/packages/gets/index.html). Because this software cannot analyze the large number of hourly observations (~20,000) for each city, we break the thirty-month sample into fourteen subsamples, each containing roughly 1,500 hourly observations. Results generated by analyzing these subsamples are not affected by the number of splits, or by choosing unequal splits [43]. Similarly, the IIS methodology is able to analyze both stationary and non-stationary autoregressive processes without bias [44].
All of our tests are based on the null hypothesis that random chance generates the impulses. Using a significance level of p = .01 implies that random chance would identify one impulse for every 100 hourly observations. We test the null hypothesis that the number of impulses retained is not different than the number expected based on random chance with a test developed specifically for the IIS methodology [45]: in whichỹ c is the observed proportion of impulses and γc denotes the gauge (the expected proportion of detected impulses under the null hypothesis of no impulses) in the initial step of the impulse indicator saturation algorithm. The normal approximation to the gauge is first established by selecting the fixed cut-off c to control the frequency of wrongly detected impulses as the sample size n increases. Under the null hypothesis of no impulses in the model, the expected and observed proportion of impulses should be equal and the S prop statistic (with appropriate scaling) follows a standard normal distribution. Rejecting the null hypothesis using the S prop statistic provides evidence that the observed proportion of impulses is significantly different from γ c = 0.01, which is the proportion expected by random chance. Rejecting the null hypothesis implies two alternatives; large but infrequent shifts in measuring equipment and/or meteorological conditions, or data misreporting.
We choose between these two possibilities based in part on the signs ofb 2 (Eq (2)). Under the null hypothesis of no misreporting, there is no a priori reason to expect more positive or negative impulses. However, if government officials periodically underreport hourly PM 2.5 concentrations relative to the 'true' values measured by U.S. embassies and consulates, we would expect more positive impulses than negative impulses, because a positive impulse identifies an hour when the PM 2.5 concentration reported by the U.S. embassy is significantly greater (p < 0.01) than implied by the corresponding Chinese station, as given by ðâ þb 1 GOVT t Þ. We test the null hypothesis that the number of positive and negative impulses are equal with a z-statistic for the proportion of positive impulses (b 2 þ t ¼ b 2 À t ) and a t-statistic that the mean value of the impulses ( � b 2 ) equals zero ( ).
Finally, we evaluate the null hypothesis that random chance generates the impulses by testing whether impulses are distributed randomly throughout the sample. If local Chinese officials misreport measurements, they are more likely to understate pollution when true PM 2.5 concentrations are high. Under this alternative hypothesis, we would expect a positive relation between the observed impulses (β 2 ) and PM 2.5 concentrations measured at U.S. embassies. We test this third hypothesis by estimating a logistic regression given by Eq (4): in which Impulse þ t is a binary variable that equals one for hours when β 2 > 0 (positive impulses). We also estimate a second logistic regression in which the dependent variable equals one for hour(s) when any impulse-either positive or negative-is present (β 2 6 ¼0). For Eq (4), positive values of β 3 indicate that high concentrations (as measured by U.S. embassies) increase the likelihood of a positive impulse. This would suggest that Chinese government officials underreport concentrations of PM 2.5 during periods of heavy pollution, a result which would be consistent with purposeful misreporting. We calculate the threshold (X) at which local officials are likely to understate concentrations as follows: where X is the PM 2.5 concentration at which the likelihood of a positive impulse reaches 50%, and α 2 and β 3 are regression coefficients from the logit model (Eq (4)).

Null hypothesis #1: Differences between Chinese and U.S. station measures are generated by random chance
Strong evidence for data misreporting is provided by the higher-than-expected frequency of impulses in Eq (2). Using a significance level of 0.01 to identify impulses implies that random chance will generate roughly 1,012 impulses from the 101,245 hourly observations across all five sample cities. However, using this pooled sample to estimate Eq (2) identifies 1,390 impulses, roughly 40% more than expected by random chance. This greater-than-expected number of impulses also is present in four of the five cities (except Shanghai) when tested individually. These results are confirmed statistically by the positive and significant values for the S prop statistic (Section 2.2), which indicate that random chance likely did not generate the large number of impulses estimated from Eq (2) ( Table 1). This suggests that the higher-thanexpected frequency of observed impulses in four of the five tested cities is caused by non-random weather patterns, changes/poor maintenance of monitoring equipment, and/or purposeful misreporting by local Chinese officials. Two graphical examples of specific 24 and 36-hour time periods with high concentrations of significant impulses (in Beijing and Shenyang respectively) are provided in S1 and S2 Figs in S1 File.

Null hypothesis #2: The number of positive and negative impulses are equal
While a higher-than-expected frequency of impulses can signify a non-random data generation process, the signs associated with β 2 also contain important information about how they are generated. If random chance generates the impulses, the number of positive and negative impulses should be roughly equal. The specification of Eq (2) makes β 2 >0 when measurements at stations controlled by the Chinese government are significantly lower than the value implied by the corresponding measurement at the U.S. embassy-controlled station. In the pooled sample (and in four of the five individual cities), more than 63% of the observed impulses are positive. A z-statistic (Table 2) indicates that, for the pooled model and all individual cities except for Shanghai, it is highly unlikely (p < 0.01) that random chance generates the preponderance of positive impulses. This result is confirmed by a two-tailed, one sample t-statistic (Table 2) which evaluates the null hypothesis that the mean coefficient value of significant impulses is

PLOS ONE
Evidence of air quality data misreporting in China (2015China ( -2017 zero. Thus, not only are impulses more frequent than expected but, when they do occur, government-controlled stations are far more likely than U.S. embassy-controlled stations to report concentrations lower than the values implied by EMB t = α 1 +β 1 GOVT t +ε t . Although, by itself, this result does not reject the possibility that the high frequency of impulses is generated by weather conditions and/or changes in/poor maintenance of monitoring equipment, the preponderance of positive impulses suggests that the observed divergences are directional, which is clearly consistent with the hypothesis that local Chinese officials misreport data in ways that understate population exposure to high concentrations of PM 2.5 .

Null hypothesis #3: The timing of positive impulses is random
If random chance generates the excessive number of positive impulses, they would likely occur randomly throughout the sample. If impulses are generated by non-random weather patterns or changes/poor maintenance of monitoring equipment, the excessive number of positive impulses would likely cluster during periods throughout the sample. By contrast, if the excessive number of positive impulses is generated by Chinese officials purposefully understating measurements, positive impulses are likely to be positively correlated with concentrations measured at U.S. embassies. That is, because officials want to lower annual average pollution concentrations, the greatest incentive to understate pollution occurs during periods when 'true' PM 2.5 concentrations are unusually high. Consistent with this notion, we find a strong positive correlation between observed impulses and hourly PM 2.5 concentrations measured at U.S. embassies. This positive correlation is evident for the pooled sample (Fig 1), and for each of the five individual cities ( Table 3). The positive correlation suggested by Fig 1 and Table 3 are confirmed by the results of the logit model (Eq (4)). The positive coefficients (β 3 ) associated with PM 2.5 concentrations measured at U.S. stations indicate that higher concentrations of PM 2.5 increase the likelihood of positive impulses (Table 3). These results are confirmed by a second logit model in which the dependent variable assumes a value of one for every significant impulse, regardless of whether it is positive or negative (S5 Table in S1 File). The PM 2.5 concentration at which the likelihood of a positive impulse surpasses 50% (which we term the 50% misreporting threshold) is 502 μg/m 3 in the pooled sample, and ranges from 180 μg/m 3 to 625 μg/m 3 for individual cities.
To approximate the magnitude of misreporting, we compare average reported PM 2.5 values between Chinese and U.S.-controlled stations for all positive impulses above the 50% misreporting threshold (Fig 2). The results show that, during hours with significant positive impulses, Chinese-controlled stations underreport U.S.-controlled stations by between 63 μg/ m 3 and 304 μg/m 3 . In percentage terms, Chinese stations underreport U.S. stations by between 18% (Shanghai) and 41% (Guangzhou) during these hours.
Together, the results of hypothesis #3 are consistent with strategic efforts to hide hours with unusually high pollution levels. Conversely, this finding is inconsistent with the 'clustered pattern' that would result if the impulses are generated by the poor performance of the instruments used to measure PM 2.5 and/or local weather.

Robustness checks: Testing the IIS methodology
The impulse indicator saturation technique identifies periods when government-controlled monitoring stations underreport PM 2.5 concentrations in a systematic, non-random fashion. These results could be caused by local Chinese officials misreporting high concentrations of PM 2.5 and/or biases in our methodology. To assess the degree which the IIS technique is prone to false positives (Type I errors) or false negatives (Type II errors), we apply the same methodology to samples in which we 'know' the 'correct' results.
To test for Type I errors, we analyze hourly measurements of air pollution from Taiwan. We assume that Taiwanese officials do not misreport pollution data because Taiwan exhibits greater levels of environmental transparency and because all monitoring stations are controlled directly by the central government. Under these conditions, our methodology should The coefficient values of all significant impulses were estimated using Eq (2) for Beijing (purple), Shenyang (red), Shanghai (blue), Guangzhou (yellow), and Chengdu (green). Statistical information about these relations is given in Table 3. https://doi.org/10.1371/journal.pone.0249063.g001

PLOS ONE
Evidence of air quality data misreporting in China (2015China ( -2017 not detect large numbers of positive impulses when concentrations are high. If our methodology detects widespread evidence of data misreporting in Taiwan, this would suggest that our statistical methodology is prone to false positives. Conversely, we test for Type II errors by analyzing data in cities where the Chinese MEE admits that air quality data were manipulated by local officials at known times and at known locations. For these times and locations, our methodology should detect large numbers of positive impulses when embassy-reported PM 2.5 concentrations are high. If our methodology does not detect evidence of data misreporting when it is known to have occurred, this would suggest that our statistical methodology is prone to false negatives.

Null hypothesis #4: Measurement errors in Taiwan are caused by random chance
Because U.S. diplomatic outposts in Taiwan do not report air quality data, we test our methodology by choosing the five pairs of Taiwanese monitoring stations (out of 74 total stations) that are separated by the shortest geographical distance (average distance � 4.3km). Although we

PLOS ONE
Evidence of air quality data misreporting in China (2015China ( -2017 cannot assume that either of the two government-controlled monitoring stations represent 'true' pollution values, as long as both stations don't misreport air quality data simultaneously and by the same magnitude, any misreporting at either station will still cause a significant impulse. The null hypothesis of no misreporting implies that our methodology should identify impulses at a frequency approximately equal to the significance level of the criterion used to identify impulses (p = 0.01). Consistent with this notion, our results fail to reject the null hypothesis that the frequency of observed impulses is generated by random chance. None of the S prop statistics in Table 4 are significant at a level (p < 0.05), which suggests that the number of impulses retained by the IIS technique is not significantly greater than the number that would be generated by random chance. This null hypothesis is nearly rejected by Pair 1 (p = 0.069), but in this case there were fewer impulses than expected by random chance, rather than more as we would expect with purposeful misreporting.
These results suggest that the methodology used to analyze concentrations of PM 2.5 in Chinese cities is not prone to Type 1 errors. As such, false positives probably do not cause us to (incorrectly) conclude that the Chinese government understates concentrations of PM 2.5 during periods of high concentrations. Furthermore, the failure to reject the null hypothesis using Taiwan data also suggests that errors/changes in instrumentation and/or local variations in weather are not responsible for the preponderance of impulses found in four of the five mainland Chinese cities analyzed.

Null hypothesis #5: Government misreporting causes the large number of positive impulses in Chinese cities
To test whether our methodology correctly identifies instances of data falsification in Chinese cities, we test for Type II errors by analyzing data from cities where local officials admit to underreporting measurements for PM 2.5 . In both Xinyu City of Jiangxi Province and Xinyang City of Henan Province, the local EPB director hired individuals to falsify air quality measurements during September and October of 2017 (the exact dates and hours of tampering were not released by the MEE). According to the MEE, both instances of manipulation involved physical tampering with air quality monitors, including stuffing cotton yarn into sensors and spraying them with mist from 'fog gun cars' [46]. More importantly for the test of our methodology, officials manipulated only one monitoring station in each city, leaving all other stations unaffected.
For both Xinyu and Xinyang, we use the IIS methodology to compare hourly PM 2.5 measurements from the corrupted monitoring station to an average of hourly measurements from the surrounding, non-corrupted monitoring stations. To identify periods when data may be misreported, we break the full sample (January 2016 to December 2018) into eighteen twomonth subsamples. If our methodology is not prone to Type II errors, the period when local officials misreported data (the September 2017 -October 2017 subsample) should contain

PLOS ONE
Evidence of air quality data misreporting in China (2015China ( -2017 more impulses than expected by random chance. It should also contain more positive than negative impulses. Consistent with these predictions, test statistics reject the null hypothesis that the number of impulses is consistent with random chance for both cities in the September 2017 -October 2017 subsample (Table 5). Similarly, test statistics reject the null hypothesis that the mean value of impulses is equal to zero. Instead, the number of positive impulses is significantly higher than expected in both cities.
Furthermore, the test of our methodology expands our understanding of what happened in Xinyu and Xinyang. None of the subsamples after October 2017 reject either of the two null hypotheses at a p < 0.05 significance level, which suggests that local officials stopped misreporting data after being caught and punished by the central government. Conversely, both null

PLOS ONE
Evidence of air quality data misreporting in China (2015China ( -2017 hypotheses are rejected in some of the subsamples before September 2017 (such as January 2015 -February 2015 in Xinyang city and January 2016 -February 2016 in Xinyu City). Together, these results suggest that local officials in Xinyu and Xinyang were misreporting air quality data well before the period in which they were caught. Finally, the small number of impulses observed in the periods after misreporting was detected also suggests that errors/ changes in instrumentation and/or local variations in weather are not responsible for causing the high frequency of impulses present in our main dataset.

Discussion
Our results strongly suggest that some local Chinese officials continued to misreport measurements of PM 2.5 concentrations in many of the country's largest megacities, even after the government's post-2012 policy reforms. Consistent with our findings, in early 2018 the MEE announced that it had caught officials in seven cities manipulating data during the previous year [46]. Our findings of ongoing air quality data misreporting in China are not surprising, because the government's post-2012 reforms did not eliminate incentives for local officials to cheat. Although requiring hourly, real-time measurements and abolishing the blue sky day metric eliminated manipulation around a given API threshold, local EPBs still face enormous pressure to report pollutant concentrations that decline continuously year-over-year. This pressure is compounded by the fact that the central government has increased penalties for local cadres in failing cities without also increasing the flow of centrally-backed resources or financial support [48]. Thus, faced with increasingly difficult attainment targets and a persistent lack of resources, some local officials have taken the path of least resistance by continuing to misreport air quality data.
Nonetheless, the persistence of local data misreporting does not invalidate results which that suggest urban air quality in China has improved in recent years [12][13][14][15][16]. Even measurements from U.S. embassies and consulates show that annual concentrations of PM 2.5 fell by more than 25% between 2013 and 2017. Although these broader trends are clear, day-to-day air quality numbers remain highly suspect, especially on high pollution days. The fact that air pollution data is less likely to be accurate on highly polluted days is of particular importance, because acute health effects appear to be more strongly related to hourly peak concentrations than daily averages [49]. Moreover, even though nationwide concentrations of certain pollutants are likely decreasing, local data misreporting makes it difficult for central officials to determine which cities are driving these improvements and which are free-riding.
Central leaders are aware of this problem, but until recently their policy responses have been slow and ineffectual. Starting in late 2016, the central government instituted a series of new reforms aimed at improving environmental governance. Specifically, Beijing announced a transition towards a more centralized environmental bureaucracy that seeks to reduce the negative impacts of 'local protectionism' in environmental management [50]. Also, in September 2016, the power to nominate city-level EPB directors was transferred from city governments to provincial EPBs (although city approval is still required to confirm nominees). Also, at the end of 2020, provincial EPBs (rather than city EPBs) assumed full responsibility for funding and personnel decisions at local monitoring stations [51]. The central government also is increasing its supervision and oversight of local officials, and in July 2017, the MEE began surprise environmental inspections in the Beijing-Tianjin-Hebei region [52]. The party-controlled People's Daily announced that these inspections would become the 'new normal,' and by the end of 2017, central agencies had disciplined nearly 12,000 local officials and assessed more than $130 million in fines [53]. These MEE inspections have been carried out in tandem with the Central Organization Department (COD) and the Central Commission for Discipline Inspection (CCDI), which are the two most important groups in determining the promotion prospects of local officials. Together, these most recent reforms may represent the beginning of an effective strategy to combat misreporting of local air quality. However, we cannot fully evaluate these reforms because they do not begin until late-2016 and 2017, and our sample only runs through June 2017. Under these conditions, further research is needed to determine whether these latest reforms have been more successful in curtailing the misreporting of local pollution data.
Overall, China's post-2012 environmental policy reforms improved the availability of air quality data and helped to reduce pollutant concentrations in urban areas, but they did not dissuade some local officials from misreporting data. Although previous studies erroneously suggest that data misreporting ended after 2012, this paper indicates that it continued; just in different, more difficult-to-detect forms. As the central government continues to implement new efforts to address data fraud, it remains to be seen whether these efforts will be effective or whether they will simply induce local officials to find newer and ever-more-innovative ways to falsify official statistics.