Scope 3 emissions: Data quality and machine learning prediction accuracy

Investors’ sophistication on climate risk is increasing and as part of this they require high-quality and comprehensive Scope 3 emissions data. Accordingly, we investigate Scope 3 emissions data divergence (across different providers), composition (which Scope 3 categories are reported) and whether machine-learning models can be used to predict Scope 3 emissions for non-reporting firms. We find considerable divergence in the aggregated Scope 3 emissions values from three of the largest data providers (Bloomberg, Refinitiv Eikon, and ISS). The divergence is largest for ISS, as it replaces reported Scope 3 emissions with estimates from its economic input-output and life cycle assessment modelling. With respect to the composition of Scope 3 emissions, firms generally report incomplete composition, yet they are reporting more categories over time. There is a persistent contrast between relevance and completeness in the composition of Scope 3 emissions across sectors, with low materiality categories such as travel emissions being reported more frequently than typically high materiality ones, such as the use of products and processing of sold products. Finally, machine learning algorithms can improve the prediction accuracy of the aggregated Scope 3 emissions by up to 6% and up to 25% when each category is estimated individually and aggregated into total Scope 3 emissions. However, absolute prediction performance is low even with the best models, with the accuracy of estimates primarily limited by low observations in specific Scope 3 categories. We


Introduction
Corporate carbon footprints, a popular proxy for firms' climate transition risks, measure the level of greenhouse gas [GHG] emissions associated with a firm's business activities or products.Corporate carbon footprints provide an indication of how much anthropogenic carbon a company contributes to atmospheric GHGs and to global warming [1,2].Carbon footprints are preferred by academics and industry practitioners over other climate transition risk rating metrics, and their clear advantage is that they can be converted to dollar losses (using the effective carbon price) or hidden costs (using the future costs of carbon across different transition scenarios) [3].Carbon footprints help facilitate the implementation of divestment strategies or low-carbon indices (e.g., S&P Carbon Efficient Indices, MSCI Low Carbon Index) by establishing a link between climate-and financial-risk.Although corporate carbon footprints are a popular metrics for assessing climate transition risks, carbon emissions data has numerous problems, including limited, inconsistent and inaccurate reporting [4,5].
The GHG Protocol (WRI and WBCSD, 2020) divides carbon emissions into three categories: Scope 1-direct emissions from sources and assets controlled by the firm, Scope 2-indirect emissions from purchased electricity, and Scope 3-indirect emissions from a firms' value chain.Traditionally, the protocol requires all firms to report Scope 1 and Scope 2 emissions, whereas firms have the discretion on whether and which categories they choose to report for Scope 3 emissions.Recently, there have been signals that a mandatory disclosure of Scope 3 emissions may be required in some jurisdictions.For instance, a draft rule by the U.S. Securities and Exchange Commission in March 2022 proposes that firms need to disclose emissions generated by their suppliers or partners if they are material or they are included in any of their emissions targets [6].Therefore, the importance of accurately quantifying Scope 3 emissions is critical.Even so, systematically accounting for all emissions along the entire value chain (sometimes up to tens of thousands of firms) to the same level of accuracy is broadly acknowledged to be extremely challenging [7].
Scope 3 has many merits-it covers all of the indirect emissions spanning a firm's full value chain, from acquiring and pre-processing raw materials (upstream) to distributing, storing, using, and disposing of the end products sold to customers (downstream).It captures a significant proportion of many firms' total carbon footprints, especially for many firms operating in the energy sector [8,9].Further, Scope 3 represents the most significant emissions reduction opportunities going forward, and a full assessment of Scope 3 emissions is critical for understanding the end-to-end impacts of carbon taxes and climate policies on individual firms [10].However, the analyses of firm-level emissions by external stakeholders are usually limited to Scope 1 and Scope 2 emissions [11][12][13].This is due to the following three issues associated with Scope 3:

[1] No regulation and lack of clear guidance
Despite recent signals, there are no binding rules on Scope 3 emissions disclosure.Further, related sustainability reporting standards/frameworks such as Global Reporting Initiative, Sustainability Accounting Standards Board, and International Integrated Reporting either remain silent on Scope 3 emissions reporting or fail to provide detailed recommendations on how Scope 3 should be properly disclosed [14].As measurement and disclosure of Scope 3 are inconsistent and unsystematic, the quality and accuracy of firms' voluntary disclosures remain unclear.Further, given the complexity in calculating Scope 3 emissions and extensive data collection efforts needed (in particular granular activity-level data from supply chains which may be business-sensitive), it is not surprising that the reporting of Scope 3 emissions is generally sparse [15].

[2] Incomplete composition/ activity exclusion
Firms are not required to disclose the full composition of Scope 3 emissions across the fifteen distinctive Scope 3 categories (See Section 2 for more details.), as reporting is on a 'comply-or-explain' basis [15, p. 10].Thus, using the aggregated Scope 3 emissions data from an incomplete composition can be misleading.As firms may choose to report only areas that they are performing well in, or that are easier to measure whilst intentionally ignoring other areas.For example, two firms that have similar value chain emissions and firm characteristics may choose to report different categories (e.g., one may report material Purchased Goods and Services emissions while another may choose to report immaterial Business Travel).It does not make sense to aggregate emissions data with many missing values, when firms have the discretion to choose which categories they would like to report and the boundaries they would like to report within.Rather than comparing apples with oranges, one should either look at firms' Scope 3 data at the category level, or replace missing values (i.e., unreported Scope 3 categories) with estimated values before performing any cross-sectional comparisons.

[3] Measurement divergence/ reporting inconsistency
Firms may set different operational boundaries on the same Scope 3 emissions category, report different values across different communication channels (i.e., annual filings, sustainability reports, or through third-party initiatives such as the Carbon Disclosure Project [CDP]), and/ or occasionally update (re-state) their reported emissions for past years in later years [14,16].The aforementioned issues, make it hard for third-party data providers, such as Bloomberg and Refinitiv, to build consensus and provide consistent Scope 3 measures.To be more specific, third-party data providers may collect Scope 3 emissions data from different sources, update restated values at different time frames, and/or make adjustments to the reported values using different proprietary models.Further, differences in scenarios (e.g., from methodological choices in allocation methods, product use assumptions, end-of-life assumptions) and estimation models make Scope 3 data unreliable and difficult to compare among different data providers [17,18].Researcher and industry practitioners (e.g., asset managers and institutional investors) should be aware of the measurement divergence among third-party data providers when performing analysis/forming investment portfolios using Scope 3 data.
In the face of the issues mentioned above for disclosing firms, there is an additional need to develop estimation models that employs externally available predictors to cover non-disclosing firms in a broader investment universe.This need arises as traditional approaches to model Scope 3 emissions either require very granular activity-level data that are rarely accessible to third-party stakeholders (i.e., the bottom-up life-cycle assessments [LCA]) or employ industry-based metrics to allocate national emissions that fails to account for heterogeneity among firms (i.e., the top-down environmental input-output models [EIO]) [17].Although models using simple extrapolation techniques [19,20], multi-variable regression models [21] or outof-the-box machine learning techniques [5,22,23] are readily available for estimating Scope 1 and Scope 2 emissions, little has been done on Scope 3.
A few attempts are discernible from emissions data providers using a variety of modelling approaches.Some providers employ a bottom-up LCA model and use parameters such as firm activities and emissions factors (Carbon4Finance) [24].Others employ EIO models using topdown metrics at the industry level (such as Trucost to estimate most of its upstream emissions) (Except for "Transport and distribution" where they collect self-reported data) [25], or combine both models in their workflow (such as ISS with EIO models for upstream and LCA models for downstream emissions) (ISS Methodology, Factset).More recently, CDP employs regression models using metrics at the firm level [26], and Bloomberg employs machine-learning techniques on a subset of oil & gas firms (https://www.bloomberg.com/professional/blog/bloombergs-greenhouse-gas-emissions-estimates-model-a-summary-of-challenges-andmodeling-solutions/.At the time of writing this paper, we have not had access to their modelled Scope 3.).Unfortunately, many organisations provide limited information on their Scope 3 estimation methods.Furthermore, the prediction performance of these models is often vague.At best, data providers including Bloomberg and ISS disclose a model confidence ranking associated with their estimates, however, the absolute magnitude of their prediction errors is rarely disclosed.The quality and integrity of the estimated Scope 3 datasets are unknown, as evidenced by Busch et al. [16], who discovered that the correlation between estimates of the aggregated Scope 3 values from ISS and Trucost is surprisingly low (16%).The estimation of Scope 3 emissions is important since it helps fill in the gaps (i.e., unreported Scope 3 categories) that in turn are used for a variety of financial functions including portfolio construction [15,24].However, it is problematic that third-party provider estimates do not disclose limitations such as inherent prediction errors and data uncertainties.
From the preceding discussion, it is evident that investors' sophistication on climate risk is increasing and as part of this, they require high-quality and comprehensive Scope 3 data.Accordingly, we investigate Scope 3 emissions data divergence (across different providers), composition (which Scope 3 categories are reported) and whether machine-learning models can be used to predict Scope 3 emissions for non-reporting firms.These three issues are inherently interlinked if investors want to understand the quality of reported and predicted Scope 3 data.More specifically, using data retrieved from Bloomberg, Refinitiv, and ISS, we examined the following research questions: (i) What is the quality of Scope 3 emissions data in terms of measurement divergence between data vendors?, (ii) What is the quality of Scope 3 emissions data in terms of the composition of emissions categories reported by firms?; and (iii) What is the prediction accuracy of Scope 3 emissions estimates using machine learning models for non-disclosing firms?To answer the first question, we looked at the Scope 3 emissions datasets from Bloomberg/Refinitiv/ISS, and the divergence among these data providers through a three-way reconciliation of aggregated Scope 3 emissions values.To answer the second question, we analysed the composition of Scope 3 emissions from Bloomberg as this is the only dataset (out of the three in our study) that has detailed breakdown by categories (Notice that both ISS and Eikon just report the aggregated Scope 3 emissions data.CDP also has a breakdown of Scope 3 emissions by categories, and it is the source data that is fed into ISS, Eikon and Bloomberg.It would be interesting to perform a comparison of Scope 3 emissions from third-party data providers and the source data such as CDP and company reports.).For each Scope 3 emissions category, we measured its relevance based on its intensities in relation to the aggregated Scope 3 values, then we explored its completeness based on the proportion of firms that choose to disclose this category.(By doing this, the relevance of a category is defined purely by its relative size.CDP has a similar approach to determine the 'relevance' of Scope 3 emissions.https:// cdn.cdp.net/cdp-production/cms/guidance_docs/pdfs/000/003/504/original/CDP-technicalnote-scope-3-relevance-by-sector.pdf)To address the third question, we evaluated whether the Scope 3 emissions values could be estimated using top-down business and financial data and whether prediction accuracy could be improved to an acceptable degree using out-of-thebox machine learning techniques.We continued to use the Bloomberg dataset for this part of analysis as we aim to predict both aggregated Scope 3 emissions and its component categories.
Our study makes several contributions to the existing literature and collectively these contributions make this paper the most comprehensive analysis to date of interlinked Scope 3 data challenges.First, we extend the study by Busch et al. [16] that analyses the divergence in thirdparty carbon emissions datasets for Scope 1, Scope 2 and Scope 3 between 2005 and 2016.Focusing solely on Scope 3 emissions data from 2013 to 2019, we go beyond correlation analysis to quantify the degree of divergence among raw emissions data and to understand the implication of this divergence on emissions rankings (i.e., where firms stand compared to the universe of firms covered in Scope 3 emissions dataset in terms of carbon emissions).Second, we analyse over time and across sectors the completeness of reporting of Scope 3 emissions categories using Bloomberg data between 2010 and 2019.This is an extension of parts of the Klaaßen and Stoll [14] analysis, who examined the impact of incomplete composition/category exclusion for 56 technology firms in 2019.Finally, our paper applies machine-learning algorithms to predict Scope 3 emissions in a similar manner to Serafeim and Velez Caicedo [27].Due to the difference in the predictor set (excluding Scope 1 and 2 and market capitalization) as compared to Serafeim and Velez Caicedo [27], our model is applicable to a wide universe of public or private firms regardless of their emissions disclosure status.(On the other hand, Serafeim and Velez Caicedo [27] limit the scope of their analysis to publicly available firms and those that disclose Scope 1 and Scope 2 emissions, by using market capitalization, Scope 1 and Scope 2 emissions.While including Scope 1 and Scope 2 emissions may provide a better representation of Scope 3 patterns, the use of market capitalization as an additional size proxy for Scope 3 emissions lacks a clear baseline.Market capitalization fluctuates on a daily basis and is heavily dependent on investors' supply and demand, making the direct link to a firm's operational activities unclear.It could be argued that other size proxies, such as revenue and total assets, better reflect the impact of operational scales on Scope 3 emissions.)In addition, we include energy consumption in the predictor set, since it has been shown to improve Scope 1 and 2 predictor accuracy [5].More critically, in contrast to the results of Serafeim and Velez Caicedo [27], we conclude that there are large absolute prediction errors even when machine learning techniques are used to produce Scope 3 emissions estimates (discussed further below).This suggests that similar absolute prediction errors may be inherent in third-party estimation models especially when they use externally available data to model emissions for thousands of firms.This finding is relevant in the current emissions data landscape, where data providers methodologies and the errors associated with their proprietary modelling remain opaque.
Our main results are summarized as follows.First, we find that there is considerable divergence in the aggregated Scope 3 emissions values among third-party data providers.When the data provider adjusts reported emissions values with its proprietary models (in this case, ISS), none of its data points are identical to Bloomberg or Refinitiv Eikon (within 1% error), and the correlation values of this dataset with the two other datasets are low (respectively 55% & 56%.).However, when data providers use purely reported emissions values without any adjustments (in this case, Bloomberg and Refinitiv Eikon), they still have a surprisingly low proportion of identical data points (only 68%) despite high correlation values (95%).Divergence between reported datasets (Bloomberg and Eikon) is generally of smaller magnitude and has no systematic biases (the trimmed mean absolute percentage error is 4% and the trimmed mean percentage error is <0.01%).Divergence between ISS and Bloomberg (or Refinitiv Eikon) has substantial magnitude and exhibits a systematic upward bias (the trimmed mean absolute percentage error is 111%, and the trimmed mean percentage error is -20%, indicating that emissions values from ISS are systematically higher than that of Bloomberg).(The mean values have been trimmed to 5%-95% due to several outliers in percentage values.See Section 4.1.)This divergence will lead to substantially different low-carbon portfolio constituents if fund managers employ the ISS dataset to rank high/low emitters and adjust their weights accordingly, but the portfolios constructed from Refinitiv Eikon and Bloomberg data should yield quite similar results.Overall, these divergences make it difficult for investors to understand their portfolios' real exposure to climate risks.
Second, we find that firms normally disclose an incomplete composition of Scope 3 emissions (on average, they only disclose 3.75 out of 15 categories in 2010-2019), but they are reporting more categories over time (from 1.7 categories in 2010 to 4.7 categories in 2019).The most relevant Scope 3 emissions categories differ both between and within industries.
Business Travel has been reported by most firms (up to 84% in our sample) despite accounting for less than 1% of the total Scope 3 emissions.Other, more material Scope 3 categories, such as Use of Sold Products (making up to 66% of the total Scope 3 emissions) and Processing of Sold Products (making up to 8% of the total Scope 3 emissions), have been largely ignored (disclosed up to 18% and 6%, respectively).A simple fill-in-the-gap analysis inspired by Klaaßen & Stoll (Klaaßen & Stoll, 2021) using the median carbon intensities from the industry peer group to proxy for unreported categories at firm-level suggests that if firms were to report the full composition (all 15 categories) of Scope 3, their total Scope 3 emissions figure could be 44% higher than currently reported.
Third, Scope 3 prediction accuracy is low, even with a range of machine learning algorithms and an extensive set of business and financial predictors.In general, it is easier to predict upstream emissions than downstream emissions.Critically, estimating total Scope 3 emissions from the category level instead of aggregated level (as in the work by Nguyen et al. [5]) improves prediction accuracy (i.e., mean absolute error [MAE] of log-transformed emissions was reduced by 25% in Linear Forest).This is most probably because the aggregated Scope 3 emissions are distorted by non-reported categories, suggesting that the modelling of Scope 3 emissions should be conducted at the category level.Further, predictor importance varies by category materially.
However, there are limited improvements in prediction performance from 'out-of-the-box' machine learning models (i.e., Linear Forest) relative to baseline models (i.e., Industry Fill or Ordinal Least Square).More precisely, Linear Forest is slightly better at predicting total Scope 3 emissions at the category level and aggregated level than baseline models (MAE is reduced by 2% to 6%) and yields more or less equivalent prediction accuracy to a Stepwise regression model across most individual categories.In addition, the percentage errors between estimated values and actual values on the original scales (CO 2 -tonne) are large, as indicated by the large median absolute percentage errors for the aggregated Scope 3 (~72%) and individual categories (59%-187%).Large estimated errors like this may lead to inefficiencies in constructing low-carbon portfolios as documented by Kalesnik, and Zink [28] for Scope 1 and 2. This finding contrasts with Serafeim and Velez Caicedo [27] who report seemingly low percentage errors for several Scope 3 categories (as their percentage error metrics are based on loggedtransformed emissions).(See Section 5.3 for detailed discussions.)Overall, our findings imply that researchers and investors should be wary of the potential prediction errors when using Scope 3 emissions obtained from third parties.The findings also call for more transparent disclosure from third-party data providers in terms of estimation methodologies and prediction performance.
The rest of the paper proceeds as follows: Section 2 provides context on the Scope 3 emissions problem.Section 3 outlines the data used and Section 4 presents the methodology implemented for the analysis.Section 5 reports the results and Section 6 concludes.

Accounting and reporting of Scope 3 emissions
The accounting and reporting of Scope 3 emissions (or 'value chain' emissions) largely follow the GHG Protocol Corporate Value Chain Accounting and Reporting Standard [17,29].The protocol differentiates Scope 3 emissions into 15 distinct categories of upstream and downstream emissions.These categories are designed to be mutually exclusive to prevent doublecounting, yet firms within the same supply chain or across different supply chains may include the same source of emissions in their Scope 3 reporting.Each category includes several activities that may emit GHG emissions individually, for which a minimum operational boundary is established to ensure that major sources of emissions are accounted for (e.g., cradle-to-gate or Scope 1 and Scope 2 emissions of relevant value chain partners).Further, the mismatch in the timing of firm's activities may exist between the firm and its value chain partners.For instance, emissions related to purchased goods may occur before the firm's reporting year, employee commuting may occur simultaneously, and use of sold products may occur long after.Therefore, firms should also set a time boundary when calculating Scope 3 emissions (Table 1).
Firms have the discretion to choose which emissions categories to report and whether they would like to go beyond the minimum boundary to include optional activities.Firms' choices are generally based on five principles: relevance, completeness, consistency, transparency, and accuracy (See the GHG Protocol for the full definitions of these principles).There are potential trade-offs among these principles.For instance, firms may choose to report a certain category not because it is material (relevance) but because its emissions data is easier to collect than other categories (completeness).To determine which emissions categories and data types to report, firms should identify activities that are most relevant to their businesses and are associated with most GHG emissions.Firms are expected to justify their rationale for reporting certain emissions categories while ignoring others.
Firms generally need two kinds of information to quantify Scope 3 emissions: (i) activity data, which represents the level of activities that leads to GHG emissions (e.g., litres of fuel consumed, kilograms of material purchased); and (ii) emissions factors that convert quantified activities to GHG emissions (e.g., CO 2 emitted per litre of fuel consumed or per kilogram of material produced).Activity data can be sourced from primary channels (e.g., data obtained directly from suppliers that relate to specific activities in the reporting firm's value chain) or secondary channels (e.g., industry-average data, financial data, proxy data).Primary data is generally considered to be more accurate and more specific to the activities whose emissions are being calculated.Using primary data, as opposed to secondary data, imposes additional burden and responsibility on firms reporting this data but allows for better differentiation in terms of the carbon profile of firms.However, under certain circumstances, secondary data can be used to supplement primary data to achieve completeness [7].When primary data is not available, firms may conduct a simple extrapolation to derive emissions from industry-average data using spend-based metrics (i.e., monetary value of goods and services purchased).Reporting firms may also perform cascade calculations on how much each of their value chain partners contribute to their total emissions.The most appropriate emissions factor(s) and method (s) employed for calculating emissions vary between categories.S1 Table in the Supporting Information lists all possible calculation methods using the first Scope 3 emissions category-Purchased Goods and Services-as an example.

Selection bias and data errors in Scope 3 emissions
Prior literature has shown that firms' decisions on whether to report Scope 3 emissions, which categories to report and what operational boundaries to establish are affected by many factors [7,15,30,31].First, while the disclosure of Scope 1 and Scope 2 has been improving rapidly, the disclosure of Scope 3 emissions remains patchy [30,31].Bigger emitters, including some unlisted oil and gas firms, are less likely to report Scope 3 emissions and/or offer downstream value chain partners supplier-specific data.Firms are not always able to source data directly from their suppliers.As such, carbon auditing the entire value chain can be a very daunting and costly task [7].Consequently, firms are less likely to include Scope 3 in their carbon reduction targets, citing that these emissions occur outside of their control.It is expected that this trend will be reversed in the future.For instance, the Science Based Targets initiative (SBTi) Extraction, production and transportation of fuels/ energy acquired by firms and not accounted in Scope 1 and 2 (upstream emissions of purchased fuels, electricity, transmissions and distributions loss, generation of purchased electricity to end-users for electricity firm/ energy retailers) For upstream emissions of purchased fuels: All upstream (cradle-to-gate) emissions of purchased fuels For upstream emissions of purchased electricity: All upstream (cradle-to-gate) emissions of purchased fuels For T&D losses: All upstream (cradle-to-gate) emissions of energy consumed in a T&D system For a generation of purchased electricity that is sold to end-users: Emissions from the generation of purchased energy Past year, reporting year

Upstream transportation and distribution
Transportation and distribution of products purchased by the firm between its tier 1 suppliers and its own operation, transportation and distribution of services purchased by the firm (inbound logistics, outbound logistics, between company activities) The Scope 1 and Scope 2 emissions of transportation and distribution providers Past year, reporting year

Waste generated in operations
Disposal and treatment of waste generated in firms' operations The Scope 1 and Scope 2 emissions of waste management suppliers that occur during disposal or treatment Reporting year, future year

Business travel Transportation of employees for business-related activities during the reporting year
The Scope 1 and Scope 2 emissions of transportation carriers that occur during the use of vehicles Reporting year

Employee commuting Transportation of employees between their homes and worksites
The Scope 1 and Scope 2 emissions of employees and transportation providers that occur during the use of vehicles Reporting year

Upstream leased assets Operations of assets leased by the firm
The Scope 1 and Scope 2 emissions of lessors that occur during the reporting company's operation of leased assets (e.g., from energy use)

Transportation and distribution
Transportation and distribution of products sold by firms between its operations and end consumers (including retails and storage) The Scope 1 and Scope 2 emissions of transportation providers, distributors, and retailers that occur during the use of vehicles and facilities Reporting year, future year

Processing of sold products Processing of intermediate products sold by downstream companies
The Scope 1 and Scope 2 emissions of downstream companies that occur during processing Reporting year, future year

Use of sold products
The end-use of goods and services sold by the company The direct use-phase emissions of sold products over their expected lifetime (i.e., Scope 1 and Scope 2 emissions of end-users that occur from the use of products that directly consume energy (fuels or electricity) during use; fuels and feedstocks; and GHGs and products that contain or form GHGs that are emitted during use) Reporting year, future year requires that if Scope 3 emissions represent more than 40% of their carbon footprint, then firms should set a target to cover this impact [32].Second, firms tend to cherry-pick which Scope 3 categories to disclose.Firms are expected to map out all emissions categories in their value chains and identify which ones to include based on their relevance and materiality.Yet, they may be motivated to knowingly understate or neglect certain Scope 3 emissions categories that are material.According to the CDP [33], only 26.7% of the disclosing firms calculate all emissions categories that they consider to be relevant, and this problem is even more prominent if firms report Scope 3 emissions via channels that are under public scrutiny (e.g., corporate reports) [14,34].The inconsistency in reported Scope 3 emissions across different communication channels is also known as reporting inconsistency [14].There are two other sources of errors in Scope 3 emissions, namely boundary incompleteness and activity exclusion [14].Boundary incompleteness often arises when firms are not able to source primary or secondary data in a systematic way across various value chain partners or third-party data providers.Activity exclusion arises when firms intentionally exclude relevant emissions categories/ business activities in their Scope 3 emissions estimates.An example of activity exclusion would be that most reporting firms choose not to disclose emissions from Purchased Goods and Services and Use of Sold Products, though these are generally considered as the most material emissions categories for firms across different industries [35].These three sources of errors, if not dealt with carefully, would not only lead to inaccurate emissions calculations, but also make the comparison across different reporting firms difficult.
These problems are intensified, as Scope 3 emissions data has been collected by third-party data providers from different reporting channels and adjusted using different estimation models.[16] investigated the consistency of emissions data among third-party data providers (including Bloomberg, CDP, ISS, MSCI, Sustainalytics, Thomson Reuters Refinitiv, and Trucost) spanning the period 2005 to 2016.The authors found that the divergence in reported Scope 3 is much more substantial than that of Scope 1 and Scope 2. For instance, the Pearson correlation between ISS and Trucost is surprisingly low (16%).Further, the inconsistencies among data providers tends to grow over time.Part of the reasons for the divergence in Scope 3 is the variation in estimation approaches employed by third-party providers (e.g., process analysis versus input-output analysis), though these methods are expected to produce similar, if not identical, estimation results.

Data
As noted in the introduction, this advances the understand of the quality of Scope 3 emissions data by exploring the divergence and composition in existing third-party datasets and by exploring the prediction accuracy of Scope 3 emissions estimates using a range of machinelearning models (see Section 4.1, Section 4.2 and Section 4.3 respectively for more details).
To address our first research question and see whether divergence exists among data providers, we obtained firm-level aggregated Scope 3 emissions values from three sources; ISS, Refinitiv Eikon, and Bloomberg.We study 2013-2019 as it had the most complete data across the datasets.We used ISIN and reporting year to match data points across all three data providers and ended up with a small three-way matching sample of 6,725 firm-year observations of aggregated Scope 3 emissions values.Refinitiv Eikon and Bloomberg obtained firm raw reported Scope 3 emissions from different channels (e.g., the CDP report, firm's annual filings or sustainability reports).ISS uses proprietary modelling and trust metrics (reliability of issuer-reported emissions data and explanatory power of estimated emissions values) to provide a wider universe of emissions than just reported CDP data and other publicly available sources.For Scope 1 and 2 emissions, ISS uses self-reported data if available and provided modelled data for non-disclosing firms using a range of 800 sub-sector specific models.Hence their data sources are labelled as 'CDP', 'Sustainability/Annual Reports', 'Other Reported' and 'Modelled'.For Scope 3 emissions, it overwrites all self-reported data with modelled data as it deems self-reported data as inconsistent and incomparable across firms (since firms vary substantially in calculation methodology).Two separate modelling approaches are employed (EIO for upstream emissions and LCA for downstream emissions-see Section 1).However, ISS only provides the aggregated Scope 3 emissions values to the end users (ISS Methodology, Factset).(For our analysis, we limit the ISS dataset to the observations that are labelled as 'CDP', 'Sustainability/Annual Reports', 'Other Reported'.This means that we only compare 'modelled' Scope 3 emissions from ISS with 'self-reported' Scope 3 emissions from Bloomberg and Eikon for the universe of firms that have disclosed Scope 1 and Scope 2. This restriction has little impact on the analysis because the group of firms that disclose Scope 3 emissions is normally overlapped with the group of firms that disclose their Scope 1 and Scope 2 emissions.).
To address our second research question, we used Bloomberg to analyse the composition of Scope 3 emissions by category.This is because Bloomberg is the only provider (out of the three) that provides a categorical breakdown (both ISS and Eikon only provide the Scope 3 aggregated emissions values).These 15 distinct categories are defined by the GHG Protocol (see Section 2.1 and Table 1), as well as a miscellaneous category named 'Other' which captures emissions that are not able to be classified into one of these 15 pre-defined categories.(As several data points in Bloomberg have "Other" categories, we include this in our analysis to understand the complete composition of Scope 3 emissions values.Otherwise, the contribution of the other 15 categories is over-represented.)While Scope 3 emissions data is available since 2005 from Bloomberg, the overall number of disclosures is very small before 2010.Therefore, we restricted our sample period to 2010-2019.We started with 12,097 aggregated Scope 3 firm-year observations from the original dataset of 21,166 firm-year observations with disclosed Scope 1, 2 or 3 emissions (this means that firms that disclose Scope 3 emissions make up 57% of the disclosing group).Not all firms provide a detailed breakdown of Scope 3 emissions, thus the sample used to analyse the category composition of Scope 3 emissions is reduced to 9,518 firm-year observations (this means that firms that disclose category breakdown of Scope 3 emissions make up to 45% of the disclosing group).The average firm in the Bloomberg dataset emits 2.8 million tons of Scope 1, 0.5 million tons of Scope 2 and 11.1 million tons of Scope 3 greenhouse gases.(Data is available upon request.)This confirms that Scope 3 is the most significant proportion of a firm's total carbon footprint.
To address our third research question, we continued to use both aggregated and the detailed breakdowns of Scope 3 at the category level from the Bloomberg dataset as the target variables.For our machine learning prediction analyses, the baseline predictor set contains two financial metrics-total revenues and number of employees-that have been commonly used in the past literature estimating emissions (MSCI ESG Research [19], Thomson Reuters [20]).We also extended the original predictor set to include financial metrics from firm's annual income statements and balance sheets (see Section 4.3 for more details).We divided firms into smaller industry groups using their GICS group codes so that differences in emissions patterns across industries could be properly reflected.All financial predictors and industry classifications are retrieved from Refinitiv Eikon and are matched back to Bloomberg emissions dataset using ISIN and reporting year.Our final sample for Scope 3 predictions consists of 11,109 firm-year observations.988 firm-year observations were removed due to no breakdown details, values being missing, extremely small or large, or non-normally distributed.(Summary statistics of our final dataset for the third research question is presented in S5 Table in Supporting Information.).

Data quality: Divergence
As mentioned in the Introduction section, the first research question looks at the divergence in the aggregated Scope 3 emissions values across data providers.To answer this question, Busch et al. [16] applied the Pearson/Spearman correlation analyses to measure the consistency of firm carbon emissions data from third-party providers.We went further by seeking to quantify the degree of divergence into percentage error metrics and to understand the implication of this data inconsistency on emissions rankings (emissions ordered from highest to lowest per provider).
To do this, we obtained aggregated Scope 3 emissions data from ISS, Refinitiv Eikon and Bloomberg and calculated the percentage error and the absolute percentage error for each firm-year observation across datasets.The percentage error signifies the direction of divergence, while the absolute percentage error signifies the magnitude of divergence between two datapoints.Eq 1 and Eq 2 presents the underlying calculations for these metrics.Here, for firm i in reporting year t, Emission B  it and Emission A it are the aggregated Scope 3 emissions obtained from dataset A and dataset B, respectively.
Next, we calculate the proportion of data points that are 'identical' (i.e., with an absolute percentage error of less than 1%, see Eq 3 below) between all three providers, as well as the proportion of data points that are not identical but within an acceptable error range (i.e., with an absolute percentage error of less than 20%, see Eq 4 below).
where n is the number of overlapped firm-year observations between dataset A and dataset B, %AbsError ABit is the absolute percentage error of Scope 3 emissions between dataset A and dataset B for firm i in reporting year t, respectively.The divergence between two datasets A and B is then summarized into two aggregated metrics, namely, the trimmed mean absolute percentage error and trimmed mean percentage error (Eq 5 and Eq 6).As with data points, the trimmed mean percentage error and the trimmed mean absolute percentage error signifies the direction and magnitude of divergence between two data sets.The trimmed values are obtained by taking the mean of the 5 th percentile-95 th percentile range (to rule out the outlier effect of the extreme percentage error terms).
f%Error ABit in 5 th À 95 th percentileg ð5Þ We further investigated whether the divergence in Scope 3 emissions has a substantial effect on emissions ranking.This is particularly relevant in the construction of low-carbon portfolios (e.g., S&P Carbon Efficient Indices, MSCI Low Carbon Indices), where rating agencies and/or investors may overweight firms in lower emissions deciles whilst underweight firms in higher emissions deciles.To do so, all firm-year observations obtained from Bloomberg, Refinitiv Eikon and ISS were assigned to different ranking deciles based on emissions, and the proportion of observations that stay in the same or adjacent ranking deciles was identified using Eq 7 and Eq 8, respectively: where

Data quality: Composition
In the second part of our analysis, we investigate the quality of the composition of Bloomberg's Scope 3 emissions.For each category, we measured its relevance based on its relative contribution to the firm's aggregated Scope 3 emissions values, and its completeness based on the proportion of firms that choose to disclose this category.Carbon intensity was calculated using Eq 9. Normalizing by total revenues allows us to compare Scope 3 emissions across firms of different sizes: where Cat represents one of the fifteen Scope 3 emissions categories.Intensity Cat it ; Emission Cat it , and RV it represent carbon intensity, Scope 3 emissions, and total revenues of firm i in reporting year t, respectively.
The relative contribution of each emissions category to the full composition of Scope 3 was measured by Eq 10.Note that we set unreported categories (i.e., missing carbon intensity values) to zero.As a result, the contribution of any unreported category for firm i in reporting year t would be zero.
The relevance of category would be calculated using a mean of their contribution in each firm-year observations, weighted by their Scope 3 intensity.
The completeness of each Scope 3 emissions category was then calculated as: where Disclosure Cat it is a dummy variable taking the value of one if a specific Scope 3 emissions category is reported by firm i in year t.If firms disclose "0" (i.e., zero emissions for that category), it is still counted as disclosed.
To explore the potential impact of an incomplete Scope 3 composition, we substituted unreported Scope 3 categories with the median carbon intensity of all other firms in the nearest available peer group (multiplied by firm's revenues).In doing so, we follow Klaaßen and Stoll's [14] strategy and assume that a category is relevant to firms if its peer group reports a positive median carbon intensity for that category unless firm explicitly states that the emissions for that category is zero.The nearest available peer group refers to firms within the same GICS sub-industry.We require the peer group to have at least 10 firms.When there are not sufficient observations from the same GICS sub-industry, we gradually extended our criteria to include firms operating within the same GICS industry, GICS industry group, and finally, GICS sector (following Thomson Reuters [20]).For the avoidance of doubt, this imputed dataset is used purely to explore the impact of an incomplete Scope 3 composition.The original, unimputed dataset, is employed for the machine learning prediction analysis (see Section 4.3).
where Emission it is the raw and incomplete Scope

Prediction accuracy of Scope 3 machine learning models
Finally, we address the third research question by developing estimation models to predict Scope 3 emissions for non-disclosing firms.We employed a range of machine learning algorithms for our estimation models following [5], who predict firms' Scope 1 and Scope 2 emissions from a set of externally available data.The target variables are the aggregated Scope 3 emissions and the fifteen distinct categories that make up Scope 3 emissions.The original (non-imputed) dataset from Bloomberg is used as described in Section 3.
4.3.1.Baseline models.We started with two baseline models for benchmarking purposes.The first model is an Industry Fill model, in which the aggregated Scope 3 emissions and its individual categories were estimated using the median of disclosed emissions data of the firm's nearest available peer group (This method is similar to the "fill-in-the-gap" strategy for unreported categories in Section 4.2).For each Scope 3 emissions category, we estimated non-disclosing firm's carbon emissions using Eq 14.All notations carry the same meaning as that of Eq 13.
After each individual category has been estimated, aggregated Scope 3 emissions were calculated as: The second baseline model is a simple Ordinal Least Square [OLS] regression that predicts Scope 3 emissions at categorical level using two financial metrics, namely, Revenue (RV it ) and Total Employees (Emp it ), and a set of dummy industry indicators (IND) for j GICS groups.This baseline predictor set has been commonly used in the past literature for estimating emissions using the naïve industry fill approach (MSCI ESG Research [19]; Thomson Reuters [20]) and even the regression approach (CDP [21,26]).Both emissions values and predictor values were transformed using natural logarithm to account for non-normal distributions.

Linear models.
There could be other financial metrics that are better at capturing Scope 3 emissions patterns across the entire supply chain.In [5], the set of predictors chosen for Scope 1 and Scope 2 emissions included: Revenue, Total Assets, Number of Employees, Intangible Assets, Net Property Plant and Equipment [NPPE], Capital Expense, Gross Margin, Leverage and Capital Intensity.We therefore extended our original predictor set to include additional financial variables such as: Cost of Goods Sold, Earnings Before Interest and Taxes [EBIT], Earnings Before Interest, Taxes, Depreciation, and Amortization [EBITDA], Operational Expense, Net Income, Total Debt, Current Asset, Current Liability, Inventory, and Receivables.A similar exercise is carried out independently by Serafeim and Velez Caicedo [27] on the group of public firms.Their modelling design includes Scope 1 and Scope 2 emissions as well as market capitalization in their predictor set (See Section 1).Our prediction framework uses purely business and financial data-thus is applicable to all private and public firms even if they have not disclosed Scope 1 or Scope 2 emissions.In addition, we also test the inclusion of energy consumption amounts in the predictor set.In [5], the inclusion of this energy-related predictor is found to improve the prediction accuracy for Scope 1 and 2 significantly because it better reflects the emissions patterns of firms.(Energy production amount is not used due to the limited number of disclosing firms.).
Given the limited number of observations in the Scope 3 emissions dataset, the employment of the extended set of predictors could lead to multicollinearity issues.To avoid this, we employed the Forward-Backward Stepwise Regression, which automatically includes relevant predictors (< 1% significant level) into the model and excludes irrelevant ones (>5% significant level).The list of top five relevant predictors for each Scope 3 emissions category can be found in S4 Table of the Supporting Information.
We further employed Elastic Net, a penalized linear regression model to address potential multicollinearity issues [36].The regularization strategies are to shrink the size of the coefficients on identical predictors.This is achieved by adding two penalty terms, λ 1 and λ 2 , to the sum of squared estimate of errors [SSE] as weights on the sum of squared coefficients and the absolute values of the coefficients.If the coefficient estimates are inflated by multicollinearity, they would be shrunken down to 1/k of one single predictor (use sum of squared coefficients) or an absolute zero (use the absolute value of coefficients) if they are inflated by multicollinearity.The penalty terms are usually referred to as hyper-parameters in machine learning algorithms and are optimized using Bayesian hyperparameter optimization in the training set.We employ five-fold cross-validation to optimize the hyperparameters as well as to compare the performance across different prediction models (i.e., the model is optimized and trained on four folds and is evaluated on the mean error of the remaining hold-out fold).All yearly observations of a firm are either included in the same training subset or in the holdout set, so that prediction performance is evaluated on the non-disclosing firms.

Tree based ensemble models.
Application of the Tree-Based Ensembles has been reported in recent emissions modelling literature to yield superior prediction performance compared to other modelling techniques [5,22,23].This is because these models can capture non-linearity and correlations among the predictor set, and at the same time, improve stability and interpretability of coefficients in predicting carbon emissions.We employed two types of treebased ensemble models in this paper-namely, Random Forest and Extreme Gradient Boosting.
The basis of tree-based ensemble is decision trees [37].A decision tree includes multiple branches using if-then statements, and the estimation of the target variable is calculated using constant approximation (i.e., the mean value of observations in the same branch).The if-then statement is formed by choosing a predictor and its split value by minimizing the best aggregated SSE from two sub-samples.The tree either grows into maximum depth or is "pruned" to a shallow tree.Several trees could be combined into an ensemble, in which predictions are combined individually via Random Forest (RF) [38] or sequentially via Extreme Gradient Boosting (XGB) [39].The main hyperparameters for these models are the number of trees and the maximum depth.Potential overfitting problems are addressed by restricting the maximum depth of the trees (so as not to create too complex decision boundaries) and by growing multiple trees with randomness (by subsampling predictors or subsampling observations).For the XGB model, we also optimized other hyper-parameters (e.g., minimum child weight, column sample by trees, subsample and regularized alpha).The hyperparameters are continued to be optimized using Bayesian hyperparameter optimization in five-fold cross validations.
4.3.4.Linear tree models.Linear Tree models are a special form of tree-based models with a linear functional model in each leaf.Linear tree perfectly combines the learning ability of decision trees and the extrapolation power of linear models.Thus, the hybrid model leads to better predictive power and better insights than either model alone.Linear Forests generalize Random Forests algorithms by combining linear models with the same Random Forests.An initial linear model is fitted on the whole dataset, then the residuals from this model are used as a target variable for the subsequent Random Forest, and the final predictions are generated by using the sum of predictions from the initial linear model and the residual predictions from the Random Forest [40].
Similar to XGB, Linear Boosting builds models subsequentially in a two-stage process.Starting with an initial linear model, a simple decision tree is fitted to model the residuals of the previous steps.During this process, the branch with the highest absolute predicted residual is identified and a binary vector is identified based on the observations to this rule.This binary vector is then fitted into the initial linear model until a certain stopping criterion is met [41].The hyperparameters remain similar to tree-based ensemble models and are optimized using Bayesian hyperparameter optimization in five-fold cross validations.

Result
This section summarises the result of three analyses detailed in Section 4. Corresponding to the three research questions, this section is also divided into three sub-sections.Section 5.1.looks at the divergence, Section 5.2 looks at the composition, and Section 5.3 looks at the prediction accuracy of Scope 3 machine learning models.

Data quality: Divergence
This section presents the result from the divergence analysis detailed in Section 4.1.Fig 1 presents the coverage of Scope 3 among the three data providers over time.ISS uses proprietary models to fill missing values for non-reporting firms and has the largest coverage among the three providers (5,433 firms as of 2019), followed by Bloomberg (2,238 firms as of 2019) and Refinitiv Eikon (2,066 firms as of 2019; see Panel a).We observed a steady increase in the sample size of all three data providers over time, except for ISS, who have had a sizeable expansion in 2018.On average, 57% -60% of firms that report Scope 1 and/or Scope 2 emissions in Bloomberg and Refinitiv Eikon also report Scope 3 emissions, and this proportion remains steady throughout the entire sample period.For ISS, the proportion of firms with Scope 3 data is 100% of those that report Scope 1/Scope 2 emissions metrics (see Panel b).This is not surprising, given that ISS adjusts all reported/missing Scope 3 emissions data with their own estimates to address the inconsistencies in reporting and to differentiate between upstream and downstream emissions.Table 2 summarises the divergence statistics.We find that the fraction of identical data among Bloomberg, Refinitiv Eikon and ISS is surprisingly low.While it is expected that none of the ISS adjusted values are within the 1% error range of the same firm-year observation obtained from the other two datasets, the low proportion of identical data points (68%) between Bloomberg and Refinitiv Eikon was unexpected given that reported emissions are supposed to be similar, especially when some are extracted directly from the same communication channel (i.e., corporate reports or CDP).This problem persists even when the cut-off error range is extended to 20%.We find that while 84% of the data points between Refinitiv Eikon and Bloomberg are within an acceptable range (20% cut-off), only 5% of ISS's Scope 3 emissions are similar to the other two datasets.On percentage terms, the trimmed mean percentage error between Bloomberg and Refinitiv Eikon is <0.01%, and the trimmed mean absolute percentage error is 4%.This confirms that the divergence between these two reported datasets is small in scale and random.Between ISS and Bloomberg, the trimmed mean percentage error is -20% and the trimmed mean absolute percentage error is 111%.This confirms that the ISS proprietary models seem to exhibit an averaged upward bias.Despite this, the Pearson pairwise correlations between the three datasets (Bloomberg and Refinitiv Eikon, Bloomberg and ISS, and Refinitiv Eikon and ISS) are relatively high-reaching an average of 95%, 55%, and 56%, respectively (Prior to this analysis, we remove two data outliers for Bloomberg that were detected due to plotting.).
We find that emissions rankings are more consistent among three datasets than absolute emissions values (Table 2).Specifically, Bloomberg has 82% (an additional 13%) of data points that are in the same (adjacent) ranking decile(s) as Refinitiv Eikon.ISS consistently differs in that only 22% (and an additional 31%) of its emissions data falls into the same (adjacent) ranking decile(s) as the other two datasets.This implies that if portfolio managers use emissions data obtained from Bloomberg or Refinitiv to divest from the top emitters or construct their low-carbon indices/portfolios, they are more likely to have consistent results.Meanwhile, low carbon indices/portfolios from ISS may differ significantly from those using the other two datasets.
It would be interesting to disaggregate the divergence in the aggregated Scope 3 emissions values.Unfortunately, Bloomberg is the only provider with the category breakdown, thus a comparison of divergence by category is not possible.Furthermore, the methodologies employed by third-party providers remain largely a black box, and the limited information that comes with the datasets with respect to data collection, cleaning and modelling processes makes it extremely difficult to examine the root cause of divergence [24].In the context of our analysis, we can only hypothesize several explanations for the divergence, and visualise the error patterns across several dimensions to confirm if such an explanation is plausible.Between reported datasets (i.e., Eikon and Bloomberg), divergence happen primarily in the data collection and cleaning steps, such as: (i) timing of collecting and updating values, (ii) different rounding units (thousand/million), (ii) treatment of outliers, (iii) adjustment across sectors, (iv) adjustment for high-emitters and low-emitters within sectors.These explanations could be applicable for the divergence between all three datasets.For the divergence between ISS and the other two datasets additional steps in modelling could contribute to divergence, such as (v) imputation for undisclosed categories or (vi) adjustment within disclosed categories using different modelling approaches.To explore these, we undertook visualisations (see result in S2 Table and S1 Fig in the Supporting Information) across reporting years, sectors, number of reported categories, emissions deciles, nearest rounding units and outliers.The result suggests that the difference between Eikon and Bloomberg seems random (apart for more prominent inconsistency in Utilities, Real Estate, Financial, etc..).However, between reported and modelled datasets (Bloomberg and ISS), the divergence seems to be more systematic, especially with the number of reported categories (firms that report less categories are more likely to be adjusted upward by ISS).However, there appears to be adjustment within reported categories as well (firms that report the full set of 15 categories plus "Other" are also adjusted by ISS).Hence, there seems to be a clear time effect and size effect in the level of inconsistency of ISS as compared to other data providers (less upward adjustment over years, less upward adjustment with higher emissions values, less outliers in ISS dataset).Despite these results, we cannot be conclusive in terms of the root causes for divergences across the datasets.Overall, we show that there is considerable divergence in Scope 3 emissions among third-party data providers, especially when the data provider (in this case, ISS) adjusts values using its proprietary estimation models.

Data quality: Composition
Tables 3 and 4 report the completeness and relevance of Scope 3 emissions over time and across different GICS sectors, respectively (These results are also reported in S2 and S3 Figs in the Supporting Information).On average, firms only report 3.8 out of the 15 distinct Scope 3 emissions categories during our sample period (see Table 3 below).The degree of completeness is relatively low.However, we see significant improvements over time-the average number of reported Scope 3 categories (4.7) in 2019 has tripled relative to 2010 (1.7), and we see a significant increase in the proportion of disclosing firms in most Scope 3 emissions categories (Table 3).However, firms tend to report categories that are easier to calculate rather than those that are more material to their organisations' carbon footprints.Firms report emissions related to business travels (84%) more than any other category, despite the fact that Business Travel covers less than 1% of the total emissions of the value chain.While most Scope 3 emissions could be captured by Use of Sold Products (64%), less than 20% report this emissions category.The second and the third largest emissions categories-Purchased Goods and Services and Process of Sold Products are also largely ignored.The most relevant Scope 3 categories vary greatly across GICS sectors (See Table 4 below).For most firms (especially those operating in Energy, Industrials and Consumer Discretionary sectors, which includes oil & gas firms and fossil fuel-based car manufacturers), a significant portion of Scope 3 emissions comes from Use of Sold Products.In contrast however, firms operating in the Financials sector 'fund' emissions via their loan/ investment portfolios.Consequently, most of their Scope 3 emissions come from Investments.For firms operating in the Health Care and Consumer Staples GICS sectors, most of their Scope 3 emissions come from Purchased Goods and Services, whereas for Utilities firms, Fuel and Energy Related Activities contributes the most to Scope 3 emissions.
Finally, we calculate the 'corrected' Scope 3 emissions by substituting unreported Scope 3 categories by the median carbon intensity of all other firms in the nearest available peer group following Eq 13.These fill -in-the-gap analyses are conducted on the subset of firms that disclose incomplete emissions composition with available revenue data.A simple fill-in-the-gap analysis suggests that if firms report the full composition of Scope 3, their total Scope 3 emissions figure could be 44% higher than currently reported.Detailed analysis by year and sector can be found in S4 and S5 Figs in the Supporting Information 3 intensities (by firm intensities) (higher: green, lower: red).Unreported categories (i.e., missing carbon intensity values) are set to zero.The sample is 9,518 observations from 1,972 firms that disclose the composition of Scope 3 in the Bloomberg dataset in 2010-2019.

Table 3. Completeness and Relevance of Scope 3 emissions categories over time.
This table summarises the analyses of completeness and relevance of Scope 3 emissions categories over time as described in Section 4.2.For each category, completeness is measured based on the proportion of firms that choose to disclose each category (higher: green, lower: red), relevance is measured based on the weighted mean relative contribution of each category to Scope 3 intensities (by firm intensities) (higher: green, lower: red).Unreported categories (i.e., missing carbon intensity values) are set to zero.The sample is 9,518 observations from 1,972 firms that disclose the composition of Scope 3 in the Bloomberg dataset in 2010-2019.

Performance of machine learning prediction models
Table 5 presents the out-of-sample prediction performance of all models presented in Section 4.3.The main criterion of performance assessment is the mean absolute error (MAE) of logtransformed emissions in five-fold cross-validation (the model is optimized and trained on four folds and is evaluated on the mean error of the remaining fold).All yearly observations of a firm are either included in the same training subset or in the holdout set, so that prediction performance is evaluated on the non-disclosing firms.In panel (a), we compare prediction results on aggregated Scope 3 emissions when they are treated as a single value (that is, only one machine learning model is built) and they are aggregated from a group of 16 sub-models made up from 15 categories (see Table 1) and the residual covered by the 'Other' category.In panel (b), we present prediction results on individual categories.First, we find that our two baseline models (the industry-fill model (Table 5, Column 1) and the naïve OLS model (Table 5, Column 2)) produce very similar prediction performance on both the aggregated Scope 3 emissions and its individual categories.However, the industry-fill model underperforms naïve OLS in emissions estimates for certain categories (e.g., Franchise, Investment, Downstream Transportation and Distribution, and Downstream Leased Asset) when we only have limited reported emissions data.
Second, we find that the application of machine learning algorithms to be more useful when each category is estimated individually and aggregated into the total Scope 3 emissions values.Both the industry fill model and the naïve OLS model generate a log-MAE of 1.88 when Third, we find that it is easier to predict upstream emissions than downstream emissions.The best prediction performance is found in Business Travel (naïve OLS log-MAE: 0.99), Employee Commuting (naïve OLS log-MAE: 1.18) and Capital Goods (naïve OLS log-MAE: 1.27).There are two possible reasons behind this: (i) firms report more of the emissions associated with these categories, and (ii) good proxies could be found in suppliers' financial statements that help capture emissions derived from these upstream activities.For instance, number of employees could be used to calculate emissions associated with business travel/ employee commuting, whereas capital expenditures (CAPEX) might be a good indicator for emissions associated with capital goods.
We extended beyond the baseline models by including (i) all possible financial predictors that may capture Scope 3 emissions alongside the supply chain (full OLS-Table 5, Column 3), and (ii) the most relevant (i.e., significant) predictors chosen by forward-backward stepwise (stepwise OLS-Table 5, Column 4).While stepwise significantly improves prediction accuracy, whether full OLS outperforms naïve OLS remains inconclusive.
Surprisingly, machine learning algorithms only lead to limited improvements in prediction accuracy compared to the baseline methods (industry fill and naïve OLS models) as well as the best OLS model (stepwise model).Out of all machine learning techniques used (Elastic Net, XGBoost, Random Forest, Linear Boost, Linear Forest), only Linear Forest (see Table 5, Column 9) consistently outperforms baseline models in predicting total Scope 3 emissions (using both aggregated data and category-level data) (MAE is reduced by 6% and 2%, respectively).Linear Forest is slightly better at predicting the aggregated Scope 3 emissions than the Stepwise model (MAE is reduced from 1.85 to 1.77-4%) but yields more or less equivalent prediction accuracy across most individual emissions categories (MAE is 1.32 as compared to 1.31 from Stepwise regression).However, the gain in prediction performance by using out-of-the-box machine learning algorithms is very limited when there is poor data quality and low observations in certain categories.
When doing stepwise regressions, we find that predictor importance varies by category materially (see S4 Table in the Supporting Information).For certain emissions categories (e.g., Employee Commuting, Use of Sold Products, and Upstream Leased Assets), total revenues are the most important size factor.For other categories, total revenues may not be relevant, and other financial metrics might be better proxies for calculating emissions.For instance, both Purchased Goods and Services and Use of Sold Products are better estimated if level of inventory is included in the estimation model, whilst Capital Goods is better captured by capital expenditures figure in the same reporting year.We also included GICS industry group dummies in the forward-backward stepwise regression.Similar to financial metrics, the importance of industry group indicators varies greatly between categories.For instance, Insurance, Retailing, Transportation and Materials are important GICS industry group indicators for Upstream Leased Assets, whereas the Software & Services is important in predicting emissions associated with Business Travel.
So far, we measure prediction accuracy on log-transformed data.It is on this basis that Serafeim and Velez Caicedo [27] conclude that ML models can be used to predict Scope 3 emissions.However, using log-transformed results to compute the accuracy of models is misleading as it overstates their predictive power which can only truly be measures in absolute terms (i.e., taking the antilog of the predictions, following Nguyen et al. [5]).(Since all predictions are performed in a logarithmic scale, prediction errors in logged-MAE computed so far do not depict the actual deviations of the absolute emissions scales (tonne of CO 2 -e).Therefore, the percentage error (PE) on actual emissions scales is calculated by retransforming the log-scaled actual emissions and predicted emissions following the equation (below).Here ŷ is the predicted value of the log-transformed emissions y, n is the number of observations in the dataset, e is the exponential, i is the firm and t is the reporting year.)Thus, we explore how our best prediction model performs in absolute emissions values (Table 6).To do so, we transformed log-MAE (Table 6, Column 2) into median absolute percentage error [MDAPE] on non-transformed emissions values (Table 6, Column 3) and calculated the proportion of emissions estimates that lies within +/-50% of the actual emissions values [PPAR] (Table 6, Column 4).
We find that despite the gain in prediction accuracy (especially when each category is estimated individually and aggregated into total Scope 3 emissions values), even the best prediction model has substantial prediction errors.Specifically, the stepwise regression model generates a median absolute percentage error of 72.2% on aggregated Scope 3 emissions values (by combining all categorical-level emissions estimates).This means that only 30% of the predicted emissions values lies within +/-50% of the actual reported emissions values.(This finding contrasts with Serafeim and Velez Caicedo [27] who report a seemingly low percentage error for several Scope 3 categories (as their percentage error metrics is based on logged emissions).For instance, Table 4 of their working paper reports a Root Mean Squared Logged Error of 0.90 on Business Travel using Adaboost.This means that if (natural) log-transformed actual emissions value is 8.5 (Table 2 of their working paper), log predicted emissions is 9.4, the logged error is 0.9 (10% on logged emissions), but the percentage difference between the retransformed predicted emissions (12,088 tons) and retransformed actual emissions (4,914 tons) is 146%.)Potential biases in the reported datasets (i.e., the training set) may contribute to this outcome to a certain extent.Researchers and industry practitioners should be wary of prediction errors when doing risk analysis using machine learning techniques and/or 'adjusted' Scope 3 emissions obtained from third-party data providers.Finally, we also investigate if the inclusion of energy-related data improves prediction accuracy.In [5], the inclusion of energy consumption data is found to improve the prediction accuracy for Scope 1 and 2 significantly because it better reflects the emissions patterns of firms.We use this information to feed into the best model (Stepwise OLS) regression and find that there is a small improvement of 4% (reduction in MAE from 1.31 to 1.25), and this improvement comes from better prediction of Category 3 -Fuel and Energy Related Activities (17% reduction in MAE from 1.52 to 1.26) (and surprisingly, the improvement in Category-14 Franchise by 11%) (Details of this analysis are presented in S5 Table in the Supporting Information).However, perhaps because the energy consumption information is within the firm's operation itself and not along its supply chain, there remain substantial prediction errors in the estimation models of Scope 3.

Conclusion
This paper explored the quality of Scope 3 emissions datasets.We looked at three issues: (1) the divergence in the aggregated Scope 3 (reported and estimated) emissions values among existing third-party datasets (Bloomberg, Refinitiv Eikon, ISS), ( 2) the (incomplete) composition of Scope 3 emissions when broken-down by category and (3) whether machine-learning models can be used to predict Scope 3 emissions for non-reporting (or incomplete reporting) firms.
With respect to the divergence of Scope 3 emissions, we find that Scope 3 emissions vary greatly across third-party data providers.The divergence between ISS (adjusted values) and the other two datasets (reported values) is the most prominent.Surprisingly, we find that Bloomberg and Refinitiv only have 68% identical data points, though both datasets rely on firms' reported data with no further adjustments.While emissions rankings are more consistent among three data providers than that of absolute emissions values, only 22% of the ISS's emissions data falls into the same ranking decile as the other two datasets.These large differences are likely to have an impact on the implementation of divestment strategies and the composition low-carbon indices.
With respect to the composition of Scope 3 emissions, firms tend to prefer to disclose emissions categories that are easier to calculate, even though these categories make-up a relatively low proportion of total Scope 3 emissions.By looking at the relative contribution of each emissions category to the full composition of Scope 3, we argue that the most relevant Scope 3 emissions categories vary greatly between industries.Therefore, firms should perform relevant tests to make sure that they establish the 'correct' operational boundaries.Users of (reported) Scope 3 emissions data should fill in the gaps for unreported categories before making comparisons across firms, otherwise a firm could be considered as green just because it reports an incomplete composition.A simple industry fill could be considered for this imputation process, but more advanced estimation techniques (such as OLS regressions) could be employed for this exercise as well.
For those firms who never disclose Scope 3 emissions (or its breakdown by category), the paper also compared the prediction accuracy of various baseline (industry fill, naïve OLS), linear (full OLS, forward-backward stepwise) and machine learning models (Elastic Net, XGBoost, Random Forest, Linear Boosting, and Linear Forest).Overall, prediction is more effective when each emissions category is estimated individually and aggregated into total Scope 3 emissions.Further, we find that predictor importance varies greatly by emissions category, and upstream emissions are easier to predict than downstream emissions.Contrary to our expectations, even when we use the most advanced 'out of the box' machine learning techniques, the improvements in prediction accuracy are relatively small.Further, contrary to the Serafeim and Velez Caicedo [27], we find that absolute prediction performance is low even with the best models, with the accuracy of estimates primarily limited by low observations in specific Scope 3 categories.Similar absolute prediction errors may be inherent in third-party estimation models using externally available data.Therefore, researchers and investors should take great caution when using Scope 3 emissions estimates.
Overall, our findings emphasize the need for improvement in Scope 3 emissions disclosure.First, binding mandates should be established, and more guidance is needed to derive accurate calculations.Second, firms should expand their operational reporting boundaries to include categories that are the most relevant to their businesses.Third, it is also important for firms to source primary data from value chain partners, and be cognizant that this data may be subject to large uncertainties itself (e.g. from SMEs in the supply chain who in turn estimate their emissions).Researchers and industry practitioners should be wary of both reported and estimated Scope 3 data and should account for potential data error in their analyses.

Fig 2
demonstrates the divergence across three emissions datasets in terms of directionpercentage errors (panel a) and magnitude-absolute percentage errors (panel b).Within each panel, each line represents a pair of datasets, and its values signifies the cumulative fraction of

Fig 1 .
Fig 1. Scope 3 data coverage over time.This figure summaries the data coverage of the three Scope 3 emissions datasets (Bloomberg, Refinitiv Eikon and ISS) in 2013-2019.Panel (a) presents the number of observations over time, panel (b) presents the proportion of firms that also have Scope 3 emissions data relative to those with Scope 1 or 2 emissions data, and panel (c) presents the pairwise scatter plots of the three datasets.https://doi.org/10.1371/journal.pclm.0000208.g001

Fig 2 .
Fig 2. Divergence between three Scope 3 emissions datasets in 2013-2019.This figure illustrates the divergence between three of the largest third-party datasets (ISS ESG, Refinitiv Eikon and Bloomberg) as described in Section 4.1.The three-way matched dataset includes 6,725 firm-year observations of aggregated Scope 3 emissions in 2013-2019.Panel (a) presents the cumulative proportion of data points that lies within a certain cut-off point of percentage errors (the direction of divergence).Panel (b) presents the cumulative proportion of data points that lie within a certain cutoff point of absolute percentage errors (the magnitude of divergence).https://doi.org/10.1371/journal.pclm.0000208.g002

Table 1 . Scope 3 emissions categories.
3 emissions data obtained from Bloomberg for firm i in year t, and Emission Correct it is the 'corrected' emissions figure by filling in all missing values using peer group median.Intensity Cat IND;t is the median carbon intensity for all reporting firms in firm i's nearest available peer group.

Table 2 . Divergence statistics between three Scope 3 emissions datasets in 2013-2019.
This table summarises the analyses of divergence between three of the largest third-party datasets (ISS ESG, Refinitiv Eikon and Bloomberg) as described in Section 4.1 and Fig 2.

Table 4 . Completeness and Relevance of Scope 3 emissions categories across sectors.
This table summarises the analyses of completeness and relevance of Scope 3 emissions categories across sector as described in Section 4.2.For each category, completeness is measured based on the proportion of firms that choose to disclose each category (higher: green, lower: red), relevance is measured based on the weighted mean relative contribution of each category to Scope.

Table 5 . Performance of machine learning prediction models.
35e used to predict total Scope 3 emissions.Given that firms generally report incomplete compositions, we first obtained emissions estimates from all 16 categories, and then aggregated them into total Scope 3 emissions.By doing so, we see a large improvement in prediction accuracy for both models, as evidenced by a decline in log-MAE from 1.88 to1.35for the industry fill model and from 1.88 to 1.34 for naïve OLS.
the mean absolute error (MAE) of log-transformed emissions averaged from five-fold divisions.Panel (a) presents the prediction results on aggregated Scope 3 emissions when they are treated as a single value and when they are aggregated from a group of 16 sub-models (for 15 categories and other).Panel (b) presents the prediction results on individual categories.https://doi.org/10.1371/journal.pclm.0000208.t005aggregated values

Table 6 . Performance of the best model with alternative measures.
This table summarises the out-of-sample performance of the best prediction model for Scope 3 emissions (Stepwise OLS) as described in Section 5.3.The main criterion is the mean absolute error (MAE) of log-transformed emissions averaged from five-fold divisions.Median Absolute Percentage Error (MDAPE) and proportion of prediction in acceptance range (PPAR) are used as the alternative performance criteria on raw emissions values.The calculations of alternative measures are presented below.Here ŷ^i s the predicted value of the log-transformed target variable y, n is the number of observations in the dataset, e is the exponential, i is the firm and t is the reporting year. https://doi.org/10.1371/journal.pclm.0000208.t006 ¼ median je Y i;t À e Ŷi;t j