Dropping diversity of products of large US firms: Models and measures

It is widely assumed that in our lifetimes the products available in the global economy have become more diverse. This assumption is difficult to investigate directly, however, because it is difficult to collect the necessary data about every product in an economy each year. We solve this problem by mining publicly available textual descriptions of the products of every large US firms each year from 1997 to 2017. Although many aspects of economic productivity have been steadily rising during this period, our text-based measurements show that the diversity of the products of at least large US firms has steadily declined. This downward trend is visible using a variety of product diversity metrics, including some that depend on a measurement of the similarity of the products of every single pair of firms. The current state of the art in comprehensive and detailed firm-similarity measurements is a Boolean word vector model due to Hoberg and Phillips. We measure diversity using firm-similarities from this Boolean model and two more sophisticated variants, and we consistently observe a significant dropping trend in product diversity. These results make it possible to frame and start to test specific hypotheses for explaining the dropping product diversity trend.


Introduction
For decades economists have been using diversity to gauge the productivity and stability of regional economies, and this has motivated continuing efforts to craft better ways to measure diversity [1][2][3][4][5]. The economic diversity of geographic regions has been correlated with higher levels of gross domestic product, and economic diversification is often promoted as a route to economic stability, growth and development [6,7]. This paper focuses more narrowly on the diversity of the products bought and sold in the economy overall. The diversification of products produced by important individual firms has been studied [8,9], and so has the diversity of products in markets with many kinds of firms selling many kinds of products at fluctuating prices to many kinds of consumers [10][11][12]. Taking advantage of the existence of high quality public textual data, this paper focuses on the products of large US firms over the past two decades.
Some discussions of product diversity are theoretical and focus on the mathematical  consequences of simple economic scenarios, but our focus is empirical and data-driven, and relatively theory neutral and free of economic assumptions. We simply observe the changing product diversity of large US firms, evident in their annual product descriptions, and describe the trends we observe. In recent years, there have been similar efforts to draw ideas from quantitative biology, systems science and data mining to study the diversity of systems in social science [13,14]. In economics, many papers design and apply standard indices of economic diversity and complexity (e.g,, [15]), but atemporal data blinds us to how the indices have changed. The temporal data binning used here reveals how economic diversity and complexity have changed over the past generation and are trending today. By doing so, our results are precise and quantitative. In addition, our methods are easily reproducible. We first embed annual documents describing each firm's products in a high-dimensional vector space, producing a model of the similarities among the products of large US firms. As shown in Figure 1, we then group the vectors by SIC class to obtain product-focused vector representations for industry classes. The diversity of those products is calculated from this classification for each year. We focus on three different document embeddings: a Boolean embedding modeled after the current industry standard in product-focused industry classification [16,17], a slightly more sophisticated TF-IDF embedding, and a more complex Paragraph-Vector Distributed Memory (PV-DM) embedding. All of the models are first evaluated by measurement of their Industry Specificity relative to the Standard Industrial Classification (SIC) and evaluation of the a priori plausibility of their firm clusters. Models that pass these tests are each used to measure the diversity of the products of large US firms over the past two decades. In order to identify diversity trends that are robust, we employ a suite of more or less complex ways to measure diversity, including a baseline measurement based merely on each firm's SIC classification. This enables us to identify diversity trends that are robust across a variety of models.
By doing so, we provide evidence during 1997-2017 of a falling trend in the diversity of products offered by large US companies. This evidence comes from a consensus of semantic-vector models trained on a corpus of 10-K documents from 1995-2019 that describe the products of those firms. This trend is further corroborated by the text-free model based just on SIC Codes. We conclude by evaluating a number of hypotheses for how to explain the trend of dropping diversity.
Our work is one of a growing number of text-based analyses of economic topics, such as banking, finance, accounting, mergers and acquisitions, or corporate innovation and fraud. Many use topic modeling methods akin to our methods [18][19][20][21] and apply them like we do to 10-K documents [18][19][20], while others mine other kinds of documents, such as IPO prospectuses [22][23][24] and analysts' reports and regulatory filings [23,[25][26][27].
Our work also reflects the expanding diversity of applications of NLP and machine learning methods. Bergeaud [35]. Both classifications were manually designed by experts and are updated by hand as industries evolve. In general, the NAICS classifies companies according to the processes by which they produce products, while the SIC classifies them according to the types of products they produce [35]. Given our present purpose of measuring diversity of products, this paper uses the SIC classification of firms when measuring the diversity of their products.
In a hierarchical classification tree like the 4-digit SIC classification scheme, individual firms i and j are leaves at the bottom of a 4-level branching tree structure.
For example, the SIC hierarchical classification tree depicted in Figure 2   Mining, which ended much larger than it started.
A simple gauge of the similarity of two firms is their distance from one another in the four-level SIC classification tree. We define the distance between firms i and j as the length of the shortest tree walk (sequence of adjacent nodes) between leaves i and j.
The number of sub-classes in the SIC classification tree varies significantly across the different nodes in the tree. To create more distance between firms classified under especially heavily branching nodes, we define the length of a walk as the number of SIC codes that fall under its highest node.
The Standard Industrial Classification (SIC) tree has been carefully designed by human experts; it has passed the test of time and is widely used. We use it here to define a simple trustworthy metric of firm similarity against which to compare more sophisticated alternatives. This firm similarity metric based on proximity in the SIC classification tree is a crude representation of the similarities of actual firms. For example, the SIC tree proximity metric assigns a perfect similarity to every pair of firms in the same SIC Code, and it assigns identical similarities to all pairs of firms connected through the same highest node. This metric has a perfectly simple and predictable form, consisting of a number of rectangular fields with absolutely uniform similarity ( Figure 4).
Embedding firms in semantic vector spaces provides a much more sensitive and product-centric measure of firm similarity. Each individual firm has a unique location in the vector space, which yields a fine-grained measure of the similarity of each pair of firms. The current industry standard in precise firm similarity matrices for large US firms is a simple Boolean word-vector embedding of documents [16,17]. We construct and study an analogous Boolean word-vector model of product similarities, and we also construct and study two more sophisticated vector spaces. After confirming the plausibility of all of the models, we examine what they reveal about trends in the diversity of products of large US firms.

Semantic vector model-training corpus
In order to build the product vector space we use the Form 10-K, a document filed with the SEC by any company with more than $10 million in assets with ownership by 2000 or more individuals. The 10-K filing "provides a comprehensive overview of the company's business and financial condition" [36]. Companies that file 10-K forms with the SEC are large US firms. Taken together the 10-K corpus is a complete, accurate, standardized, publicly available annual description of the products produced by every large US firm, and it was used to train the current industry standard in quantitative firm similarity measurement [16,17].
We use the section of 10-K documents typically labeled "Part 1 Section 1: Business".
The Business section of a firm's 10-K describes significant products the firm offers to their customers, what markets the firm operates in, and any subsidiaries it owns [16,17].
If it exists, we exclude the part of the Business section typically labeled "Section 1A: Risk Factors," leaving only details relevant to offered products.
We obtain 10-K, 10-K405, and 10-KSB documents from 1993 through 2018 from the Software Repository for Accounting and Finance (SRAF) [37]. The 10-K documents do not all have one standardized format, and their heterogeneity makes it a challenge to extract exactly their Business sections. SRAF stage-one parsing removes various markup from the documents and removes tables. Figure 3 shows the number of unique companies which file for each year in our dataset (broken down by SIC division).
After obtaining the data we extract the desired section by way of a series of regular expressions designed to catch common formats as well as a more flexible keyword based program. In total approximately 12% of documents cannot be parsed by either the regular expression or the keyword approaches. As Figure 5 illustrates, filing data for each company exists for only a subset of the years considered, but in general our programs are able to extract business sections from filings whenever the filings exist.
We evaluated the success of our extraction by manually checking both the extracted business sections to ensure that they were complete and did not contain extra text, as well as by reading through the unparseable documents to see if there was actually business section information lost by excluding those filings. Analysis of 50 randomly chosen extracted business sections revealed 49 of them to be correctly pulled from the corresponding 10-K forms. The errant filing was such that sections beyond the business section were included in the extracted text. Manual analysis of 100 randomly chosen unparseable filings found that 90 of them contained no business section at all, while the other 10 had either especially non-standard formatting, extremely short business sections of less than 1000 characters, or combined their business and properties sections into a single section which made the relevant details of the section hard to distinguish from the irrelevant details. These analyses make us confident that we are building models on a dataset which is reasonably complete as well as textually relevant.
Once the appropriate sections are extracted they are preprocessed to only include nouns as suggested by [17]. In addition, we convert all alphabets to lower-case, remove  In order to faciliate comparison with [17], we also remove from the training corpus any filings which do not have Compustat Global Company Keys, which lack at least a year of lagged Compustat data, or which are financial firms (SIC Codes 6000-6999), again following [17]. While the notion of a product can be extended to include some of the things that are "produced" by some financial firms, many large US financial firms do not offer the consumer products on which our analysis focuses. This last step reduces the number of individual documents in our training corpus from 179,717 to 107,500.
The number of different CIKs in the 10-K documents filed each year is plotted in  width (number of bins) and evenness (similarity of counts across all bins). Since the number of bins (instantiated SIC classes) varies by more than 7% across the years we studied, it is also interesting to plot just the evenness of the distribution, which is shown by the normalized entropy: where H D is the Shannon entropy and # D is the number of bins (Figure 7 (right)).
The distribution's entropy and its width and evenness all display decreasing trends.
The number of instantiated SIC Codes is a simple measure of the diversity of the products produced by large US firms, as are the normalized or non-normalized Shannon entropy of the distribution. But both product diversity metrics are crude, because they October 19, 2021 11/28 ignore the different degrees of similarity between different SIC classes.

Embedding product descriptions with models
The documents in the training corpus are used to train a firm-similarity model that contains a vector representation of the products of each firm. Specifically, for every document p ∈ F , the embedding function is given by f e : p → v p where v p ∈ R d . All these vectors are normalized to have a length of 1. Here we compare embedding methods for bag-of-word models and neural network models.

Bag-of-words embeddings
Bag-of-words models ignore the order of words in the training corpus and build vectors based just on the occurrence of the words. We study two different bag-of-words embeddings: Boolean and Term Frequency-Inverse Document Frequency (TF-IDF).
In the Boolean model, the vector for document p, v p is given by for every word Σ[i] in the dictionary. Following [17], a word is included in the dictionary only if it appears in less than 20 percent of the documents in the training corpus.
Removing very common words is important but it is arbitrary to set a threshold at precisely 20 percent.
A more principled method is to replace the Boolean information about a word with the word's term frequency/inverse document frequency (TF-IDF) statistic-a commonly used measure of the relevance of each word in a document from a large corpus. The

Neural embeddings
To obtain neural embeddings of firms in product space we use the Paragraph Vector

Methods of analysis
Before we use our models to make more sophisticated measurements of the diversity of the products, we first establish the plausibility of the embeddings of firms in semantic vector spaces produced by the Boolean, TF-IDF, and PV-DM models. We gauge model plausibility in two ways: One is to measure how much similarity the embeddings attribute to the firms within the same industries, where the industries are identified by some trusted source. The other is to examine whether the micro-structure of the embeddings fit with human common sense judgments of the similarity of well-known firms.

Industry specificity
Existing classifications such as the SIC consider firms in the same industries to be relatively similar, and firms in distinct industries to be much less similar. The SIC is constructed by domain experts and is widely used by researchers and government offices, October 19, 2021 13/28 so it is safe to assume that each industry defined by a 4-digit SIC Code contains firms that are rather similar, much more similar than firms with different SIC Codes. So, one way to assess the plausibility of the vector embeddings of documents by individual firms is simply to check whether the average similarity of pairs of documents from firms in the same SIC Code is much higher than the average similarity of firms with different SIC Codes. The ratio of these two averages we term the Industry Specificity (relative to the SIC) of the similarity matrices produced by a given model. (See Appendix S1 for precise definitions.)

Diversity
Diversity of products is often measured in economics simply as the number of different types of commodities (goods, products) available in a marketplace [10][11][12]. This approach is roughly analogous to the plot in of the number of different SIC Codes exemplified each year by large US firms (Figure 7 left). Sometimes the distribution of types of commodities in a market is weighted in some way, such as by total sales, and diversity is then measured by something like the Shannon entropy of the distribution of products [8,9], an approach analogous to the Shannon entropy of the distribution of SIC Codes instances shown in Figure 7 (right). This entropy measure is quite simple, but is also rather crude, too crude, for example, to reflect the diversity of the firms within each SIC Code, or the "distance" between different SIC Codes within a given SIC Industry Group.
A more fine-grained approach is to measure the variance of the vectors in a product feature space by computing the number of dimensions needed to account for the bulk (here, 90%) of the variance of all of the individual firm vectors in each year. This measure has the virtue of being built out of the local details of the embedding of firms in a product space, and the results are relative to that product space. This measure is easily applied to documents that have been embedded in any product space of interest, and here we use the Boolean, TF-IDF, and PV-DM vector spaces.
An even more fine-grained measure of the diversity of the products produced by a set of firms comes from a generalized measure of diversity from theoretical ecology.
Once a classification with s classes (as defined by four-digit SIC Codes) is obtained for  (4) where q = 1 is a sensitivity parameter [41] that controls how much the diversity measure emphasizes common versus rare industries. When q is small, q D(a, Z) gives as much importance to rare industries as common ones [41]; thus, 0 D(a, Z) is a measure of industry "richness" (the effective number of industries). By contrast, when q is large, rare industries are de-emphasized and q D(a, Z) includes information about the evenness of industries.

Model plausibility results
We examine the firm-pair similarity matrices produced by the Boolean, TF-IDF, and PVDM models, and compare them for plausibility by comparison with the simple SIC model's similarity matrix (visible in Figure 4). Next, we test the plausibility of each model by seeing if they put similar firms in clusters, and if they give especially high similarity to pairs of firms with the same SIC Codes.   (documents about the products of) the twenty-five firms.

Micro-analysis of clusters
We gauge the proximity of embedded documents in high-dimensional vector spaces Inspection of the t-SNEs in Figure 9 confirms that the Boolean, TF-IDF, and PV-DM models all pass this additional test of common-sense. For ease of identification, the first three of these groups of firms are circled red, yellow and green.
This micro-analysis of the details of the embeddings of firms in the DJIA adds weight to the general plausibility of all three document embeddings studied here. The Boolean, TF-IDF, and PV-MD models all demonstrate a significant degree of common-sense realism and plausibility.

SIC Industry Specificity results
The SIC Industry Specificity of each model (Boolean, TF-IDF, and PV-MD) is indicated in Figure 10.    Table 1. Correlation coefficients of diversity q D with year and significance levels ( * * : p-value ≤ 0.05, * * * : p-value ≤ 0.01).
In order to understand the effect on diversity trends of varying the degree of sensitivity to rare species, annual q D values are calculated for q ∈ {0, 2, 5}. These diversity values reflect not just the abundances of different SIC classes but also how similar the classes are to each other. The scatterplots of annual diversity values and linear regression fits for q = 0 and the three models of interest are shown for on Figure 12 while the Pearson correlation coefficients for all the tested sensitivities are shown on Table 1. Table 1 and Figure 12 show that for q = 0, all three models show statistically significant decreasing trends in diversity. This means that all three models are in agreement that the richness of products is decreasing over the years. In other words, the trend of dropping product richness in the (descriptions of) products of large US firms is a consensus conclusion of diversity measurements with q = 0 of Boolean, TF-IDF and PV-DM models. Similarly, all three text-based models show statistically significant decreasing trends in 2 D. 2 D starts to pay less emphasis to rare species and is equivalent to a commonly used diversity measure in ecology known as Rao's quadratic entropy [41]. Finally, further increasing q to 5 continues the pattern of Boolean, TF-IDF and PV-DM showing statistically significant decreasing trends in diversity. The upshot of the diversity correlation coefficients is that all models show statistically significant patterns of dropping diversity across different sensitivity values. As with Shannon entropy, to take away the effect of the decreasing number of SIC Codes on q D, the metric can be normalized as described in S2 Appendix. It can be seen that the normalized 0 D also shows a decreasing trend in the Boolean and TF-IDF models and no significant trend with PV-DM.

Conclusions and Discussion
This paper presents a wealth of evidence for a significant drop in the diversity of the products produced by large US firms in this century. This downward trend is evident whether diversity is measured in crude or sophisticated ways, and whether the information about the products of individual firms is coarse-or fine-grained. This trend can be seen using a Boolean word vector model, the current industry standard in product-focused firm embeddings due to Hoberg and Phillips [17], and the trend can be seen using more sophisticated TF-IDF and PV-DM models. remains an open question how to explain the dropping diversity trend. We noted earlier an overall drop in number of firms over the same years (recall Figure 3), and this drop in the number of firms might be thought to explain the drop in diversity of products.
Further, it is known that since the 1990s market concentration has been occurring as fewer firms take up more market share in their industries [43,44]. However, we still observed the diversity drop when we measured diversity using normalized abundance vectors, so the drop in the number of firms is unlikely by itself to explain the observed trend in dropping product diversity.
A second, quite different hypothesis is document homogenization, which proposes that the decreasing diversity of the descriptions of firms' products is due merely to an increasing professionalization and standardization of the text in 10 Figure 7); nor does it explain the roughly 50% increase in average number of word tokens and word types in each document in the training corpus (recall Figure 6). So the dropping diversity seen using text-based models is unlikely to be due specifically to document homogenization.
A number of further hypotheses could explain the dropping product diversity trend.
One is the hypothesis that products have shrunk in diversity because consumer demand for products has narrowed. Another hypothesis is that the growing diffusion of information technology into more and more products is making products overall more alike. A third hypothesis is that the drop in product diversity is due to the rise of outsourcing by large US firms, and a consequent rise in the diversity of products produced outside the United States. A fourth hypothesis would connect the drop in diversity of the products of large firms with a rise in the diversity of products produced by small firms. We have no specific evidence for or against any of these hypotheses, but all of them have empirically testable consequences. However, gathering accurate and complete data about the products of firms of most sizes in most countries remains a huge hurdle.
One final hypothesis worth considering is that the trend of falling product diversity is explained by an increasing diversity of products within large US firms. On this hypothesis, the total diversity of products in the marketplace may be stable or growing, because individual large US firms on average have been producing an increasingly diverse array of products. The diversity of products produced by some individual firms has been studied, and some have grown more diverse over time. When we measure the diversity of the products produced by large US firms, the products of each firm is embedded as a point in a high-dimensional product space, and we measure the diversity of those points in product space. So, those measurements reflect the diversity between Therefore, the adjusted diversity of order q is given by q D adj (a, Z) = q D(a, Z) s .