The Government Finance Database: A Common Resource for Quantitative Research in Public Financial Analysis

Quantitative public financial management research focused on local governments is limited by the absence of a common database for empirical analysis. While the U.S. Census Bureau distributes government finance data that some scholars have utilized, the arduous process of collecting, interpreting, and organizing the data has led its adoption to be prohibitive and inconsistent. In this article we offer a single, coherent resource that contains all of the government financial data from 1967-2012, uses easy to understand natural-language variable names, and will be extended when new data is available.


Introduction
Widely shared and easy to use databases facilitate quantitative research and render the replication of findings practical and convenient [1]. Indeed, much of what we know about public finance has been tested against large microdata sets-in the United States, primarily merged information files based on household-level data from the IRS Individual Public-Use Tax Files, the Current Population Survey, the Consumer Expenditure Survey, and the triennial Survey of Consumer Finances. Unfortunately, students of public financial management at the local government level must often rely on one-off, custom-built datasets to pursue their inquiries, which is costly, inimical to replication, and leaves practitioners uncertain about the utility of academic insights.
For someone from outside the field of public financial management the lack of widely used and consistently applied data might seem an unlikely obstacle. After all, scholars of public financial management have access to a database that is in many respects ideally suited to their needs. The U.S. Census Bureau has surveyed state and local governments annually since 1967, and, as the Director of the U.S. Census Bureau stated in a letter accompanying the 2013 request for financial information: "This survey is the only comprehensive source of information on the finances of local governments in the United States. " Many examples of research using these data exist, including recent papers by Gore [2], Baber and Gore [3], Kido et al. [4], Murray et al. [5], Carroll [6], Mullins [7], and Fisher and understand patterns in government finance as well as important advice for other researchers working with the database.

Issues with Data Availability and Coding
The basic unit of reporting in government accounting is the fund, essentially a separate bucket of financial resources tasked with accomplishing some objective [21]. While more recent accounting guidance mandates that some government-wide information be reported in addition to fund level reports [22], the census extends this reference frame by consolidating information across funds and presenting all of its data on a government-wide basis. This approach is broadly beneficial for studies that seek to understand something about governments as separate financial entities, and better conforms to the way that citizens and financial intermediaries (as opposed to governmental managers charged with oversight) use government accounting data [23].
While many government funds report information on a modified-accrual basis, some are required to report using a cash basis, a modified cash basis, or (rarely) a full accrual basis [24]. This diversity of reporting practices within and across governments presents some difficulty to anyone attempting to present or utilize government financial data in a consistent manner. Given that a transformation of data between the different accounting treatments is not possible, the census adopts the accounting basis declared by each government fund "so long as that basis (1) conforms to generally accepted accounting procedures and (2) is applied consistently from year-to-year." In practice this means that the data are best conceptualized as roughly equivalent to cash flows, even though they will not always represent actual cash flows during the periods reported.
The data are reported in thousands of nominal dollars, unadjusted for changes in prices or wages over time, allowing researchers to choose whether and how best to convert the information into real dollars. The time period represented in the data is 1967-2012, however the number of governments included varies significantly from year to year. The primary source of this variation is the fact that the Census Bureau collects financial data from governments in two separate, but related efforts. During years ending in a 2 or 7 the government collects a census (essentially a population) of government financial statistics in the "Census of Government Finance and Employment Data". Every year when a census is not being conducted a sample of governments report data through the "Annual Survey of State & Local Government Finances".
The data include federal (type 6), state (type 0), county (type 1), municipal (type 2), township (type 3), special district (type 4), and school district data (type 5), each of which can be isolated by censoring the data on the "Type_Code" variable. While every state is included in the sample every year the coverage for other government types is less complete. Fig 1 shows the number of governments of each type that are included in the data each year. Fig 1 highlights several important insights into the coverage of the data over time. Reporting rates are uniformly high during years when a full census was conducted. In addition, school districts report at much higher rates than other governments, but show a large reduction in reporting during the years between 1993 and 1996. Closer examination of the data for other government types shows a similar (but less visually pronounced) reduction in coverage during those years.
In 1993 the census began sampling a smaller, but still significant, portion of all government types. Because of their work with the National Center for Education Statistics, the census was able to resume nearly complete coverage of school districts following the 1997 census, but the other government types were never again sampled at the levels seen in the late 1980's.
One of the largest hurdles in the process of organizing the government financial data as a single, coherent database is learning to interpret the codes used by the census to identify what each data point represents. The database we present replaces these codes with natural language variable names borrowed from the census' classification manual, however understanding the codes that the census uses internally will help readers to validate, interpret, and apply our work.
Each census code combines an "object code" with a "function code". Object codes are one character long and represent large categories or types of data. For instance, the object code T is used for all tax revenues. Function codes are double digit numbers that indicate what the funds in question were used for. Combining an object code, such as A, for current charges, with a function code, such as 12, for elementary and secondary education, results in a pointer to a particular variable, in this case A12: current charges from elementary and secondary education. Function codes are not applied consistently across the entire data set, but are still useful to understanding data within large sections of it. For instance, the function code 01 represents property taxes whenever it is used with object code T, but represents air transportation with every expenditure function code.

Creating a Single, Coherent Database
The government financial data comes in two forms. Data from 1967 through 2007 is more or less organized in the manner that researchers expect from panel data. The files are divided by year. Each row of each file corresponds to one government. There are several columns for identifying information and a column for each financial variable. These columns are all labeled with natural language names that make it easy to understand what they represent. One wrinkle arises from the fact that data for this period is always provided in three separate text files each year. Each file contains a row for every government and some identifying information, but the three files contain different subsets of the financial information available.
Overcoming this challenge is straightforward. Given the consistent naming scheme used by the census for these years we simply merge the three data files each year so that all of the columns are available in one large matrix. We then loop through the years available and continue aggregating the data into what we call the "early database".
The newer data presents a much more substantial challenge, and the process of consolidating it with the early database to create one source of data is our main contribution. Data after 2007 is organized into two files per year. The first file is a fixed width text file called the "Individual Unit File". On each row of this file there is one government ID number, one census data code, one number representing data, the year of the data, and a character that encodes something about how the data was gathered.
This organization presents the first major hurdle to merging the recent data with the early database, since each row of the individual unit file holds data that must comprise one cell in the final matrix. For this reason the individual unit file is transposed so that there is one row per government and one column for each census data code.
The second file the Census provides contains identifying information for every government in that year's data, and is organized by government ID code. This "Government ID" file has the name of each government, population figures, and several other pieces of identifying information. Once the individual unit files are transposed they are merged with this identifying information to create the "recent database".
The second major hurdle presented by the more recent government financial data is the fact that the data are not encoded with natural language variable names the way that the early data is. This needs to be fixed, and so the final step in our data consolidation process is a mapping of each of the census codes onto the variable names used in the early database. The Census provides some resources to facilitate the process, including a user's guide to the early data and classification manuals describing the recent data, but the process is still time consuming and meticulous in a way that likely deters other researchers from incorporating the recent data into their studies.
In the end we take the recoded recent data and merge it with the early data to present a single coherent database of government financial data between 1967 and 2012. Specific instructions for replicating our consolidation, the SAS code we employed, and a mapping of data codes to variable names is available in S1 File and are also included with the database when you download it as one of our supporting information files (S2, S3, S4, S5, S6 and S7 Files) which each contain data for one government type. A high-level view of the process of organizing and consolidating all of the government financial data is shown in Fig 2.

Categories of Financial Data
At a high level the data for each government are grouped into four large categories: revenue flows, expenditure flows, cash and investment positions, and debt positions.
Revenue. The revenue data are organized by sector into general revenue, utility revenue, liquor store revenue, and social insurance trust revenue. Each of these sectors is comprised of a number of smaller subcategories, as shown in Table 1.
The revenue data within each subcategory are further broken down in order to identify more specific sources of funds. Tax revenues have the largest number of subcategories in the data. Table 2 summarizes the organization of tax revenue subcategories.
Intergovernmental revenue data is first separated based on its source (from the federal, state, or local government) as shown in Table 1. Within each of these sources intergovernmental revenue is categorized by its intended use. Table 3 displays this structure.
The precise application of each of these categories changes somewhat based on the source of the intergovernmental revenue. For instance federally sourced intergovernmental revenue for public welfare includes programs such as TANF (Temporary Assistance for Needy Families) and Medicaid, whereas state sourced intergovernmental revenue for public welfare includes passthrough of these programs, as well as revenue arising from state specific programs. An exhaustive documentation of what each variable contains and excludes is available in the census classification manual included in the supporting information files (S2, S3, S4, S5, S6, or S7 Filess).
Current charges are amounts that the government collects from individuals and corporations in exchange for providing services. They are reported in gross amounts, ignoring any cost of service. Liquor stores and utilities are excluded from current charges and given their own category of revenue in order to distinguish them from general revenue. Charges are separated based on the type of service provided as shown in Table 4.
Liquor store and utility revenue are not disaggregated to the extent that the other revenue data is. Total liquor store revenue is reported, and utility revenue is broken into revenue from each of the four types of utilities: water, electricity, gas, and mass transit. Several categories of general revenue are listed under miscellaneous general revenue. Their organization is shown in Table 5.
The last category of revenue is revenue from social insurance trusts. Insurance trust revenue is separated into retirement plan revenue and unemployment revenue, and several smaller partitions of both are reported as shown in Table 6.
Expenditure. Expenditures are organized according to their category and function. The category of each expenditure refers to how the cash was used, while the function of the expenditure refers to the type of service it was used to accomplish. In general every expenditure variable is a combination of one category and one function, following the logic of the census codes. For instance, "Air Transportation Capital Outlay" is in the capital outlay category and was used for the air transportation function. Table 7 shows the different categories of expenditures that are recorded in the data. Total expenditures are the sum of direct expenditures and intergovernmental expenditures. Direct expenditures can further be broken down into current expenditures used to pay employees, purchase supplies and hire contractors; construction expenditures used to build long term assets; and expenditures used to purchase (rather than build) long term assets. Capital outlay expenditures are the sum of construction and purchase expenditures.
Intergovernmental expenditures are defined by the census as "amounts paid to other governments for performance of specific functions or for general financial support." They are included in total expenditure, and are separated based on whether the funds went to state governments or local ones.
In a very small number of instances assistance, subsidies, and interest on debt are added to direct expenditures and total expenditures. Assistance and subsidies are coded by the census as object J, and occur four times in the data: state government scholarships (J19), federal categorical assistance programs (J67), other cash assistance programs (J68), and federal and state veterans' services (J85). Interest on debt is coded by the census as object I and occurs five times: interest on general debt (I89), and interest on debt for the four classes of utilities (I91, I92, I93, and I94). When this occurs the data always include a separate line item reporting the amount of assistance, subsidies, or interest, allowing researchers to correct for their inclusion in direct expenditure if necessary. Expenditures are also separated by function within the database. Table 8 shows the various expenditure functions considered in the data and the census function codes that correspond to them. Some of the expenditure functions recorded by the census only exist at the federal level and have been excluded from the database otherwise. Other codes exist in the newest census data but do not exist for years prior to 2007 and have been consolidated into their earlier versions to create a more coherent database.
Cash and Investment Positions. Several of the cash and investment positions of each government are recorded in the data. Other current and long term assets, such as those recorded on a typical statement of net position are not included by the census. A summary of these variables is shown in table 9.
It may seem like an odd choice to report state and local government securities twice given the explicit division between governmental and non-governmental securities in the data, but given that defaults by state and local governments are more likely than federal defaults ( [25] and many others) including state and local government bonds with non-federal securities is reasonable. For situations where this combination is unwanted state and local government securities can be subtracted out of the non-governmental securities variable.
Debt Positions. Debt statistics were significantly simplified following the 2005 redesign of the Census' government finance statistics program. Prior to this simplification data on debt were separated based on whether the debt was issued with the backing of the full faith and credit of the government in question, whether it was not guaranteed, or whether the guarantee was unspecified. Within each of those categories the debt was broken out by function: debt to be used for each of the four utilities (water, electric, gas, and transit), general use debt, elementary and secondary education debt, or higher education debt. Measures of debt outstanding, debt issued, and debt retired were recorded for each of these guarantees and functions.
Debt outstanding, issued, and retired are still reported variables, but the distinctions between the guarantee levels and functions of debt have been removed. Instead, debt variables are disaggregated into public debt for private purposes, and debt for all other general purposes.
Because the data prior to 2005 have substantial additional detail, the government finance database keeps all of the potential categories of debt, even though many of these values are missing for the most recent years. Any research using the finer-grained debt data should exclude years prior to 2005, but the larger debt categories are comparable over the entire timespan of the data.

Implications of Unbalanced Panel Data
The government finance database is an unbalanced panel dataset because the annual samples vary in size, and so any analysis of the data should be informed by traditional approaches to working with unbalanced panels, such as fixed effects models [26]. However, a deeper understanding of how the sample varies over time can provide us with advantages over the simple application of statistical tools, by guiding future research designs and by helping to interpret results. One particularly striking finding from a high level analysis of the data is that smaller (larger) sample sizes indicate that the sample is skewed towards larger (smaller) governments. The graphs in Figs 3 and 4 show a clear, inverse relationship for counties (r = -0.93) and municipalities (r = -0.78) between the number of governments sampled each year and the median population of the governments in the sample, providing strong evidence that larger municipalities The Government Finance Database are more likely to be sampled during non-census years. The two government types that are exceptions to this pattern are states and school districts, because of the uniformly high reporting rates for both. This relationship indicates several actionable steps, beyond the straightforward advice to apply year fixed effects, for quantitative research using this data. First, considering government size in your research design will be essential. Directly controlling for size, or being able to make a plausible argument for why size is not important for the question being investigated is an important bar for studies using this database to clear. If such controls or arguments are missing, academics and policy makers should be very wary of generalizing their results.
Second, your research focus may inform the data cleaning and selection process in novel ways. For instance, studies that aim to identify long term financial trends across all governments may want to only use the data from years ending in a 2 or a 7, because that will ensure that every measure they calculate is representative of a population of governments. Some of the time series we graph later in this paper will clearly show the impact that ignoring this advice can have.
On the other hand, studies that include or truncate data based on the population of each observation may claim to be including all of the data, but are actually removing much of what is available and are prejudicing their sample towards including more observations during the most recent years. While this type of data cleaning is often implemented without much thought in other fields, reviewers of work using the government finance database should ask authors to justify (or test to ensure) that the choice to include only governments with a certain population does not bias the results of the study.
While the number of governments sampled in any given year varies considerably, impacting the median population of the sample, it has long been understood that city populations follow a power law, or Pareto distribution ( [27] or [28]), and thus it is reasonable to ask whether years with a small number of governments might nonetheless cover a large fraction of the population. Figs 5 and 6 take advantage of the fact that all states report data for every year to calculate the percentage of the total population covered by various government types each year.
What these figures show is that even though the samples are skewed towards governments with the largest populations, and so are not representative of all cities or all counties, they do capture a sizeable portion of the overall population in both cases. An important implication of this is that studies which use the government finance database to measure the overall economic force of a particular category of government cash flow are likely to come very close to an accurate estimate, even during small sample years. Per capita numbers, which are easy to compute in the database, will often be a reasonable tool to use given that the data cover so much of the population. These measures will still be weighted towards representing people living in the largest governments during years when the sample is the smallest, but most people live in places with large populations and so per capita measures will be broadly representative.
There are likely several reasons why none of the years reach 100% coverage for population. One is that Connecticut and Rhode Island do not report county data, even though they both have counties. In addition, the District of Columbia is coded as a state and not as a local government. In practice however these reasons do not account for much of the gap. Other sources potentially include systematic non-reporting from less obvious sources, or the possibility that state population estimates are updated more often than other governments and so display growth sooner. Another important consideration in working with unbalanced panel data is that requiring a long, uninterrupted time series of observations will limit the generality of your results. More specifically in this case, depending on the type of government being researched, requiring consecutive observations is likely to bias your sample towards including larger governments and data measured during the years in the late 1980's when the samples were larger. The size of this effect is controlled by the number of concurrent observations your research design requires however, so even small differences in such requirements have the potential to sizably impact your findings. Table 10 shows the number of observations that have consecutive data of a given length, and Table 11 shows how average population changes in those samples. Fig 7 graphs these sample sizes as a proportion of all of the available data for each government type.
These points also suggest some practical considerations that arise when using unbalanced panel data in less academic settings. For instance, if you are interested in discovering something about the revenues or expenses of a particular local government you are not likely to find a The Government Finance Database complete time pattern of behavior in this database. In fact, unless the government you are interested in serves a particularly large population you may find that data only exist once every five years. Individuals who want to dig deeply into the finances of a particular local government are likely to have much better luck asking for financial records directly from the local government they are interested in.

Results
In the process of organizing and cleaning the data we were struck by its wide applicability to many different areas of public administration. In this section we present a number of simple analyses that illustrate both the flexibility and usability of the government finance database.   It shows a time series of the average number of tax revenue sources computed for both municipalities and school districts. These data were constructed by adding an indicator variable to the database for each type of tax revenue. The indicator was coded as a 0 whenever the total amount of that tax was either missing or equal to zero, and was coded as a 1 otherwise. The indicators for property taxes, sales taxes, income taxes, license taxes, and other taxes were then summed and the average was calculated, by year, for each government type.
The results show two interesting features. The first is a quantitative confirmation of the often-noted trend towards increasing revenue diversification by municipal governments [29]. This trend is mirrored by school districts, a fact which is far less well known. The second notable feature is that the number of municipal tax revenue sources looks much more variable than the number of school district tax revenue sources. In fact, much of that variability is induced by the different sample sizes (and therefore the different average populations) each year. Fig 9 shows a similar analysis that also highlights several additional considerations for using the data. It graphs average, real, per capita government debt at both the state and municipal levels.
Scaling by population is easy, since population figures are included in the data, but because the database is recorded in nominal thousands of dollars any analysis that wants to control for inflation needs to merge an appropriate scaling factor into the database. In this case we used the annual average CPI levels from the Bureau of Labor Statistics (with 1983 ffi 1), scaled total debt outstanding by both CPI and population, and multiplied each resulting figure by 1,000 (to correct for the fact that all data in the government finance database is recorded in thousands of dollars). The government level figures were then averaged, by year, for each government type.
On the surface the graph in Fig 9 shows many of the features described by Hildreth and Zorn [30], including a substantial increase in debt levels following the Tax Reform Act of 1986, decreasing new issues in the early 1990's, and a general upward trend in debt outstanding since. Beyond those well-known trends however the real, per capita levels tell an interesting story about how large and small municipalities have used debt markets differently.  Prior to 1986 census years show relatively little difference from annual samples in terms of the average level of real debt per person. Following the 1987 census however those differences dominate the figure, indicating that large municipalities have taken advantage of the Tax Reform Act of 1986 far more than small municipalities, even in inflation adjusted per capita terms. While city size has been studied in relation to its impact on interest rates [31] [32], this previously unnoticed pattern between city size and the level of outstanding municipal debt is a potential area for future research.
There is no reason why the data need to be analyzed from the aggregate perspective our previous two figures used. Breaking the data out and studying one particular government is also an interesting exercise. For instance, Fig 10 shows the state of Oregon's total revenue and total expenditure, in billions of nominal dollars through time.
The most striking feature of this graph is the sizeable impact of the great recession on total revenue in 2009. Contrary to what you might think, this change is not the result of a large decrease in taxes collected or any other traditional revenue source, instead virtually all of the difference between the 2008 and 2009 numbers comes from the approximately $12 billion dollar loss from public employee retirement system investment revenue.
The visual impact of this loss on the graph is small compared to the actual impact losses like this had on public retirement systems across the nation and the world [32], but it drives home an important point about the flexibility of the government finance database. Isolating more stable government revenues through the use of general revenue, rather than total revenue, is likely to be advisable in many situations, and further isolating your data from the impact of intergovernmental revenue by using the "own source" versions of either revenue number is also possible.  Another option for segmenting the data is to look more closely at patterns within a particular government type. Table 12 shows one such analysis for special districts. While the general pattern of growth in special districts is well known [34] [35], and there are a few isolated studies that attempt to understand what is driving that growth (cf. [36], a study using one year of data, to our table), there are no studies describing which types of special districts have contributed most to that growth.
Our findings demonstrate a number of interesting patterns. First, much of the growth seems to be an organic expansion of the most common special districts without much change in their proportion. For instance, even though local fire protection districts added 1,628 to their total and grew almost 40% over the 30 years, they represented a very stable 16% of all special districts at both points in time. Second, some of the most dramatic growth came from the other multi-function district category, which grew from around 2% to over 7% of all districts, indicating that citizens who form special districts are increasingly deciding that the efficiency of combining multiple functions (perhaps from economies of scale, or reductions in administrative costs), outweighs the burden arising from additional complexity.
There are many other interesting storylines that we might draw from this Table 12, including the reduction in school building authorities and cemeteries potentially representing shifts in population demographics, or the strong growth of library, health, and solid waste management districts potentially representing increased demand for those services in areas without the population to support them previously. The diversity of potential insights from this relatively simple analysis highlights the fact that we can only begin to characterize the full extent of the flexibility and utility of the government finance database here.

Conclusion
A trade-off between ease of use and purity exists with any data cleaning effort. On the one hand we would like to present researchers with a database that is free from abnormalities and can be The Government Finance Database easily used in the widest range of circumstances. On the other hand we also want a database that is as close to the raw data as possible in order to limit statistician induced measurement error.
In this effort we have made several decisions that ensure the purity of the data even when there is some reduction in its usability. We plan to implement fixes for these issues, but have reserved these changes for a later work, since this will give us an opportunity to describe our approach to the data cleaning in a complete way, and because our approaches are potentially controversial. Academics who appreciate the changes we plan to make are free to apply them or use our revised database, and those who disagree with us or would prefer to use an alternate method can still have access to the government financial data in this form. A brief description of the issues we would like to fix is warranted however, since our choosing to not amend the database now means that the data may have less utility for some studies.
The two primary issues surround the population figures and the fiscal year end dates. The issue with the population numbers is that they do not update annually. Given the fact that per capita levels are a common, useful transformation to apply to government finance statistics, the use of old population figures means that per capita variables are likely to be measured with error in many cases. Short of conducting a retrospective count of populations for every government in the dataset the best solution is to model what the population must have been in every   year when the population estimate is not current. This model could take many different forms, so we will reserve a discussion of how to create it for the future. The Government Finance Database The issue with the fiscal year end dates is threefold: inconsistent coding of dates, a large number of error codes that we can interpret, and a surprising collection of other strange entries that are harder to interpret. The bluntest illustration of this problem is that fact that there are 520 unique values of fiscal year end dates, and only 365 days in a year. There are several avenues for correcting this problem, none of which is perfect. In the meantime however research that relies on fiscal year end dates should be careful of dropping observations that don't conform to the expected format of this data field.
While the data we provide is far from perfect it still represents a substantial step forward for quantitative research in public financial analysis, and helps to solve a long-standing problem created by the lack of standardized, cross-sectional databases in local government finance.
All of the data was collected by the U.S. Census Bureau's annual surveys and five-year censuses of state and local government finance, but prior to our work the use of government financial data in public accounting and finance research always involved a substantial investment of time into data cleaning and organization. As a result there was very little standardization in the time periods and government types covered, and the interpretability, accessibility, and replicability of research suffered.
We offer this database in the hopes that it can bring more consistency and transparency to quantitative research in public financial management. In the process it should also make conducting this type of research less costly, and may provide a template for others with access to unique data sources who want to provide them to our field.
Supporting Information S1 File. Appendix for the Government Finance Database. This file contains detailed information showing how to replicate our work creating the database described in the paper. The three appendices include step by step instructions, a mapping of our variable names to the census data codes, and the SAS code we used to consolidate the data files we received from the Census into the government finance database.