A general approach for predicting the behavior of the Supreme Court of the United States

Building on developments in machine learning and prior work in the science of judicial prediction, we construct a model designed to predict the behavior of the Supreme Court of the United States in a generalized, out-of-sample context. To do so, we develop a time-evolving random forest classifier that leverages unique feature engineering to predict more than 240,000 justice votes and 28,000 cases outcomes over nearly two centuries (1816-2015). Using only data available prior to decision, our model outperforms null (baseline) models at both the justice and case level under both parametric and non-parametric tests. Over nearly two centuries, we achieve 70.2% accuracy at the case outcome level and 71.9% at the justice vote level. More recently, over the past century, we outperform an in-sample optimized null model by nearly 5%. Our performance is consistent with, and improves on the general level of prediction demonstrated by prior work; however, our model is distinctive because it can be applied out-of-sample to the entire past and future of the Court, not a single term. Our results represent an important advance for the science of quantitative legal prediction and portend a range of other potential applications.


Introduction
As the leaves begin to fall each October, the first Monday marks the beginning of another term for the Supreme Court of the United States. Each term brings with it a series of challenging, important cases that cover legal questions as diverse as tax law, freedom of speech, patent law, administrative law, equal protection, and environmental law. In many instances, the Court's decisions are meaningful not just for the litigants per se, but for society as a whole.
Unsurprisingly, predicting the behavior of the Court is one of the great pastimes for legal and political observers. Every year, newspapers, television and radio pundits, academic journals, law reviews, magazines, blogs, and tweets predict how the Court will rule in a particular case. Will the Justices vote based on the political preferences of the President who appointed them or form a coalition along other dimensions? Will the Court counter expectations with an unexpected ruling? Despite the multitude of pundits and vast human effort devoted to the task, the quality of the resulting predictions and the underlying models supporting most forecasts is unclear. Not only are these models not backtested historically, but many are difficult to formalize or reproduce at all. When models are formalized, they are typically assessed ex post to infer causes, rather than used ex ante to predict future cases. As noted in [1], "the best test of an explanatory theory is its ability to predict future events. To the extent that scholars in both disciplines (social science and law) seek to explain court behavior, they ought to test their theories not only against cases already decided, but against future outcomes as well." Luckily, the Court provides a new opportunity to test each year. Thousands of petitioners annually appeal their cases to the Supreme Court. In most situations, the Court decides to hear a case by granting a petition for a writ of certiorari. If that petition is granted, the parties then submit written materials supporting their position and later provide oral argument before the Court. After considering the case, each participating Justice ultimately casts his or her vote on whether to affirm or reverse the status quo (typically seen through the lens of a decision by the lower court or special master). Over the last decade, the Court has issued between 70-90 opinions per term for an average of approximately 700 Justice votes per term.
While many questions could be evaluated, the Court's decisions offer at least two discrete prediction questions: 1) will the Court as a whole affirm or reverse the status quo judgment and 2) will each individual Justice vote to affirm or reverse the status quo judgment?
In this paper, we describe a prediction model answering these two questions as guided by three modeling goals: generality, consistency, and out-of-sample applicability. Building on developments in machine learning and the prior work of [1], [2] and [3], we construct a model to predict the voting behavior of the Court and its Justices in a generalized, out-of-sample context. As inputs, we rely upon the Supreme Court Database (SCDB) and some derived features generated through feature engineering. Our model is based on the random forest method developed in [4]. We predict nearly two centuries of historical decisions (1816-2015) and compare our results against multiple null (baseline) models.
Using only data available prior to decision, our model outperforms all baseline models at both the Justice and Court observation level under both parametric and non-parametric tests. This performance is consistent with, and improves on, the general level of prediction demonstrated by prior work; however, our model is distinctive because it can be applied out-of-sample to the entire past and future of the Court, not just a single term. Finally, our conclusion suggests areas for future improvement and collaboration. Our results represent a significant advance for the science of quantitative legal prediction and portend a range of potential applications, such as those described in [5].

Research principles and prior work
In this section, we describe the principles guiding our model construction and how we conducted our testing in light of prior work on the topic.

Generality
Leveraging the early work of [6], both [1] and [3] developed a classification tree model which was designed to predict the behavior of Supreme Court Justices for the 2002-2003 Supreme Court term. Their work represents a seminal contribution to the science of legal forecasting as their classification tree models not only performed well in absolute terms, but also matched or outperformed a number of subject matter experts.
Despite its contribution to the field, however, the approach undertaken in [1] and [3] was limited in several important ways. For example, their model construction is only applicable to a single "natural court" with full participation, i.e., cases where all of a specific set of Justices are sitting. The natural court tested in their paper, following Justice Stephen G. Breyer's appointment in 1994, was one of the longest periods without personnel changes on the Court, providing their models with an unusually large training sample. It is not possible, however, to evaluate their model in periods prior to 1994 or after 2005 following the replacements of Chief Justice William H. Rehnquist and Justices Sandra Day O'Connor, David H. Souter, and John Paul Stevens. As a result of these issues, the performance and nature of the model cannot necessarily be generalized to all Supreme Court cases during their test period, let alone cases before or after their tested natural court.
Our first principle, generality, is based on these observations. As the composition of the Court changes case-by-case or term-by-term, either through recusal, retirement, or death, a prediction model should continue to generate predictions. The properties and performance of a prediction model should also be able to be studied across time and "abnormal" circumstances (e.g., cases with original jurisdiction or fewer than nine Justices). Therefore, our goal is to construct a model that is general-that is, a model that can learn online, in a manner similar to online learning models described in [7] and [8].

Consistency
Second, we prefer the model to have consistent performance across time, case issues, and Justices. Similar to our motivation for generality, existing models have had significantly varying performance over time and across Justices. To support the case for a model's future applicability, it should consistently outperform a baseline comparison.
Both legal scholars and practicing lawyers have had difficulty leveraging prediction models [5]. Among other difficulties, qualitatively-oriented legal experts tend to suggest model improvements based on anecdote or their own untested mental model. However, if these ostensible improvements cannot be systematically inferred from data, or if their impact on the model is detrimental in other periods or for other Justices, then they ought not be included in a model engineered for consistency.
While prediction models can be applied in many contexts, consistency can also be related to a risk preference in a repeated betting scenario. For example, instead of preferring the highest per-wager expected value (i.e., maximum accuracy), a bettor might prefer a wager with less volatility or long-term downside risk.
Both consistency and generality can be seen as related to overfitting and the bias-variance trade-off. But in addition to the typical learning problems under a stationary system, we are faced with a more complex reality. Court outcomes are potentially influenced by a variety of dynamics, including public opinion as in [9], inter-branch conflict [10], both changing membership and shifting views of the Justices as explored in [11] [12], and judicial norms and procedures [13]. The classic adage "past performance does not necessarily predict future results" is very much applicable. For example, likely due to changes in norms, the number of cases per term has fallen from approximately 150 between 1950-1990 to fewer than 90 between 1990-2015. Consider another famous historical example, as explored in [14] and [15], when the aftermath of President Franklin D. Roosevelt's attempted Court-packing plan in 1937 resulted in a significant turnover of Justices in years that followed. Each of these and other changes represents a challenge to a model engineered with consistency as a goal.

Out-of-sample applicability
Our third model principle is out-of-sample applicability. Namely, all information required for the model to produce an estimate should be knowable prior to the date of decision. This is in contrast with models like [2], which require partial knowledge about the outcome to predict the full outcome. This principle is arguably the most important, as it allows for the model to generate predictions in advance, i.e., predictions that can be applied usefully in the real world.
While existing approaches like [1,2] and [3] may honor one or two of these principles, none simultaneously achieve all three above, severely limiting their general applicability. Both [1] and [3] are predictive out-of-sample but fail to be general enough to apply widely or consistent when tested. By contrast, [2] is general across terms and consistent, but not predictive out-of-sample since it requires knowledge of some votes to predict others. As detailed further below, our approach is the first that satisfies all three of these criteria, and thus represents a significant advance in the science of quantitative legal prediction.

SCDB
In order to build our model, we rely on data from the Supreme Court Database (SCDB) [16]. SCDB features more than two hundred years of high-quality, expertly-coded data on the Court's behavior. Each case contains as many as two hundred and forty variables, including chronological variables, case background variables, justice-specific variables, and outcome variables. Many of these variables are categorical, taking on hundreds of possible values; for example, the ISSUE variable can take 384 distinct values. These SCDB variables form the basis for both our features and outcome variables.
SCDB is the product of years of dedication from Professor Harold Spaeth and many others. The database has been consistently subjected to reliability analysis and has been used in hundreds of academic studies (e.g., [11], [17], [18], [19], [20], [21], [22], [23]). While there are serious and important limits to SCDB, as detailed in [24], SCDB is the highest-quality and longest-duration database for Supreme Court decisions.
There are currently two releases of SCDB: SCDB Modern and SCDB Legacy. The SCDB Modern release contains terms beginning in 1946, while the SCDB Legacy release contains terms beginning in 1791. When [25], an earlier pre-print version of this paper was released, SCDB Legacy had not yet been released. As SCDB Legacy represents more than a threefold increase in the length of simulation history and size of training data, we have re-run all model construction and analysis for the new data release; methods and results from [25] are thus superseded by this paper.

Targets
To model Supreme Court decisions, we need to define an outcome variable from SCDB corresponding to a decision. Typically, Court-watchers frame decisions as either affirming or reversing a lower court's decision. This, however, is only consistent with cases heard on appeal. In some circumstances, the United States Supreme Court is the court of original jurisdiction, and there is therefore no lower court against which to frame reversal. In these cases, decisions are typically framed as either siding with the plaintiff(s) or defendant(s). In addition, the Court and its members may take technically-nuanced positions or the Court's decision might otherwise result in a complex outcome that does not map onto a binary outcome.
In order to build a general model that can handle all cases, we created a disposition coding map that defines a Justice vote as (i) Reversed, (ii) Affirmed, or (iii) Other, depending on a Justice's vote and the SCDB's CASEDISPOSITION variable. This disposition coding map is outlined in our Github repository [26]. Our mapping displays Justice vote values by column and Court CASEDISPOSITION values by row. The case outcome is defined as Reverse if there are more total Reverse votes than Affirm votes; notably, Other votes, which may include recusals or non-standard form decisions, are excluded from the vote aggregation. Table 1 below displays the distribution of Reverse, Affirm, and Other coding by Justice outcome and case outcome.

Features and feature engineering
With the outcome variable specified, we proceed next to describe the SCDB features used and feature engineering we performed. SCDB contains a wide range of potential features, and the majority of these are categorical variables. In our study, we begin with the following features available from SCDB: JUSTICE ( In addition to simple feature encoding, we also engineer features that do not occur in SCDB as released. The first set of features that we engineer are related to the Circuit Court of Appeals from which the dispute arose. SCDB codes this data in the form of the case source and case origin, where the source corresponds to the opinion under review and the origin corresponds to the location of original filing. While there are over 130 unique courts that these variables may be coded as, scholars primarily group them by Circuit; Circuits have been shown to be a strong predictor of reversal during certain periods, as shown in [27]. Based on this guidance, we therefore developed a translation from each SCDB court ID to the corresponding Circuit. The coding maps from these origin and source courts to a new set of 16 categorical values, which are then binarized as the raw features above. The features engineered above can both be described as coarsened or collapsed. We move on next to features that are derived through arithmetic or interaction of one or more features. The first of this class is a set of chronologically-oriented features related to oral argument and case timing. These features include (i) whether or not oral arguments were heard for the case, (ii) whether or not there was a rehearing, and (iii) the duration between when the case was originally argued and a decision was rendered. These features are based on the qualitative observation that the length of time between argument and decision is related to the unanimity of the Court; for example, in the past three terms, the ten "fastest" decisions of each term have nearly all been unanimous 9-0.
Item (iii) may seem at first to include future or out-of-sample knowledge. However, in practice, the predictions for a case may evolve as new information about the case is acquired prior to the decision being rendered. For example, when the Court announces that a case will have arguments heard, the delay feature may be set to zero initially. Once the argument date passes, the delay feature is then incremented periodically. After each time step that passes, the feature matrix for undecided cases is updated, and the resulting predictions may therefore change. Consistent with "online" learning approaches such as [7] and [8], this does not require out-ofsample information; it only requires that the data and algorithm be re-run at a specified frequency for any undecided cases in a term. Lastly, we engineer features that summarize the "behavior" of a Justice, the Court, the lower court, and differences between them. These features fall into three categories: (i) features related to the rate of reversal, (ii) features related to the left-right direction of a decision, and (iii) features related to the rate of dissent. These features can be thought of as conditional empirical probabilities. For example, (i) includes, at a given term and for a given justice, the historically-observed proportion of votes to reverse. Importantly, in addition to calculating these values for each justice, we also include difference terms between the Court as a whole and the individual justice. These difference terms are, qualitatively, the relative inclination of a Justice to reverse compared to the Court. We repeat these calculations for other justice-specific features including direction and agreement features, providing quantitative measures of leftright political preference and rate of dissent. In addition, we include a difference term between the lower court's decision direction and the Justice's historically-observed mean direction; this provides a measure of how far apart, ideologically, the Justice is from the lower court's opinion on review (excepting original jurisdiction cases). Together, these features provide relative information about Courts' and Justices' political and procedural leanings; for example, we find that reversal rates vary significantly even in the last 35 years at both the Court and Justice level.

Model construction
With features and outcome data defined, we proceed to discuss the construction of our model. While this section provides a general overview of modeling procedures, readers interested in the technical details should review the Github repository accompanying the paper, [26]; all source code and data required to reproduce the results presented are freely available there. The model is developed in Python and all methods described below, unless otherwise indicated, are from scikit-learn 0.18 [28].
The modeling process begins by selecting a term T Ã ; in order to satisfy our three principles above, no information from term T Ã or after should be available during the training phase. If we let each docket-vote feature vector d i and docket-vote outcome v i have term T(d i ), then our training feature set for model term T Ã is D T = {d i : T(d i ) < T Ã } and our training target set V T corresponds to matching v i records. While some information may be known intra-term, i.e., for {d i : T(d i ) = T Ã }, this modeling procedure only retrains at the outset of each term. For example, while some decisions in term T Ã may have been observed by December, cases in January are predicted using only information prior to October. Other than the incremental delay feature discussed above, no information derived from the current court term is incorporated into the model until the start of the following term.
While we represent D and V above as sets of vectors, we can easily consider it to be a feature matrix with each docket-vote in a row and each feature in a column. As of 2015, D 2015 based on SCDB Legacy (beta) has 249,793 docket-votes; under our feature engineering approach described above, D 2015 has 1,501 columns. In many machine learning approaches, we might pre-process D by rescaling, rotating, interacting, or removing columns. Random forest classifiers, especially when applied to binarized or indicator variables, do not generally require preprocessing. Furthermore, random subspace methods like random forests implicitly remove or "select" features by subsetting the feature space for each sub-learner tree. One weakness of the scikit-learn implementation of random forests relative to alternatives like xgboost, however, is its treatment of missing data. In most cases, this is handled by mapping missing values to a separate "missing" indicator column during encoding; in some cases, however, a historical mean imputation may be used. However, no additional feature selection or pre-processing methods are applied to D prior to learning.
We next apply a learning algorithm to D and V. As noted previously, we selected a random forest classifier [4]. Random forests are part of the family of ensemble methods. Ensemble methods leverage the wisdom of the statistical crowds. In the case of random forest classifiers, we construct a forest of statistically diverse trees using bootstrap aggregation on random substrates of our training data. To cast predictions, we simply calculate predictions for each of our individual trees and then average across the entire forest. While an individual statistical learner (a single tree) might offer an unrepresentative prediction of a given phenomenon, the crowdsourced average of a larger group of learners is often better able to forecast outcomes.
Not only have random forests proven to be "unreasonably effective" in a wide array of supervised learning contexts [29], but in our testing, random forests outperformed other common approaches including support vector machines (LibLinear, LibSVM) and feedforward artificial neural network models such as multi-layer perceptron models implemented with [30]. For details of the implementation, interested readers are directed to the scikit-learn documentation [28] and [31] and keras documented [30].
Of some note, however, is our experimentation with the warm_start parameter to "grow" the forest online. Recall that at the beginning of each term, the model is retrained to incorporate newly observed data. In [25], we built a "fresh" forest model each term with number of trees selected by cross-validated hyperparameter search. In this updated research, however, we have simulated performance using both "fresh" forests and "growing" forests, in which trees are added to an existing forest. Only under certain circumstances, such as the changing of the natural court, following the addition or loss of a Justice, does the model build a "fresh forest". For example, the models used to produce this paper's results were trained with 125 initial trees beginning in 1816 (5 Ã 25 trees, five for each term between 1791-1816). Each term, in the absence of a natural court change, an additional five trees were trained and added to the prior term's forest.
Our implementation of this "growing" approach allows for substantially faster simulation times and more stable predictions, as it only need train a small number of trees per step. Equally important is that most trees in the forest are stable for most years, and so the same inputs in year T and T + 1 are likely to produce the same predictions.
Generally speaking, most learners benefit from joint cross-validation and hyperparameter search. For the "fresh" forest approach, in which a new random forest is built each term, we performed a number of experiments by grid-searching the number of trees, minimum number of leaves per node, maximum depth per tree, heuristic used to select the number of features per tree (e.g., log, sqrt), and split criterion (e.g., Gini vs. entropy) for each model retraining, i.e., for each term. This approach allows the parameters to adapt over the nearly 200 years of change in historical sample composition and size. However, we found that the marginal improvement in accuracy and F1 were not worth the substantial increase in computational requirement and decreased stability of predictions. In the simple examples included in the Github repository, a cross-validated hyperparameter does not have a noticeable impact on accuracy over "default" random forest configuration.
As a whole, our model construction applies standard pre-processing and learning approaches within each step, but experiments with purposeful and atypical design around longitudinal model application. For simplicity of subsequent presentation and replication, only the "growing" forest approach described above with five trees per step is presented. All source and results are available at [26] for a reader interested in the details of model specification and implementation.

Model testing and results
The data and model described above allow us to simulate out-of-sample performance for nearly 200 terms at the Supreme Court. However, there is no single approach to assessing performance in this context. Below, we present standard, un-adjusted machine learning diagnostic results derived from the application of our prediction model. We present both results at the justice-level (i.e., our performance on predicting the votes of individual justices) and our performance at the case level (i.e., predicting the overall outcome of the Court). Then, we compare our accuracy to that of several potential "null" or "baseline" models.

Performance of case and justice prediction model
Justice-level prediction results. To begin, we present the results of our Justice vote prediction model. Recall that the Justice-level model predicts whether the vote will fall into three classes (Affirm, Reverse, Other), but that the outcome at the case-level depends on whether or not a given Justice's votes are Reverse or not. As a result, in Tables 2 and 3 below, we present precision, recall, and F1 results for both three-class and two-class problems in the tables below. In total, over the period from 1816-2015, our model exhibits accuracy of 71.9% at the Justice vote level.
Case-level prediction results. An alternative but related prediction task is the prediction of case outcomes. While better understanding the behavior of Justices is of interest to some court observers, the prediction of case outcomes is the key capability that motivates litigants and can move markets [32]. Table 4 presents case-level results from our prediction model. The predicted case outcome is determined from whether or not the majority of individual Justice votes favor reversing the prior status quo. Starting in 1816 and carrying through the conclusion of the October 2014 term, our model correctly predicts 70.2% of the Court's decisions.
Tables 2-4 provide the overall performance of our model (1816-2015). Fig 1, by contrast, demonstrates the consistency and generality of our approach over nearly two hundred years at both the case and justice level. While some years and some decades are better than others, our model typically delivers stable performance for both cases outcomes and the votes of individual justices. Candidate baseline (null) models But are the results above "good?" To meaningfully answer this question requires the development of a plausible baseline or null model. Specifically, while our approach may outperform an unweighted coin flip for both the two-class and three-class problems (50% and 33%, respectively), few legal experts would rely on an unweighted coin as a null model against which to compare their predictions. Instead, informed by recent years, common wisdom among the legal community is that the baseline betting strategy should be to always guess Reverse. This strategy is supported by the recent history of the Court over the last 35 terms: 57% of Justice votes and 63% of case outcomes have been Reverse. However, this wisdom is quickly drawn into question when a broader view of history is taken into account, as Fig 2 demonstrates below. This trend is even more unbalanced when one considers the significant reduction in docket size over the past few decades, resulting in even more Affirm observations in previous years.  Since common wisdom appears too myopic to use as a historical baseline, we instead propose two additional null models to also use as comparisons. Specifically, in addition to the always guess Reverse heuristic, consider two simple and similarly-intentioned rules: a most-frequent guessing strategies with an "infinite" memory and another with a "finite" memory. The infinite memory baseline model, for a term T, simply guesses the most frequent outcome as observed in D T . This model is most aligned with the spirit of common wisdom; however, as seen in Table 1, it results in a model that would still predict Affirm for the modern Court, and it has therefore significantly underperformed for the last 50 years. In fact, at the current rate of dockets per year, it would take multiple decades worth of unanimous 9-0 decisions before this model would switch to predicting Reverse.
Therefore, we instead focus on an adapted most-frequent model featuring a "finite window" or "moving average." Instead of determining the most frequent outcome over all history up to term T, only cases decided within the last M < T terms are used.
This memory parameter M introduces a common hyper-parameter into the model definition. The optimization of "memory" parameters is a frequent challenge in many learning situations, especially "online." In less technical terms, the optimization of M can be reframed as a simple question: how much of the past is useful for predicting the future? As is demonstrated by Fig 2, it is often unclear when one should change strategy as underperformance is experienced. It should be noted that this issue affects not just models in machine learning, but especially individual human experts attempting to leverage their personal experience and mental models. While it is not possible to learn the optimal size of M for all future states of the world, in our experiments, reproduced in our Github repository [26], we have settled on a value of M = 10. Not only does M = 10 provide an easily-understood "prior decade" baseline, but also by selecting this as our memory window, we are able to test our prediction model against a null model built upon a value of M that is nearly globally optimal for case accuracy. In other words, as M = 10 is essentially derived by optimizing the M hyperparameter in-sample, i.e., using "future information," this further advantages the baseline model and substantially hampers our efforts to outperform the null. Despite this challenge, as demonstrated below, we still outperform the optimized baseline model over the past two centuries.
Tables 5-7 present the results from the justice and case-level for the M = 10 "finite" memory null model. Similar to the results reported for prediction model, Table 5 displays the performance for the M = 10 at the case level. As above, the predicted case outcome is determined from whether the individual Justice votes are Reverse or Not Reverse. In sum, optimizing the finite memory window using in sample information yields justice-level accuracy of 66.2% and case-level accuracy of 67.5% from 1816-2015.
Comparison against baseline models   Our model also performs well on the case-level predictions. Our approach especially outperforms both the always guess reverse heuristic and the infinite memory window during large, sustained periods.
After more than a century of soundly defeating all three null models, the performance of our prediction model has dipped during in the Roberts Court (as compared against the always guess reverse heuristic and M = 10 null model). Within the scope of this study, it is difficult to determine whether this represents some sort of systematic change in the Court's macrodynamics. However, thus far, it does appear that the Roberts Court is less predictable than its immediate predecessors.
Flattening the data by taking each term as the relevant unit of analysis, Fig 4 offers an alternative perspective on our performance. Fig 4 scores each term by comparing our performance to that of the null model. We assign a score of +1 to term where our model outperforms the null model, -1 in any term where our model performs worse than null model and 0 for any term where our model and the null offer identical performance. Given the results previously displayed in Fig 3, we only consider the M = 10 null model for purpose of this analysis.  the memory window through hyperparameter optimization is actually leveraging future information. By leveraging this class of future information, the in sample optimization of M appears to be better able than our model at fitting to some of the actual dynamics present in the early years of the Court.
As the size of the training data increases, our model eventually surpasses the null. Namely, our twenty five year learning period (1791-1816) appears insufficient such that it requires several additional decades for our model to be able to consistently extract the signal from the noise. In addition, the ultimate success of our model vis-à-vis the null model is likely also driven by some increased level of behavioral stability on behalf of the Court starting in the second half of the nineteenth century. As reflected in Fig 4, starting after the conclusion of the American Civil War and in particular at the outset of the Fuller Court, our model begins to consistently outperform the in sample optimized null model.
Beyond performance on a term-by-term basis, another perspective on the performance of our model is to see how it performs on a justice-by-justice basis. At the justice-by-justice level over the past 100 years,

Statistical evaluation of model performance against the null models
While the figures and tables above offer basic evidence regarding the performance of our model, we can now proceed to statistically measure the degree of confidence in our outperformance against the null model. For completeness, in Table 8, we present the results of three tests for both our justice and case-level prediction models compared against the M = 10 null model: (i) a paired t-test on annual case accuracy series, (ii) a Wilcoxon rank-sum test on annual case accuracy series, and (iii) a binomial test on per-case outcomes. Tests (i) and (ii) evaluate whether, both under parametric and non-parametric assumptions, our model outperforms the baseline model at an aggregate, longitudinal level as measured by annual accuracy. Test (iii), on the other hand, tests whether the distribution of individual model predictions is significantly better than a "fairly"-weighted coin flip. All tests are framed as one-sided tests that require our model accuracy to be greater than the null or baseline model.
These tests indicate that our random forest model significantly outperforms the baseline model, both at the aggregate, per-term level and at the per-case distribution.

Conclusion and future research
Building upon prior work in the field of judicial prediction [1][2][3], we offer the first generalized, consistent and out-of-sample applicable machine learning model for predicting decisions of the Supreme Court of the United States. Casting predictions over nearly two centuries, our model achieves 70.2% accuracy at the case outcome level and 71.9% at the justice vote level. More recently over the past century, we outperform an in-sample optimized null model by nearly 5%. Among other things, we believe such improvements in modeling should be of interest to court observers, litigants, citizens and markets. Indeed, with respect to markets, given judicial decisions can impact publicly traded companies, as highlighted in [32], even modest gains in prediction can produce significant financial rewards. We believe that the modeling approach undertaken in this article can also serve as a strong baseline against which future science in the field of judicial prediction might be cast. While a researcher seeking to optimize performance for a given case or a given time period might pursue an alternative approach, our effort undertaken herein was directed toward building a general model-one that could stand the test of time across many justices and many distinct social, political and economic periods.
Beyond predicting U.S. Supreme Court decisions, our work contributes to a growing number of articles which either highlight or apply the tools of machine learning to some class of prediction problems in law or legal studies (e.g., [5], [33], [34], [35], [36], [37], [38], [39], [40]). We encourage additional applied machine learning research directed to these areas and new areas where the application of predictive analytics might be fruitful.
At its core, our effort relies upon a statistical ensemble method used to transform a set of weak learners into a strong learner. We believe a number of future advancements in field of legal informatics will likely rely on elements of that basic approach. Namely, our focus on statistical crowd sourcing actually foreshadows future developments in the field. Future research will seek to find the optimal blend of experts, crowds [41] and algorithms as some ensemble of these three streams of intelligence likely will produce the best performing model for a wide class of prediction problems [42].

Acknowledgments
We would like to thank our reviewers and all of those who provided comments on prior drafts of this paper.