The UK Research Excellence Framework and the Matthew effect: Insights from machine learning

With the high cost of the research assessment exercises in the UK, many have called for simpler and less time-consuming alternatives. In this work, we gathered publicly available REF data, combined them with library-subscribed data, and used machine learning to examine whether the overall result of the Research Excellence Framework 2014 could be replicated. A Bayesian additive regression tree model predicting university grade point average (GPA) from an initial set of 18 candidate explanatory variables was developed. One hundred and nine universities were randomly divided into a training set (n = 79) and test set (n = 30). The model “learned” associations between GPA and the other variables in the training set and was made to predict the GPA of universities in the test set. GPA could be predicted from just three variables: the number of Web of Science documents, entry tariff, and percentage of students coming from state schools (r-squared = .88). Implications of this finding are discussed and proposals are given.


Introduction
Many advantages of modern life are the fruits of scientific research and innovation. Yet in many countries, the proportion of research that is funded by the government has declined, while the share of industry has increased [1]. As a consequence, academic researchers are facing greater competition for funds. The competition among researchers and universities mirrors competition in the market [2], with the difference that success in the market is easy to measure (i.e. profit) and universally accepted. By contrast, research productivity has no universally accepted metric. For example, how does one rank an achievement in pure mathematics (e.g. proving Fermat's last theorem) vis-a-vis an achievement in applied mathematics (e.g. the RSA algorithm, the basis of encryption in online banking)? This comparison requires a value judgment between knowledge that advances a discipline and knowledge that is practical [3]. Despite the difficulty of these comparisons, countries such as the UK, Australia, Italy and Germany engage in research assessment [4](and their scientists comply) because holding science accountable has been accepted as a social norm [5], and assessment exercises may be a "good enough" realization of a merit-based distribution of rewards [6].
Since 1986 and every six or so years thereafter, the UK has held research assessment exercises [7,8]. The REF 2014 assessment evaluated three areas: research outputs (weighted at 65 percent), impact (20 percent), and environment (15 percent), all of which were rated by an expert panel [9]. The classification of each institution's research quality across departments determined its grade point average (GPA). GPA multiplied by the number of units submitted for assessment became the basis for research funding allocation. Although the REF is intended to ensure that research funds are well-spent, the REF is a costly activity with estimates ranging from £47 million to £1 billion pounds [10,11]. An official report to the UK's funding bodies put the figure at £246 million-which equals the annual funding awarded to UCL (ranked second) and Cambridge (ranked third) for 2015-16 [10,12]. Time-wise, each institution spent an average of 985 person-weeks [10]preparing submissions, an increase over the 2008 assessment [12]. These preparations included mock assessments in which academics were graded by panels of colleagues [4]. Those with unfavourable assessments faced the threat of job loss, denial of tenure, or transfer to a teaching only role [13]. These demands and challenges may have prompted the heads of the British Academy and the Royal Society to call for a less burdensome and costly alternative [14].
Several alternatives to the REF have been proposed [15,16]. Generally, these alternatives rely on citation counts. The systematic analysis of citations, called bibliometrics, rests on two assumptions: first, that noteworthy scientific works elicit reactions from other scientists [17], and second, that citation count within the first year after publication predicts long-term scientific importance [18]. The h-index is the largest number of articles h that have been cited at least h times [19]. From this definition, it is clear that it has two components: article count and citation count, which represent productivity output and impact respectively [20]. Generally, these counts are not the same, but Hirsch's definition constrains them to be equal, so it is the lower of the two counts that serves as the ceiling for the other. H-index is primarily calculated for each researcher but can be aggregated at the department or institution levels. With regard to the REF, h-index is proposed as an alternative (or complementary) metric for departmental research productivity and quality [15,19,21]. Recently, Mryglod and colleagues examined whether departmental h-indices predict departmental rank in REF 2014 [8]. Importantly, they published their predictions before the actual results were out. Later on their predictions were shown not to correspond with actual rank, and they concluded that h-index could not substitute for the REF [22]. While this might be the case, previous papers have found a correlation between citations and research assessment scores [23,24]. Thus, citations and other factors could still play a role in research assessment.
As a starting point, it would be helpful to see the financial cost of the REF in light of what it intends to achieve: meritocracy and judicious spending. "Not spreading limited funds too thinly" is a condition for the UK to remain internationally competitive, according to a Russell Group report [25]. This view evokes the metaphor of competition, in which winners, who may have enjoyed advantages from the start, get rewarded-the so called Matthew effect. An alternative metaphor-one that is partly collaborative-can be given. The REF is a resource division exercise among several players, where the amount to be divided is reduced by how much the exercise itself costs. This implies that it would be in the interest of the players to minimize the administrative cost of the competition in order to have more money left over to divide. This is consistent with the objective of judicious spending. A cost-effective REF would serve the interests of peer reviewers, researchers, and all universities. Admittedly, universities that are weaker in research have more to gain compared to stronger ones, because the latter ones would take a large portion regardless of the size of the funding pie.
The present study had two objectives. First, to examine if a machine learning algorithm, applied to publicly available and library-subscribed data, predicts the REF 2014 GPA score.
The REF itself used the following formula for overall GPA: The starred numbers are categories ranging from 4 � (world-leading) to 1 � (recognised but modest), plus an extra "unclassified" category. Although these ratings ultimately determine rank, it can be argued that they in turn depend on universities' accumulated human, social, and financial capital [26]and the extent to which institutions encourage or discourage research, as opposed to say, teaching [27]. These factors define the network or organizational context in which researchers are embedded [28]. A researcher's affiliation opens or closes doors to resources, rewards, and the genesis and diffusion of ideas [29,30], and these ultimately influence researcher productivity [27]. But research is not all about intangible factors. While the REF ostensibly rewards excellent outputs, it may actually reward amassed inputs. For example, while publishing in Science or Nature indicates a discovery of some importance [31], getting published there usually requires access to substantial funding. The housing and upkeep of laboratory mice might cost $200,000 annually [32]. Thus, the second objective of this project was to determine which among a wider set of factors most strongly predict REF performance. We required that these factors of research output be publicly available (or available through university libraries) in an attempt to reproduce the ranking at little or no cost.

Materials and methods
Of the 128 institutions appearing in the REF 2014 ranking, 19 non-typical or specialized ones were excluded (e.g. Institute of Cancer Research, Open University, Cranfield University). This was done because many of the explanatory variables (described in the next section) are not applicable. One hundred and nine universities remained in our sample.

Data sources and variables
The data for this study came from several different sources: the Times Higher Education report on REF 2014, HESA tables, Web of Science, and the Guardian league tables. Our selected predictor variables were of three kinds: institutional, faculty, and student characteristics. Institutional variables were: university income, total expenditures per student, number of full-time equivalent researchers submitted for assessment to the REF, and student-to staff ratio. Faculty variables were: citation impact (total citations / total papers), average h-index, percentage of faculty with a PhD, number of Web of science documents. Web of Science is a citation database offered on a subscription basis by Clarivate Analytics. It covers publications in the basic and life sciences, social sciences, and humanities. Bibliometric data were extracted for each university from the InCites database, restricted to the years 2008-13, corresponding to the period being evaluated. Student variables were: average entry tariff (a measure of how selective a university is by offering admission on the basis of high examination grades), percentage of socially disadvantaged students (defined as those coming from UK social classes 4 to 7), percentage of students from state schools, percentage of disabled students, percentage of students with ADHD, percentage of UK-domiciled students employed six months after graduation, average graduate salary, student satisfaction score from the National Student Survey, and career prospect score.

Analysis
We used Bayesian additive regression trees (BART) [33] to predict GPA. The BART model "learns" associations between dependent variables and GPA in a subset of data and is then tested in a separate subset. This can be compared to a student who studies a set of topics and participates in a mock exam. The student might do well in it, but the basis for the grade is performance in the actual exam, which may or may not be similar to the mock exam. In like fashion, the BART model is "trained" using part of the data and is then assessed on its performance on the held-out data.
Conceptually, BART models are a type of regression model in which the dependent variable is predictable by a dichotomous split in one or more predictor variables, presented in sequence. For example, the best estimate for a car's price might depend on the following nested questions. Is it imported? Is it a new model? Is its engine displacement greater than 1600 cc? By successively partitioning the sample according to the responses, an accurate estimate can be reached. Sketching a diagram of the series of questions would result in a tree-like structure with a number of terminal nodes, which represent answers to the questions. Creating several trees in like manner would result in trees of different structure and number of nodes. BART then creates a sum-of-trees model and performs regularization on the parameters of that model based on a set of priors that favour simplicity (i.e. fewer nodes) [33]. For technical details, kindly refer to the original BART paper [33].
The modelling process, consisting of the following steps, is depicted in Fig 1. First, we followed the 2/3 to 1/3 ratio recommendation for assigning units to a training and test set, with 79 universities assigned to a training set 30 to a test set [34]. The test set was set aside until step 4. In the training set, we calculated bivariate correlations of the 18 predictor variables plus GPA. Then, we fit a BART model with GPA as the dependent variable. Calibration and tuning were performed to determine hyperparameter values including: α and β (together with node depth, these give the probability that a node is nonterminal), m (number of trees), k (a parameter that controls how aggressive regularization is to be done), q (probability that BART has lower error than least squares regression), ν (a degrees of freedom to control the shape of residual error) [34,35]. For this work, the hyperparameter values were: α = .95, β = 2, m = 20, k = 3, q = 0.95, ν = 3.
Finally, we examined which predictors were most relevant. BART's algorithm builds treesgrowing and pruning them iteratively to achieve better fit. In this process, only a small set of variables are used and BART keeps track of the frequency that each variable is used [33]. This serves as a measure of variable importance. We applied the final BART model to the test set. In this step, the model was given the values of predictors to be used for calculating fitted GPA. Predicted GPA was then compared to actual GPA and r-squared was calculated, which was used as an indicator of accuracy.
BART modelling was implemented in the R software [35] using the bartMachine package [36]. The data and syntax used in the analysis is available at: https://www.protocols.io/view/ uk-ref-2014-analysis-data-and-r-script-t3eeqje.

Results
Entry tariff (r = .78), percentage of faculty with doctoral degrees (r = .77), and h-index (r = .74) had the strongest correlations with GPA. (See Table 1). Inspecting which variables correlated with these three, we found that h-index was most strongly associated with university income, the number of full-time equivalent staff submitted to the REF for assessment, and the number of Web of Science documents for which the university was listed as an affiliation. Entry tariff was strongly inversely related to the percentage of students coming from state schools (r = -.89).
The BART model identified three of the 18 variables as important predictors (Fig 2). These were: number of Web of Science documents, entry tariff, and percentage of students coming from state schools. More Web of Science documents and higher entry tariff predicted higher GPA while percentage of state-school educated entrants was inversely related to GPA. (See Fig  2) The final model's predicted GPA was strongly correlated with actual GPA of the test set (r = .94, r-squared = .88). The correlation and r-squared are based on the columns labelled Actual GPA and Predicted GPA in Table 2. However, the rank order of predicted GPA were discrepant with actual GPA. The largest discrepancies were with Roehampton University (Actual rank in the test set: 13 th , Predicted rank in the test set: 23 rd ) and Central Lancashire University (Actual: 25 th , Predicted: 18 th ). This error in ranking would have financial consequences in real-life: a £700,000 loss (20%) for Roehampton and a £240,000 gain (6%) for Central Lancashire, based on actual 2015 research funding allocations [37]. Interestingly, our model ranked

General discussion about the results
The present study had two main findings. First, a machine learning algorithm applied to publicly available and library-subscribed data gave a good estimate of GPA. The second and more surprising finding is that Web of Science documents, entry tariff, and percentage of students coming from independent schools were the most valuable predictors of GPA. We now discuss the implications of these findings. Web of Science documents measures quantity only and disregards quality or impact unlike the h-index. These two measures are related but do not correspond exactly. For example, the h-indices of deceased and retired scientists (who do not produce new papers) can only either increase or stay the same. On the other hand, a new scientist can publish 10 papers in one year but their h-index is 0 until their papers are cited. In a study that aimed to relate the h-index and 8 alternative indices with peer assessment, Bornmann and colleagues reported two important findings [20]. These indices consisted of two factors corresponding to impact and publication count. Compared with publication count, impact was more predictive of peer assessments. In the present work, publication count almost perfectly correlated with university income (r = .97) as did h-index (r = .94). These correlations do not indicate that income is a sufficient condition for publications and citations. They do confirm that scientific discovery and research excellence are subject to economic factors like everything else.
The strong correlation between Actual and Predicted GPA indicates that an inexpensive and time-saving technique could be used in conjunction with peer review. Using Cohen's guide [38] for assessing strength of correlation, this would be a strong effect size (.26 or larger). Would it be reasonable to expect perfect prediction, i.e. for predicted GPA to equal actual GPA? We think that it is neither necessary nor desirable. There is an inherent trade-off between accuracy and cost in any evaluation process. Beyond a certain accuracy threshold there are diminishing returns to further improvement and the question whether £200 million is a fair cost (vis-à-vis a semi-automated REF) deserves serious discussion. A particularly wasteful use of time and money was the Canadian experience in which the cost of peer review exceeded giving every researcher a baseline grant [39]. Another source of measurement error is selection bias. To appreciate this point, it is worth pointing out that universities selected which staff (or units) it wanted to be assessed. For this reason, a rise in REF rank across time could simply be due to a more astute selection of staff instead of real progress [40]. Preparatory work for REF 2021 now includes a proviso to include all researchers.
That greater entry selectivity and fewer state-educated students should predict REF ranking at all highlight the tension between research excellence and social inclusion. It is to be expected that research-intensive universities rank higher in the REF. What the results suggest is that research excellence may partly result from a lack of diversity. Ultimately, universities are made up of people, so it is unsurprising that student and staff characteristics predict research output. Distinct processes that determine the composition of students and staff in UK universities are at work. First, elite UK universities select a high proportion of privileged students. In the years 2007-09, 18 percent of pupils from comprehensive schools were accepted into the Top 30 UK universities while the corresponding figure was 48 percent for independent schools [41]. Independent school pupils were nearly seven times as likely to be accepted to Oxbridge as those from comprehensive schools [41]. Second, privileged families select private schools and elite universities as a means of passing on wealth and status to their children. Recent research in the UK and US, suggest that wealthy families actively hinder downward mobility by ensuring that their children retain their social position even if they are of lower aptitude than lower class counterparts [42,43]. Not surprisingly, the majority of Nobel scientists and Royal Society Fellows are the children of parents with professional and managerial-technical occupations [44,45]. Third, children from underprivileged families (even with good grades) are less inclined to apply to elite universities because of financial and psychological barriers [46,47].
A similar selection process is at work in universities. In a study of research productivity and prestige of academic position among biochemists, Long reported that departmental prestige has a stronger effect on productivity than prior publications [48]. This is interpreted as the accumulation of advantage that is responsible for the stratification of departments and universities into more and less prestigious groups [29]. This is manifested in the US by the stability of departmental prestige rankings from 1925 to 1993 despite the movement of people in and out of departments [29]. Consistent with the results of the present study, universities with the deepest pockets can also afford to hire the most reputable researchers [32].
The present study is subject to several limitations. First, typical universities (e.g. University of the Highlands and Islands, Leeds Beckett University) with a missing value in any of the 18 candidate predictors were excluded from our sample. This can be potentially remedied by requiring all universities to make their data publicly available. Second, as a proof of concept viability, we split UK universities into a training and testing set. As a result, our predictions of rank were limited to 30 universities. This would have to be modified in the real world application because all 120 or so universities would have to be part of the test set. One possibility in the next UK assessments cycle is to conduct panel reviews and machine learning components in parallel. For the machine learning part, the algorithm can be trained using REF 2014 data and feed it new data for testing. The departmental rankings of the two components could then be compared. Third, our model gives university-level GPA but not GPA at the department level. Variability across departments in a given university definitely affected our GPA estimate since the citation and document counts across disciplines are not comparable. For example, the average impact factors of molecular biology journals were 8 times greater than the average impact factors of mathematics journals [49]. Lastly, the prediction model relied solely on the Web of Science database. It has been reported that a researcher's h-index, publication count, and citation count may vary across Google Scholar, Scopus, and Web of Science [50,51].

Proposals based on the results
In light of the findings of the present study, the following proposals regarding national research assessment exercises are given: Proposal 1: Consider incorporating machine learning into research assessment by applying it to human developed metrics of research excellence. The advantages of using machine learning over human assessors only are: reproducibility, transparency, objectivity and the inclusion of all university researchers. Huge amounts of data about research outputs can be interrogated for publications, datasets shared, patents, software programs contributed, and other intellectual products. These outputs could be linked to university researchers who can be mandated to use a universal identifier such as ORCID to link them with research outputs and their affiliations. This step alone would free researchers from documenting, cherry-picking, and perhaps window-dressing their research outputs, leaving more time to do research itself. The biggest challenge would be the selection of appropriate indicators or metrics of research excellence. Several discussions on this topic have been held including the Science and Technology Indicators Conference in Leiden Netherlands [52], which led to the publication of proceedings and the Leiden Manifesto [53]. These and other documents could serve as the basis for formulating standard research metrics. Once these metrics are developed, the ranking process can be left to an open-source algorithm whose result can then be verified. The introduction of machine learning could also have disadvantages. First, depending on the algorithm used, understanding how the ranking was produced could be challenging. This is especially true of neural networks and deep learning that produce accurate predictions without giving people an insight about  Table 2  how they work [54]. Without understanding how the rankings were reached, universities would not know how to improve their performance in the next cycle. Secondly, algorithms are not immune to bias, and therefore care has to be taken so that human biases about gender, race, and reputations do not make their way into machine learning-based judgments of research excellence.
Proposal 2: Invest the savings from a partly automated REF exercise to fund research positions at universities lower-ranked in the REF. The REF results are the basis for the allocation of quality-related (QR) funding. This amounts to about £12 billion for 6 years. Assuming that the use of machine learning could save 50 percent of the cost of panel-based research assessment, there would be £125 million available for strengthening research capacity in universities with less resources. The principal reason in support of this argument is that promoting excellence needs to be balanced with broadening the college of scientists [55]. The singular pursuit of excellence could result in "teaching only universities" that disseminate facts but not the spirit and methods of inquiry which are increasingly needed in a knowledge economy [55]. Accordingly, an independent review of the REF recommended that research excellence ought to be supported wherever it is found [56]. A second reason for redistributing potential REF savings is to mitigate the possibility of ranking error. This follows from the fact that 2 of the top 3 predictors of GPA are not research indicators as such (Fig 3). If the REF rankings were a race to the finish line, these student characteristics (as well as other advantages) may represent different starting points. A recent analysis showed that apart from Oxford and Cambridge, the rest of the Russell Group universities did not form a cluster of themselves only [57]. From a measurement perspective, this raises the question whether an ordinal ranking is possible, or whether clusters fit the data better. A third consideration for spreading research dollars is that research productivity, measured by number of publications, increases only up to a point [58]. For this reason, both the US National Institutes of Health and the Canadian Institutes of Health Research have decided to limit the size of grants any researcher can receive.

Conclusion
Machine learning could be used alongside peer review in the UK's research assessment exercises. The most important predictors represented factors not directly related to research impact. Nevertheless, the predicted ranking corresponded well with actual ranking and placed research-intensive universities at the top. This paradoxical result suggests that input factors in the form of financial and social capital play a role in research output. Incorporating machine learning into research assessment may reduce the burden of panel review. The savings in money and time could be reinvested in universities with lesser resources.
24. Oppenheim C. The correlation between citation counts and the 1992 research assessment exercise ratings for British research in genetics, anatomy and archaeology. J Doc. 1997; 53(5):477-87.