Modeling Verdict Outcomes Using Social Network Measures: The Watergate and Caviar Network Cases

Modelling criminal trial verdict outcomes using social network measures is an emerging research area in quantitative criminology. Few studies have yet analyzed which of these measures are the most important for verdict modelling or which data classification techniques perform best for this application. To compare the performance of different techniques in classifying members of a criminal network, this article applies three different machine learning classifiers–Logistic Regression, Naïve Bayes and Random Forest–with a range of social network measures and the necessary databases to model the verdicts in two real–world cases: the U.S. Watergate Conspiracy of the 1970’s and the now–defunct Canada–based international drug trafficking ring known as the Caviar Network. In both cases it was found that the Random Forest classifier did better than either Logistic Regression or Naïve Bayes, and its superior performance was statistically significant. This being so, Random Forest was used not only for classification but also to assess the importance of the measures. For the Watergate case, the most important one proved to be betweenness centrality while for the Caviar Network, it was the effective size of the network. These results are significant because they show that an approach combining machine learning with social network analysis not only can generate accurate classification models but also helps quantify the importance social network variables in modelling verdict outcomes. We conclude our analysis with a discussion and some suggestions for future work in verdict modelling using social network measures.


Introduction
Although modelling criminal trial verdict outcomes is a classic problem in predictive criminology [1], building verdict classification models for criminal networks is a relatively new area of research. This paper compares the performance of different analytic techniques for addressing the problem of verdict outcome classification using machine learning and social network measures.
The scientific investigation of social networks in criminal organizations is a branch of quantitative criminology that generates knowledge regarding such networks through the analysis of links between network members [2]. Such an analysis requires data that, unlike the information normally employed by criminologists, bears directly on these membership ties. By examining these data, the researcher can explore in detail the social behaviour of criminal groups and organizations [3][4][5][6][7][8][9][10][11][12] and terrorists operations [13][14][15]. This focus on the ties or links between group members is what accounts for the success of social network analysis in the study of criminal organizations [16][17][18][19][20][21][22][23][24][25].
The problem of verdict outcome classification in particular is of great interest to various actors in criminal justice systems, and especially to forensic criminologists faced with the task of converting a set of data into evidence of a network's criminal conduct. To our knowledge, however, only three academic studies have analyzed the relationship between verdicts and social network measures: the pioneering work by Baker and Faulkner [26] and, more recently, the papers by Faulkner and Cheney [27] and Morselli, Masías, Crespo and Laengle [28]. These authors have used different sets of social network measures to test their relationships with verdict outcomes. Their general conclusion is that social networks have much potential for constructing models that can successfully predict verdicts.
Valuable though these three analyses are, they all confine their methodologies to the use of Logistic Regression as a data classifier. Studies in other contexts comparing different classifiers have shown that that their performance can vary significantly depending on the data domain they are applied to [29][30][31][32][33][34][35]. This suggests that classification techniques other than Logistic Regression should be evaluated to determine how well they perform comparatively with criminal network data.
The present article is an attempt to carry out just such comparisons. Two real-world cases will be used for the purpose: (1) the Watergate Conspiracy (WC), the American political scandal of the 1970's; and (2) the Caviar Network (CN), a now-dismantled international drug trafficking ring that was based in Montréal, Canada. The classifiers whose performance will be evaluated and compared in addition to Logistic Regression (LR) [36] are Naïve Bayes (NB) [37] and Random Forest (RF) [38]. Our contribution consists principally in demonstrating that an approach combining machine learning with social network analysis not only can generate accurate classification models but also helps quantify the importance social network variables in modelling verdict outcomes. Both of these conclusions are new findings in the field of criminology and penology.
The remainder of the article is organized into four sections. Section 2 reviews the relevant literature; Section 3 presents the methodology, the data, the social network measures, the analysis and the models obtained; Section 4 sets out the results separately for the two cases studied and the importance of each network measure; and finally, Section 5 discusses the results and a number of specific issues raised by the analysis and states our final conclusion on the performance of the three classifiers in modelling verdict outcomes. attempted to address is the following: given a set of evidence or data on the relations between individuals in a social network suspected of criminal activity, what can be inferred with a certain degree of confidence regarding their guilt or innocence? Traditionally, the data criminologists work with do not include information on such relations. By contrast, the relatively new social network approach focuses explicitly on these interdependencies.
The three studies explored the predictive power of various measures of centrality, which "quantify an intuitive feeling that in most networks some vertices or edges are more central than others" ( [39] [p.16]). All three found the centrality of a criminal network member to be correlated with verdict outcomes. In [26], the earliest of the articles, Baker and Faulkner investigated illegal networks in the Heavy Electrical Equipment Industry (HEEI) that were involved in conspiracies to fix the prices of switchgears, transformers and turbines. Their analysis chose 78 individuals from 13 companies who directly participated in the price-fixing, 37 of whom were eventually found guilty or pleaded no contest (nolo contendere). The authors discovered that the centrality degree indicator, which measures the number of direct contacts an individual has with others, had a positive and significant relationship to the verdict. As centrality degree increased, the probability that a given agent was found guilty increased as well. Using this metric Baker and Faulkner were able to identify 87% of the individuals who were found guilty and 78% of those who were found innocent.
The second case study [27] analyzed the Watergate scandal [40], a highly complex criminal case in which various individuals were found guilty or innocent and a number of the sentences handed out were subsequently increased, reduced or revoked [41,42], but initially 7 persons were convicted. The authors showed that the betweenness centrality indicator [43], which measures the number of shortest paths from all vertices to all others that pass through a given network member, contributed significantly to the guilty verdict classifications. As betweenness centrality increased, the probability that a given conspirator was found guilty also increased [27].
The third case study, a collaborative effort by Canadian and Chilean researchers, analyzed the Caviar Network, a former Canada-based drug trafficking operation as noted in the Introduction [28]. This work is particularly revealing because unlike the other two cases, it used data collected from real communications between the network members. The police investigation of the network resulted in the arrest of 25 individuals, of whom 22 were charged and 14 found guilty. The study found that the out-degree centrality indicator [44], which measures the agent out-flow communication edge, made a significant contribution to the verdict classifications. As out-degree centrality increased, so did the probability that a given conspirator was found guilty [28]. The findings of this analysis and the two other studies just discussed are summarized in Table 1, showing the different measures tested in each case and the statistical significance of the results.
In brief terms, the three studies found empirical support for the hypothesis that social network indicators show considerable potential for modelling verdict outcomes. This suggests that the degree of responsibility of an individual in a network can be related to their networked behaviour. These are new findings given that most previous studies have attempted to make predictions based on social, demographic, economic or ethnic variables, or on variables relating to the functioning of the judicial system, among other factors [45][46][47][48][49][50][51][52].
The demonstration of this working hypothesis opens up an area of research that raises various predictive analysis problems. One of these problems relates to the data. A promising approach to the data sets is provided by the NB and RF classifiers referred to earlier. Although there appear to be no previous works applying these classification techniques to the issues investigated here, some studies have shown that NB and RF perform better than LR in the identification of terrorist attacks [53][54][55][56][57]. For example, Graham et al. [58] reported that NB correctly identified roughly 80% of the perpetrators of terrorist group events in the Philippines. In another paper comparing the three classifiers, Hill et al. [54] found that RF outperformed LR and NB in correctly identifying the guilty parties in one of the world's major terrorism hot spots.
With the above considerations regarding the state of the art in mind, the next section describes the experimental method adopted for the present study.

Methodological Setup
The method and strategy followed in our comparative analysis is depicted in the flowchart in Fig 1. A set of 16 social network measures were chosen to be the independent variables and in both cases the dependent variable was the verdict outcome, which categorized each criminal agent binarily as either guilty or innocent. The next step was to calculate the various social network measures using the original data sets for the WC and CN cases. Predictive models based on LR, NB and RF classifiers were then constructed. To address the class imbalance in the two data sets, the Synthetic Minority Oversampling Technique (SMOTE) was used [59]. The models were validated using the 10-fold Cross-Validation (10-fold-CV) technique [60]. To compare their respective performances, various performance measures were applied. These included Accuracy [61], Precision [62], Recall [62], the Area under the ROC curve (AUC) [63,64] and the Matthews Correlation Coefficient [65]. Finally, a series of tests using Cochran's Q and McNemar's test statistic (w 2 Mc ) [66,67] were conducted to determine whether the differences recorded were statistically significant. More detailed information on these various steps is given in the following subsections.

Data sources and networks for analysis
The data sets for the WC and CN cases are described below: WC Working Web. The source of our data for the WC case was the documentary research carried out by Faulkner and Cheney [27,68]. Each author individually coded the information they collected to establish the relations between network members and determine who did what with whom in the various Watergate activities, and then compared their results. Finding that they agreed 100% of the time on which actors worked with whom, they concluded that their coding was highly reliable. A sociogram of the WC data set is shown here in Fig 2. CN Communication Flow. The source of our data for the CN case was the documentary research conducted by Morselli [69], who collected evidence derived from electronic surveillance transcripts presented in court during the trials of some of the network participants. The more than 1,000 pages of transcripts released to the public revealed the communication flow

Compute selected social network measures
The 16 social network measures serving as the independent variables are briefly described in Table 2. Some of them were found to be statistically significant in the previous studies (see Table 1) while others were tested here for the first time. Also, whereas all 16 measures could be calculated for CN, only 12 could be for WC because of the binary or non-binary nature of the data or the symmetry or asymmetry of the matrix (S1 Supporting Information).

Run machine-learning classifiers
In what follows we describe some of the techniques used in building the models, balancing the classes, and running, validating and evaluating the models.
Classifiers. The three machine-learning classifiers used in this study described below: • The LR classifier learns probability functions of the form P(Y|X), where Y is the class variable and X the attribute vector [36]. LR assumes a parametric function for the distribution of P(Y| X), and based on the training data it estimates the distribution's parameters. The distribution is usually a logistic function, thus justifying its name as well as ensuring the probabilities range between 0 and 1. P(Y|X) can then be a linear combination of the predictor attribute vector.
• NB computes the conditional a posteriori probabilities of a class variable given some set of predictor variables using the Bayes rule [37]. It is simple to implement and has proven to perform very well with a variety of data types in supervised learning settings even though it implicitly assumes independence of attributes [79]. NB theoretically works best when there are independent features as predictor variables. However, as has been pointed by Rish, "The naive Bayes classifier greatly simplify [sic] learning by assuming that features are independent given class. Although independence is generally a poor assumption, in practice naive Bayes often competes well with more sophisticated classifiers" ( [80] [p. 41]). We chose this classifier in light of the view expressed in another study that "NB is the best choice under the condition of highly imbalanced class distribution" ( [81] [p. 454]).
• RF trains various unpruned decision trees by iteratively sampling the original data set without replacement. Each tree is then used to classify an instance individually [38] and the instance is assigned to a class by counting. One of RF's features is that it can calculate strength or importance measures using the Out-of-Bag (OOB) method, which enhances understanding of which attributes have greater predictive power. The only parameter that has to be chosen is n, the number of variables selected randomly in each node of the N available variables. The value of n is determined experimentally by selecting the value that minimizes the error rate for the OOB data. In our study the number of variables selected at random was n = 8 for WC and n = 3 for CN. For both data sets, RF was trained with 500 trees to grow to ensure every input row was predicted at least a few times. We chose this classifier because it performs well in databases with relatively few cases, high-dimensional feature space and complex data structures [82,83].
Addressing unbalanced class distribution. A dataset is unbalanced if the classes are not more or less equally represented. This is true of both of our case studies. In WC, 7 of the 61 individuals in the network were found guilty while in CN, it was 14 out of 110. If the imbalance is not corrected it may result in very low levels on the recall and precision performance measures. To rebalance the two data sets we therefore applied the SMOTE approach [59], one of the most widely used strategies in the machine learning community for dealing with unbalanced classes in classification problems. This technique over-samples the minority class by creating synthetic examples rather than over-sampling with replacement. It is administered "by taking each minority class sample and introducing synthetic examples along the line segments joining any/all of the k minority class nearest neighbours. Depending upon the amount of over-sampling required, neighbours from the k nearest neighbours are randomly chosen" ( [59] [p. 328]). SMOTE has been successfully used to balance classes in classification problems involving social network data [84][85][86]. Here, for the WC case SMOTE obtained 42 synthetic observations for class 1 and 70 for class 0 while in the CN case, the corresponding numbers were 84 and 140.
Model validation. In order to prevent overfitting, the models generated by LR, NB and RF were validated using the 10-fold-CV technique. Previous research has recommended the use of k-fold cross-validation sampling in networked data given that "the test accuracies of classifiers that use network information are always better. This is due to the fact that with random partition the nodes in the test partition will naturally have more neighbours from train and validation partitions, which have the actual labels, as opposed to the labels estimated by the classifiers" ( [87] [p. 146]). The 10-CV technique has been used in cultural modelling, an emerging field aimed at developing computational models of small groups [81,[88][89][90]. The 10-CV randomly splits the original sample into 10 "folds" or subsamples. One of the nine subsamples is only used to test the model, while the remaining nine are used for the algorithm training process. This process is repeated 10 times for each of the k subsamples. Thus, 10 outcomes are obtained that are then averaged to evaluate the performance of the classifier.

Degree centrality
The number of neighbours of a agent [44] In-Degree centrality Agent in-flow communication edge [44] Out-Degree centrality Agent out-flow communication edge [44] Eigenvector centrality Degree of connected agents who are themselves connected to many players [71] Authority centrality An agent is authority-central to the extent that its in-links are from agents that have many out-links [72] Hub centrality A node is hub-central to the extent that its out-links are to agents that have many in-links [72] Betweenness centrality Across all agent pairs that have the shortest path containing the player, the percentage that pass through the player [43] Information centrality Calculates the Stephenson and Zelen information centrality measure for each agent [73] Triad count Number of triads centred at the agent [74] Interlockers Interlocker are agents that have a triad count (the number of triads each node is in) that is greater than the mean plus one standard deviation [74] Radials Radial are agents that have a triad count (the number of triads each node is in) that is less than the mean minus one standard deviation [74] Clique count The number of distinct cliques to which each agent belongs [75] Constraint The degree to which an agent is constrained by its current communication network [76] Effective network size The effective size of a agent's communication network based on redundancy or ties [76] Clustering coefficient (local) Density of the agent's ego network, which is subgraph induced by its immediate neighbours [77] Simmelian ties Number of ties with strongly, reciprocally connected players when there are one or more third-party agents who commonly have strong and reciprocal edges to themselves and the connected agent [78] doi:10.1371/journal.pone.0147248.t002

Performance measures
The performance measures for the above techniques were calculated using a confusion matrix, that is, a matrix containing the numbers of positive and negative predictions made by a classification system (see Fig 4). The three classifiers' respective performances were evaluated using standard measures of Accuracy [61], Precision [62], Recall [62] and the Area under the ROC curve (AUC) [63,64], the lattermost computed via Leave-One-Out Cross-Validation (LOOCV) [91] as suggested by Airola et al. [92,93] for small data sets. Also applied was the Matthews Correlation Coefficient (MCC) [65], which is often used to measure performance with unbalanced databases (see Table 3).

Statistical evaluation of models
Two tests were used to evaluate the performance of LR, NB and RF: Cochran's Q test [94], which evaluates the three classifiers simultaneously, and McNemar's test (w 2 Mc ) [66], which   evaluates them pair by pair. For Cochran's Q the null hypothesis (H 0 ) was that the three performed similarly whereas the alternative hypothesis (H 1 ) was that they did not, that is, that they performed differently. For McNemar's test, the null hypothesis (H 0 ) was that the three models performed similarly while the alternative hypothesis (H 1 ) was that their performances differed.

Softwares
The social network measures were computed using the Organization Risk Analyzer (ORA) software tool [70,95,96]. We also used the following R packages for data analysis: DMwR R package for SMOTE [97], r-base-core for LR [98], e1071 R package for NB [99], RandomForest R package for RF and variable importance analysis [100], cvTools R package for computing 10-fold-CV and ROC with LOOCV [101], RVAideMemoire R package for the Cochran's Q Test [102], and Package exact2x2 for McNemar's test [103]. Questions or comments regarding the quantitative data analysis using the ORA and R Packages may be addressed to the authors.

Results
In this section we present and compare the performance scores for the WC and CN cases, and also set out the importance values of the variables in the best predictive model obtained.

Classification results for the WC case
The results for the WC case are summarized in Table 4, which shows the average performance scores (AVG) and their standard deviations (SD) for the three classification models. As can be seen, RF was the classifier with the highest average scores (in bold type) and the lowest standard deviations (underlined) for the accuracy, precision, recall, MCC measures, and ROC Area (AUC) (see Fig 5). Also apparent is that LR outperformed NB on all of the measures except Recall, where NB did better.
Cochran's Q test rejected the null hypothesis (Q = 5.09 with p<0.10), although only at the 10% significance level, meaning that performances of LR, NB and RF were statistically different. McNemar's pair-by-pair test with continuity correction found that RF's higher scores were significant in comparison to both NB (H 0 is rejected, w 2 Mc ð1Þ = 29.6, p <.001) and LR (H 0 The tests also demonstrated that LR's superior performance to NB was significant (H 0 is rejected, w 2 Mc ð1Þ = 6.68, p = .0087). Clearly, then, RF was the classifier that performed best in modelling verdict outcomes in the WC case while LR did better than NB.

Classification results for the CN case
The results in the CN case are summarized in Table 5. Once again, they show that RF was the classifier with the highest average performance scores (in bold) and the lowest standard deviations (underlined) for the accuracy, precision, recall, MCC measures, and ROC Area (AUC) (see Fig 6). LR outperformed NB on all of the measures except Precision, where NB did better.
Cochran's Q test rejected the null hypothesis (Q = 31.75 with p<0), meaning that performances of LR, NB and RF were statistically different. McNemar's pair-by-pair test with continuity correction found that as in the WC case, RF's higher performance scores were significant in comparison with both NB (H 0 is rejected, w 2 Mc ð1Þ = 30.1, p < 0.001) and LR (H 0 is rejected, w 2 Mc ð1Þ = 9.63, p < 0.001). The tests also showed that LR's superior performance relative to NB  was again significant (H 0 is rejected, w 2 Mc ð1Þ = 6.68, p = .0087). Thus, in the CN case the RF classifier performed best in modelling verdict outcomes while LR did better than NB.
Since RF obtained the best results in both cases, the following subsection presents the importance values of the social network measures in the predictive models.

Importance of social network measures in verdict classification
RF is used not only for classification but also to assess variable importance. The latter concept is defined as the total decrease in node impurities from splitting on the variable, averaged over all trees, where node impurity is measured by the Gini index [38,104]. The variable with the highest index has the greatest impact on classifier performance of all the variables tested in correctly modelling what class each instance belongs to.
The importance analysis for RF was conducted following the procedure proposed by Breiman [38]. The permutation-based mean decrease in accuracy was used to measure the importance of each variable in the classification. The importance values for each variable in the WC and CN cases are displayed in Figs 7 and 8, respectively. In the WC case, centrality betweenness was the most important social network measure as measured by the Gini index in discriminating between innocent and guilty parties. The next five most important variables were the simmelian ties, clique count, degree centrality, triad count and clustering coefficient measures.
In the CN case, effective network size was the social network measure of greatest importance as indicated by the Gini index in discriminating between innocent and guilty parties. The next two in importance were out-degree centrality and then eigenvector centrality.

Discussion
The results set out in the previous section demonstrated clearly that RF performed better, and in a statistically significant manner, than either NB or LR on the accuracy, recall, precision, MCC, and ROC measures. We may therefore conclude that RF produced better verdict outcome classifications in the two cases studied than the other two classifiers. But beyond this basic conclusion there are a number of important issues raised by the analysis presented thus far that are taken up in the following subsections.

On social network measures
The three published works discussed here earlier [26][27][28] on modelling verdict outcomes with social network measures used only seven, one and four measures, respectively (see Table 1 above). This contrasts with the present study in which 16 were employed (see Table 2 above). If we compare the importance values we obtained for the measures using the RF model in the WC case with the original WC study [27], we find that they agree on the primary importance of betweenness centrality in determining verdict outcomes. The original study utilized this measure to operationalise the notion that "political conspiracies rely on brokers between individuals and mediators between groups to integrate the cabals with each other and cabals with the cadre" ( [27] [p. 266]). We found, however, that betweenness centrality was not able to take directly into account the small groups in which actors play the role of broker or gatekeeper. Despite this measure's importance, then, other measures pointing to the social microstructures that agents intermediate must also be investigated.
Our results showed that there are in fact several other measures with degrees of importance similar to that of betweenness centrality, such as simmelian ties, clique count, degree, triad count and clustering coefficient. All of these indicators attempt to measure network substructures, the very phenomenon [27] referred to in the observation quoted above on the WC case network being organized into cabals that are part of cadres. If we consider the WC study authors' assertion that "[a] cabal is a clandestine team assembled to carry out political sabotage, espionage, and other illegal activities" ( [27] [p. 266-267]), such social structures are precisely the sort that can be detected by the simmelian ties (ties embedded in cliques), clique count, degree, triad count and clustering coefficient measures. In the WC case network (i.e., a cadre), the agents initially found guilty were those who developed a high degree of betweenness and were organized into clandestine teams (i.e., a cabal).
If we compare the importance values of the RF model variables obtained above with the original investigation in the CN case [28], we find that while out-degree centrality was important in both studies, effective network size had greater importance in the RF model. This measure has been explored empirically in a paper on criminal networks in Québec, Canada by Morselli and Tremblay [105]. The two authors conducted a correlational analysis of the data gathered from a survey of inmate volunteers in southern Quebec prisons, finding that "higher proportions of nonredundancy in personal networks (networks with higher effective size) were positively associated with criminal earnings, market crime commissions and low self-control, and negatively related with the age of the offender". And in a path analysis of the same data, Morselli, Tremblay and McCarthy corroborated that "mentorship increases effective size, which in turn increases criminal earnings. In other words, criminal mentors improve their protégés' social capital and such brokerage-like networking offers a competitive advantage in crime" ( [106] [p. 35]). Thus, both the present study and previous empirical evidence have found that the effective network size measure plays a prominent role in this type of criminal network.
The importance value obtained for the eigenvector centrality measure in the CN case also deserves comment. From a theoretical viewpoint, eigenvector centrality is designed to identify central agents connected to others who are also central. As one author has put it, "[e]igenvector centrality capitalizes on how differences in degree can propagate through a network (. . .) If one believes that differences in degree drive centrality, status, or power, then eigenvector centrality is called for" ( [107] [p. 561]). Empirically, the measure has been successful in identifying criminals. For example, it was used as an index for ranking the "Men of Honor" among members of the U.S. Mafia and enabled the construction of a predictive model for detecting criminal leaders [108]. Eigenvector centrality has also been proposed for locating central agents in cooffending networks [109]. The measure's importance in our RF model indicates the possibility of a hierarchical typology of criminal networks in which there are agents in high positions connected to many agents in low positions. Indeed, the three measures taken together suggest the following is true of CN: (a) the network has an organizational style in which the criminals control the effective size of their ego-network to distribute their earnings (effective network size); (b) it communicates directly with various agents (out-degree centrality); and (c) it has central agents that connect with other central agents (eigenvector centrality). In our view, the insights into the interplay between the various features of a social network are the most interesting aspect of importance analysis.

Challenges Ahead
A criminal network can be studied using different analytical techniques. Current pedagogical practice, however, tends to give preference to conventional rather than alternative approaches. This is reflected in introductory textbooks on quantitative criminology, which present LR as one of the first techniques of choice in binary classification problems [110][111][112]. The present study provides evidence that at least one alternative method has the ability to generate predictive models which are highly competitive with those produced by LR in criminal responsibility identification. The results we obtained are consistent with a small group of studies that have also reported better performance for RF than for LR or NB [54,113]. Further research is required to compare the performance of different classifiers in the data domain.
The number of social network measures with predictive potential that can be used to characterize a criminal network is constantly growing. The data domain of the social networks is based on two primitive types of data: (a) who a network member related with; and (b) how many times the relationship was activated. With these data types a virtually infinite number of social network indicators can be constructed. In the last three decades in particular, a number of new social network measures have been developed [44,74,[114][115][116][117][118][119][120], including variations on the classic centrality measures [121][122][123][124][125][126][127][128]. In addition, weighted social networks measures have been created that enable human interaction to be explored in new ways (see [129] and [130]). However, the inclusion of weighted social network measures in prediction problems has the undesirable effect of increasing the dimensionality of the data. New machine learning approaches must therefore be developed or adapted for using this new type of measure. Other types of social network measures appear regularly in specialized journals such as Social Networks or Connections. New techniques for selecting and managing the growing dimensionality generated by this increasing variety of available measures will have to be developed in future research.
Difficulties may also arise with the use of classifiers for verdict classification if the database for the analysis has class imbalances [131], that is, if the dependent variable has a class with significantly more innocent than guilty parties or vice versa. In this study we used the SMOTE technique to balance the classes in both cases, but other strategies exist in the literature [88,[132][133][134]. Another potential complication has to do with the size of some criminal networks [109,135]. The natural size of a human social network is % 150 [136], or in more specific terms, "[m]aximum network size averaged 153.5 individuals, with a mean network size of 124.9 for those individuals explicitly contacted" ( [137] [p. 53]). In criminal networks, however, with the exception of certain cases such as the Italian Mafia [108] or corrupt companies such as ENRON [138], the number of members may be relatively limited [139]. This means that future investigations will require techniques that can learn quickly from a small number of observations. Finally, there is the question of which structural aspects of criminal networks influence the way social network measures differ in their importance in determining verdicts. In our results for the WC and CN cases, the order of importance of the social network measures is not the same. In qualitative terms, the variables which predicted verdict outcome were different. Whereas effective network size was eighth in importance in WC, it was first in CN. Centrality betweenness, meanwhile, was one of the least important in CN but the most important in WC. This is indirect evidence that the configuration of the two networks and their relationships with verdict outcome also varied. Further research such as that reported in [140] is needed to compare the structures of networks in terms of their topological characteristics in order to understand how structural aspects contribute to explaining criminal network verdict outcomes.

Limitations of this experimental study
This experimental study has three main limitations. The first one has to do with the interpretation of the RF models. Despite their good classification performance, RF is a black box type of algorithm and the models are difficult for humans to interpret in the sense that the results do not indicate the individual effects of each attribute on the output variable. Future research should include an investigation of visualization techniques based on sensitivity analyses as suggested by [141].
The second limitation relates to the possible bias stemming from the treatment of unbalanced classes. Although SMOTE was used in the present study to address the class imbalance problem, the small number of available instances for training the algorithms may have produced an unknown level of bias in the synthetic classes generated. However, the results show that the models generated by RF exhibit low error levels in the classification results, a sign that SMOTE combined with RF helped to increase the predictive ability of the models in both classes. As noted early, several studies have applied SMOTE in classification problems using social network data. For example, it was used to create a model that predicts social security fraud detection in Belgium [86]. In another case the technique was utilized to reduce some of the effects of class imbalance among Trust and Distrust classes in social network online services [84]. Finally, it was employed to rebalance the classes for a trust prediction problem using social network data [85]. Additional research (for example, simulation studies using artificial generated data) to improve our understanding of the effect of SMOTE on class distribution would nevertheless be useful.
The third limitation of this study, a problem inherent in verdict classification, is that our models assumed the judicial system reached the correct verdict in each case. Yet it is well known that in the WC case, although Richard Nixon was one of the main perpetrators, the system treated him differently than the others. As for the CN case, some of the criminal agents were not convicted in exchange for informing on the others. Given the many complications in any criminal justice process, and more specifically, the complex rules governing criminal trials, the vagaries in the performance of the prosecution and defence teams and the quality of the evidence, both this and previous studies have had little choice but to proceed on the assumption that the criminal justice systems' verdicts are correct. In other words, what is really being studied is not who was actually guilty but how the justice systems classify guilt. Thus, for determining guilt or innocence the methods we have discussed have real limitations, but for modelling the behaviour of the judicial process they have much to offer.

Final words
This study has attempted to respond to a number of questions and define new tasks for modelling criminal trial verdicts using social networks measures. The ultimate goal is to provide criminologists with valuable feedback for the efficient allocation of resources and effort to issues of public interest. The application of machine learning in criminal networks requires further study, particularly as regards ethical and legal questions that arise in real-world cases. Greater application of social network analysis and machine learning in quantitative criminology could provide valuable information about the organization of criminal networks and their networked behaviour.