A Novel Clustering Methodology Based on Modularity Optimisation for Detecting Authorship Affinities in Shakespearean Era Plays

In this study we propose a novel, unsupervised clustering methodology for analyzing large datasets. This new, efficient methodology converts the general clustering problem into the community detection problem in graph by using the Jensen-Shannon distance, a dissimilarity measure originating in Information Theory. Moreover, we use graph theoretic concepts for the generation and analysis of proximity graphs. Our methodology is based on a newly proposed memetic algorithm (iMA-Net) for discovering clusters of data elements by maximizing the modularity function in proximity graphs of literary works. To test the effectiveness of this general methodology, we apply it to a text corpus dataset, which contains frequencies of approximately 55,114 unique words across all 168 written in the Shakespearean era (16th and 17th centuries), to analyze and detect clusters of similar plays. Experimental results and comparison with state-of-the-art clustering methods demonstrate the remarkable performance of our new method for identifying high quality clusters which reflect the commonalities in the literary style of the plays.


Introduction
Following great advancements in technology, all scientific fields have been faced with huge empirical datasets. Identifying groups of objects, patterns or elements with "similar characteristics" has always drawn the attractions of researchers. Data clustering is one of the most widely studied problems in data mining and machine learning with a wide variety of applications in many fields ranging from the analysis of genomic data in biology [1,2] to classifying customers for efficient marketing strategies [3,4]. Clusters are natural groups in data such that elements within the same cluster are more similar to each other than to elements belonging to other clusters [5]. Thousands of techniques have been proposed to address a wide variety of clustering problems in different disciplines (see the review of data mining and clustering techniques by Han and Kamber [6]). The main clustering methods include probabilistic and generative models [7,8], distance-based clustering [9], density and grid-based clustering [10,11], spectral clustering [12,13] and graph clustering [14].
Graph clustering is the process of grouping vertices of a graph into clusters according to the structure of the graph. Generally speaking, a good graph clustering contains clusters with many edges within the clusters and few between clusters. However this definition is unclear, thus many quality functions have been proposed to evaluate graph clusters such as ratio cut [15], normalized cut [16] and modularity [17]. In the recent literature a cluster of vertices in a graph is mostly called a community [18] and graph clustering discussed as the community detection problem [19]. The graph clustering problem is a fundamental algorithmic problem, however it has not been solved satisfactory and it is a notably challenging member of the NP-complete class [20,21].
This study proposes a new data clustering methodology for the authorship analysis in a dataset generated from 168 plays from the Shakespearean era. Authorship study is a longstanding research activity, predating the availability of computers by many centuries. Interest in authorship follows from the magnitude of the consequences of assigning a work to one author or another, e.g. in separating canonical and apocryphal books in a religious tradition, establishing the authenticity of legal documents, or determining which works or parts of works are by celebrated literary authors. With the availability of digital text and computational tools, this activity is now predominantly quantitative. The Shakespearean canon has been a particular focus, with studies testing the claims of various possible authors other than William Shakespeare of Stratford to the works we know as Shakespeare's [22], and debating possible additions to Shakespeare's canon [23,24]. The advent of quantitative methods has also renewed interest in the idea that authorship is objectively a property of texts, rather than largely the invention of readers and institutions, as was argued by poststructuralist theorists [25][26][27][28].
Here, we propose a new methodology for identifying groups in the large text corpus dataset. The methodology is easily generalizable and can be employed for analyzing any other dataset. This advanced clustering methodology brings together Jensen-Shannon divergence(JSD) as an information theoretic dissimilarity measure and combines it with a newly proposed memetic algorithm (iMA-Net) for graph clustering to yield a powerful strategy for segmenting the text corpus dataset. The proposed memetic algorithm maximizes the modularity quality function and it is an unsupervised algorithm, i.e. the method does not require any information about the data elements, such as authorship and genre of plays, it only uses the structure of the dataset to uncover clusters of similar plays. The outcomes are promising and show great authorial affinities in each cluster; a computational proof of the highly individual literary style of each author.

Dataset
In this work, we utilized a text corpus dataset containing 168 plays from the Shakespearean era (16 th and 17 th centuries) with the unambiguous contribution of authorship of 39 authors (see Table 1). This dataset was generated by Craig and Whipp [29] using a software application called Intelligent Archive (IA) to count variant spellings of approximately 55,114 unique words across all 168 plays. The software tool IA is used to extract the word-play matrix from a collection of suitably edited texts, so that only the words of dialog are counted and other materials, such as prefaces, stage directions and speaker tags are omitted. In the texts, contractions such as "they're" are expanded so this counts as one instance of they and one of are. The IA has the capacity to combine variant spelling forms so that multiple spellings of the same word are combined into a single count [29]. As a result, this dataset contains about 9.3 million word usage statistics (word frequencies) stored in the form of a (55,114 × 168) matrix. The wordplay matrix is available in the supporting information of [30].
This dataset was analyzed by Marsden et al. [30] and they found measurably distinct literary styles by detecting the tendency towards overuse or avoidance of particular words for four key authors who account for the largest number of plays in the dataset. That study resulted in a set of marker words that can distinguish between plays written by Fletcher, Jonson, Middleton and Shakespeare compared to the remainder of the authors. Recently, Arefin et al. [31] used this dataset to test the performance of their new graph clustering method incorporating paraclique identification and a clustering method known as MST-kNN [32].

Method
As mentioned earlier, to detect clusters of plays that are more similar in the given dataset, we propose an unsupervised graph-based clustering methodology that employs a new memetic algorithm which optimizes the modularity value in k-Nearest Neighbour (kNN) graphs derived from the dataset. Fig 1 outlines our methodology as a four-stage process. This clustering methodology is general; it can be used to detect clusters in a dataset of n elements where each element has m quantified features (the input dataset can be represented in a form of a (m × n) matrix). In the current study, n equals to 168 and refers to the number of plays and m equals 55,114 and refers to the words used in the plays. Details of each stage are described in the following section. The primary idea of this partitioning approach was proposed in [33] as a new method to analyse a complete network derived from field survey aimed at detecting consumer communities of trust and confidence in the Australian not-for-profit and charity sector. In this paper, we extended the primary idea by using the square root of Jensen-Shannon divergence metric ( ffiffiffiffiffiffiffi ffi JSD p ) as the dissimilarity metric, enhancing the modularity optimization algorithm and finding clusters with higher quality.
Stage 1-Compute the Dissimilarity Matrix. We start our method by computing the dissimilarity between all pairs of plays. We use the square root of Jensen-Shannon divergence ( ffiffiffiffiffiffiffi ffi JSD p ). JSD is a metric measuring the dissimilarity between two probability distributions based on Jensen's inequality [34] and the Shannon entropy [35]. We refer to Lin [36] for a discussion of the JSD. The JSD has been used in many studies, especially in bioinformatics [37][38][39]. The ffiffiffiffiffiffiffi ffi JSD p is selected being the dissimilarity metric because it has the distance metric properties and satisfies the triangle inequality condition (this property is proved in [40,41]). The JSD reflects the dissimilarity of two plays based on the frequency of different words appeared in each play. Generally, for two probability distributions, P and Q, JSD is computed as follows, where, H(X) is Shannon information entropy for distribution X and is computed by Eq 2. Here, x i is the probability of occurrence of word i in the play X. The JSD is a non-negative and symmetric metric, therefore, JSD(P, Q) = JSD(Q, P). For two identical probability distributions (i.e. P = Q), JSD(P, Q) = 0, i.e. there is no dissimilarity between the two. Moreover, by using the log 2 in Eq 2 the JSD is bounded by 1, thus 0 ffiffiffiffiffiffiffi ffi JSD p 1. By computing the ffiffiffiffiffiffiffi ffi JSD p for all pairs of plays, we obtain a symmetric 168 × 168 matrix representing the dissimilarity between all plays (see S1 Table). In this study, the ffiffiffiffiffiffiffi ffi JSD p matrix is computed using the cibm.utils package developed in R. A graph is associated with the dissimilarity matrix, where each node represents a play and edges connect nodes where the weight of an edge is the dissimilarity between the plays. In the next stage, we will introduce a new method to analyze this graph.
Stage 2-Generate k-Nearest Neighbour (kNN) Graphs. In this stage we generate incomplete graphs by removing edges that are less important from the complete graph and preserving edges that represent stronger connections between nodes. For this purpose, we set up a series of kNN graphs starting with k = 1 to k = 10 (in total 10 graphs). To generate the kNN graphs from the dissimilarity matrix, a pair of nodes (a, b) is connected if either a is one of the k-nearest neighbors of b or if b is among the k nearest neighbors of a. The kNN graphs of this study are generated by the scalable software tool proposed by Arefin et al. [42].
Investigating 10 different kNN graphs gives us the opportunity to study the performance of our method under different cases. By increasing the value of k, the average degree of nodes is increased and the graph becomes denser. Therefore, identifying meaningful clusters in kNN graphs with a larger k is more computationally challenging in practice, as graph clustering methods identify clusters in the graph based on the graph's edge structure, and more edges simply means more possible solutions that the method has to analyze. Table 2 shows the number of edges and the average degree for each of the kNN graphs.
We then apply a clustering approach to the kNN graphs. To detect the clusters, we developed a memetic algorithm that optimizes modularity value as the evaluation criterion. Due to the fact that the memetic algorithm uses the edge weights to cluster nodes with strong connections, we have to modify the edge weights in kNN graphs in order to represent the 'similarity' between nodes so that the clusters will contain similar nodes. For this purpose, similarity is defined as one minus dissimilarity.
Stage 3-Modularity optimization algorithm. Modularity optimization is one of the most popular methods for detecting graph structure and hidden groups of nodes with many intraconnections and comparatively few inter-connections. As explained in the introduction, such a group of nodes in the graph is called a community or cluster. It is believed that nodes which belong to the same community (cluster) are more likely to play a similar role or share common functionality [18,19]. Detecting the community structure of a given graph, or community detection, is a newer nomenclature for the graph clustering problem and both problems are similarly looking for groups of "related" nodes to label as a cluster or community [14]. It is worth emphasizing that detecting the community structure is practicable only if graphs are sparse [19]. If the number of edges is close to the maximal number of edges, then the graph is a dense graph, the nodes become too homogeneous and the community structure does not make sense turning the problem into data clustering, which has its own concepts and methods that are more suitable for high density graphs [5,43]. As a variety of complex systems in biology, sociology, physics and computer science can be represented as graphs, community detection is one of the most attractive problems in many disciplines [44,45]. Although different approaches and methods have been proposed to solve the community detection problem, it is still a difficult problem and has not been solved satisfactorily [19]. Modularity, which was first introduced by Newman and Girvan [17], has the ability to evaluate the quality of communities, therefore it has become the key objective function in most of the recent algorithms.
The modularity function is formulated by computing the difference between the number of the edges within a community (i.e. sub-graph) and the number of edges expected to appear in the same sub-graph in the 'null-model'. The null-model is a model of the graph which is used for comparison, to verify whether the graph has community structure. The most popular nullmodel is the randomised version of the original graph proposed by Newman and Girvan, where the same number of nodes are randomly assigned, under the constraint that the expected degree of each node matches the degree of the node in the original graph [19,46]. The random graph is not expected to have a community structure, therefore the graph has a community structure if it is significantly different from the random graph. In other words, higher modularity shows greater difference between the edge density in a sub-graph and expected value in the random graph, revealing community structure. Given an edge-weighted graph G = (V, E, W) the modularity Q of a clustering solution (i.e. community structure) I is defined as follows: The summation runs over all the communities c contained in the solution I, and M refers to the total weight of G, i.e.
where w ij denotes the weight of the edge connecting node i and node j. In Eq 3, l c stands for the total weights of edges connecting two nodes in community c and d c is the sum of the weighted degrees of the nodes in community c, i.e.
A solution with a higher value of modularity is believed to indicate a better community structure or clustering solution within the given graph. Therefore, the maximization of modularity is by far the most popular approach for identifying the community structure of a graph [19]. Since Brandes et al. [47] proved that modularity maximization leads to a NP-complete problem, finding the community structure with the maximum modularity has become a great challenge for data scientists. Therefore, many heuristic algorithms have been developed to find near optimal and good solutions. Some of the most popular methods employed for modularity maximization are heuristics based on hierarchical or aggregation clustering methods [17,[48][49][50], genetic algorithms [51,52], memetic algorithms [53][54][55], spectral methods [56,57] and simulated annealing [58].
In the present study, we propose a new memetic algorithm specially designed for solving the community detection problems named "iMA-Net" that is an improved version of the MA-Net algorithm previously proposed in [59]. There are three main differences between MA-Net and iMA-Net. 1) MA-Net considers all graphs as unweighted graphs while iMA-Net detects communities in weighted graphs. 2) iMA-Net employs an enhanced local search algorithm to make the algorithm more stable. 3) While the population initialization procedure in both MA-Net and iMA-Net applies local search to each individual to improve its fitness, differing local searches make the initialization procedure performance different in MA-Net and iMA-Net. The framework and steps of iMA-Net are explained in detail in the following sections. In the last section of stage 3, the experimental results of the proposed algorithm (iMA-Net) in five real-world benchmark networks are given. To illustrate the performance reliability of iMA-Net, we compare results of iMA-Net with MA-Net [33] and those from five representative community detection algorithms.
iMA-Net: A Memetic algorithm for Modularity optimization. Due to the proven success of memetic algorithms and their effectiveness in dealing with many NP-hard combinatorial optimization problems [60], we chose the memetic framework for our modularity optimization algorithm, named iMA-Net. iMA-Net is a population-based algorithm in which each solution in the population represents a particular clustering of the given weighted and undirected graph. In the proposed algorithm, solutions are evolved using problem-specific genetic operators and a local search procedure employed to detect high-quality community structure with the optimal or near optimal modularity value.
We used the string-coding solution representation in which a clustering solution of G is represented by a list of n integer numbers [C 1 , C 2 , . . ., C n ], where n is equal to the number of nodes in G and C i refers to the community label of node v i . Therefore, if C i = C j then i and j are located in the same community. This way of representing clustering solutions has been used in previous algorithms, for instance [51,53,54]. An illustration example of string-coding representation is given in Fig 2. Algorithm 1 presents the overall framework of iMA-Net for finding the partition of G that achieves the maximum modularity (computed following Eq 3). Details of each step of the algorithm are explained in the following paragraphs. Initialization: The initial population consists of N p feasible solutions, so the following process runs N p times to build the initial population. Firstly, each of the n nodes in G is given a random community label that can be any integer number in the range [1,n]. Then to improve the quality of the solutions in the initial population, the local search procedure, as explained later in this section, is applied on each solution and made changes on community labels of nodes to find a neighbor solution with higher modularity. As the community label is selected randomly from a wide range of [1,n], there is not any constraint on the number of the different labels or the number of clusters in a solution. Applying the local search procedure on each solution, improves the quality of the initial population and speeds up the convergence of the algorithm.
Modularity-based recombination operator: Traditional operators including uniform, onepoint and two-point recombination operators are not suitable for the string based representation as they do not convey characteristics of parent solutions to the offspring. iMA-Net uses a problem-specific recombination operator for generating new solutions that inherit the best characteristics from the parent solutions which is named the modularity-based recombination operator. This recombination operator was first proposed for MA-Net [59]. This recombination method effectively transfers the best characteristics of the parents to the descendants and it works as follows.
First, two solutions I a and I b are selected uniformly at random from the population as parents. The initialization procedure allows us to safely assume that all population members have relatively good quality and the random selection strategy helps maintain the population diversity. Next, a priority list PL is generated by sorting all communities in both parents based on their fitness values, as computed by Eq 3. Then, the best community with the highest priority from PL is selected and the same community is formed in the offspring. Then, for the next community in the priority list, form the most similar community in the offspring with the nodes that are not assigned to any community yet. This procedure runs until all the nodes in the offspring solution have been assigned to a community.
The offspring generated by the modularity-based recombination operator inherits the best community structure from their parents [33]. One of the limitations of modularity optimization algorithms is their difficulty in detecting small size communities and their tendency to merge small communities and form large communities [61]. The modularity-based recombination operator that we proposed has the advantage of generating offspring with a larger number of communities than its parents. This advantage helps the algorithm to explore the search space more deeply and not avoid solutions with small communities.
Adaptive mutation operator: This operator changes the community's label of nodes with the probability of P m . This means that in each run of the adaptive mutation operator dP m Ã ne nodes are randomly moved to one of their neighbors' communities. Thus, this mutation operator is neighbor-based and considers only effective changes. The mutation probability (P m ) is adaptive and it linearly grows from P 0 m to 2P 0 m , as the algorithm approaches the termination condition. As shown in Algorithm 1, the stopping condition of iMA-Net is N s generations without any improvement in the best solution found so far. By increasing the mutation probability (P m ), more changes will occur in the solution, thus diversity of the population will grow and the search area will expand. For instance, when the algorithm's parameters are set to be P 0 m ¼ 0:05 and N s = 30 and the algorithm runs 15 generations without improvement in the best solution found, the mutation probability will grow to 15 30 P 0 m þ P 0 m and it will be P m = 0.075. Local search: The local search procedure works on every new solution developed by the adaptive mutation operator, and it aims to improve the fitness of the solution before adding it to the population. iMA-Net uses the Vertex Movement (VM) heuristic [62] together with a stochastic hill climbing strategy to search the neighborhood to find a better solution. The local search procedure is shown in 2. Given a graph G with n nodes, the local search procedure works as follows. Firstly, L, a random sequence of the nodes is generated. Then, for each node v i chosen according to L, v i 's label is changed to that of a randomly selected neighbor, if the change improves the fitness of the solution. If the movement of v i from its community to none of its neighbors' community increases the fitness (modularity), we leave v i in its place and check the next node in L. The above process runs until there is no change for any of the nodes in L that improves the fitness of the solution (see Algorithm 2).
The local search procedure is designed to accept the first random movement of the node that improves the fitness value. This stochastic behavior in selecting the neighbor helps the algorithm to avoid one of the known limitations of deterministic hill climbing strategies which is getting stuck in the local optima. The local search procedure operation is effective and we reduced the computational cost by using the method proposed in [50] for computing the change to the modularity ΔQ incurred moving node v i into the community of node v j .
Elitist population updating strategy: As with MA-Net [33], in iMA-Net the population updating strategy is elitist. New solutions generated by the local search procedure are added to the population and the least fit solutions are removed from the population. The elitist strategy enhances the algorithm convergence by retaining the better solutions in the population, making them eligible to be parents in the subsequent generations.
Performance results of iMA-Net on benchmark networks. Before applying the iMA-Net on the kNN graphs produced in Stage 2, we tested the performance of iMA-Net on five realworld networks with similar size to the kNN graphs. These networks include the Zachary's Karate Club network (karate) [63], the Bottlenose Dolphins network (dolphin) [64], the American College Football network (football) [18], the Political Books network (polbooks) [56] and the Jazz Musicians network (jazz) [65]. Basic information on these benchmark networks is shown in Table 3. All of these networks are unweighted and undirected graphs.
The comparison between iMA-Net and MA-Net in Table 4, is made to illustrate the effectiveness of iMA-Net. Moreover, to investigate the performance of iMA-Net, we compared the iMA-Net results with a set of well-known community detection algorithms with different approaches, frameworks and objective functions. The benchmark algorithms include LPAm [66], Meme-Net [67], MOGA-Net [68], MODPSO [69] and MLCD [70].
LPAm presented by Barber and Clark [66] is a label propagation algorithm which reformulated the original label propagation algorithm (LPA) [71] to overcome its drawbacks. LPAm employs a modularity-specific rule for label propagation and updates the community label of nodes repeatedly till no possible improvement can be found. Meme-Net [67] is a memetic algorithm optimizing the modularity density to discover communities in the graph. MOGA-Net [68] and MODPSO [69] are two multi-objective optimization algorithms for community detection. MOGA-Net maximizes the intra-community links and minimizes the inter-community links, while MODPSO maximizes the internal link density and minimizes the external link density. MLCD is a novel memetic algorithm presented by Ma et. al [70]. MLCD has a multilevel learning framework for detecting the community structure by optimizing the modularity with the aid of a special hybrid global-local heuristic search procedure.
In a comprehensive study [70], Ma et. al. ran all of the selected benchmark algorithms 50 times each, with the same key parameters, on several real-world and computer synthesized benchmark networks, and recorded the maximum, average and standard deviation of the modularity values (Q max ,Q avg ,Q std ). In this study, we refer to the reported results in [70] and compare results obtained by 50 independent runs of MA-Net and iMA-Net with the benchmark algorithms. MA-Net and iMA-Net are implemented in Python 2.7 and executed on a PC with Intel 1 Xeon 1 CPU E5-1620 at a clock speed of 3.6 GHz (4 cores and 8 logical processors) and 16 GB of memory. We tuned the parameters of MA-Net and iMA-Net as follows: population size, N p , is set to 40, the minimum mutation probability, P 0 m , is set to 0.05 and the termination condition is set to 30 generations without improvement (N s = 30). The comparison results are shown in Table 4.
Firstly, the comparison results in Table 4 show that in all benchmark networks MLCD, MA-Net and iMA-Net obtained the largest Q max . While both MA-Net and iMA-Net achieved the maximum possible modularity in all networks, iMA-Net has more reliable performance because it improves the Q avg . Meanwhile, comparing Q std values obtained by MA-Net and iMA-Net show that in all networks iMA-Net has smaller deviation, indicating that the improvements in iMA-Net enhance the stability and accuracy of the algorithm. Comparisons between the Q avg values obtained by iMA-Net and the other algorithms show that in all cases iMA-Net achieved higher Q avg and performed better than four representative algorithms, LPAm, Meme-Net, Moga-Net and MODPSO. Only MLCD, which is a novel multi-level learning memetic algorithm has a better performance than iMA-Net in benchmark networks. As the MLCD method is not publicly available and iMA-Net shows outstanding performance in detecting the high quality community structure with great stability, in this study we use the iMA-Net as the community detection algorithm. It is important to note that the clustering methodology is a generic method and any community detection algorithm can be employed in the third stage. Stage 4-Partition Quality Evaluation. To evaluate the quality of the solution obtained by our clustering method, we can compare the solution with the known authorship label of each play. In other words, we expect that a good solution would put the plays written by the same author in one group. Thus, we are considering the authorship as the true label of each play and to evaluate the solution quality we used two well-known clustering similarity measures: Normalized Mutual Information (NMI) and Adjusted Rand Index (ARI).
NMI: A symmetric index computing the similarity between two clustering solutions based on the confusion matrix (also referred to as the contingency matrix). As defined in [72], for two clustering solutions of a given graph A and B, the NMI is defined as follows: where F is the confusion matrix for solutions A and B in which rows and columns correspond to the clusters in A and B, respectively. F ij is the node-overlap ith cluster of A and jth cluster of B. F row i is the sum of the elements in the row i of F and F col j denotes the sum of elements in the , compares two clusterings based on the number of cluster membership agreements and disarrangements between them. It shows the ratio of the number of node pairs similarly classified in both solutions, divided by the total number of pairs. RI is defined as where n is the number of nodes, α refers to the number of pairs of nodes which are in the same cluster in solution A and in the same cluster in solution B and β is the number of pairs of nodes that are classified in different clusters in both solutions. The RI lies between 0 and 1 and it is equal to 1 when two solutions are exactly the same. The ARI was proposed by Hubert and Arabie [74]. To adjust the RI they assumed the generalized hypergeometric distribution as a null model and defined ARI as follows: where, similar to Eq 7, F ij , F row i and F col j are extracted from the confusion matrix F. Like the NMI, a larger ARI shows that two solutions are more similar. The ARI attracted more attention than RI as it is in the range [−1, +1], the wider range of values increasing the sensitivity of the index. To combine the effectiveness of NMI and ARI, we also used their product, NMI × ARI, for evaluation and comparing different clusterings.

Results
We applied iMA-Net to a series of ten kNN graphs (for k = 1 to k = 10) to optimize the modularity value and find the best solution that partitions the set of 168 plays written by 39 authors. The computer used for these experiments has the same configuration as stated in the section 'Performance results of iMA-Net on real-world networks'. We ran iMA-Net 30 times for each graph with the following parameters: population size N p is set to 40, the termination criterion is set to 30 generations without improvement (N s = 30) and the minimum mutation probability is equal to 0.05 (P 0 m ). The best solution found in 30 runs of the algorithm, which has the highest value of modularity, is recorded in Table 5. Quality measures including NMI, ARI and NMI × ARI, representing the similarity between best solutions and the true clustering solution obtained from the authorship labels, are shown in Table 5.
The results in Table 5 demonstrate that by increasing the value of k and adding more connections to the graph, the number of the detected clusters and the solution modularity (Q) is reduced. Because of the dependency of the highest value of Q on the structure of the graph, it cannot be used for comparing solutions from different kNN graphs. But we can refer to quality measures (NMI and ARI) to find in which graphs iMA-Net finds the solution that better matches authorship labels. The results show that the best-matched solution with the highest quality measures, i.e. NMI = 0.742 and ARI = 0.525, is obtained by analyzing the 2-nearest neighbor (2NN) graph and contains 18 clusters, as is shown in  Furthermore, considering authors with limited contributions in this dataset and the way these works grouped with other plays reveals some interesting insights about their literary style. For instance, the single Tourneur play in the dataset is classified in Cluster 5 with 20 plays of Shakespeare. Commentators have seen Tourneur as a follower of Shakespeare [75], and it is likely that in this case the influence of Shakespeare's style explains the placement of the play within the cluster. Another example is the clustering of the play of Marmion and the single play of Brome in the same group (Cluster 12) with 10 plays attributed to Middleton that illustrate the similarity between their word usage.

Reduced Datasets
Referring to Table 1, the complete dataset contains 19 authors with only one attributed play. iMA-Net classified these plays with the most similar plays, maximizing the modularity value. However, this behavior reduces the quality measures (NMI and ARI). To better understand the performance of the proposed methodology, we applied the same stages on reduced datasets resulting from removing some plays from the original dataset. The five reduced datasets generated by removing plays written by authors who have 1, 2, 3, 4 and 5 contributions are as follows: 1. G1: Plays from authors with more than one contribution.
2. G2: Plays from authors with more than two contributions.   Table 6 shows the number of plays and authors that remained in each of the reduced datasets. We applied our four-stage method to the five reduced datasets (G1-G5) and the results are given in Table 7.
From the higher values of NMI and ARI in Table 7, compared to their values in the original dataset, Table 5, we can conclude that by removing authors with fewer contributions from the original dataset, the proposed clustering method found solutions with higher quality with clusters better matched to the authors' labels. More precisely, while the average NMI and average ARI in the original dataset are 0.670 and 0.422, respectively. The average NMI increased to 0.718 and the average value of ARI to 0.538 in the reduced datasets.
On the other hand, the results in Table 7 show that NMI and ARI are different quality measures and they do not always agree on the best solution among the ten solutions found in the kNN graphs. Thus, we consider NMI × ARI to combine the effect of two measure into a single index. Interestingly, NMI×ARI always attains the highest value in the 5-nearest neighbour (5NN) graphs in all of the five reduced datasets. The best solution according to NMI×ARI is found in the G5 dataset and in the 5NN graph where the highest NMI×ARI is equal to 0.693. Table 8 shows the confusion matrix of the true solution and the solution with the highest quality measures for NMI and NMI×ARI and Fig 4 represents this best clustering solution in the 5NN graph. For details of plays in each cluster refer to the S4 Table. Both Table 8 and Fig 4 show that plays written by Middleton (Cluster 2), Fletcher (Cluster 4), Lyly (Cluster 6) and Ford (Cluster 7) are classified in completely separated clusters. All 28 plays of Shakespeare in this dataset are in the biggest group, Cluster 1, together with three plays from different authors which are the comedy 'Blind Beggar of Alexandria' written by Chapman, the pastoral play 'Faithful Shepherdess' by Fletcher and a comedy from Lyly named 'Woman in the Moon'. So far as we are aware no commentators have suggested previously that these three non-Shakespeare plays in Cluster 1 are particularly Shakespearean, but they are sometimes noted as unusual for their authors. In the present analysis 'The Woman in the Moon' is among the five nearest neighbors of 'The Faithful Shepherdess' and 'The Blind Beggar of Alexandria', thus these three plays are themselves connected in G5.
The 'Woman in the Moon' is a departure for Lyly in that he moves to verse after a career writing prose comedies. It is his last-performed play. Critics note other important differences from Lyly's normal practice as well [76]. None of the other seven Lyly plays in this dataset appear among the five plays with the highest similarity score and connected to 'Woman in the Moon'. Moreover, though this play is normally classified as a comedy [77] the closest play according to ffiffiffiffiffiffiffi ffi JSD p measure is a Shakespeare tragedy, 'Romeo and Juliet', and other closest is another tragedy, 'Tis Pity She's a Whore', by John Ford, from a much later phase of drama-the first production of Ford's play was in 1632, compared to 1593 for the Lyly play. In her edition of the play Leah Scragg suggests that the play has more similarities to the rest of the Lyly canon than critics generally acknowledge, and has links to two other Lyly plays, 'Endymion' and 'Mother Bombie', in particular [76]; the results of the present analysis come down on the side of change rather than continuity in this play. In the context of a clustering in which author and genre likenesses come through strongly, and consistently, overall, the cross-author and crossgenre links for this play are worthy of further investigation, though that is beyond the scope of the present paper. 'The Faithful Shepherdess' is a pastoral, quite unlike Fletcher's usual work in comedy and tragicomedy. In his studies of the shares of Fletcher and Beaumont in their collaborative plays Cyrus Hoy excludes 'The Faithful Shepherdess' from Fletcher baselines for that reason [78]. Jonathan Hope notes Fletcher's departure from his usual style in 'The Faithful Shepherdess', which he calls "a consciously literary and deliberately archaic piece" [79]. Thus it is not surprising to find it clustered away from the main Fletcher cluster. Two Fletcher plays appear in the nearest neighbors of 'The Faithful Shepherdess', suggesting that the analysis has uncovered some authorial characteristics in the play, even if they are not sufficient to have it join the Fletcher cluster (Cluster 4).
Something similar may be said about the 'Blind Beggar of Alexandria', which appears with two other Chapman comedies in the list of plays with the highest similarity score. The 'Blind Beggar of Alexandria' is Chapman's first performed play and has often been viewed negatively as overly farcical with an underdeveloped romantic plot [80].
It is most likely that these three plays are clustered with Shakespeare because they have weak links to their authorial canons, and the Shakespeare cluster is large and therefore likely to be diverse, with many options for links.  Table 8. Confusion matrix of the true authorship and the clustering solution obtained by the 5NN graph with NMI × ARI = 0.693. The confusion matrix shows how 106 plays by 7 authors are distributed into 7 clusters. As expected, a good separation occurred in clusters 2, 4, 6 and 7, which are formed by plays of one specific author. Two other clusters (Cluster 3 and Cluster 5) are formed by mixing all 17 of Jonson's plays and 12 plays written by Chapman. A deeper look at the plays in these two clusters shows that the genre of the plays has a great role in putting similar works in the same cluster. The larger cluster (Cluster 3) contains 21 comedies of both Jonson and Chapman together with the only pastoral play written by Jonson named ('Sad Shepherd'). However, Cluster 5 contains all of six tragedy plays written by these two authors and a historical play of Chapman which is called 'Caesar and Pompey'. Both Jonson and Chapman would be recognized as having drastically different styles in their own comedies compared to their own tragedies, so this would fit with general critical opinion.
Taking all the clusters found by our methodology into account, authorship is generally very strong on these measures, but a few authors have either an aberrant play or a consistent internal division by genre. Chapman has both of these attributes. The clusters identified by the proposed unsupervised method show considerable associations with the authors' writing styles. Nevertheless and in order to validate the effectiveness of our method, we applied two available clustering methods on the dataset and compared the results.

Performance comparison with other clustering methods
The proposed method provides solutions for the clustering of plays in the dataset based on the similarity of word frequencies in each play. To compare our method with other clustering methods, we applied six well-known distance-based clustering techniques on the Complete dataset and the five reduced datasets (G1-G5). The selected clustering methods for comparisons are: a well-known implementation of the k-means method named k-means++ [81], one unsupervised graph-based clustering algorithm named MST-kNN [32], and four popular hierarchical clustering methods: 1)Complete-linkage, 2)Average-linkage, 3)Single-linkage and 4) Ward's algorithm. k-means is by far the most popular clustering method used in many scientific and industrial applications [82,83]. k-means++ developed by Arthur [81], is a novel k-means algorithm with an enhanced seeding technique for outperforming the standard k-means. Due to the popularity of k-means methods, k-means++ is selected for comparison. k-means++ is implemented in the Scikit-learn(0.16.1) clustering package [84] in Python and is publicly available. It is worth noting that k-means++, according to the original method, uses the Euclidean distance and identifies clusters by minimizing the average square distance between elements in the same clusters [81]. This method is a supervised method, in the sense that the number of clusters (K) is known in advance; therefore, we run the algorithm for K = 2 to K = 20 and report the best solution based on the quality measure NMI×ARI.
MST-kNN is an unsupervised graph clustering technique proposed by Inostroza-Ponta et. al [32]. This method works by partitioning the minimum spanning tree using the k-nearest neighbour graph with an adaptive determination of the number of clusters. The MST-kNN is chosen for comparison because for two main reasons: 1) it can use the ffiffiffiffiffiffiffi ffi JSD p matrix and compute the clusters based on this distance (dissimilarity) matrix, 2) MST-kNN also uses a similar approach of using graph structure for data clustering. This method has been employed in several studies [85][86][87] and it has shown remarkable results in data analysis. While MST-kNN is unsupervised, it is applied directly on ffiffiffiffiffiffiffi ffi JSD p matrices and the obtained clusters is evaluated regarding the true partitions.
Hierarchical clustering methods seek the hierarchical structure of the graph by recursively partitioning nodes into clusters either in top-down fashion by dividing clusters into smaller clusters (divisive hierarchical clustering) or bottom-up fashion by merging clusters into larger cluster (agglomerative hierarchical clustering). Dissimilarity measures and linkage criteria are used in order to select which cluster should be merged or divided. The linkage criterion refers to the manner in which the distance between two clusters is calculated. The most well-known linkage criteria are complete-linkage, average-linkage and single-linkage [88]. The distance between two cluster according to the complete-linkage criterion is equal to the longest distance between any pair of nodes from the two clusters [89], but based on the single-linkage criteria it is equal to the shortest distance between any pair of nodes from the two clusters [90] and in average-linkage methods the distance between two cluster is defined as the average distance of any member of one cluster to any member of the other cluster [91].
In this study, we use the stats (3.3.0) package (hclust function) provided in R. The ffiffiffiffiffiffiffi ffi JSD p is employed as the dissimilarity measure and for the linkage criterion four methods have been used: Complete-linkage, Average-linkage, Single-linkage and Ward's method [92,93]. Ward's method is a modified average-linkage method that works based on the squared dissimilarity. Hierarchical methods result in dendrograms, representing the nested grouping of nodes. The clustering solution of the data elements is obtained by cutting the dendrogram at the desired level. Therefore in this study to compare the clustering result of hierarchical methods with the true partitions and other methods, the obtained dendrogram was cut to get the desired number of clusters. To give the most possible chance to the dendrogram to provide the best solution the number of clusters was varied from K = 2 to K = 20. Table 9 shows the best and worst result of the proposed clustering methods together with the best results obtained by k-means++, MST-kNN and four hierarchical clustering methods (i.e. complete-linkage, average-linkage, single-linkage and Ward's method). The quality of the clustering results are evaluated by NMI, ARI and NMI×ARI which are computed by comparing the clusters with the true authorship label of the plays in each dataset. The Best and the Worst results are repetitions from Table 5 (for Complete dataset) and Table 7 (for reduced datasets). Methods are ranked in each dataset based on NMI × ARI that shows how successfully the method obtained the true partition.
Firstly, the best results of the proposed method (Best), in the three of six datasets, i.e. G2, G3 and G5 datasets, are ranked first and outperform the other methods. In these three datasets the results of Ward's method are in the second rank. In the other three datasets, i.e. Complete, G2 and G4 datasets, Ward's method is ranked first and Best is in the second place.
Secondly, comparison between the benchmark methods shows that MST-kNN has an obvious tendency to merge small clusters. Hence, it results in a solution with fewer clusters. On the other hand, the three hierarchical clustering methods, i.e. Complete-linkage, Average-linkage and Single-linkage, result in solutions with more clusters. However, among the three traditional hierarchical clustering methods the Complete-linkage method obtained better results than Average-linkage and Single-linkage in all datasets.
Finally, comparison between the worst results of the proposed method (Worst) with other benchmark methods indicates that in four of the datasets, i.e. Complete, G1, G2, G3 and G5, Worst is ranked 3 after Best and Ward's. Another striking observation about the Worst results is that in all datasets, it is better than MST-kNN, Complete-linkage, Average-linkage and Single-linkage. Furthermore, comparison between the performance of the popular method k-means++ with Worst illustrates that in four datasets, Worst is ranked better than k-means++, and only in G3 and G4 does k-means++ show slightly better performance than Worst.

Significance and Contribution
In this paper, we proposed a new four-stage methodology for data clustering. The methodology is unsupervised and it is able to identify groups of data elements which have a meaningful similarity. Though several methods have been proposed for two problems of data clustering and community detection, in this study, we introduced a new general methodology to convert the data clustering problem to the community detection problem by borrowing fundamental concepts of information theory and graph theory. This methodology aims to identify novel clusters from data. To evaluate the performance of our methodology, we have conducted several experiments on an interesting corpus dataset generated by counting word frequencies in 168 Shakespearean plays. Clusters obtained from datasets (Complete and reduced datasets) revealed stimulating findings about author's literary styles.
Moreover, the memetic algorithm (iMA-Net) which is proposed optimizing the modularity value shows efficient performance in discovering solutions in benchmark networks. Also, in the kNN graphs generated for the Complete dataset and reduced datasets, the iMA-Net obtained good clustering results that are highly matched with the true partitions of the datasets. Furthermore, comparison with six well-known clustering techniques (Table 9) illustrates the great ability of the developed methodology in detecting superior clustering solutions in the studied datasets. Finally, the proposed methodology is a generic method and in each of the four stages a variety of Table 9. The best clustering solutions obtained by benchmark methods (k-means++, MST-kNN, Complete-linkage, Average-linkage, Single-linkage and Ward's method) together with the best (Best) and the worst (Worst) clustering result obtained by the proposed method in Complete dataset and five reduced datasets (G1-G5). The Rank column is based on the value of NMI × ARI. other methods can be employed. For instance, in the third stage of identifying the community structure of the kNN graphs, variety of community detection methods might be applicable.

Limitations and future research
Although the proposed methodology obtains remarkable results in studied datasets, there are a few limitations, which are possible areas for future work. Firstly, we utilized the square root of Jensen-Shannon divergence ( ffiffiffiffiffiffiffi ffi JSD p ) to measure the dissimilarity between elements and then convert the dissimilarity (distance) in kNN graphs to the similarity by subtracting it from one; it performs competently in the studied datasets, but more studies on different datasets and comparison with other measures are required to prove that ffiffiffiffiffiffiffi ffi JSD p is the most suitable measure for constructing the kNN graphs.
Secondly, there is a limitation regarding the number of kNN graphs that should be analyzed. In the second stage of the proposed methodology, the kNN graphs were generated from the dissimilarity matrix and we set the value of k to be in the range of 1 to 10. As mentioned before, comparison between the best found clustering in the five reduced datasets (Table 7) showed that solutions with the highest value of NMI×ARI were always obtained from the 5NN graph. However more investigations are required to justify that analyzing 5NN graphs has an advantage over the other kNN graphs. One of the future research directions would be finding a proper procedure for tuning the value of k in different datasets.
Note that the proposed clustering methodology is a four-stage procedure and for each stage different tools are applicable. This has its advantages and disadvantages. While the main advantage is the flexibility of the method to employ different tools, the disadvantage is its complexity in combining many different tools. Future work will aim to develop a single software tool for the proposed clustering approach that can apply all four stages in an optimized and scalable way.
Another future research direction is to employ multi-objective community detection algorithms that optimize not only modularity value but also other quality functions. As an example, we can refer to the multi-objective community detection method proposed by Shi et al. [94]. Multi-objective algorithms return a set of non-dominated solutions and give the opportunity for the user to select the most appropriate solution from a few. Furthermore, applying overlapping community detection methods in the third stage is another important future research direction, which will enhance the ability of the proposed methodology to identify overlapping clusters; the latter has several real-world applications.
To conclude, we proposed a novel four-stage methodology for clustering based on information theory and modularity optimization. To demonstrate the effectiveness of this methodology, we conducted experiments on clustering a large text corpus dataset. We discovered remarkable results which lead to the authors' literary styles in different genres. As discussed, we envision that the proposed efficient methodology might be adopted in various fields for the purpose of investigating the clustering structure of datasets.
Supporting Information S1   Table. List of plays located in each cluster of the best clustering solution of G5 which is demonstrated in Fig 4. (XLSX)