Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

A Novel Clustering Methodology Based on Modularity Optimisation for Detecting Authorship Affinities in Shakespearean Era Plays

  • Leila M. Naeni ,

    Leila.MoslemiNaeni@uon.edu.au

    Affiliations The Priority Research Centre of Bioinformatics and Information-Based Medicine, The University of Newcastle, Newcastle, New South Wales, Australia, School of Electrical Engineering and Computer Science, Faculty of Engineering and Built Environment, The University of Newcastle, Newcastle, New South Wales, Australia, School of Built Environment, Faculty of Design, Architecture and Building, University of Technology Sydney, Sydney, Australia

  • Hugh Craig,

    Affiliation Centre for Literary and Linguistic Computing, School of Humanities and Social Science, The University of Newcastle, Newcastle, New South Wales, Australia

  • Regina Berretta,

    Affiliations The Priority Research Centre of Bioinformatics and Information-Based Medicine, The University of Newcastle, Newcastle, New South Wales, Australia, School of Electrical Engineering and Computer Science, Faculty of Engineering and Built Environment, The University of Newcastle, Newcastle, New South Wales, Australia

  • Pablo Moscato

    Affiliations The Priority Research Centre of Bioinformatics and Information-Based Medicine, The University of Newcastle, Newcastle, New South Wales, Australia, School of Electrical Engineering and Computer Science, Faculty of Engineering and Built Environment, The University of Newcastle, Newcastle, New South Wales, Australia

Abstract

In this study we propose a novel, unsupervised clustering methodology for analyzing large datasets. This new, efficient methodology converts the general clustering problem into the community detection problem in graph by using the Jensen-Shannon distance, a dissimilarity measure originating in Information Theory. Moreover, we use graph theoretic concepts for the generation and analysis of proximity graphs. Our methodology is based on a newly proposed memetic algorithm (iMA-Net) for discovering clusters of data elements by maximizing the modularity function in proximity graphs of literary works. To test the effectiveness of this general methodology, we apply it to a text corpus dataset, which contains frequencies of approximately 55,114 unique words across all 168 written in the Shakespearean era (16th and 17th centuries), to analyze and detect clusters of similar plays. Experimental results and comparison with state-of-the-art clustering methods demonstrate the remarkable performance of our new method for identifying high quality clusters which reflect the commonalities in the literary style of the plays.

Introduction

Following great advancements in technology, all scientific fields have been faced with huge empirical datasets. Identifying groups of objects, patterns or elements with “similar characteristics” has always drawn the attractions of researchers. Data clustering is one of the most widely studied problems in data mining and machine learning with a wide variety of applications in many fields ranging from the analysis of genomic data in biology [1, 2] to classifying customers for efficient marketing strategies [3, 4]. Clusters are natural groups in data such that elements within the same cluster are more similar to each other than to elements belonging to other clusters [5]. Thousands of techniques have been proposed to address a wide variety of clustering problems in different disciplines (see the review of data mining and clustering techniques by Han and Kamber [6]). The main clustering methods include probabilistic and generative models [7, 8], distance-based clustering [9], density and grid-based clustering [10, 11], spectral clustering [12, 13] and graph clustering [14].

Graph clustering is the process of grouping vertices of a graph into clusters according to the structure of the graph. Generally speaking, a good graph clustering contains clusters with many edges within the clusters and few between clusters. However this definition is unclear, thus many quality functions have been proposed to evaluate graph clusters such as ratio cut [15], normalized cut [16] and modularity [17]. In the recent literature a cluster of vertices in a graph is mostly called a community [18] and graph clustering discussed as the community detection problem [19]. The graph clustering problem is a fundamental algorithmic problem, however it has not been solved satisfactory and it is a notably challenging member of the NP-complete class [20, 21].

This study proposes a new data clustering methodology for the authorship analysis in a dataset generated from 168 plays from the Shakespearean era. Authorship study is a long-standing research activity, predating the availability of computers by many centuries. Interest in authorship follows from the magnitude of the consequences of assigning a work to one author or another, e.g. in separating canonical and apocryphal books in a religious tradition, establishing the authenticity of legal documents, or determining which works or parts of works are by celebrated literary authors. With the availability of digital text and computational tools, this activity is now predominantly quantitative. The Shakespearean canon has been a particular focus, with studies testing the claims of various possible authors other than William Shakespeare of Stratford to the works we know as Shakespeare’s [22], and debating possible additions to Shakespeare’s canon [23, 24]. The advent of quantitative methods has also renewed interest in the idea that authorship is objectively a property of texts, rather than largely the invention of readers and institutions, as was argued by poststructuralist theorists [2528].

Here, we propose a new methodology for identifying groups in the large text corpus dataset. The methodology is easily generalizable and can be employed for analyzing any other dataset. This advanced clustering methodology brings together Jensen-Shannon divergence(JSD) as an information theoretic dissimilarity measure and combines it with a newly proposed memetic algorithm (iMA-Net) for graph clustering to yield a powerful strategy for segmenting the text corpus dataset. The proposed memetic algorithm maximizes the modularity quality function and it is an unsupervised algorithm, i.e. the method does not require any information about the data elements, such as authorship and genre of plays, it only uses the structure of the dataset to uncover clusters of similar plays. The outcomes are promising and show great authorial affinities in each cluster; a computational proof of the highly individual literary style of each author.

Materials and Methods

Dataset

In this work, we utilized a text corpus dataset containing 168 plays from the Shakespearean era (16th and 17th centuries) with the unambiguous contribution of authorship of 39 authors (see Table 1). This dataset was generated by Craig and Whipp [29] using a software application called Intelligent Archive (IA) to count variant spellings of approximately 55,114 unique words across all 168 plays. The software tool IA is used to extract the word-play matrix from a collection of suitably edited texts, so that only the words of dialog are counted and other materials, such as prefaces, stage directions and speaker tags are omitted. In the texts, contractions such as “they’re” are expanded so this counts as one instance of they and one of are. The IA has the capacity to combine variant spelling forms so that multiple spellings of the same word are combined into a single count [29]. As a result, this dataset contains about 9.3 million word usage statistics (word frequencies) stored in the form of a (55,114 × 168) matrix. The word-play matrix is available in the supporting information of [30].

thumbnail
Table 1. Authors and their contributions in the 168 Shakespearean era plays dataset.

https://doi.org/10.1371/journal.pone.0157988.t001

This dataset was analyzed by Marsden et al. [30] and they found measurably distinct literary styles by detecting the tendency towards overuse or avoidance of particular words for four key authors who account for the largest number of plays in the dataset. That study resulted in a set of marker words that can distinguish between plays written by Fletcher, Jonson, Middleton and Shakespeare compared to the remainder of the authors. Recently, Arefin et al. [31] used this dataset to test the performance of their new graph clustering method incorporating paraclique identification and a clustering method known as MST-kNN [32].

Method

As mentioned earlier, to detect clusters of plays that are more similar in the given dataset, we propose an unsupervised graph-based clustering methodology that employs a new memetic algorithm which optimizes the modularity value in k-Nearest Neighbour (kNN) graphs derived from the dataset. Fig 1 outlines our methodology as a four-stage process. This clustering methodology is general; it can be used to detect clusters in a dataset of n elements where each element has m quantified features (the input dataset can be represented in a form of a (m × n) matrix). In the current study, n equals to 168 and refers to the number of plays and m equals 55,114 and refers to the words used in the plays. Details of each stage are described in the following section. The primary idea of this partitioning approach was proposed in [33] as a new method to analyse a complete network derived from field survey aimed at detecting consumer communities of trust and confidence in the Australian not-for-profit and charity sector. In this paper, we extended the primary idea by using the square root of Jensen-Shannon divergence metric () as the dissimilarity metric, enhancing the modularity optimization algorithm and finding clusters with higher quality.

thumbnail
Fig 1. Flowchart of stages in the clustering methodology proposed in this study.

https://doi.org/10.1371/journal.pone.0157988.g001

Stage 1- Compute the Dissimilarity Matrix.

We start our method by computing the dissimilarity between all pairs of plays. We use the square root of Jensen-Shannon divergence (). JSD is a metric measuring the dissimilarity between two probability distributions based on Jensen’s inequality [34] and the Shannon entropy [35]. We refer to Lin [36] for a discussion of the JSD. The JSD has been used in many studies, especially in bioinformatics [3739]. The is selected being the dissimilarity metric because it has the distance metric properties and satisfies the triangle inequality condition (this property is proved in [40, 41]). The JSD reflects the dissimilarity of two plays based on the frequency of different words appeared in each play. Generally, for two probability distributions, P and Q, JSD is computed as follows, (1) where, H(X) is Shannon information entropy for distribution X and is computed by Eq 2. (2)

Here, xi is the probability of occurrence of word i in the play X. The JSD is a non-negative and symmetric metric, therefore, JSD(P, Q) = JSD(Q, P). For two identical probability distributions (i.e. P = Q), JSD(P, Q) = 0, i.e. there is no dissimilarity between the two. Moreover, by using the log2 in Eq 2 the JSD is bounded by 1, thus .

By computing the for all pairs of plays, we obtain a symmetric 168 × 168 matrix representing the dissimilarity between all plays (see S1 Table). In this study, the matrix is computed using the cibm.utils package developed in R. A graph is associated with the dissimilarity matrix, where each node represents a play and edges connect nodes where the weight of an edge is the dissimilarity between the plays. In the next stage, we will introduce a new method to analyze this graph.

Stage 2- Generate k-Nearest Neighbour (kNN) Graphs.

In this stage we generate incomplete graphs by removing edges that are less important from the complete graph and preserving edges that represent stronger connections between nodes. For this purpose, we set up a series of kNN graphs starting with k = 1 to k = 10 (in total 10 graphs). To generate the kNN graphs from the dissimilarity matrix, a pair of nodes (a, b) is connected if either a is one of the k-nearest neighbors of b or if b is among the k nearest neighbors of a. The kNN graphs of this study are generated by the scalable software tool proposed by Arefin et al. [42].

Investigating 10 different kNN graphs gives us the opportunity to study the performance of our method under different cases. By increasing the value of k, the average degree of nodes is increased and the graph becomes denser. Therefore, identifying meaningful clusters in kNN graphs with a larger k is more computationally challenging in practice, as graph clustering methods identify clusters in the graph based on the graph’s edge structure, and more edges simply means more possible solutions that the method has to analyze. Table 2 shows the number of edges and the average degree for each of the kNN graphs.

thumbnail
Table 2. Basic information of kNN graphs constructed from the complete dataset.

All graphs have 168 nodes that are representing the 168 plays in the dataset.

https://doi.org/10.1371/journal.pone.0157988.t002

We then apply a clustering approach to the kNN graphs. To detect the clusters, we developed a memetic algorithm that optimizes modularity value as the evaluation criterion. Due to the fact that the memetic algorithm uses the edge weights to cluster nodes with strong connections, we have to modify the edge weights in kNN graphs in order to represent the ‘similarity’ between nodes so that the clusters will contain similar nodes. For this purpose, similarity is defined as one minus dissimilarity.

Stage 3- Modularity optimization algorithm.

Modularity optimization is one of the most popular methods for detecting graph structure and hidden groups of nodes with many intra-connections and comparatively few inter-connections. As explained in the introduction, such a group of nodes in the graph is called a community or cluster. It is believed that nodes which belong to the same community (cluster) are more likely to play a similar role or share common functionality [18, 19]. Detecting the community structure of a given graph, or community detection, is a newer nomenclature for the graph clustering problem and both problems are similarly looking for groups of “related” nodes to label as a cluster or community [14]. It is worth emphasizing that detecting the community structure is practicable only if graphs are sparse [19]. If the number of edges is close to the maximal number of edges, then the graph is a dense graph, the nodes become too homogeneous and the community structure does not make sense turning the problem into data clustering, which has its own concepts and methods that are more suitable for high density graphs [5, 43].

As a variety of complex systems in biology, sociology, physics and computer science can be represented as graphs, community detection is one of the most attractive problems in many disciplines [44, 45]. Although different approaches and methods have been proposed to solve the community detection problem, it is still a difficult problem and has not been solved satisfactorily [19]. Modularity, which was first introduced by Newman and Girvan [17], has the ability to evaluate the quality of communities, therefore it has become the key objective function in most of the recent algorithms.

The modularity function is formulated by computing the difference between the number of the edges within a community (i.e. sub-graph) and the number of edges expected to appear in the same sub-graph in the ‘null-model’. The null-model is a model of the graph which is used for comparison, to verify whether the graph has community structure. The most popular null-model is the randomised version of the original graph proposed by Newman and Girvan, where the same number of nodes are randomly assigned, under the constraint that the expected degree of each node matches the degree of the node in the original graph [19, 46]. The random graph is not expected to have a community structure, therefore the graph has a community structure if it is significantly different from the random graph. In other words, higher modularity shows greater difference between the edge density in a sub-graph and expected value in the random graph, revealing community structure. Given an edge-weighted graph G = (V, E, W) the modularity Q of a clustering solution (i.e. community structure) I is defined as follows: (3)

The summation runs over all the communities c contained in the solution I, and M refers to the total weight of G, i.e. (4) where wij denotes the weight of the edge connecting node i and node j. In Eq 3, lc stands for the total weights of edges connecting two nodes in community c and dc is the sum of the weighted degrees of the nodes in community c, i.e. (5) (6)

A solution with a higher value of modularity is believed to indicate a better community structure or clustering solution within the given graph. Therefore, the maximization of modularity is by far the most popular approach for identifying the community structure of a graph [19]. Since Brandes et al. [47] proved that modularity maximization leads to a NP-complete problem, finding the community structure with the maximum modularity has become a great challenge for data scientists. Therefore, many heuristic algorithms have been developed to find near optimal and good solutions. Some of the most popular methods employed for modularity maximization are heuristics based on hierarchical or aggregation clustering methods [17, 4850], genetic algorithms [51, 52], memetic algorithms [5355], spectral methods [56, 57] and simulated annealing [58].

In the present study, we propose a new memetic algorithm specially designed for solving the community detection problems named “iMA-Net” that is an improved version of the MA-Net algorithm previously proposed in [59]. There are three main differences between MA-Net and iMA-Net. 1) MA-Net considers all graphs as unweighted graphs while iMA-Net detects communities in weighted graphs. 2) iMA-Net employs an enhanced local search algorithm to make the algorithm more stable. 3) While the population initialization procedure in both MA-Net and iMA-Net applies local search to each individual to improve its fitness, differing local searches make the initialization procedure performance different in MA-Net and iMA-Net. The framework and steps of iMA-Net are explained in detail in the following sections. In the last section of stage 3, the experimental results of the proposed algorithm (iMA-Net) in five real-world benchmark networks are given. To illustrate the performance reliability of iMA-Net, we compare results of iMA-Net with MA-Net [33] and those from five representative community detection algorithms.

iMA-Net: A Memetic algorithm for Modularity optimization.

Due to the proven success of memetic algorithms and their effectiveness in dealing with many NP-hard combinatorial optimization problems [60], we chose the memetic framework for our modularity optimization algorithm, named iMA-Net. iMA-Net is a population-based algorithm in which each solution in the population represents a particular clustering of the given weighted and undirected graph. In the proposed algorithm, solutions are evolved using problem-specific genetic operators and a local search procedure employed to detect high-quality community structure with the optimal or near optimal modularity value.

We used the string-coding solution representation in which a clustering solution of G is represented by a list of n integer numbers [C1, C2, …, Cn], where n is equal to the number of nodes in G and Ci refers to the community label of node vi. Therefore, if Ci = Cj then i and j are located in the same community. This way of representing clustering solutions has been used in previous algorithms, for instance [51, 53, 54]. An illustration example of string-coding representation is given in Fig 2.

thumbnail
Fig 2. An illustration of the string-coding representation for a clustering solution.

Left: a simple toy graph with two clusters shown using different colors, Right: an integer list encoding the solution, where the red and blue color clusters are labeled as 1 and 2, respectively.

https://doi.org/10.1371/journal.pone.0157988.g002

Algorithm 1 presents the overall framework of iMA-Net for finding the partition of G that achieves the maximum modularity (computed following Eq 3). Details of each step of the algorithm are explained in the following paragraphs.

Algorithm 1. iMA-Net Framework

Input: Graph G = (V, E, W)

Parameters: Np: Population size, : Minimum mutation probability, Ns: Acceptable number of generations without improvement.

Output: I* a partition of G with fitness value Q*

pop = {I1, I2, …, } ← Initialization(G,Np);

repeat

Inew ← Modularity Based Recombination(pop);

Imut ← Adaptive Mutation(Inew, , Ns);

Iimp ← Local Search(Imut);

pop ← Update(pop, Iimp);

until Termination Condition(Ns);

return I * = argmaxIpop{Q(I)};

Initialization: The initial population consists of Np feasible solutions, so the following process runs Np times to build the initial population. Firstly, each of the n nodes in G is given a random community label that can be any integer number in the range [1,n]. Then to improve the quality of the solutions in the initial population, the local search procedure, as explained later in this section, is applied on each solution and made changes on community labels of nodes to find a neighbor solution with higher modularity. As the community label is selected randomly from a wide range of [1,n], there is not any constraint on the number of the different labels or the number of clusters in a solution. Applying the local search procedure on each solution, improves the quality of the initial population and speeds up the convergence of the algorithm.

Modularity-based recombination operator: Traditional operators including uniform, one-point and two-point recombination operators are not suitable for the string based representation as they do not convey characteristics of parent solutions to the offspring. iMA-Net uses a problem-specific recombination operator for generating new solutions that inherit the best characteristics from the parent solutions which is named the modularity-based recombination operator. This recombination operator was first proposed for MA-Net [59]. This recombination method effectively transfers the best characteristics of the parents to the descendants and it works as follows.

First, two solutions Ia and Ib are selected uniformly at random from the population as parents. The initialization procedure allows us to safely assume that all population members have relatively good quality and the random selection strategy helps maintain the population diversity. Next, a priority list PL is generated by sorting all communities in both parents based on their fitness values, as computed by Eq 3. Then, the best community with the highest priority from PL is selected and the same community is formed in the offspring. Then, for the next community in the priority list, form the most similar community in the offspring with the nodes that are not assigned to any community yet. This procedure runs until all the nodes in the offspring solution have been assigned to a community.

The offspring generated by the modularity-based recombination operator inherits the best community structure from their parents [33]. One of the limitations of modularity optimization algorithms is their difficulty in detecting small size communities and their tendency to merge small communities and form large communities [61]. The modularity-based recombination operator that we proposed has the advantage of generating offspring with a larger number of communities than its parents. This advantage helps the algorithm to explore the search space more deeply and not avoid solutions with small communities.

Adaptive mutation operator: This operator changes the community’s label of nodes with the probability of Pm. This means that in each run of the adaptive mutation operator ⌈Pm * n⌉ nodes are randomly moved to one of their neighbors’ communities. Thus, this mutation operator is neighbor-based and considers only effective changes. The mutation probability (Pm) is adaptive and it linearly grows from to , as the algorithm approaches the termination condition. As shown in Algorithm 1, the stopping condition of iMA-Net is Ns generations without any improvement in the best solution found so far. By increasing the mutation probability (Pm), more changes will occur in the solution, thus diversity of the population will grow and the search area will expand. For instance, when the algorithm’s parameters are set to be and Ns = 30 and the algorithm runs 15 generations without improvement in the best solution found, the mutation probability will grow to and it will be Pm = 0.075.

Local search: The local search procedure works on every new solution developed by the adaptive mutation operator, and it aims to improve the fitness of the solution before adding it to the population. iMA-Net uses the Vertex Movement (VM) heuristic [62] together with a stochastic hill climbing strategy to search the neighborhood to find a better solution. The local search procedure is shown in 2. Given a graph G with n nodes, the local search procedure works as follows. Firstly, L, a random sequence of the nodes is generated. Then, for each node vi chosen according to L, vi’s label is changed to that of a randomly selected neighbor, if the change improves the fitness of the solution. If the movement of vi from its community to none of its neighbors’ community increases the fitness (modularity), we leave vi in its place and check the next node in L. The above process runs until there is no change for any of the nodes in L that improves the fitness of the solution (see Algorithm 2).

Algorithm 2. Local search procedure

Input: Graph G with n nodes; A clustering solution I = [C1, …, Cn].

Output: a solution with better fitness than I.

I′ ← I;

repeat

t = 0 (t shows the number of nodes that cannot move);

L ← a random sequence of n nodes;

for each node vi in L do

  N ← a list of vi’s neighbours located in other communities;

  ΔQ = 0;

  while ΔQ ≤ 0 do

   vj ← a neighbour node selected from N;

   ΔQ ← fitness change due to the movement of vi to the community of vj;

  end

  if ΔQ > 0 then

    (vi is moved to the community of vj);

  else

   t = t + 1 (when vi is not moved);

  end

end

until t = n;

return I′;

The local search procedure is designed to accept the first random movement of the node that improves the fitness value. This stochastic behavior in selecting the neighbor helps the algorithm to avoid one of the known limitations of deterministic hill climbing strategies which is getting stuck in the local optima. The local search procedure operation is effective and we reduced the computational cost by using the method proposed in [50] for computing the change to the modularity ΔQ incurred moving node vi into the community of node vj.

Elitist population updating strategy: As with MA-Net [33], in iMA-Net the population updating strategy is elitist. New solutions generated by the local search procedure are added to the population and the least fit solutions are removed from the population. The elitist strategy enhances the algorithm convergence by retaining the better solutions in the population, making them eligible to be parents in the subsequent generations.

Performance results of iMA-Net on benchmark networks.

Before applying the iMA-Net on the kNN graphs produced in Stage 2, we tested the performance of iMA-Net on five real-world networks with similar size to the kNN graphs. These networks include the Zachary’s Karate Club network (karate) [63], the Bottlenose Dolphins network (dolphin) [64], the American College Football network (football) [18], the Political Books network (polbooks) [56] and the Jazz Musicians network (jazz) [65]. Basic information on these benchmark networks is shown in Table 3. All of these networks are unweighted and undirected graphs.

The comparison between iMA-Net and MA-Net in Table 4, is made to illustrate the effectiveness of iMA-Net. Moreover, to investigate the performance of iMA-Net, we compared the iMA-Net results with a set of well-known community detection algorithms with different approaches, frameworks and objective functions. The benchmark algorithms include LPAm [66], Meme-Net [67], MOGA-Net [68], MODPSO [69] and MLCD [70].

thumbnail
Table 4. Experimental results on five real-world benchmark networks.

The maximum, average and standard deviation of modularity values (Qmax,Qavg,Qstd) obtained by LPAm, Meme-Net, Moga-Net, MODPSO, MLCD, MA-Net and iMA-Net.

https://doi.org/10.1371/journal.pone.0157988.t004

LPAm presented by Barber and Clark [66] is a label propagation algorithm which reformulated the original label propagation algorithm (LPA) [71] to overcome its drawbacks. LPAm employs a modularity-specific rule for label propagation and updates the community label of nodes repeatedly till no possible improvement can be found. Meme-Net [67] is a memetic algorithm optimizing the modularity density to discover communities in the graph. MOGA-Net [68] and MODPSO [69] are two multi-objective optimization algorithms for community detection. MOGA-Net maximizes the intra-community links and minimizes the inter-community links, while MODPSO maximizes the internal link density and minimizes the external link density. MLCD is a novel memetic algorithm presented by Ma et. al [70]. MLCD has a multi-level learning framework for detecting the community structure by optimizing the modularity with the aid of a special hybrid global-local heuristic search procedure.

In a comprehensive study [70], Ma et. al. ran all of the selected benchmark algorithms 50 times each, with the same key parameters, on several real-world and computer synthesized benchmark networks, and recorded the maximum, average and standard deviation of the modularity values (Qmax,Qavg,Qstd). In this study, we refer to the reported results in [70] and compare results obtained by 50 independent runs of MA-Net and iMA-Net with the benchmark algorithms. MA-Net and iMA-Net are implemented in Python 2.7 and executed on a PC with Intel ® Xeon ® CPU E5-1620 at a clock speed of 3.6 GHz (4 cores and 8 logical processors) and 16 GB of memory. We tuned the parameters of MA-Net and iMA-Net as follows: population size, Np, is set to 40, the minimum mutation probability, , is set to 0.05 and the termination condition is set to 30 generations without improvement (Ns = 30). The comparison results are shown in Table 4.

Firstly, the comparison results in Table 4 show that in all benchmark networks MLCD, MA-Net and iMA-Net obtained the largest Qmax. While both MA-Net and iMA-Net achieved the maximum possible modularity in all networks, iMA-Net has more reliable performance because it improves the Qavg. Meanwhile, comparing Qstd values obtained by MA-Net and iMA-Net show that in all networks iMA-Net has smaller deviation, indicating that the improvements in iMA-Net enhance the stability and accuracy of the algorithm. Comparisons between the Qavg values obtained by iMA-Net and the other algorithms show that in all cases iMA-Net achieved higher Qavg and performed better than four representative algorithms, LPAm, Meme-Net, Moga-Net and MODPSO. Only MLCD, which is a novel multi-level learning memetic algorithm has a better performance than iMA-Net in benchmark networks. As the MLCD method is not publicly available and iMA-Net shows outstanding performance in detecting the high quality community structure with great stability, in this study we use the iMA-Net as the community detection algorithm. It is important to note that the clustering methodology is a generic method and any community detection algorithm can be employed in the third stage.

Stage 4- Partition Quality Evaluation.

To evaluate the quality of the solution obtained by our clustering method, we can compare the solution with the known authorship label of each play. In other words, we expect that a good solution would put the plays written by the same author in one group. Thus, we are considering the authorship as the true label of each play and to evaluate the solution quality we used two well-known clustering similarity measures: Normalized Mutual Information (NMI) and Adjusted Rand Index (ARI).

NMI: A symmetric index computing the similarity between two clustering solutions based on the confusion matrix (also referred to as the contingency matrix). As defined in [72], for two clustering solutions of a given graph A and B, the NMI is defined as follows: (7) where F is the confusion matrix for solutions A and B in which rows and columns correspond to the clusters in A and B, respectively. Fij is the node-overlap ith cluster of A and jth cluster of B. is the sum of the elements in the row i of F and denotes the sum of elements in the column j of F. cA and cB are the number of clusters in A and B, respectively. The NMI measures the similarity of two solutions, therefore if A = B, then NMI(A,B) = 1 and if there is no similarity between A and B then NMI(A,B) = 0.

ARI: The adjusted version of the Rand Index(RI), first introduced in [73], compares two clusterings based on the number of cluster membership agreements and disarrangements between them. It shows the ratio of the number of node pairs similarly classified in both solutions, divided by the total number of pairs. RI is defined as (8) where n is the number of nodes, α refers to the number of pairs of nodes which are in the same cluster in solution A and in the same cluster in solution B and β is the number of pairs of nodes that are classified in different clusters in both solutions. The RI lies between 0 and 1 and it is equal to 1 when two solutions are exactly the same. The ARI was proposed by Hubert and Arabie [74]. To adjust the RI they assumed the generalized hypergeometric distribution as a null model and defined ARI as follows: (9) where, similar to Eq 7, Fij, and are extracted from the confusion matrix F. Like the NMI, a larger ARI shows that two solutions are more similar. The ARI attracted more attention than RI as it is in the range [−1, +1], the wider range of values increasing the sensitivity of the index. To combine the effectiveness of NMI and ARI, we also used their product, NMI × ARI, for evaluation and comparing different clusterings.

Results

We applied iMA-Net to a series of ten kNN graphs (for k = 1 to k = 10) to optimize the modularity value and find the best solution that partitions the set of 168 plays written by 39 authors. The computer used for these experiments has the same configuration as stated in the section ‘Performance results of iMA-Net on real-world networks’. We ran iMA-Net 30 times for each graph with the following parameters: population size Np is set to 40, the termination criterion is set to 30 generations without improvement (Ns = 30) and the minimum mutation probability is equal to 0.05 (). The best solution found in 30 runs of the algorithm, which has the highest value of modularity, is recorded in Table 5. Quality measures including NMI, ARI and NMI × ARI, representing the similarity between best solutions and the true clustering solution obtained from the authorship labels, are shown in Table 5.

thumbnail
Table 5. The best solution found by iMA-Net in ten kNN graphs derived from 168 Shakespearean era play dataset.

https://doi.org/10.1371/journal.pone.0157988.t005

The results in Table 5 demonstrate that by increasing the value of k and adding more connections to the graph, the number of the detected clusters and the solution modularity (Q) is reduced. Because of the dependency of the highest value of Q on the structure of the graph, it cannot be used for comparing solutions from different kNN graphs. But we can refer to quality measures (NMI and ARI) to find in which graphs iMA-Net finds the solution that better matches authorship labels. The results show that the best-matched solution with the highest quality measures, i.e. NMI = 0.742 and ARI = 0.525, is obtained by analyzing the 2-nearest neighbor (2NN) graph and contains 18 clusters, as is shown in Fig 3.

thumbnail
Fig 3. Best clustering outcome of iMA-Net with the highest NMI and ARI.

18 clusters are detected in 2-nearest neighbor graph shown by different colours. Node size is proportional to the node’s total degree. Nodes are labeled by the play.author.

https://doi.org/10.1371/journal.pone.0157988.g003

Fig 3 presents the best solution in 2NN graph where each node represents a unique play. The play’s author is shown as the label of the node and the detected clusters are identified by different colors. Lists of the plays located in each cluster of this clustering solution together with the confusion matrix of the solution and the true solution is available in S2 and S3 Tables.

Moreover, Fig 3 shows that the solution contains five clusters dedicated to only one author, where all plays in these clusters are written by the same author. The five finely detected authors are Dekker (Cluster 3), Ford (Cluster 6), Lyly (Cluster 16), Chapman (Cluster 11) and Webster (Cluster 14). It also contains a good classification of authors with more contributions in the dataset; for instance the biggest cluster (Cluster 5) with 23 plays includes 20 plays attributed to Shakespeare together with three plays from authors that have only one contribution in the dataset. Moreover, we have a very good clustering of the plays written by four authors Fletcher (Cluster 1), Middleton (Cluster 13), Jonson (Cluster 15) and Chapman (Cluster 17). These authors are separately clustered excepting only one play by another author included in each cluster.

Furthermore, considering authors with limited contributions in this dataset and the way these works grouped with other plays reveals some interesting insights about their literary style. For instance, the single Tourneur play in the dataset is classified in Cluster 5 with 20 plays of Shakespeare. Commentators have seen Tourneur as a follower of Shakespeare [75], and it is likely that in this case the influence of Shakespeare’s style explains the placement of the play within the cluster. Another example is the clustering of the play of Marmion and the single play of Brome in the same group (Cluster 12) with 10 plays attributed to Middleton that illustrate the similarity between their word usage.

Reduced Datasets

Referring to Table 1, the complete dataset contains 19 authors with only one attributed play. iMA-Net classified these plays with the most similar plays, maximizing the modularity value. However, this behavior reduces the quality measures (NMI and ARI). To better understand the performance of the proposed methodology, we applied the same stages on reduced datasets resulting from removing some plays from the original dataset. The five reduced datasets generated by removing plays written by authors who have 1, 2, 3, 4 and 5 contributions are as follows:

  1. G1: Plays from authors with more than one contribution.
  2. G2: Plays from authors with more than two contributions.
  3. G3: Plays from authors with more than three contributions.
  4. G4: Plays from authors with more than four contributions.
  5. G5: Plays from authors with more than five contributions.

Table 6 shows the number of plays and authors that remained in each of the reduced datasets. We applied our four-stage method to the five reduced datasets (G1-G5) and the results are given in Table 7.

thumbnail
Table 7. Best solutions found by iMA-Net in 10 kNN graphs derived from each reduced dataset (G1-G5).

The highest values of NMI, ARI and NMI×ARI in each dataset are denoted in bold.

https://doi.org/10.1371/journal.pone.0157988.t007

From the higher values of NMI and ARI in Table 7, compared to their values in the original dataset, Table 5, we can conclude that by removing authors with fewer contributions from the original dataset, the proposed clustering method found solutions with higher quality with clusters better matched to the authors’ labels. More precisely, while the average NMI and average ARI in the original dataset are 0.670 and 0.422, respectively. The average NMI increased to 0.718 and the average value of ARI to 0.538 in the reduced datasets.

On the other hand, the results in Table 7 show that NMI and ARI are different quality measures and they do not always agree on the best solution among the ten solutions found in the kNN graphs. Thus, we consider NMI × ARI to combine the effect of two measure into a single index. Interestingly, NMI×ARI always attains the highest value in the 5-nearest neighbour (5NN) graphs in all of the five reduced datasets. The best solution according to NMI×ARI is found in the G5 dataset and in the 5NN graph where the highest NMI×ARI is equal to 0.693. Table 8 shows the confusion matrix of the true solution and the solution with the highest quality measures for NMI and NMI×ARI and Fig 4 represents this best clustering solution in the 5NN graph. For details of plays in each cluster refer to the S4 Table.

thumbnail
Table 8. Confusion matrix of the true authorship and the clustering solution obtained by the 5NN graph with NMI × ARI = 0.693.

The confusion matrix shows how 106 plays by 7 authors are distributed into 7 clusters. As expected, a good separation occurred in clusters 2, 4, 6 and 7, which are formed by plays of one specific author.

https://doi.org/10.1371/journal.pone.0157988.t008

thumbnail
Fig 4. The best clustering outcome of iMA-Net with the highest NMI × ARI in G5.

Seven clusters are shown by different colours in the 5-nearest neighbour graph constructed from similarity between 106 plays from 7 authors. Nodes sizes are proportioned to their degree.

https://doi.org/10.1371/journal.pone.0157988.g004

Both Table 8 and Fig 4 show that plays written by Middleton (Cluster 2), Fletcher (Cluster 4), Lyly (Cluster 6) and Ford (Cluster 7) are classified in completely separated clusters. All 28 plays of Shakespeare in this dataset are in the biggest group, Cluster 1, together with three plays from different authors which are the comedy ‘Blind Beggar of Alexandria’ written by Chapman, the pastoral play ‘Faithful Shepherdess’ by Fletcher and a comedy from Lyly named ‘Woman in the Moon’. So far as we are aware no commentators have suggested previously that these three non-Shakespeare plays in Cluster 1 are particularly Shakespearean, but they are sometimes noted as unusual for their authors. In the present analysis ‘The Woman in the Moon’ is among the five nearest neighbors of ‘The Faithful Shepherdess’ and ‘The Blind Beggar of Alexandria’, thus these three plays are themselves connected in G5.

The ‘Woman in the Moon’ is a departure for Lyly in that he moves to verse after a career writing prose comedies. It is his last-performed play. Critics note other important differences from Lyly’s normal practice as well [76]. None of the other seven Lyly plays in this dataset appear among the five plays with the highest similarity score and connected to ‘Woman in the Moon’. Moreover, though this play is normally classified as a comedy [77] the closest play according to measure is a Shakespeare tragedy, ‘Romeo and Juliet’, and other closest is another tragedy, ‘Tis Pity She’s a Whore’, by John Ford, from a much later phase of drama—the first production of Ford’s play was in 1632, compared to 1593 for the Lyly play. In her edition of the play Leah Scragg suggests that the play has more similarities to the rest of the Lyly canon than critics generally acknowledge, and has links to two other Lyly plays, ‘Endymion’ and ‘Mother Bombie’, in particular [76]; the results of the present analysis come down on the side of change rather than continuity in this play. In the context of a clustering in which author and genre likenesses come through strongly, and consistently, overall, the cross-author and cross-genre links for this play are worthy of further investigation, though that is beyond the scope of the present paper.

‘The Faithful Shepherdess’ is a pastoral, quite unlike Fletcher’s usual work in comedy and tragicomedy. In his studies of the shares of Fletcher and Beaumont in their collaborative plays Cyrus Hoy excludes ‘The Faithful Shepherdess’ from Fletcher baselines for that reason [78]. Jonathan Hope notes Fletcher’s departure from his usual style in ‘The Faithful Shepherdess’, which he calls “a consciously literary and deliberately archaic piece” [79]. Thus it is not surprising to find it clustered away from the main Fletcher cluster. Two Fletcher plays appear in the nearest neighbors of ‘The Faithful Shepherdess’, suggesting that the analysis has uncovered some authorial characteristics in the play, even if they are not sufficient to have it join the Fletcher cluster (Cluster 4).

Something similar may be said about the ‘Blind Beggar of Alexandria’, which appears with two other Chapman comedies in the list of plays with the highest similarity score. The ‘Blind Beggar of Alexandria’ is Chapman’s first performed play and has often been viewed negatively as overly farcical with an underdeveloped romantic plot [80].

It is most likely that these three plays are clustered with Shakespeare because they have weak links to their authorial canons, and the Shakespeare cluster is large and therefore likely to be diverse, with many options for links.

Two other clusters (Cluster 3 and Cluster 5) are formed by mixing all 17 of Jonson’s plays and 12 plays written by Chapman. A deeper look at the plays in these two clusters shows that the genre of the plays has a great role in putting similar works in the same cluster. The larger cluster (Cluster 3) contains 21 comedies of both Jonson and Chapman together with the only pastoral play written by Jonson named (‘Sad Shepherd’). However, Cluster 5 contains all of six tragedy plays written by these two authors and a historical play of Chapman which is called ‘Caesar and Pompey’. Both Jonson and Chapman would be recognized as having drastically different styles in their own comedies compared to their own tragedies, so this would fit with general critical opinion.

Taking all the clusters found by our methodology into account, authorship is generally very strong on these measures, but a few authors have either an aberrant play or a consistent internal division by genre. Chapman has both of these attributes. The clusters identified by the proposed unsupervised method show considerable associations with the authors’ writing styles. Nevertheless and in order to validate the effectiveness of our method, we applied two available clustering methods on the dataset and compared the results.

Performance comparison with other clustering methods

The proposed method provides solutions for the clustering of plays in the dataset based on the similarity of word frequencies in each play. To compare our method with other clustering methods, we applied six well-known distance-based clustering techniques on the Complete dataset and the five reduced datasets (G1-G5). The selected clustering methods for comparisons are: a well-known implementation of the k-means method named k-means++ [81], one unsupervised graph-based clustering algorithm named MST-kNN [32], and four popular hierarchical clustering methods: 1)Complete-linkage, 2)Average-linkage, 3)Single-linkage and 4)Ward’s algorithm.

k-means is by far the most popular clustering method used in many scientific and industrial applications [82, 83]. k-means++ developed by Arthur [81], is a novel k-means algorithm with an enhanced seeding technique for outperforming the standard k-means. Due to the popularity of k-means methods, k-means++ is selected for comparison. k-means++ is implemented in the Scikit-learn(0.16.1) clustering package [84] in Python and is publicly available. It is worth noting that k-means++, according to the original method, uses the Euclidean distance and identifies clusters by minimizing the average square distance between elements in the same clusters [81]. This method is a supervised method, in the sense that the number of clusters (K) is known in advance; therefore, we run the algorithm for K = 2 to K = 20 and report the best solution based on the quality measure NMI×ARI.

MST-kNN is an unsupervised graph clustering technique proposed by Inostroza-Ponta et. al [32]. This method works by partitioning the minimum spanning tree using the k-nearest neighbour graph with an adaptive determination of the number of clusters. The MST-kNN is chosen for comparison because for two main reasons: 1) it can use the matrix and compute the clusters based on this distance (dissimilarity) matrix, 2) MST-kNN also uses a similar approach of using graph structure for data clustering. This method has been employed in several studies [8587] and it has shown remarkable results in data analysis. While MST-kNN is unsupervised, it is applied directly on matrices and the obtained clusters is evaluated regarding the true partitions.

Hierarchical clustering methods seek the hierarchical structure of the graph by recursively partitioning nodes into clusters either in top-down fashion by dividing clusters into smaller clusters (divisive hierarchical clustering) or bottom-up fashion by merging clusters into larger cluster (agglomerative hierarchical clustering). Dissimilarity measures and linkage criteria are used in order to select which cluster should be merged or divided. The linkage criterion refers to the manner in which the distance between two clusters is calculated. The most well-known linkage criteria are complete-linkage, average-linkage and single-linkage [88]. The distance between two cluster according to the complete-linkage criterion is equal to the longest distance between any pair of nodes from the two clusters [89], but based on the single-linkage criteria it is equal to the shortest distance between any pair of nodes from the two clusters [90] and in average-linkage methods the distance between two cluster is defined as the average distance of any member of one cluster to any member of the other cluster [91].

In this study, we use the stats (3.3.0) package (hclust function) provided in R. The is employed as the dissimilarity measure and for the linkage criterion four methods have been used: Complete-linkage, Average-linkage, Single-linkage and Ward’s method [92, 93]. Ward’s method is a modified average-linkage method that works based on the squared dissimilarity. Hierarchical methods result in dendrograms, representing the nested grouping of nodes. The clustering solution of the data elements is obtained by cutting the dendrogram at the desired level. Therefore in this study to compare the clustering result of hierarchical methods with the true partitions and other methods, the obtained dendrogram was cut to get the desired number of clusters. To give the most possible chance to the dendrogram to provide the best solution the number of clusters was varied from K = 2 to K = 20.

Table 9 shows the best and worst result of the proposed clustering methods together with the best results obtained by k-means++, MST-kNN and four hierarchical clustering methods (i.e. complete-linkage, average-linkage, single-linkage and Ward’s method). The quality of the clustering results are evaluated by NMI, ARI and NMI×ARI which are computed by comparing the clusters with the true authorship label of the plays in each dataset. The Best and the Worst results are repetitions from Table 5 (for Complete dataset) and Table 7 (for reduced datasets). Methods are ranked in each dataset based on NMI × ARI that shows how successfully the method obtained the true partition.

thumbnail
Table 9. The best clustering solutions obtained by benchmark methods (k-means++, MST-kNN, Complete-linkage, Average-linkage, Single-linkage and Ward’s method) together with the best (Best) and the worst (Worst) clustering result obtained by the proposed method in Complete dataset and five reduced datasets (G1-G5).

The Rank column is based on the value of NMI × ARI.

https://doi.org/10.1371/journal.pone.0157988.t009

Firstly, the best results of the proposed method (Best), in the three of six datasets, i.e. G2, G3 and G5 datasets, are ranked first and outperform the other methods. In these three datasets the results of Ward’s method are in the second rank. In the other three datasets, i.e. Complete, G2 and G4 datasets, Ward’s method is ranked first and Best is in the second place.

Secondly, comparison between the benchmark methods shows that MST-kNN has an obvious tendency to merge small clusters. Hence, it results in a solution with fewer clusters. On the other hand, the three hierarchical clustering methods, i.e. Complete-linkage, Average-linkage and Single-linkage, result in solutions with more clusters. However, among the three traditional hierarchical clustering methods the Complete-linkage method obtained better results than Average-linkage and Single-linkage in all datasets.

Finally, comparison between the worst results of the proposed method (Worst) with other benchmark methods indicates that in four of the datasets, i.e. Complete, G1, G2, G3 and G5, Worst is ranked 3 after Best and Ward’s. Another striking observation about the Worst results is that in all datasets, it is better than MST-kNN, Complete-linkage, Average-linkage and Single-linkage. Furthermore, comparison between the performance of the popular method k-means++ with Worst illustrates that in four datasets, Worst is ranked better than k-means++, and only in G3 and G4 does k-means++ show slightly better performance than Worst.

Discussion

Significance and Contribution

In this paper, we proposed a new four-stage methodology for data clustering. The methodology is unsupervised and it is able to identify groups of data elements which have a meaningful similarity. Though several methods have been proposed for two problems of data clustering and community detection, in this study, we introduced a new general methodology to convert the data clustering problem to the community detection problem by borrowing fundamental concepts of information theory and graph theory. This methodology aims to identify novel clusters from data. To evaluate the performance of our methodology, we have conducted several experiments on an interesting corpus dataset generated by counting word frequencies in 168 Shakespearean plays. Clusters obtained from datasets (Complete and reduced datasets) revealed stimulating findings about author’s literary styles.

Moreover, the memetic algorithm (iMA-Net) which is proposed optimizing the modularity value shows efficient performance in discovering solutions in benchmark networks. Also, in the kNN graphs generated for the Complete dataset and reduced datasets, the iMA-Net obtained good clustering results that are highly matched with the true partitions of the datasets. Furthermore, comparison with six well-known clustering techniques (Table 9) illustrates the great ability of the developed methodology in detecting superior clustering solutions in the studied datasets. Finally, the proposed methodology is a generic method and in each of the four stages a variety of other methods can be employed. For instance, in the third stage of identifying the community structure of the kNN graphs, variety of community detection methods might be applicable.

Limitations and future research

Although the proposed methodology obtains remarkable results in studied datasets, there are a few limitations, which are possible areas for future work. Firstly, we utilized the square root of Jensen-Shannon divergence () to measure the dissimilarity between elements and then convert the dissimilarity (distance) in kNN graphs to the similarity by subtracting it from one; it performs competently in the studied datasets, but more studies on different datasets and comparison with other measures are required to prove that is the most suitable measure for constructing the kNN graphs.

Secondly, there is a limitation regarding the number of kNN graphs that should be analyzed. In the second stage of the proposed methodology, the kNN graphs were generated from the dissimilarity matrix and we set the value of k to be in the range of 1 to 10. As mentioned before, comparison between the best found clustering in the five reduced datasets (Table 7) showed that solutions with the highest value of NMI×ARI were always obtained from the 5NN graph. However more investigations are required to justify that analyzing 5NN graphs has an advantage over the other kNN graphs. One of the future research directions would be finding a proper procedure for tuning the value of k in different datasets.

Note that the proposed clustering methodology is a four-stage procedure and for each stage different tools are applicable. This has its advantages and disadvantages. While the main advantage is the flexibility of the method to employ different tools, the disadvantage is its complexity in combining many different tools. Future work will aim to develop a single software tool for the proposed clustering approach that can apply all four stages in an optimized and scalable way.

Another future research direction is to employ multi-objective community detection algorithms that optimize not only modularity value but also other quality functions. As an example, we can refer to the multi-objective community detection method proposed by Shi et al. [94]. Multi-objective algorithms return a set of non-dominated solutions and give the opportunity for the user to select the most appropriate solution from a few. Furthermore, applying overlapping community detection methods in the third stage is another important future research direction, which will enhance the ability of the proposed methodology to identify overlapping clusters; the latter has several real-world applications.

To conclude, we proposed a novel four-stage methodology for clustering based on information theory and modularity optimization. To demonstrate the effectiveness of this methodology, we conducted experiments on clustering a large text corpus dataset. We discovered remarkable results which lead to the authors’ literary styles in different genres. As discussed, we envision that the proposed efficient methodology might be adopted in various fields for the purpose of investigating the clustering structure of datasets.

Supporting Information

S1 Table. The 168×168 matrix contains the between each pair of 168 plays.

https://doi.org/10.1371/journal.pone.0157988.s001

(XLSX)

S2 Table. Confusion Matrix of the best solution that is demonstrated in Fig 3.

https://doi.org/10.1371/journal.pone.0157988.s002

(XLSX)

S3 Table. List of plays located in each cluster of the solution demonstrated in Fig 3.

https://doi.org/10.1371/journal.pone.0157988.s003

(XLSX)

S4 Table. List of plays located in each cluster of the best clustering solution of G5 which is demonstrated in Fig 4.

https://doi.org/10.1371/journal.pone.0157988.s004

(XLSX)

Acknowledgments

We would like to thank Renato Vimieiro and Carlos Riveros for providing the R package for JSD matrix computation, and to Luke Mathieson for his extensive proof-reading and constructive comments. We also thank the four anonymous reviewers for their valuable comments.

Author Contributions

  1. Conceived and designed the experiments: LN PM.
  2. Performed the experiments: LN.
  3. Analyzed the data: LN PM HC.
  4. Contributed reagents/materials/analysis tools: LN PM HC RB.
  5. Wrote the paper: LN PM HC.

References

  1. 1. Baldi P, Hatfield GW. DNA microarrays and gene expression: from experiments to data analysis and modeling. Cambridge University Press; 2002.
  2. 2. Clark MB, Johnston RL, Inostroza-Ponta M, Fox AH, Fortini E, Moscato P, et al. Genome-wide analysis of long noncoding RNA stability. Genome research. 2012;22(5):885–898. pmid:22406755
  3. 3. Arabie P. Cluster analysis in marketing research. Advanced methods in marketing research. 1994;p. 160–189.
  4. 4. de Vries NJ, Carlson J, Moscato P. A Data-Driven Approach to Reverse Engineering Customer Engagement Models: Towards Functional Constructs. PLoS ONE. 2014 07;9(7):1–19.
  5. 5. Aggarwal CC, Reddy CK. Data clustering: algorithms and applications. CRC; 2013.
  6. 6. Han J, Kamber M, Pei J. Data mining: concepts and techniques. Elsevier; 2011.
  7. 7. Smyth P. Model selection for probabilistic clustering using cross-validated likelihood. Statistics and Computing. 2000;10(1):63–72.
  8. 8. Hathaway RJ, Bezdek JC. Extending fuzzy and probabilistic clustering to very large data sets. Computational Statistics & Data Analysis. 2006;51(1):215–234.
  9. 9. Jain AK. Data clustering: 50 years beyond K-means. Pattern Recognition Letters. 2010;31(8):651–666.
  10. 10. Ester M, Kriegel HP, Sander J, Xu X. A density-based algorithm for discovering clusters in large spatial databases with noise. In: KDD. vol. 96; 1996. p. 226–231.
  11. 11. Amini A, Wah TY, Saybani MR, Yazdi SRAS. A study of density-grid based clustering algorithms on data streams. In: Fuzzy Systems and Knowledge Discovery (FSKD), 2011 Eighth International Conference on. vol. 3. IEEE; 2011. p. 1652–1656.
  12. 12. Ng AY, Jordan MI, Weiss Y, et al. On spectral clustering: Analysis and an algorithm. Advances in neural information processing systems. 2002;2:849–856.
  13. 13. Von Luxburg U. A tutorial on spectral clustering. Statistics and Computing. 2007;17(4):395–416.
  14. 14. Schaeffer SE. Graph clustering. Computer Science Review. 2007;1(1):27–64.
  15. 15. Chan PK, Schlag MD, Zien JY. Spectral k-way ratio-cut partitioning and clustering. Computer-Aided Design of Integrated Circuits and Systems, IEEE Transactions on. 1994;13(9):1088–1096.
  16. 16. Shi J, Malik J. Normalized cuts and image segmentation. Pattern Analysis and Machine Intelligence, IEEE Transactions on. 2000;22(8):888–905.
  17. 17. Newman ME, Girvan M. Finding and evaluating community structure in networks. Physical Review E. 2004;69(2):026113.
  18. 18. Girvan M, Newman ME. Community structure in social and biological networks. Proceedings of the National Academy of Sciences. 2002;99(12):7821–7826.
  19. 19. Fortunato S. Community detection in graphs. Physics Reports. 2010;486(3):75–174.
  20. 20. Andreev K, Räcke H. Balanced Graph Partitioning. In: Proceedings of the Sixteenth Annual ACM Symposium on Parallelism in Algorithms and Architectures. ACM; 2004. p. 120–124.
  21. 21. Feldmann AE. Fast balanced partitioning is hard even on grids and trees. In: Mathematical Foundations of Computer Science. Springer; 2012. p. 372–382.
  22. 22. Elliott WE, Valenza RJ. Oxford by the Numbers: What Are the Odds that the Earl of Oxford Could Have Written Shakespeare’s Poems and Plays. Tennessee Law Review. 2004;72:323.
  23. 23. Vickers B. ‘Counterfeiting’ Shakespeare: Evidence, Authorship and John Ford’s Funerall Elegye. Cambridge University Press; 2002.
  24. 24. Jackson MP. Determining the Shakespeare Canon: Arden of Faversham and A Lover’s Complaint. Oxford University Press; 2014.
  25. 25. Foucault M. What is an author? In: Faubion JD, trans Harari Josué V, editors. Aesthetics, Method, and Epistemology. The New Press; 1994. p. 205–21.
  26. 26. Barthes R. The Death of the Author. Image / Music / Text. 1977;p. 142–148.
  27. 27. Craig H. Style, statistics, and new models of authorship. Early Modern Literary Studies. 2009;15(1):1–42.
  28. 28. Egan G. What Is Not Collaborative About Early Modern Drama In Performance And Print? In: Holland P, editor. Shakespeare Survey. vol. 67. Cambridge University Press; 2014. p. 18–28. Shakespeare Survey Online.
  29. 29. Craig H, Whipp R. Old spellings, new methods: automated procedures for indeterminate linguistic data. Literary and Linguistic Computing. 2010;25(1):37–52.
  30. 30. Marsden J, Budden D, Craig H, Moscato P. Language individuation and marker words: Shakespeare and his Maxwell’s demon. PLoS ONE. 2013;8(6):e66813. pmid:23826143
  31. 31. Arefin AS, Riveros C, Berretta R, Moscato P. The MST-kNN with Paracliques. In: Artificial Life and Computational Intelligence. Springer; 2015. p. 373–386.
  32. 32. Inostroza-Ponta M, Mendes A, Berretta R, Moscato P. An integrated QAP-based approach to visualize patterns of gene expression similarity. In: Progress in Artificial Life. Springer; 2007. p. 156–167.
  33. 33. Naeni LM, de Vries NJ, Reis R, Arefin AS, Berretta R, Moscato P. Identifying Communities of Trust and Confidence in the Charity and Not-for-profit Sector: A Memetic Algorithm Approach. In: Big Data and Cloud Computing (BdCloud), 2014 IEEE Fourth International Conference on; 2015. p. 500–507.
  34. 34. Rao CR, Nayak TK. Cross entropy, dissimilarity measures, and characterizations of quadratic entropy. Information Theory, IEEE Transactions on. 1985;31(5):589–593.
  35. 35. Shannon CE. A mathematical theory of communication. Bell System Technical Journal. 1948 July;27(3):379–423.
  36. 36. Lin J. Divergence measures based on the Shannon entropy. Information Theory, IEEE Transactions on. 1991;37(1):145–151.
  37. 37. Berretta R, Moscato P. Cancer biomarker discovery: the entropic hallmark. PLoS ONE. 2010;5(8):e12262.
  38. 38. Grosse I, Bernaola-Galván P, Carpena P, Román-Roldán R, Oliver J, Stanley HE. Analysis of symbolic sequences using the Jensen-Shannon divergence. Physical Review E. 2002;65(4):041905.
  39. 39. Sims GE, Jun SR, Wu GA, Kim SH. Alignment-free genome comparison with feature frequency profiles (FFP) and optimal resolutions. Proceedings of the National Academy of Sciences. 2009;106(8):2677–2682.
  40. 40. Endres DM, Schindelin JE. A new metric for probability distributions. IEEE Transactions on Information theory. 2003;.
  41. 41. Österreicher F, Vajda I. A new class of metric divergences on probability spaces and its applicability in statistics. Annals of the Institute of Statistical Mathematics. 2003;55(3):639–653.
  42. 42. Arefin AS, Riveros C, Berretta R, Moscato P. GPU-FS-k NN: A Software Tool for Fast and Scalable k NN Computation Using GPUs. PLoS ONE. 2012;7(8):e44000. pmid:22937144
  43. 43. Gan G, Ma C, Wu J. Data clustering: theory, algorithms, and applications. vol. 20. Siam; 2007.
  44. 44. Watts DJ. A twenty-first century science. Nature. 2007;445(7127):489–489. pmid:17268455
  45. 45. Lazer D, Pentland AS, Adamic L, Aral S, Barabasi AL, Brewer D, et al. Life in the network: the coming age of computational social science. Science. 2009;323(5915):721. pmid:19197046
  46. 46. Newman ME, Girvan M. Finding and evaluating community structure in networks. Physical Review E. 2004;69(2):026113.
  47. 47. Brandes U, Delling D, Gaertler M, Görke R, Hoefer M, Nikoloski Z, et al. Maximizing modularity is hard. arXiv preprint physics/0608255. 2006;.
  48. 48. Clauset A, Newman ME, Moore C. Finding community structure in very large networks. Physical Review E. 2004;70(6).
  49. 49. Wakita K, Tsurumi T. Finding community structure in mega-scale social networks. In: Proceedings of the 16th international conference on World Wide Web. ACM; 2007. p. 1275–1276.
  50. 50. Blondel VD, Guillaume JL, Lambiotte R, Lefebvre E. Fast unfolding of communities in large networks. Journal of Statistical Mechanics: Theory and Experiment. 2008;2008(10):P10008.
  51. 51. Tasgin M, Herdagdelen A, Bingol H. Community detection in complex networks using genetic algorithms. arXiv preprint arXiv:07110491. 2007;.
  52. 52. Li J, Song Y. Community detection in complex networks using extended compact genetic algorithm. Soft Computing. 2013;17(6):925–937.
  53. 53. Gach O, Hao JK. In: Coello CAC, Cutello V, Deb K, Forrest S, Nicosia G, Pavone M, editors. A memetic algorithm for community detection in complex networks. Springer Berlin Heidelberg; 2012. p. 327–336.
  54. 54. Shang R, Bai J, Jiao L, Jin C. Community detection based on modularity and an improved genetic algorithm. Physica A: Statistical Mechanics and its Applications. 2013;392(5):1215–1231.
  55. 55. Liu D, Jin D, Baquero C, He D, Yang B, Yu Q. Genetic algorithm with a local search strategy for discovering communities in complex networks. International Journal of Computational Intelligence Systems. 2013;6(2):354–369.
  56. 56. Newman ME. Modularity and community structure in networks. Proceedings of the National Academy of Sciences. 2006;103(23):8577–8582.
  57. 57. Leicht EA, Newman ME. Community structure in directed networks. Physical Review Letters. 2008;100(11):118703. pmid:18517839
  58. 58. Liu J, Liu T. Detecting community structure in complex networks using simulated annealing with k-means algorithms. Physica A: Statistical Mechanics and its Applications. 2010;389(11):2300–2309.
  59. 59. Naeni LM, Berretta R, Moscato P. In: Handa H, Ishibuchi H, Ong YS, Tan CK, editors. MA-Net: A Reliable Memetic Algorithm for Community Detection by Modularity Optimization. Cham: Springer International Publishing; 2015. p. 311–323.
  60. 60. Neri F, Cotta C, Moscato P. Handbook of memetic algorithms. vol. 379. Springer; 2012.
  61. 61. Fortunato S, Barthélemy M. Resolution limit in community detection. Proceedings of the National Academy of Sciences. 2007;104(1):36–41.
  62. 62. Schuetz P, Caflisch A. Efficient modularity optimization by multistep greedy algorithm and vertex mover refinement. Physical Review E. 2008;77(4):046112.
  63. 63. Zachary WW. An information flow model for conflict and fission in small groups. Journal of Anthropological Research. 1977;p. 452–473.
  64. 64. Lusseau D, Schneider K, Boisseau OJ, Haase P, Slooten E, Dawson SM. The bottlenose dolphin community of Doubtful Sound features a large proportion of long-lasting associations. Behavioral Ecology and Sociobiology. 2003;54(4):396–405.
  65. 65. Gleiser PM, Danon L. Community structure in jazz. Advances in Complex Systems. 2003;6(04):565–573.
  66. 66. Barber MJ, Clark JW. Detecting network communities by propagating labels under constraints. Physical Review E. 2009;80(2):026129.
  67. 67. Gong M, Fu B, Jiao L, Du H. Memetic algorithm for community detection in networks. Physical Review E. 2011;84(5):056101.
  68. 68. Pizzuti C. A multiobjective genetic algorithm to find communities in complex networks. Evolutionary Computation, IEEE Transactions on. 2012;16(3):418–430.
  69. 69. Gong M, Cai Q, Chen X, Ma L. Complex network clustering by multiobjective discrete particle swarm optimization based on decomposition. IEEE Transactions on Evolutionary Computation. 2014;18(1):82–97.
  70. 70. Ma L, Gong M, Liu J, Cai Q, Jiao L. Multi-level learning based memetic algorithm for community detection. Applied Soft Computing. 2014;19:121–133.
  71. 71. Raghavan UN, Albert R, Kumara S. Near linear time algorithm to detect community structures in large-scale networks. Physical Review E. 2007;76(3):036106.
  72. 72. Ana L, Jain AK. Robust data clustering. In: Computer Vision and Pattern Recognition, 2003. Proceedings. 2003 IEEE Computer Society Conference on. vol. 2. IEEE; 2003. p. II–128.
  73. 73. Rand WM. Objective criteria for the evaluation of clustering methods. Journal of the American Statistical Association. 1971;66(336):846–850.
  74. 74. Hubert L, Arabie P. Comparing partitions. Journal of classification. 1985;2(1):193–218.
  75. 75. Eliot S T. Eassys on Elizabethan Drama. Harcourt, Brace and Company, Inc.; 1956.
  76. 76. Lyly J, Scragg L. The Woman in the Moon: By John Lyly. The Revels Plays. Manchester University Press; 2006.
  77. 77. Harbage A, Schoenbaum S. Annals of English Drama, 975–1700: An Analytical Record of All Plays, Extant Or Lost, Chronologically Arranged and Indexed by Authors, Titles, Dramatic Companies & C. 2nd ed. University of Pennsylvania; 1964.
  78. 78. Hoy C. The Shares of Fletcher and His Collaborators in the Beaumont and Fletcher Canon (I). Studies in Bibliography. 1956;8:129–146.
  79. 79. Hope J. The authorship of Shakespeare’s plays: a socio-linguistic study. Cambridge University Press; 1994.
  80. 80. Kaufman HA. The Blind Beggar of Alexandria: A Reappraisal. Philological Quarterly. 1959;38:101.
  81. 81. Arthur D, Vassilvitskii S. k-means++: The advantages of careful seeding. In: Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms. Society for Industrial and Applied Mathematics; 2007. p. 1027–1035.
  82. 82. Berkhin P. A survey of clustering data mining techniques. In: Grouping multidimensional data. Springer; 2006. p. 25–71.
  83. 83. Jacques J, Preda C. Functional data clustering: a survey. Advances in Data Analysis and Classification. 2014;8(3):231–255.
  84. 84. Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, et al. Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research. 2011;12:2825–2830.
  85. 85. Arefin AS, Mathieson L, Johnstone D, Berretta R, Moscato P. Unveiling clusters of RNA transcript pairs associated with markers of Alzheimer’s disease progression. PLoS ONE. 2012;7(9):e45535. pmid:23029078
  86. 86. Capp A, Inostroza-Ponta M, Bill D, Moscato P, Lai C, Christie D, et al. Is there more than one proctitis syndrome? A revisitation using data from the TROG 96.01 trial. Radiotherapy and Oncology. 2009;90(3):400–407. pmid:18952309
  87. 87. Inostroza-Ponta M, Berretta R, Moscato P. QAPgrid: A two level QAP-based approach for large-scale data analysis and visualization. PLoS ONE. 2011;6(1):e14468. pmid:21267077
  88. 88. Rokach L, Maimon O. Clustering methods. In: Data mining and knowledge discovery handbook. Springer US; 2005. p. 321–352.
  89. 89. King B. Step-wise clustering procedures. Journal of the American Statistical Association. 1967;62(317):86–101.
  90. 90. Sneath PH, Sokal RR. Numerical taxonomy. The principles and practice of numerical classification. W. H. Freeman; 1973.
  91. 91. Murtagh F. Multidimensional clustering algorithms. Compstat Lectures; 1985.
  92. 92. Ward JH. Hierarchical grouping to optimize an objective function. Journal of the American Statistical Association. 1963;58(301):236–244.
  93. 93. Murtagh F, Legendre P. Ward’s hierarchical agglomerative clustering method: which algorithms implement Ward’s criterion? Journal of Classification. 2014;31(3):274–295.
  94. 94. Shi C, Yan Z, Cai Y, Wu B. Multi-objective community detection in complex networks. Applied Soft Computing. 2012;12(2):850–859.