Extractive single document summarization using binary differential evolution: Optimization of different sentence quality measures

With the increase in the amount of text information in different real-life applications, automatic text-summarization systems become more predominant in extracting relevant information. In the current study, we formulated the problem of extractive text-summarization as a binary optimization problem, and multi-objective binary differential evolution (DE) based optimization strategy is employed to solve this. The solutions of DE encode a possible subset of sentences to be present in the summary which is then evaluated based on some statistical features (objective functions) namely, the position of the sentence in the document, the similarity of a sentence with the title, length of the sentence, cohesion, readability, and coverage. These objective functions, measuring different aspects of summary, are optimized simultaneously using the search capability of DE. Some newly designed self-organizing map (SOM) based genetic operators are incorporated in the optimization process to improve the convergence. SOM generates a mating pool containing solutions and their neighborhoods. This mating pool takes part in the genetic operation (crossover and mutation) to create new solutions. To measure the similarity or dissimilarity between sentences, different existing measures like normalized Google distance, word mover distance, and cosine similarity are explored. For the purpose of evaluation, two standard summarization datasets namely, DUC2001, and DUC2002 are utilized, and the obtained results are compared with various supervised, unsupervised and optimization strategy based existing summarization techniques using ROUGE measures. Results illustrate the superiority of our approach in terms of convergence rate and ROUGE scores as compared to state-of-the-art methods. We have obtained 45% and 5% improvements over two recent state-of-the-art methods considering ROUGE−2 and ROUGE−1 scores, respectively, for the DUC2001 dataset. While for the DUC2002 dataset, improvements obtained by our approach are 20% and 5%, considering ROUGE−2 and ROUGE−1 scores, respectively. In addition to these standard datasets, CNN news dataset is also utilized to evaluate the efficacy of our proposed approach. It was also shown that the best performance not only depends on the objective functions used but also on the correct choice of similarity/dissimilarity measure between sentences.


Introduction
Text summarization [1] is a natural language processing task which aims to create a summary describing the main theme of the document. Summarization techniques can be categorized into two groups depending on the extraction methodology: extractive [2] and abstractive [3][4][5]. In extractive summarization (ESDocSum), portions of the original document are used to form summary. While in abstractive summarization, reformulation of text is required which needs linguistic knowledge. Some of the application domains where text summarization techniques can be applied are web document summary generation, summarization of bug-reports, report generation and generation of personalized summary helping in question-answering systems. Because of the complexity of text documents and consideration of semantic and syntactic information present in texts, text-summarization has become a challenging natural language processing task.
Nowadays, sentence based extractive summarization [6][7][8][9] systems become popular where a set of sentences are extracted from the document for the overall understanding of a document. These set of sentences are selected using some sentence scoring features. Some such scoring features are position of the sentence in the document [9], the length of the sentence [9], similarity with respect to the title of document [9] etc.
In the existing literature, a lot of works have been reported to solve the summarization problem. Different learning paradigms have been tried like supervised [10], unsupervised [11], deep learning [12,13], etc. But, in recent years, researchers have solved ESDocSum using different meta-heuristic optimization techniques like evolutionary algorithms which include genetic algorithm [14], differential evolution [15], etc. These algorithms help in extracting relevant sentences from the given document by optimizing some criteria. The algorithms have shown significant improvements [6,9] over the existing methods. In this paper also, we have proposed a novel approach for single-document summarization which utilizes multi-objective binary differential evolution (MOBDE) [16] as the underlying optimization strategy. However, there exist several other optimization algorithms like AMOSA [17], PSO [18], etc. Several new concepts like self-organizing map [19] based mating pool generation etc. are introduced in our proposed framework. Before discussing the motivation behind developing such an algorithm, the existing works on ESDocSum are analyzed next.

Related works
We have divided the related works on single document summarization into four categories: (a) supervised; (b) unsupervised; (c) meta-heuristic; and, (d) neural-network. Brief descriptions of these methods with the corresponding drawbacks are described below: Supervised methods. SVM [20] considered pre-existing document-summary pair for learning. In [10], summarization problem is treated as a sequence labeling problem and is solved using Condition Random Field (CRF) [21]. In [22], a method named, Manifold Ranking was proposed in which a ranking score was assigned to each sentence in the document based on its information richness and diversity. Then, sentences having high ranking scores are only selected to generate the final summary. In [23], regression-based model was proposed using Integer Linear Programming [24] which uses three features to select the candidate summary from the set of available summaries. Main limitation of the methods proposed in these papers is that they make use of labeled data for training (i.e., whether sentence belongs to the summary or not) which requires manual effort and this is also a time-consuming step.
Un-supervised methods. In [25], QCS, a query-based method was proposed by Dunlavy et al. to generate the summary. It uses Hidden Markov Model (HMM) which predicts the probability of a sentence to be included in the summary or not. Note that the method developed was a graph-based method which was adopted for simultaneous summarization of single as well as multi-documents. Main drawback of this approach was that it considers only three features: sentence position, local saliency (for single-document summarization) and global saliency (for multi-document summarization) scores of the sentences. Ferreira et al. [8] developed a context-based summarization system and have shown that quality of generated summary obtained using different combinations (sum) of sentence scoring functions/features depends on the type of text (news, article, blog). Their sentence scoring features include wordbased scoring (like term frequency, etc.), graph-based scoring (obtained using Text Rank algorithm [7]) and sentence-based scoring (sentence position, sentence similarity with the title, etc.). Main limitation of the discussed unsupervised methods [8,25] is that they have not explored the feature like readability which is important in understanding the generated summary by the end-user.
Meta-heuristics based methods. Aliguliyev et al. [6] proposed an optimization based automatic text summarization method. Here, the sentences in the document are assigned to different clusters and clusters quality are optimized using differential evolution algorithm. Then in every cluster, sentences are sorted based on some sentence scoring features. Finally, high ranked sentences are selected as a part of the summary. In [26], fuzzy evolutionary optimization model (FEOM) was developed and applied to extractive summarization as an application. In [9], the method, MA-SingleDocSum was proposed by Mendoza et al. using optimization algorithm named as Memetic algorithm. It makes use of guided local search to solve the summarization problem. In [27], a method named ESDS-GHS-GLO is proposed based on Global-best Harmony Search meta-heuristic and a greedy local search procedure. It considers extractive single document summarization as a binary optimization problem. Rasim et al. [28] proposed a COSUM method utilizing clustering and optimization technique optimizing coverage and diversity of the summary simultaneously. Main drawbacks of these metaheuristic algorithms are their low convergence rate and low ROUGE score. Moreover, they optimized sum (in some of the cases, weighted sum) of different objective functions, thus, converting multiple objective values to a single value.
Neural-network based methods. In [11], a neural network based method was developed namely, NetSum, which uses the RankNet [29] algorithm to assign a rank to the sentences in the document and then identifies informative sentences. In recent years [12,13], some deep learning models like a recurrent neural network, etc. have been used for solving single document extractive summarization task. Note that these methods make use of supervised information while training.

Motivation
Existing meta-heuristic strategies suffer from the following problems: slow convergence and low ROUGE score values for the obtained summary. None of the existing approaches makes use of the self-organizing map (SOM) in their architecture, which can help in exploring the neighborhood of a solution in an efficient way to determine the optimal ROUGE score. Here, ROUGE score is an evaluation function to measure the informativeness of the summary. In the current paper, summarization problem is treated as a binary optimization problem where different quality measures of the summary are optimized simultaneously. Six objective functions, namely, the position of the sentence in the document, the similarity of a sentence with the title, length of the sentence, cohesion, readability, coverage are selected to be optimized simultaneously. Multi-objective binary differential evolution (MOBDE) [16] is used as the underlying optimization strategy to optimize all objective functions simultaneously where each chromosome (or solution) is a binary string representing a set of possible sentences to be selected in the generated summary. Optimization of multiple objective functions helps in generation of good quality summary for a given document and thus, attaining better ROUGE score.
To increase the convergence rate further, concepts of self-organizing map (SOM) [19,30] are incorporated in MOBDE framework. SOM is a type of neural network which maps high dimensional input space to low-dimensional output space, where, output space is a grid of neurons arranged in 2-dimensional space. The central principle behind the SOM is that input samples which are close to each other in the input space should also come close to each other in the output space. Thus, it can be used as a cluster analysis tool. In any evolutionary algorithm, the qualities of new solutions generated from the old solutions play vital roles in convergence as they help in reaching the global optimum solutions. In our approach, SOM is used to generate high-quality solutions which in turn help in faster convergence. SOM is first trained using the current population to discover the localities of chromosomes/solutions, and then a mating pool is constructed for each chromosome using the neighborhood relationships extracted by SOM. After that, chromosomes present in the mating pool are combined using reproduction operators (crossover and mutation) [31] to generate some new solutions.
To show that the performance of the proposed summarization technique not only depends on objective functions considered but, also on the type of sentence similarity/dissimilarity function used, experiments are conducted by varying the similarity/dissimilarity measures, namely normalized Google Distance (NGD) [48], word mover distance [49] and cosine similarity [6]. The proposed approach is tested on two standard datasets of text summarization namely, DUC2001 and DUC2002 (https://www-nlpir.nist.gov/projects/duc/data.html). One more dataset related to CNN news [23] is also used to evaluate the efficacy of our proposed system. Results obtained clearly show the superiority of our proposed algorithm in comparison to various state-of-the-art techniques.

Contributions
The major contributions of this paper are enumerated below: • In the literature, ESDocSum problem is often formulated as a single objective optimization problem with the weighted sum of different objectives [6,9] and this is popularly solved using different EA techniques. However, in this paper, summarization problem is treated as a multi-objective optimization problem where various aspects of summary like the readability of the summary, the similarity of the sentences in the summary with the title, etc. are optimized simultaneously.
• In the existing multi-objective evolutionary algorithms, usually, reproduction operators like roulette wheel selection, tournament selection [14] etc., popularly used in a single-objective optimization framework, are used to generate new solutions. But, in the current study, to generate high-quality solutions, some newly designed self-organizing map based genetic operators are used which further help in reaching the global optimum solutions in a faster way.
• In order to show that performance of summarization system not only depends on the objective functions used but also depends on the type of similarity/dissimilarity measure used between sentences, three types of similarity/dissimilarity measures namely, normalized google distance [48], word mover distance [49], and cosine similarity [6], are explored in this paper.
• Most of the papers on summarization using some optimization strategies make use of actual summary to report the results. But, in real time situations, actual summary may not be available. Therefore, in this paper, we have explored various unsupervised strategies for selecting a single solution from the final Pareto optimal front produced by any multi-objective optimization based technique.

Background knowledge
This section discusses some related concepts used in our proposed framework.

Multi-objective optimization
Multi-objective optimization (MOO) [14] refers to the task of optimizing more then one objective function, simultaneously, to solve a particular problem. It provides a set of alternative solutions to the decision maker as oppose to the single objective optimization. Mathematically, MOO can be formulated as: where X is a set of decision vectors in n-dimensional space denoted as fx 1 ;x 2 . . .x n g, m is the number of objective functions to be maximized and should be grater than 2. Note that there can be some constraints as a part of the optimization process.

A binary differential evolutionary algorithm for optimization
Differential Evolution (DE) is a population-based global optimization technique proposed by Storn and Price [15] to solve real-world problems. There exist many variants of the DE; each differs in representation (real-coded or binary-coded) of the solution and in the use of parameters. In our paper, a binary differential evolution algorithm [16] is used where each solution is represented as a binary vector. Each solution is associated with two or more objective functions in DE framework for multi-objective optimization. It executes similar to any other evolutionary algorithms. It starts with a set of solutions called as population. At time stamp (generation) t, ith solution is represented as where, n is the length of the solution and i = 1, 2. . ., jPj, jPj is the size of the population, x i,m can take value either 0 or 1 for m = 1, 2. . ., n. For each current solution i, offspring y 0 is generated using crossover and mutation operations [15,50] and then it undergoes evaluation in comparison with current solution, i. Crossover is the exchange of components between two solutions and mutation is the modification in the component. Only the better solution in terms of objective function value out of these two solutions (current and new offspring) can survive into the next generation.

Problem definition
Consider a document D consisting of N sentences, {s 1 , s 2 , . . ., s N }. Our main task is to find a subset of sentences, S 2 D, such that X where, S represents the main theme/topic of the document or subset of sentences which cover the relevant information from the document, s i is the sentence belonging to S, l i measures the length of ith sentence in terms of number of words, S max is the maximum number of words allowed in generated summary.

Sentence similarity/dissimilarity measures and sentence scoring features
To select the best possible set of sentences to be present in the summary, various statistical features (fitness functions or objective functions) are used to evaluate the subset and those are optimized simultaneously using the binary differential evolution algorithmic framework. Some of the features use similarity/dissimilarity criteria between sentences. In the current paper, we have utilized different types of similarity/dissimilarity measures and statistical functions. Descriptions of these functions/features are given below:

Sentence similarity/dissimilarity measures
In our work, three similarity/dissimilarity measures are used: Normalized Google Distance [48], word mover distance [51] and cosine similarity [6]. Normalized google distance. Normalized Google Distance (NGD) measures the semantic relationship between two sentences using terms present in the sentences. It was first proposed in [6]. Two terms tend to be close to each other if they are having similar sense. It is important to note that it is a dissimilarity measure, not a distance function. NGD between two sentences, s i and s j , can be defined as: where, t 1 and t 2 are the terms belonging to sentences, s i and s j , respectively; nt i and nt j are the number of terms in sentence s i and s j , respectively; NGD can be expressed as: where, f t1 denotes the number of sentences in the document (D) containing term t1, f t2 denotes the number of sentences in the document (D) containing term t2, f t1,t2 indicates the number of sentences in the document (D) containing both terms, t1 and t2, N is the number of sentences in the document. Three important properties of NGD are listed below: 1. The range of d NGD (s i , s j ) lies in the scale of 0 to 1.

If
Note that if N = 1, then we have f t1 = f t2 = f t1,t2 . In this case, d NGD ðt1; t2Þ ¼ 0 0 , will be considered as 0 by the 2nd property of NGD.
Word mover distance. Word Mover Distance (WMD) [49,51,52] calculates the dissimilarity between two texts as the amount of distance that the embedded words [53] of one text needs to travel to reach the embedded words of another text [51]. In our approach, text means a sentence. To obtain word embedding of different words, it makes use of word2vec [53]. If two sentences are similar, then WMD will be 0.
Cosine similarity. Cosine similarity [6] is a measure of similarity between two non-zero vectors that measures the cosine of the angle between vectors. It can be defined as: ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi P n i¼1 V 1 i q ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi where,Ṽ 1 andṼ 2 are the vectors of length n, V j i is the ith component of jth vector, j = 1, 2.
The value of this similarity lies between -1 to 1. 1 means two vectors are overlapping or exactly similar to each other, -1 means two vectors are opposite to each other, and 0 indicates they are orthogonal to each other. As our documents contain texts, in order to measure cosine similarity between two sentences, sentence vectors are required. For this purpose, word2vec [53] tool is used.
Word2vec [53] is a model that is used to generate word embedding. It is a two-layered neural network which takes a large corpus of text as the input and generates a unique vector of several hundred dimensions for each word in the corpus as the output. The main goal is to predict a word given the other words in a context. Therefore, it is capable of capturing the semantics between the two words. In our framework, pre-trained word2vec model on googlenews corpus (https://github.com/mmihaltz/word2vec-GoogleNews-vectors) is used. The sentence vector is obtained by averaging the word vectors of the words (obtained from the pretrained word2vec model) present in the sentence.

Statistical features or objective functions
To obtain a good summary, selection of objective functions (quality functions on sentences) is crucial. These objective functions assign some fitness values to the sentences and further help in improving the quality of generated summary. The set of objective functions used in our approach are: the position of the sentence in the document, similarity of a sentence with the title, length of the sentence, cohesion, coverage, and readability. First five objective functions are selected motivated by the paper [9]. Authors of cited paper have optimized weighted sum of first five objective functions and shown that their results are better that state-of-the-art results. But combining the values of different objective functions using weighted criteria into a single value may not be meaningful [54]. Moreover, in any text-based summarization system, readability is an important factor as generated summary should be readable to end-users. Therefore, in our approach, readability feature is considered as a sixth objective function. All these objective functions have to be maximized simultaneously by the use of some multi-objective optimization framework instead of using weighted sum approach. Brief description on these objective functions are provided below: Sentence position. In any document, regardless of domain, relevant/informative sentences can be found in some sections of the document like the leading paragraph of the document. Therefore, to consider this information into account, sentence position [55,56] is expressed as: where q i is the position of ith sentence. It assigns higher scores to initial sentences of the document. As the sentence position in the document increases, the value of p decreases. Similarity with title. Sentences in the summary should be similar to the title [57] to obtain a good summary because the title describes the theme of the document. This objective function is defined as given below: where, title is the headline/title of the document in which sentence s i belongs to, sim(s i , s j ) is the similarity between sentences, s i and s j , O is the number of sentences in generated summary, SWT avg is the average similarity of the sentences in summary with the title, max 8Summary SWT is the average maximum similarities of all sentences with the title, and SWT is the similarity factor of the summary S with the title. SWT is close to 1 if sentences, in summary, are closely related to the title of the document. Sentence length. Literature survey suggests that shorter sentences have less chances to appear as a part of the summary [58]. In this paper, normalization based sigmoid function [59] is used which favors the longest sentence but does not entirely rule out the medium length sentences. Mathematically, it is expressed as: Where, μ(l) is the mean length of sentences in the summary, l(s i ) is the length of sentence, s i , and std(l) is the standard deviation of lengths of sentences in summary S. Cohesion. Cohesion [60,61] measures the relatedness of the sentences in the summary. For a good summary, relatedness between sentences should be tightly coupled. It is expressed as Where, M = max sim(s i , s j ), i, j � N, C s measures the average similarity of the sentences in the summary, sim(s i , s j ) is the similarity between sentences, s i and s j , N is the total number of sentences in the document, M is the maximum similarity between two sentences. It ranges between [0, 1]. 1 indicates sentences in summary are highly related to each other.
Coverage. Coverage (CoV) [9] measures the extent to which sentences in the summary provide useful information about the document and should be maximized. Coverage is defined as where s i and s j are the sentences belonging to generated summary and document, respectively, Doc is the document, N is the number of sentences in the document, sim(s i , s j ) is the similarity between sentences, s i and s j . Readability factor. Readability factor [60] is the last objective function which is the most important factor for summary formation. In this, each sentence should be related to the previous one to make the summary readable. It is expressed as: where, N p is the number of sentences in the predicted summary, s i and s i−1 are two consecutive sentences in the predicted summary, sim(s i , s i−1 ) is the similarity between sentences, s i and s i−1 .

Self-organized multi-objective differential evolution based ESDocSum approach
In this paper, two approaches were developed for sentence based extractive single document summarization. Note that both approaches utilize a multi-objective based differential evolution technique as the underlying optimization strategy. SOM-based genetic operators are introduced in the process to increase the convergence. The flowchart of the proposed approach is shown in Fig 1 and underlying steps are discussed in subsequent sections. The pseudo code is also provided in Algorithm 1.
1. Approach-1: In this approach all objective functions are assigned some importance factors/ weights. For example, if fitness values of six objective functions are < ob 1 , ob 2 , ob 3 , ob 4 , ob 5 , Where g is the current generation number initialized with 0; g max is the maximum number of generations which is defined by the user; |P| is the number of solutions in the population. After step-8, g is incremented by 1 and the process continues until maximum number of generations is reached.
2. Approach-2: In this approach all objective functions are simultaneously optimized without assigning any weight values.
In the literature [9,62], it was shown that some of the objective functions used in our approach have more importance than others. Therefore, Approach-1 is developed to see the effect of the varying importance of different objectives functions.

Algorithm 1: SOM-based Extractive Text Summarization
Data: Single Text Document Result: The best solution and corresponding summary generated 1 Initialize population size |P|, population P (including calculation of objective functions) and max_generation; 2 Initialize training data for SOM as S = P; 3 t = 1; 4 while t<max_generation do Perform training of SOM using S; 7 for each solution in P do 8 Construct Mating pool (Q); 9 Generate new solution using Q, crossover and mutation; 10 Calculate new solution's objective functional values; 11 Add new solution into P 0 ; 12 end 13 P@ = Merge populations P and P 0 ; 14 P Apply non-dominated sorting and crowding distance operator (if needed) on P@ to select the top jPj solutions; 15 Update SOM training data as S = P 0 \ P@; 16 t = t+1; 17 end 18 return the best solution from the final Pareto optimal front and corresponding summary;

Preprocessing
Before generating the summary, a series of steps are executed to pre-process the document. These steps include segmentation of the document into sentences, stop word removal (frequent words like is, am, are, etc. are removed from the document), case folding (lower case conversion) and removal of punctuation marks. Here, the nltk toolkit [63] is used for document segmentation and removal of stop words.

Representation of solution and population initialization
Any evolutionary algorithm, starts with a set of solutions (or chromosomes), <x 1 ;x 2 . . .x jPj >, called as population, where, jPj is the number of solutions. As our approach is based on binary optimization, each solution is represented in the form of a binary vector. The size of the solution is set equal to the number of sentences in the document. For example, if a document consists of 10 sentences then a valid solution can be represented as [1, 0, 0, 1, 1, 0, 1, 0, 0, 0]. This solution indicates that first, fourth, fifth, and seventh sentences of the original document should be in the summary. The initial population is generated randomly. While generating the solution, the constraint on summary length is taken into account as P s i 2Summary l i � S max , where, l i measures the length of sentence in terms of the number of words, S max is the maximum number of words allowed in the generated summary.

Objective functions used
To measure the quality of each solution in the population, a set of objective/fitness functions are evaluated. These functions are discussed in the previous section, and all are of maximization type. Note that optimization of these functions helps in getting a good quality summary.

SOM training
In this step, SOM [19,30] will be trained using the solutions in the population. In this paper, we have used the sequential learning algorithm to train the SOM. Readers can refer to [32] for more information. SOM will help in understanding the distribution structure of the solutions in the population. The solutions which are closer in the input space, come closer to each other in the output space (neuron grid in SOM).

Genetic operators
In any evolutionary algorithm, genetic operators help in generating new solutions. This set of new solutions forms a new population, P 0 . In our framework, from each solution, a new solution is generated using three genetic operators: mating pool generation, mutation, and, crossover. These genetic operators are described below: Mating pool generation. Using this operator, mating pool is constructed for each solution. It consists of a set of solutions which can mate to generate new solutions. For its constructions, neighboring solutions are identified using the trained SOM. Let us assume that we want to generate a new solution for current solution denoted asx current . Let β be some threshold probability. Then its construction steps are described below: 1. Identify the winning neuron 'h 0 in the SOM grid forx current using the shortest Euclidean distance criterion as b ¼ arg min 1�u�U kx current Àw u k, wherew u is the weight vector of u th neuron, U is the total number of neurons.
2. The solutions mapping to the neighboring neurons are identified by calculating the Euclidean distances between the position vector of neuron 'h 0 and other neurons' position vectors.
3. A random probability, r, is generated.
4. If r < β, then indices of the neurons are sorted based on minimum distance to winning neuron (h). Then, fix number of solutions mapped to sorted neuron indices are extracted to form a mating pool.
5. If r > β, then all solutions in the population are considered as a part of the mating pool.
Note that r < β and r > β exhibit the exploitation and exploration behaviour of the evolutionary algorithm, respectively. These are necessary phenomenon to explore the search space efficiently.
Mutation and crossover. Mutation and crossover operations result in the change in the values of the components of the current solutionx current , thus, generating new solution corresponding tox current . But, before performing the operation, three random solutions,x r1 ,x r2 and x r3 , from its constructed mating pool are selected and then a probability prototype vector is generated. If the component value of the prototype vector is found to be higher than some random probability lying between 0 to 1, then, that component value is replaced by 1 and vice versa. For more information, reader can refer to the paper [16].
It is important to note that during generation of the new solution, y@, all possible combinations of the mating pool (randomly chosen solutionsx r1 ,x r2 ,x r3 ) are tried, and mutation and crossover are performed against each combination and then constraint of summary length is checked. It may be possible that more than one combination may satisfy the constraint. In that case, only that combination is selected which is close to length constraint (considering the maximum number of words in the summary).

Selection of the best |P| solutions for next generation
This step includes selection of the best |P| solutions out of the old population (P) and new population (P 0 ). Note that size of population P 0 is equal to population P. To perform this operation, non-dominated sorting (NDS) and crowding distance operator (CDO) of NSGA-II are utilized [14]. NDS includes assignment of ranks to different solutions based on their objective functional values and puts them in different fronts. This phenomenon is shown in Fig 2. CDO identifies the solutions in a front which reside in more crowded region using the nearby solutions in objective space. Best solutions are selected based on their rankings until a desired number of solutions are obtained. In case of a tie in a front, solution having the highest crowding distance is given priority.
Example: Consider a problem where two objectives/fitness functions are to be maximized. Let the size of the population P be 3, i.e., |P| = 3 and the objective functional values are (6, 7), (5, 3), (2, 2) corresponding to solutions a, b and c, respectively. Let d, e, and f be the new solutions generated after applications of genetic operators. Let their fitness functional values be (6.5, 5), (5.5, 2.3), and (2.5, 1), respectively. After merging, total number of solutions will be 6 out of which only 3 solutions are passed to the next generation. Firstly, these solutions are ranked using NDS. After calculating ranking as per NDS algorithm, rank-1 solutions are {a, d}; rank-2 solutions are {b, e} and rank-3 solutions are {c, f}. As rank-1 includes two solutions, therefore they will be passed for next generation. Now, only (3 − 2) = 1 solution is left to be chosen from rank-2 set of solutions. Therefore, to select (3 − 2) = 1 solution, crowding distance operator is applied to rank-2 solutions and thus (3 − 2) = 1 solution is selected having highest crowding distance.

Updation of SOM training data
In this step, training data for SOM is updated. In the next generation, SOM will be trained using those selected solutions (out of the best solutions selected in previous step) which have not been seen before. It is important to note that updated weight vectors of the neurons in the Extractive single document summarization using fusion of SOM and multi-objective differential evolution current generation will now be treated as the initial weight vectors of the neurons in the next generation.

Termination condition
For any iterative procedure, termination condition is required. Therefore, in our work, proposed algorithm is repeated until a maximum number of generations (iterations), g max is reached. This step is shown by diamond box in Fig 1.

Selection of single best solution and generation of summary
At the end of the final generation, a set of non-dominated solutions on the final Pareto optimal front are generated by our MOO-based algorithm. Here, Pareto optimal front means all solutions are having equal importance to each other. Thus, it provides a flexibility to a decision maker to select a solution based on his/her requirement. In this paper, firstly, we have generated summaries corresponding to different solutions and then selected that solution which has the highest ROUGE-2 score. This is done to illustrate that our proposed approach can produce the best solution having highest Rouge score in comparison to the state-of-the-art techniques. We have reported the average Rouge score values corresponding to the best solutions (selected based on the highest ROUGE-2 values) for all documents. It also helps in proper comparison with existing methods which produce only a single solution.
Note that to calculate the Rouge score, gold/reference summary is used, which may not be available in real time situations. Therefore, a single solution from the final Pareto front should be selected after considering other criteria which do not use any supervised information. To address this issue, we have explored various methods to select the best solution. Let us name the approaches making use of supervised (available gold summary information) and unsupervised information for selection of single best solution from the final Pareto optimal front as SMaxRouge and UMaxRouge, respectively. The methods explored under UMaxRouge policy are explained below: To calculate these, firstly, for all the solutions of the final generation, the single objective function (for example, readability score) is analyzed, and then, the solution having the highest value based on chosen single objective function is considered as the best solution. Some combinations of these objective functions are also explored. In this case also, the solution with the highest value is considered as the best solution. For example: • MaxWeightSumAllObj: In this approach, summation of all objective functional values optimized in our approach is considered.
• MaxWeightSum2Obj: In MaxWeightSum2Obj, the summation of two objective functions, namely, sentence position and sentence similarity with the title is considered.
• MaxWeightSum3Obj: This is similar to MaxWeightSum2Obj. Only difference is that we have added one more objective function namely, cohesion.
2. Ensemble approach (EnSem): In this approach, we have firstly considered all the sentences which are present in the summaries corresponding to all generated rank-1 solutions of the final Pareto optimal front. Then the frequency of occurrence of each of these sentences over different summaries corresponding to different rank-1 solutions is calculated as per Eq 15.
Sentences are then sorted based on their frequencies of occurrence and those are added one by one as per their sorted order in the final summary until the desired length is reached. Let |PS| is the number of rank-1 solutions, PSS is the set of all unique sentences present in the summaries corresponding to PS number of solutions. Let us assume that we want to count the frequency of occurrence of ith sentence, i.e., sent i , belonging to PSS. Then, the following equation is followed: where, PS k is the kth summary corresponding to kth solution of a document. Same equation (the above Eq) was followed to calculate the count of remaining sentences belonging to PSS. Two other variations of the ensemble approach are also tried. After collecting the sentences of rank-1 solutions (merged pool), they are sorted based on (a) maximum length; (b) maximum sentence to title similarity. For both cases (a) and (b), final summary is generated by adding the sentences from the merged pool one by one following their sorted order until the desired length is reached. In this paper, the approaches corresponding to (a) and (b) are named as EnSemMaxLen and EnSemMaxSentTitle, respectively.

Sentence and Word embedding (MinReconsError):
This approach is based on the semantic similarity between the document and the generated summary. The motivation behind this idea is to check whether the generated summary can represent the central theme of the document or not. The solution having maximum semantic similarity [53,64] will be considered as the best solution. In this approach, firstly, we generate the sentence vectors of all sentences present in a particular document by averaging the word vectors (word-embedding) of the words present in the sentences. To get the word vectors, we have used the pretrained word2vec model on [53] GoogleNews corpus which contains 3 billions words and each word vector is of 300 dimension. A document theme is represented by averaging the sentence vectors of that document. Then the similarity is calculated between sentences present in the summary and document theme vector. The solution with summary having highest similarity will be treated as the best solution. In other words, we can say that a solution will be treated as the best solution if it has the minimum reconstruction error (ReconsError) which is defined as where DocVec is the vector representing document's theme, SentVec i is the ith sentence vector of jth summary (or summary corresponding to jth solution), K is the number of sentences in jth summary, kDocVec − SentVec i k 2 is the Euclidean distance between document vector and ith sentence vector. In the current paper, we name this approach as MinRecon-sErrorWord2vec.
Performing the averaging of word vectors to get the sentence vector and then averaging the sentence vectors to obtain the document vector, somehow reduce the semantics of sentence and document [65] vectors. Therefore, we have tried another approach based on Doc2vec [66]. Its performance is shown to be good when trained on large corpora with pre-trained word-embedding [66]. From the trained model, we can directly get the document vector and sentence vector [67]. Here also we want to minimize the reconstruction error between document vector and generated summary as mentioned in Eq 16. Let us name this approach as MinReconsErrorDoc2vec.
4. Maximum distance from the origin (MaxObjDistOrigin): As six objective functions used in our proposed approach are of maximization type; therefore, here, we have calculated the Euclidean distance between the origin having position (0, 0, 0, 0, 0, 0) and objective functional values of the solution. The solution having the largest distance is selected as the best solution.
Note that, the sentences, present in the final summary, are reported based on their occurrences in the original document. For example, the sentence which appears first in the document will be the first sentence in the summary.

Experimental setup
This section presents the datasets used for the experimentation, evaluation metrics to measure the performance, comparing methods, followed by parameter settings. All the proposed approaches were implemented on Ubuntu server having Intel Xeon CPU 2.20 GHz with 256 GB of RAM.

Datasets
In order to show the effectiveness of the proposed approach and to show that performance not only depends on the chosen objective functions, but, also depends on the type of similarity/dissimilarity measures used, two benchmarks datasets namely, DUC2001 and DUC2002 from Document Understanding Conference (https://www-nlpir.nist.gov/projects/duc/data.html0) are used. These contain 309 and 567 news reports (in the form of documents), respectively, written in English. For each document, the original/actual summary is available in approximate 100 words for single document summarization. In addition to these datasets, we have also used the CNN dataset [8,23] which contains news articles collected from CNN news site https://edition.cnn.com/. It consists of 3000 news articles/documents out of which only 50 articles are made available on https://sites.google.com/view/doceng19-extesu/home?authuser=0 by the authors (at the time of submission). Their actual summary includes 3-4 sentences on an average. Note that our proposed algorithm is fully unsupervised in nature in the sense that it does not use any actual summary information for generating the summary. Actual summary is utilized only for evaluation of our generated summary at the end of execution of our algorithm. A brief description of the used datasets is provided in Table 1.

Comparing methods
We have compared our proposed system with 13 existing systems. Some methods use supervised approaches, while, others used neural network. Some of the comparing algorithms are also based on optimization techniques to improve the ROUGE score. The names of the existing systems used for comparison are Unified Rank [68], MA-SingleDocSum [9], Manifold Ranking [22], QCS [25], CRF [10], NetSum [11], SVM [20], DE [6], FEOM [26], SummaRuN-Ner [13], NN-SE [12], COSUM [28], ESDS-GHS-GLO [27]. These works except [12,13] make Extractive single document summarization using fusion of SOM and multi-objective differential evolution use of both DUC2001 and DUC2002 datasets for reporting the performance of summarization systems. In addition to these methods, in paper [23], five regression-based methods are proposed, namely, LeastMedSq, Linear Regression, MLP Regressor, RBF Regressor, and SMOreg, which differ in terms of machine learning classifier used. Out of these regression-based models, Linear Regression and LeastMedSq performed the best for DUC2001 and DUC2001 datasets, respectively. Therefore, these best methods are also considered for comparison purpose. Note that [12,13] make use of only DUC2002 dataset. Therefore, for a fair comparison, results are directly taken from these reference papers. Above discussed techniques are already described in literature survey.

Evaluation metrics
To evaluate the performance of the proposed architecture, we have utilized the ROUGE measure [69]. It measures the overlapping units between the actual/gold summary and our predicted summary. More will be the ROUGE score, closer will be our summary with respect to the actual summary. The mathematical definition of ROUGE score is defined below: Where N represents the length of n-gram, Count match (N-gram) is the maximum number of overlapping N-grams between actual summary and the generated summary, Count(N-gram) is the total number of N-grams present in the actual summary. In our experiment, N takes the values of 1 and 2 for ROUGE−1 and ROUGE−2, respectively. In addition to the ROUGE score, we have also reported another evaluation measure namely, BLEU or the Bilingual Evaluation Understudy [70]. It is generally used in machine translation system and compares a candidate translation of text to one or more reference translations, but, can also be used in text summarization task. It also counts the n-gram-overlap between system generated summary and available actual summary, but, with one difference: ROUGE-N considers N-grams (1-gram or 2-gram etc.) at a time, while, BLUE considers various sizes of N-grams (in our case, 1-gram, 2-gram, 3-gram, 4-gram) simultaneously to compute the same. For mathematical definition of BLUE, the reader can refer to [70]. It is important to note that the existing work doesn't report BLUE score, therefore, we have reported the same only for our best proposed approaches when applied on three datasets.

Parameter settings
Different parameter values used in our proposed framework are-DE parameters: jPj = 40, mating pool size = 4, threshold probability in mating pool construction (β) = 0.7, maximum number of generations (g max ) = 25, crossover probability (CR) = 0.2, b = 6, F = 0.8. SOM parameters: initial neighborhood size (σ 0 ) = 2, initial learning rate (σ 0 ) = 0.6, training iteration in SOM = jPj, topology = rectangular 2D grid; grid size = 5 × 8. Sensitivity analysis on DE parameters and SOM parameters can be found in [16] and [32], respectively. Inspired by these works, similar values of parameters are utilized in the current work. Importance factors/weight values assigned to different objective functions: α = 0.25, β = 0.25, γ = 0.10, δ = 0.11, λ = 0.19, ϕ = 0.10; System summary: length (in words) = 100 words. In most of the existing literature [16,62], similar weight values of importance factors are considered. Results obtained are averaged over 10 runs of the algorithm. Word Mover Distance makes use of pre-trained word2vec model on GoogleNews (https://github.com/mmihaltz/word2vec-GoogleNews-vectors) corpus to calculate the distance between two sentences. Table 2 reports the ROUGE scores obtained by our proposed approaches using different similarity/dissimilarity measures (NGD, CS, WMD) and different state-of-the-art methods on DUC2001 and DUC2002 data sets. Note that these results are generated by our proposed approach with SMaxRouge strategy of selection of a single best solution from the final Pareto optimal front as mentioned in section 'Selection of Single Best Solution and Generation of Summary'. To illustrate the utility of incorporating SOM based genetic operators in the DE process, results are also reported for multi-objective binary DE-based summarization approach with standard genetic operators of DE (without using SOM). It can be observed that our approaches using discussed similarity/dissimilarity measures outperforms all other approaches for both the data sets in terms of ROUGE-1 and ROUGE-2 scores. The best ROUGE scores as reported in Table 2 for both the datasets were obtained using Approach-1 with SOM-based genetic operators and WMD as the similarity measure. Thus it can be concluded from obtained results that the use of different sentence similarity/dissimilarity Table 2. ROUGE scores attained by different methods for DUC2001 and DUC2002 data sets. Here our proposed methods are executed using Normalized Google Distance (NGD), Cosine Similarity (CS) and Word Mover Distance (WMD), and, SMaxRouge strategy is used for selecting a single best solution from the final Pareto front. Here, † denotes the best results; it also indicates that results are statistically significant at 5% significance level; xx indicates results are not available in reference paper. For LeastMedSq and Linear Regression methods, results in the reference paper are presented up to 4 decimal points, therefore, to make a fair comparison up to 5 decimal points, we have added 0 as the last decimal digit such that their results remain unchanged. Similar case also applicable to NN-SE and SummaRuNNer methods. Extractive single document summarization using fusion of SOM and multi-objective differential evolution measures and self-organized multi-objective differential evolution for optimization indeed helps in achieving improved performance.

Results and discussion
As Approach-1 (as per results of Table 2), utilizing word mover distance, performs best, therefore, we have evaluated the same approach on the third dataset namely, CNN. The corresponding results are reported in the Table 3. Here, results are shown only for 50 articles. Although, there exist papers [8] and [23] which use 400 and 3000 articles of CNN, respectively, but, it will be unfair to compare our results with these papers due to unavailability of complete dataset for CNN. Moreover, the codes of these papers, [8] and [23], are not available. Therefore, for comparison purpose, we have used our own Approach-2 utilizing WMD. Note that results of Approach-1 and Approach-2 for CNN dataset are shown using with and without SOM-based genetic operators. From Table 3, it can be observed that Approach-1 using WMD as a dissimilarity measure and SOM as a genetic operator, performs the best which was also the case for DUC2001 and DUC2002 datasets.

Comparison of results using BLEU score
The results in terms of BLEU score corresponding to three datasets are reported in Table 4. Note that from the final set of Pareto optimal solutions, a particular solution sol1, maybe best w.r.t. ROUGE-2 F1-measure, but, may not be best w.r.t. BLUE score. Therefore, the average BLUE score reported in the Table is obtained by selecting the best solution from the final set of Pareto optimal solutions based on maximum BLUE score. As the existing approaches do not report the BLUE score values; therefore, for the purpose of comparison, we have considered our best approach, Approach-1 utilizing WMD in comparison with Approach-2 (WMD). Here also, results of these approaches are illustrated using SOM and without SOM-based genetic operators. From this Table, it can be observed that Approach-1 (WMD) using SOMbased genetic operators performs best and is able to attain BLUE score values of 0.32623, 0.21641, and, 0.62009 for DUC2001, DUC2001 and CNN datasets, respectively.
As any evolutionary algorithm generates Pareto optimal solutions in the final generation, therefore, we have shown the Pareto optimal fronts obtained (over one random document of DUC2001/DUC2002) after the application of the proposed Approach-1 (WMD) with SOMbased operators in the Fig 3. These fronts correspond to first, fourteen, nineteen and twenty- Extractive single document summarization using fusion of SOM and multi-objective differential evolution fifth generations. Note that it is difficult to plot Pareto optimal fronts for six objective functions. Therefore, we have shown the projected Pareto optimal fronts in three objective space (as shown in Fig 3). The following three subsections will discuss the results obtained using different distance/similarity measures on DUC2001 and DUC2002 datasets. Discussion of results obtained using normalized google distance (NGD). In Table 2, considering all cases (both approaches, with SOM and without SOM based genetic operators), our results beat other existing methods. The best ROUGE scores for both the datasets were obtained using Approach-1 with SOM-based genetic operators. On comparing the results of Extractive single document summarization using fusion of SOM and multi-objective differential evolution Approach-2 with SOM and without SOM-based operators for DUC2002 dataset, it was observed that ROUGE-2 and ROUGE-1 scores are higher in case of Approach-2 without SOM-based operators. But, the difference is not much significant when compared using SOM based operators.

Discussion of results obtained using cosine similarity (CS)
In Table 2, considering all cases (both approaches, 'with SOM' and 'without SOM' based genetic operators), it can be concluded that our proposed approaches outperform other existing methods. Out of both operators in Approach-1 utilizing WMD, 'with SOM' operator perform well. On comparing the results of Approach-2 using 'with SOM' and 'without SOM' based operators for the DUC2002 dataset, it was observed that ROUGE-2 and ROUGE-1 scores are higher in case of 'without SOM' based operators. However, the difference is not much significant when compared using SOM based operators.

Discussion of results obtained using word mover distance (WMD)
In Table 2, considering all cases (both approaches), it was found that Approach-1 obtains the best ROUGE scores with SOM-based genetic operators for both the datasets. This result is also the best when comparing with other similarity/dissimilarity measures. One of the reasons behind this improved performance is the ability of WMD in capturing semantic relationships between sentences. Another possible reason is the use of SOM-based operators which helps the algorithm to reach the optimal solution having good ROUGE scores. Time taken to generate summary using Approach-1 with SOM-based operators for DUC2001 is 32 second/document, while the same approach without SOM based operators takes 29 second/document. For DUC2002, Approach-1 with SOM and without SOM-based operators, take the almost same time, i.e., 20 second/document. Note that these reported times exclude the time taken to calculate similarity/ dissimilarity between two sentences, which is approx 10-20 second in case of WMD.
Analysis on conflicting behaviours of the two objective functions. It should be noted that ROUGE score measures the informativeness of the summary obtained and makes use of the actual summary, thus, can be considered as a type of extrinsic measure. But, to measure the quality of the summary, we have also reported an intrinsic measure (independent of actual summary), readability, of the summary which was one of objective functions in our proposed approach. This was done because it is one of the major concerns in any summarization system. The corresponding results for DUC2001, DUC2001, and CNN datasets, are shown in Table 5. These results correspond to the summaries obtained whose ROUGE scores are reported in Tables 2 and 3. Here also, we have used our best approach (Approach-1) utilizing WMD as a dissimilarity measure. For comparison, we have used the Approach-2. Results are shown using SOM and without SOM-based genetic operators. Higher the readability score, easier will be Extractive single document summarization using fusion of SOM and multi-objective differential evolution the understanding of the summary for the end-users or in other words, more readable it is. From Table 5, it can be inferred that maximum readability scores of 0.43362 and 0.44392 were obtained by Approach-1 (WMD) using SOM-based operator for DUC2001 and DUC2002 dataset, respectively. On the other hand, for CNN dataset, maximum readability score was attained by the same approach, but, without using SOM-based genetic operator. As in any multi-objective optimization based approach, objective functions are generally conflicting in behaviour or in other words, one solution may be good in terms of one objective, while, may not be good w.r.t. another objective function. We can also say that increase in one, may decrease in another objective functional value in a single solution. Therefore, we have reported another intrinsic measure, i.e., coverage (COV), in the same Table. COV is another concern of any summarization system and is used as one of the objective functions in our optimization strategy. After observing the results, it can be inferred that Approach-1 (WMD) utilizing SOM has highest coverage of 0.97735 for CNN dataset, but, for remaining datasets, highest coverage was obtained using Approach-1 (WMD) using without SOM-based operators. These results illustrate the conflicting behaviour of coverage and readability. Note that we have omitted the discussion on other objective functions to avoid a longer discussion.

Study on different methods of selecting a single best solution from final pareto front
In Table 2, we have shown the best results produced by our proposed approaches utilizing SMaxRouge strategy for selecting a single best solution from the final Pareto front. But, in real time situations, actual summary may not be available. Therefore, we have explored various unsupervised methods under UMaxRouge strategy to generate a single summary out of multiple solutions on the final Pareto optimal front as discussed in section 'Selection of Single Best Solution and Generation of Summary'. Corresponding results are reported in Table 6. It is important to note that among-st different proposed approaches, Approach-1 (WMD) performs the best with SMaxRouge strategy for the selection of single best solution; therefore, unsupervised methods are explored under this approach only. It can be observed from Table 6 that the method, MaxWeightSum2Obj, is able to beat the remaining approaches for DUC2002 dataset; having Rouge-1 and Rouge-2 scores of 0.51191 and 0.24871 (using SOM based operators), respectively, but, these scores are less than Rouge-1 and Rouge-2 scores of 0.51662 and 0.28846, respectively, which were the best results attained by SMaxRouge strategy. For DUC2001 dataset, using MaxWeightSum2Obj, we obtain better results in terms of Rouge-1 score having value 0.20839, but, it is just close to the best result of existing approaches. But, most of the approaches under UMaxRouge strategy are not able to select the best solution as selected by SMaxRouge strategy. Hence performances of these approaches are poorer compared to SMaxRouge strategy as reported in Table 2. Ensemble based approach in general performs well. But, as there are a large number of non-dominated, a variety of solutions (a solution 'sol i ' may be good in terms of 'sentences to title similarity' objective as compared to 'sol j '. On the other hand, solution 'sol j ' may be good in terms of cohesion objective which is of low priority in our approach) and we have considered the sentences belonging to these solutions to generate the final summary, the ensemble approach does not perform better than SMaxRouge strategy.
After observing the results obtained by MaxCoh, MaxCov, MaxRead, MaxSenLen, MaxSen-Pos, MaxSimTitle approaches of selecting a single best solution (based on maximum value of single objective function) it was concluded that these approaches are also not able to extract the best solution from the final Pareto optimal front. Only the approach, MinReconsError-Doc2vec is able to perform well and beats the existing algorithms. But, there are slight variations in the results as reported in Table 2. In summary, it can be concluded that solutions selected using MinReconsErrorDoc2vec under UMaxRouge scheme are very similar to those selected by SMaxRouge scheme (refer to Table 2) where available reference/gold summary is utilized for selecting single best solution. Thus performances of the proposed approaches under MinReconsErrorDoc2vec and SMaxRouge strategies are similar. But the MinReconsEr-rorDoc2vec scheme does not utilize any available supervised information. Thus the use of the MinReconsErrorDoc2vec scheme is recommended with the proposed approaches for selecting the single best solution from the final Pareto front. Note that doc2vec used in this approach was trained using DUC2001, DUC2002, DUC2006 and DUC2007 data sets utilizing implementation available at https://github.com/jhlau/doc2vec under the default parameters mentioned in that link and makes use of pre-trained model on googlenews corpus. DUC2006 and DUC2007 are the standard summarization datasets consisting of 50 and 45 document sets, respectively. Table 6. ROUGE scores obtained using Approach-1 (WMD) when the best solution is selected using any of the strategies under UMaxRouge strategy. All the strategies explored here for selecting a single best solution from the final Pareto front are unsupervised in nature. Bold entries indicate they are able to beat the state-of-the-art algorithms.

Convergence speed
To demonstrate that our proposed approach converges faster compared to existing algorithms, we have summarized the population size and the number of fitness function evaluation (NFE) used by different algorithms in Table 7. NFE is calculated as: The time complexity of any optimization algorithm depends on the number of fitness function evaluations. Table 7 clearly demonstrates that our proposed approach evaluates less or equal number of functions compared to other existing state-of-the-art techniques. Despite of this, the ROUGE score values attained by our proposed approach are better than those attained by the existing techniques. This proves that our approach converges much faster compared to the existing techniques. As our algorithm is based on optimization strategy, therefore, in Table 7, only algorithms based on some optimization strategies are compared.
We have also shown the convergence plots obtained by our proposed approach for some random documents in Fig 4. Maximum Rouge-1 and Rouge-2 score values attained by our approach over the generations are plotted. These figures show that Approach-1 (WMD) with SOM converges to a Rouge-1/Rouge-2 value after a particular iteration (as there is no change in Rouge-1/Rouge-2 score values after that iteration). This also proves the faster convergence of our approach towards the near optimal value of Rouge score (in comparison to other approaches).

Improvements obtained
We have also calculated the performance improvements obtained (PIO) by our best approach under SMaxRouge strategy to select a single best solution from the final Pareto front in comparison to existing methods using the ROUGE−2 and ROUGE−1 scores and those values are shown in Table 8. These improvements correspond to the best results when using Approach-1 (WMD) with SOM-based operators. Mathematically, PIO is defined as: Here, improvements obtained by our proposed approach compared to MA-SingleDocSum and DE are 45.16% and 4.98%(� 5%) respectively, for the DUC2001 dataset, considering ROUGE−2 and ROUGE−1 scores. While for DUC2002 dataset, improvements obtained by our approach compared to MA-SingleDocSum and COSUM are 26.3% and 5.25%, respectively. After comparing with the latest work on summarization [13] based on neural network, we obtained 20.70% and 8.99%(� 9%) improvements over ROUGE-2 and ROUGE-1 scores, respectively, for the DUC2002 dataset. In summary, for DUC2001 dataset, minimum 38.57% and 5.24% improvements are obtained over the existing techniques in terms of ROUGE-2 and ROUGE-1 score, respectively. While for DUC2002 dataset, mimimum 20.60% and 3.70% improvements are obtained over the existing techniques in terms of ROUGE-2 and ROUGE-1 score, respectively. Extractive single document summarization using fusion of SOM and multi-objective differential evolution

Error-analysis
In this section, we have thoroughly analyzed the errors made by our proposed approach, Approach-1 (with SMaxRouge strategy of selection of a single best solution from the final Pareto optimal front) with SOM-based operators using WMD as similarity/dissimilarity measure between sentences (as this approach gives the best result). Some random documents from DUC2001/DUC2002 are selected to perform error-analysis from each dataset. It has been observed that proposed approach generates less informative summary if a document length is very large because of length constraint. Some parts of the lines in predicted and reference/ actual summary do not match because some sentences in the actual summary were generated by human annotators. In Fig 5, an example of generated summary by our proposed algorithm is shown corresponding to document AP881109-0149 of topic d21d under DUC2001 dataset. The same color shows matching lines, and the beginning of a line is indicated by [Line-number]. Here, the generated summary covers most of sentences in actual summary having 0.8115 and 0.6383 as ROUGE-1 and ROUGE-2 scores, respectively. Therefore, it is considered as a good summary. Fig 6 shows an example of a predicted summary which does not seem to be good, and the corresponding values of ROUGE-1 and ROUGE-2 scores are 0.44 and 0.1276, respectively. The possible reasons could be the generation of reference summary by human annotators. Our developed approach is based on extractive summarization. Therefore, it selects direct sentences from the document to be present in the generated summary, but, it is not capable of restructuring the sentences. For example, consider Line-1 of Fig 6 in the predicted summary which is too long in original document, but is shortened by annotators in Line − 1 of reference summary to allow the reference summary to cover other themes of the main document (as more number of words can be added to reach the desired summary length). However, our predicted summary is not able to cover the whole idea of the document as the selection of Line − 1 increases the number of the words in summary and not many sentences can be added because of restriction in the number of words in summary. Extractive single document summarization using fusion of SOM and multi-objective differential evolution

Statistical significance t-test
To validate the results obtained by the proposed approach, a statistical significance test named as, Welch's t-test [71], is conducted at 5% significance level. It is carried out to check whether   Table 9. Test results support the hypothesis that obtained improvements by the proposed approach are not occurred by chance, i.e., improvements are statistically significant.

Study on effectiveness of SOM based operators on DUC2001 and DUC2002 datasets
Note that difference in the Rouge-1/Rouge-2 score values attained by 'with SOM' and 'without SOM' versions of Approach-1 (WMD) (shown in the Table 2) seems to be very small. In order to further investigate the issue, we have carried out the following analyses: (a) plotted the box plots; (b) performed the t-test. Detailed information about these are given below: where, R1 and j indicate the Rouge-1 score and rank-1 jth solution, respectively. Similar steps are followed to calculate the average Rouge-2 value. Following the above process, average Rouge scores are calculated for all the documents. This is done because we have reported the average Rouge-1/Rouge-2 scores of the best solutions of all documents in Table 2 and the best solution is one of the highest ranked solutions. Note that the best results are obtained using Approach-1 (WMD). Therefore, box plots are drawn for this method. From Fig 7(a) and 7(b), it is evident that Approach-1 with SOM-based operators attains better median values of the average of Rouge-1/2 values of rank-1 solutions of all documents for DUC2001 and DUC2002 datasets, respectively, in comparison to those obtained by 'without SOM based operators'. Also for both the datasets, Approach-1 (WMD) using SOM-based operator covers solutions having a high range of Rouge-1/  Extractive single document summarization using fusion of SOM and multi-objective differential evolution 2. t-test: We have also conducted t-test to show the significant difference between the Rouge recall values obtained by two versions (with SOM and without SOM based operators) of Approach-1 (WMD) under SMaxRouge scheme. The p-values (at 5% significant level) attained by these approaches are reported in Table 10. Extractive single document summarization using fusion of SOM and multi-objective differential evolution These p-values obtained on DUC2001 dataset clearly show that Approach-1 (WMD), when used with SOM based operator significantly improves the results. But, on DUC2002 dataset, results obtained are not significant as Rouge score values attained by Approach-1 (WMD) with SOM based operators are close to those attained by Approach-1 (WMD) without SOM based operators. However, from Figs 7-9, it is appropriate to say that there exist a set of documents for which our approach is able to determine good quality solutions with high Rouge scores in less number of iterations when used with SOM based operators.

Ranking of methods
We have also calculated the ranking scores of different methods using the Unified Ranking [9] method by considering the individual ranks of different methods with respect to different measures as shown in Table 11. It is created using Table 2. This method was first proposed by Ramiz M. Aliguliyev [72]. While calculating the ranking, we have excluded the rankings of NN-SE, and SummaRuNNer approaches as their results for the DUC2001 dataset are not available in the reference papers. The resultant rank of each method is calculated as follows: Ranking scoreðmethodÞ ¼ where, the number, 12 denotes the number of methods in comparison including the proposed one, R p denotes how many times a method appears at the pth position. Finally, the method having the highest Ranking_score is assigned the highest rank. From the Table 11, we can see that Approach-1 with SOM based operators, when used with word-mover-distance, is selected at rank 1.

Complexity analysis of the proposed approach
In this section, the complexity of the proposed approaches both for with SOM and without SOM based genetic operators are analyzed. Let N be the number of solutions, M be the number of objectives to be optimized, T be the maximum number of generations. With SOM.
1. Population initialization step takes OðNÞ time as there are N solutions which are randomly initialized using a binary vector obeying some constraint. Each solution undergoes objective function calculation step which takes OðNMÞ time. Thus, the total time complexity of population initialization is OðN þ NMÞ which is equivalent to OðNMÞ.
2. The solutions in the population undergo SOM training which takes OðN 2 Þ time [73]. 5. Evaluation of dominance and the non-dominance relationships between 2N solutions (after merging old population and new population) and then the selection of best N solutions take OðMN 2 Þ time [14].
Steps-2 to 5 are repeated up to T number of generations. Note that updation of SOM training data takes constant time. So it can be ignored. Thus, the total time complexity of the proposed architecture with SOM based operators is which is the worst time complexity of our approach when using SOM based genetic operators.
Without SOM-based genetic operators. In the proposed architecture without SOM based genetic operators, step-2 and step-3 will not be there. Here, the mating pool for each solution is the entire population. Other steps will remain the same. Thus total time complexity without SOM based genetic operators is which is the same as the time complexity of the proposed architecture when developed with SOM based genetic operators.

Conclusions and future works
In this paper, an extractive single document text summarization (ESDocSum) system is developed. The key-contributions of the proposed approach are the following: 1) a self-organized multi-objective binary differential evolution based technique is proposed for summary extraction. It utilizes the topological space identified by SOM to develop some new genetic (selection) operators; 2) the similarity/dissimilarity between two sentences is calculated utilizing three measures: normalized google distance, word mover distance and cosine similarity to show that summarization results not only depend on proposed framework but also depend on type of similarity/dissimilarity measures used; 3) six objective functions are utilized for selecting a good subset of sentences present in the document; 4) various unsupervised methods are explored to select a single best summary from the available set of summaries on the final Pareto optimal front; 5) results on standard datasets prove the efficacy of the proposed technique in comparison to state-of-the-art in terms of faster convergence and better ROUGE scores.
Experimental results demonstrate that our SOM-based approach with WMD as a distance measure has obtained 45% and 5% improvements over the best existing method considering ROUGE−2 and ROUGE−1 scores, respectively, for the DUC2001 dataset. While for the DUC2002 dataset, improvements obtained by our approach are 20% and 5%, considering ROUGE−2 and ROUGE−1 scores, respectively. Results are also validated using statistical significance test.
As the performance of summarization system depends on the type of similarity/dissimilarity measure used and also depends on the dataset, therefore, in future, we will try to make the similarity/dissimilarity measure selection automatic for different datasets. In future, we also want to extend the current approach for solving multi-document summarization problem.