Using full-text content to characterize and identify best seller books: A study of early 20th-century literature

Artistic pieces can be studied from several perspectives, one example being their reception among readers over time. In the present work, we approach this interesting topic from the standpoint of literary works, particularly assessing the task of predicting whether a book will become a best seller. Unlike previous approaches, we focused on the full content of books and considered visualization and classification tasks. We employed visualization for the preliminary exploration of the data structure and properties, involving SemAxis and linear discriminant analyses. To obtain quantitative and more objective results, we employed various classifiers. Such approaches were used along with a dataset containing (i) books published from 1895 to 1923 and consecrated as best sellers by the Publishers Weekly Bestseller Lists and (ii) literary works published in the same period but not being mentioned in that list. Our comparison of methods revealed that the best-achieved result—combining a bag-of-words representation with a logistic regression classifier—led to an average accuracy of 0.75 both for the leave-one-out and 10-fold cross-validations. Such an outcome enhances the difficulty in predicting the success of books with high accuracy, even using the full content of the texts. Nevertheless, our findings provide insights into the factors leading to the relative success of a literary work.


I. INTRODUCTION
Understanding the factors and reasons determining the effectiveness and acceptance of given pieces of artistic or scientific work represents a continuing challenge in artificial intelligence (e.g., [4,7,16,28,30]).As is often the case with complex systems, not only a large number of possible factors is potentially involved, but their individual and combined effects also tend to be highly non-linear.In this manner, small effects can lead to considerable impacts, being also likely to vary along time and space in modes that are hard to predict.
Among the several aspects that are more likely to influence the visibility and accomplishment of an artistic piece, we have its intrinsic quality, innovation, and affinity with the main trends, interests, and expectations predominating in a given period and place.All these three main aspects are not only challenging to define, but even more so to predict, which has motivated growing interest from the scientific community (e.g., [32]).
A better understanding of the motivations why an artistic piece becomes successful constitutes a particularly interesting objective for a handful of reasons: (i) this type of study can motivate the development of new concepts and methods capable of quantifying the three main aspects identified above, namely quality, innovation, and affinity of an artistic piece; (ii) that kind of research has great potential for revealing important aspects of the mechanisms underlying human preferences for specific subjects and styles along time and space; (iii) such developments can lead to strategies for predicting the acceptance of certain types of works, which may provide subsidies and motivation for developing new and more effective artistic pieces.
The present work aims at studying whether it is feasible to characterize and identify stories and narratives listed as best sellers by combining full-text content information and machine learning models.In this regard, the textual content of a set of books was modeled, and a series of experiments assessed the possibility of automatically differentiating a best seller from an ordinary book.In particular, we employed a dataset encompassing the full-text content of literary works collected from the Project Gutenberg platform.The dataset was split into two categories: success (books that appear at least once in the Publishers Weekly Bestseller Lists) and others.After applying a preprocessing step (removal of stopwords, lemmatization, and tokenization), the content of each book was embodied in terms of a word embedding representation by using the bag-of-words [17] and doc2vec [15] approaches.
Finally, we employed different strategies to assess the prediction of the success of books in terms of their embedding representations, including: (i) visualization approaches, namely the linear discriminant analysis (LDA) [12] and SemAxis [2] techniques; (ii) classification approaches, encompassing different models and cross-validation strategies.
In contrast to previous studies, here we rely on one of the prime published sources of best sellers book lists, namely the Publishers Weekly Bestsellers Lists, which comprises the best-selling books every year since 1885.Although its criteria to define a book as an absolute success is not entirely specified, it is established that every considered paperbound book sold at least 2,000,000 copies, and every selected hardbound book sold 750,000 copies or more.
It is also settled that Publishers Weekly only regards books distributed through the trade -that is, bookstores and libraries -, not including those sold by mail or book clubs [14].
Besides that, our work compounds the list of few studies which analyzed the success factor by analyzing the full-text content of the texts, posthumously modeling it through embeddings, and analyzing it both qualitatively (applying visualization and seeking for words that lead to discrimination) and quantitatively (involving supervised classifiers).
The obtained results suggest that it is infeasible to predict the success of a literary work with high accuracy by using only its full-text content.The best classification accuracy acquired throughput the value of 0.75, combining a bag-of-words representation with a logistic regression model, which is a fair-to-middling outcome.Nonetheless, our experiments evince that the subject of the books does not seem to be a core factor for a title becoming a best seller and that there are words more typically found in this category of books.This work is organized as follows.Section II presents and discusses the related works.
In Section III, we present the research questions.Section IV describes the used datasets.Section V describes the methodology adopted to analyze the books, including text preprocessing, representation, visualization and classification.The results and discussions are reported in Section VI.Finally, in Section VII, we present the conclusions and future works.

II. RELATED WORKS
The study conducted in [34] analyzed the success of books using as reference the The New York Times Best Sellers, which includes a list of best-selling books in the United States.The authors considered the books appearing on the list between August 2008 and March 2016.
As additional information, the sales patterns of books were also considered by using data from NPD BookScan [34].Several interesting results were reported.Fiction books were found to be more likely to become best sellers, while nonfiction books tended to be sold with lower intensity.The authors also proposed a model that can accurately measure longterm impact since it can predict the number of copies sold by best sellers short after their release.The proposed description was found to be consistent with a previous model devised to describe the attention received by scientific papers [30].The authors argue, therefore, that the underlying processes of attention are similar -despite the differences in time scale.
A model to predict book sales was proposed in [32].The authors used as a dataset the NPD Bookscan, focusing on a list of the 10 thousand top-selling books in a given period.A machine learning approach was proposed using different book features.Authors' visibility was taken into account by measuring the public interest in authors via Wikipedia page views.Previous sales were also considered as a feature to measure the previous success of authors.Book features included genre (e.g., horror and science fiction) and topic information (as provided by readers).In addition, publishers' information was used.All features were combined in the so-called Learning to Place (L2P) machine learning algorithm [31], which aims at classifying a new instance (i.e., predicting book sales) within a sequence of previously published books.This study found that in fiction and nonfiction books, the publisher quality tends to play an important role in the prediction.The visibility of authors was also found to be an important feature, as more visible authors potentially are more likely to sell more copies.Finally, the other factors related to the text content itself (e.g., genre and topic information) were found to play relatively a minor role in the prediction model.
Differently from previous works that did not take into account the textual content [32,34], the relevance of writing style was analyzed in [3].The authors analyzed full books from different genres (e.g.adventure, mystery, fiction).The dataset was collected from the Project Gutenberg repository.Several linguist marks of writing style were used to characterize the texts.Examples include lexical features, distribution of grammar rules, and sentiment analysis.The authors used SVM as classifier [9], and download counts were used as a surrogate for the visibility of books.Additional information such as award recipients and the number of copies sold was also used to quantify success.The authors concluded that the used stylistic metrics are effective to quantify the success of novels.
Because only a few works have analyzed the content of books to predict if they will become best sellers, in the current study we focus our analysis on full-textual features to discriminate between best sellers and ordinary literary works.

III. RESEARCH QUESTIONS
This study aims to probe whether the full-text content of the book alone can indicate if it will become a best seller.While there are several ways to represent a text, we focused on the most common approaches devoted to representing long texts.For this reason, here we also investigate which text representation better grasps the information about a book becoming a success.Finally, to recognize patterns in the textual data, we also examined which classifier is the most appropriate for discriminating between successes and ordinary books.
Briefly, the main research questions here are: 1. Is it possible to predict the inclusion of books into best sellers lists by analyzing only their full-text content? 2. Can one use bag-of-words and neural network embeddings to detect informative attributes for identifying best sellers?3. Can the abovementioned embeddings be influenced by the subject headings available in the dataset (such as genre or literary class)? 4. How different is the performance of supervised classifiers in discriminating between the two categories of books analyzed?

IV. DATASET
As the main objective of this work is to understand whether it is possible to identify and characterize styles and stories classified as best sellers, our dataset was composed of two categories: success and others.In the first, we included all books considered best sellers; in the latter, literary works not listed as such (at least not in the analyzed period and the consulted list).
Two best seller book lists were considered in the present work: The New York Times Best Sellers, first published in 1931, and Publishers Weekly Bestseller Lists, published for the first time in 1895.In the case of the first list, in the period from 1931 to the present day, only 18 titles were available on the Project Gutenberg [22] platform (a digital library whose collection is composed of full texts of books in the public domain).In the case of the second list, we mapped 110 titles -published from 1895 to 1924 -which therefore became part of our dataset, composing the success category.
To select the titles of the other category, we considered the collection of books (a) published in the same period as the selected successful ones and (b) not included in the best sellers lists of Publishers Weekly.In this sense, if the success class had ten titles published in 1923, the other would have the same number of titles published in the same yearthe titles randomly selected from the Gutenberg repository.At the end of the process, this category contained 109 titles (one less than the other category, as it was infeasible to collect the same amount of titles for all the years considered).
In [14] can be found the best-selling lists used in this study.It is worth mentioning that we considered only one book from each author to avoid the identification of authorship by the machine learning algorithms to be applied subsequently.
Once the dataset was ready, we cleaned up the textual content of the 219 texts to maintain only the relevant contents of the books.In this process, the header and footer included by Project Gutenberg were removed, as well as editor/translator/author notes, captions and illustration indications, glossaries, footnotes, side-notes, annexes, and appendices.

V. METHODOLOGY A. Text preprocessing
Using the dataset as explained in the previous section, the preprocessing of our analysis started.First, we replaced all capital letters with their corresponding lowercase counterparts.Then, the stopwords (i.e., words that provide low or no additional meaning to the context, such as articles and connectives) were removed.Next, we performed the tokenization of the books, in which elements, like punctuations and numbers, were disregarded.Finally, the obtained words were lemmatized -being lemmatization a technique whose objective is to reduce a vocable to its canonical form and to group different forms of the same word (e.g., the term "boys" is reduced to "boy" and "took" becomes "take").Table I shows an example TABLE I: Preprocessing of the excerpt "It is difficult to live up to this kind of thing, and my thoughts drift to the auld schule-house and Domsie.",obtained from the book Beside the Bonnie Brier Bush, by Ian Maclaren.In the column titled Initial is the original excerpt; in the next, the phrase without capital letters and stopwords; finally, in the last, the extract after tokenization and lemmatization processes.

B. Text embeddings
Techniques to embed textual content have been extensively used for a variety of tasks, including grasping text similarity, sentiment analysis, and classification.Among the most widely used techniques is the bag-of-words [17] approach, in which the relative frequencies of words appearing in a document are organized as a vector.
Recently, other approaches, now based on neural networks, have been developed to obtain dense embedding representations of words, sentences, or entire documents, being those approaches trained to predict masked parts in texts.In this sense, among the most used techniques is word2vec, which is based on a network comprising one hidden layer and a softmax output layer.The output layer is trained for predicting the context (words appearing together) given a focus word in a sentence [19].For a given set of sentences, such a process provides an embedding for each word.
More sophisticated techniques such as BERT [8] and sentence BERT [25] generate embeddings that capture richer context and semantic information of words or sentences.However, these techniques are limited to a small number of tokens and can not be applied to large portions of texts, such as entire books.For this reason, we opted to use the doc2vec (D2V) method to extract a vector representation of each book [15].
The doc2vec approach is based on the traditional word2vec [19] pipeline with the addition of the document tags as input.More specifically, it constitutes a neural network of three layers (input, hidden, and softmax ), as illustrated in Figure 1a.Just like in word2vec with a continuous bag-of-words (CBOW) architecture, the inputs are one-hot vectors representing a sequence of words from a sentence in a book.A target word is omitted from the input and used to train the neural network.In addition, the input includes an extra one-hot vector identifying the book.The model is trained to predict the target word from the context (words adjacent to the target) using a negative sampling strategy.The vectors in the hidden layer connected directly to the books encoded as one-hot are used as the book embedding.Here, we opted to use the Gensim [24] software to obtain the doc2vec representations of books.The network is trained to predict a target word in the sentence based on the adjacent terms.Additionally, the original book ID is also encoded as input to the neural network, and their respective trained vectors correspond to the embedding space of books.In (b), we illustrate the SemAxis approach in which the line connecting the two categories' (success vs. others) centroids defines an axis to project all the books.This process results in a continuous one-dimensional (scalar) representation of books, which is employed for visualization purposes.

C. Visualization
Neural network embeddings usually result in high-dimensional dense vectors that are not correlated among themselves, which limits the use of linear techniques to reduce the dimensionality of these spaces (such as PCA [11]).Thus, the process of visualizing such structures is usually undertaken using non-linear projections, such as t-SNE [29] and UMAP [18].
However, embeddings can encode many different aspects of the data, for instance, a certain axis in a book embedding may be related to its number of pages or its adherence to the non-fiction or fantasy genres.The SemAxis approach [2] is a way to find an axis in a highdimensional embedding that describes a certain aspect of the data.This is accomplished by first obtaining the centroids of two classes, e.g., small vs. larger books or non-fiction vs.
fantasy books.The line connecting the two centers define an axis in which all the remaining books are projected.This process is illustrated in Figure 1.Since in the current work we are interested in encoding the success of books, we employed the SemAxis approach to finding an axis for samples of the success and other classes.Similarly, in addition to SemAxis, we also employed linear discriminant analysis (LDA) [12], which also results in an axis encoding a continuous representation of the two classes.
In contrast to neural network-based approaches, the bag-of-words embedding can result in highly correlated and sparse vectors.For instance, the frequency patterns of two closerelated words can correlate strongly and rare words may only be present in a small set of documents.Nonetheless, both LDA and SemAxis are still applicable in these conditions.
All these tests were implemented in Python language [27] using the classifiers of Scikit-Learn [23] library.
Following the guidelines described in related works [1,26], we used the default parameters of the methods to classify texts.As an exception, in the case of the SVM, we changed the parameter "max iter", the maximum number of iterations, to 10,000.

VI. RESULTS AND DISCUSSIONS
This section describes the experiments performed to study the task of automatically characterizing and identifying best seller books.The proposed data analysis pipeline is illustrated in Figure 2. First, we obtain word embedding representations of each book by employing two distinct techniques (Figure 2a): bag-of-words and doc2vec (the latter with different dimensions, namely 32, 64, 128, and 256).Next, we investigate the proposed classification problem through two main approaches: visualization and classification.In the first, we employed a simple visualization pipeline to verify and illustrate the potential of using embeddings to identify best seller books (Figure 2b-d).The objective of this approach is to provide a preliminary and simple way to visually inspect the considered high dimensional embeddings by summarizing them into a single continuous axis.
The visualization pipeline starts with the standardization of the obtained embeddings (Figure 2b).To reduce the dimensionality of the embeddings, we employed SemAxis [2] and LDA [12].Since these methods are supervised, the final visualizations are performed by employing the leave-one-out technique to avoid overfitting.
The second approach considered in this work is the direct application of classification methods, allowing quantitative comparison of the respective performance.For that, we employed a pipeline comprising the same embedding configurations as before, but followed by three successive stages: preprocessing, learning method, and validation, each presented as a box in Figure 2. All combinations between the components of each of these boxes are considered in our evaluation.
In this sense, the following two first subsections are intended to detail the task of visualization, followed by the classification, both using the bag-of-words and then the doc2vec representation.Then, in the last subsection, we repeat these experiments to evaluate a specific variation of the constructed dataset.

A. Bag-of-words analysis
The first performed experiment intends to evaluate whether the frequency of words composing the books can discriminate between best sellers and ordinary literary works.For this purpose, we considered the set S, built based on the 3, 585 different words that appeared at least in N 2 texts of the dataset.The proportion N 2 was elected once smaller ones (such as N  3 or N 4 , being N the total number of books in the dataset) evoked archaic words and words not belonging to the vernacular of the English language, and higher proportions, on the contrary, led to poorer results on the experiments.Considering each entry in S, we computed its frequency for all books in the dataset, resulting, in the end, in a 219 × 3585 matrix of frequencies, henceforward called M .Next, the rows of M -each representing a book -were standardized and transformed according to two approaches: LDA and SemAxis, the results being cross-validated through leave-oneout.As shown in Figure 3, such processing led to a visual separation both in 3a and 3b, giving evidence that the bag-of-words model can provide a good -although not exactsplit between the two studied categories.
Moreover, to quantitatively assess the obtained separation, M was used as input to supervised classification methods (videlicet: KNN, logistic regression, naive Bayes, decision tree, random forest, and SVM).We also applied leave-one-out and k-fold (taking k = 10) cross-validation methods and considered both the standardized and the non-standardized versions of M (the standardized version denoted by M ).Here, we adopted the standard hyperparameters of each model -in other words, not involving tuning operations.
As shown in Table II, the linear regression model resulted as the best choice for grasping discrepancies between classes, leading to an average classification accuracy of 0.75, either for LOO or k-fold cross-validation.This result shows that the approach is apt -to a reasonable extent -to identify successful literary works.Furthermore, it is worth observing that the standardization positively impacts the outcomes, leading to performances as good as or better than the non-standardized case in ten out of twelve scenarios -languishing only the accuracy of the KNN model.
In addition, we retrieved the 40 words of S with preponderant impact onto the SemAxis projection, aiming at analyzing what sort of vocable seems to be characteristic in best sellers and in the other books.As presented in Table III, the most meaningful words for successful books encompass six adjectives, nine nouns, one adverb, and four verbs; for the non-bestseller books, we have three adjectives, seven nouns, one adverb, and nine verbs.Similarly to a result formerly reported in [3], words referring to body parts (such as eye, face, and hand ) play a central role in less successful titles.Furthermore, none of the 20 most relevant terms for successes ranks among the 40 most frequent words of S -however, when analyzing the non-best-seller books, the principal words eye, face, hand, and back represent, respectively,  The second experiment evaluates whether doc2vec's representation of literary works can grasp the dissimilarity between the two analyzed classes.With this aim, we instantiated D2V models with 32, 64, 128, and 256 dimensions (a feature commonly called vector size, hereafter referred to as # 2 D).We also set the minimum word count to 1 (to ignore all words with a total frequency lower than one), the window (maximum distance between the current and predicted term within a sentence) to 5, and the epochs (number of iterations over the corpus) to 40.Lastly, the model training occurred using all 219 instances of the dataset.
Next, each model vector (henceforth called D) -a piece representing a different bookwas transformed employing LDA and SemAxis techniques along with leave-one-out crossvalidation, yielding the results shown in Figure 4.As can be observed, the method was able to characterize best seller and non-best-seller works in a contrasting fashion, both in 4a and 4b.This result shows that it is possible to emphasize the differences between the two classes in two noticeably distinct approaches (either BoW or D2V).What if the above mentioned approaches are only grasping the subjects of the books?
That would be a valid inquiry once we did not regard this type of information during the construction of the database.To assess this possibility, we retrieved the list of subjects of each book provided by the Gutenberg platform and then analyzed the ten most common ones in the dataset.In Figure 5, we plot those subjects against the SemAxis projection of the books' D2V vector representation (using # 2 D = 64), stratifying the results by category.As one can see, the only subjects with a representative number of instances are PS and PR, which also seem to explain the separation obtained through the D2V method to some degree.
PR and PS are classifications used by the Library of Congress [13] to catalog English and British literature, respectively.In our case, the PR subject represents 102 instances of the dataset, 34 best sellers and 68 non-best-seller works.The PS one, by contrast, encompasses 98 books -72 best sellers and 26 other types of works.In this manner, as, in principle, the success category is the only one with a limited number of instances (given its criteria), we created a new dataset (with 72 successes and 72 others, embracing the same standards stated in the creation of the former dataset) with only literary works belonging to subject PS.
Then, it was selected, from the Gutenberg platform, 46 new non-best-seller titles.Using this current dataset, we repeated the previous experiments, aiming at understanding whether the fact that a book belongs to English or British literature was enough to explain the separation provided by the BoW and D2V methods.The results are presented and discussed below.

Bag-of-words analysis
Similarly to the previous experiment, we considered the set S, built based on the 3,257 different words that appeared in the text of at least N 2 books of the PS dataset.Then, we calculated their frequencies, resulting in a 144 × 3257 dimension matrix, (henceforth called M P S ).Next, M P S 's rows were standardized, transformed using LDA and SemAxis, and verified via LOO cross-validation, as shown in Figure 6.It is possible to observe that the separation between the classes is still perceptible -both in 6a and 6b -although now solely English literature is being considered.
To quantitatively analyze the separation of the categories, we performed supervised classification methods using the standardized and non-standardized versions of M P S as input.
The applied models, hyperparameters, and cross-validation methods were the same as the former experimentation.As shown in Table VI, the random forest model was the best option to distinguish the best seller and other instances, leading to an average accuracy of 0.71both in LOO and 10-fold cross-validations.The standardization did not affect the results.
Even though this is a lower accuracy than that obtained with the previous dataset (which led to the highest accuracy of 0.75), it is worth mentioning that the PS dataset has 35% fewer instances than the other, which makes us expect lower accuracies and higher standard deviations.Thus, it is possible to state that the BoW method is not classifying the corpus instances based predominantly or solely on their literary class.

Doc2vec analysis
The D2V models were instantiated in the context of the new dataset with the same hyperparameters as the previous tests -the only exception being that we trained the model using 144 books instead of 219.We repeated the former configurations (vector sizes Figure 7 shows the results of the transformations of the model vectors via LDA and SemAxis.
The split between best seller and non-best-seller works was again observed, suggesting that the method is insensitive to the literary class.
The quantitative assessment using supervised classification led to the results shown in Tables VII and VIII.From Table VIII, it is possible to conclude that the models that best performed for LOO cross-validation were logistic regression and naive Bayes -the highest accuracy (0.67) given by the latter, with # 2 D P S = 256.In this case, the standardization did not contribute to performance improvement only in the case of the KNN model.From Table VIII, we conclude that no model stood out for 10-fold, with the highest accuracy of 0.67 given by the SVM model, with # 2 D P S = 32 and non-standardized input.The standardization process induced better results on six distinct occasions, although the bestobtained accuracy counts on a non-standardized vector.
For the 219-instances dataset, the best-achieved accuracy was 0.72.Again, we expected a drop in the accuracy, as the new dataset has 35% lesser instances than the other.Thus, just as in the BoW method, it is possible to infer that the separation between classes obtained in the D2V approach does not rely on whether a book belongs to English or British literature.The study of characteristics leading to literary pieces becoming best sellers constitutes an intriguing and challenging research subject.The present work addressed this issue while considering aspects derived from the full content of a list of more and less successful books retrieved from the Gutenberg Project, based on the best seller lists of Publishers Weekly.
Several alternative content representation, standardization, visualization, and classification approaches were considered, as summarized in the diagram shown in Figure 2.
We started our analysis by examining the data using visualization techniques.The visualization enabled a preliminary direct inspection of the embedding by looking at a single axis that maximizes the separation between best sellers and ordinary books.Specifically, we employed SemAxis and LDA techniques -the first providing better discrimination between classes than the latter, both for bag-of-words and doc2vec representations.Furthermore, SemAxis provided means that helped to: (i) understand the most characteristic words in best sellers and non-best-sellers; and (ii) check if the respective success was related to the subjects of the books (e.g., love stories, adventure stories, fiction, among others).In line with earlier work [3], words related to body parts (like face, eye, and hand ) played a central role in non-best-seller books, while more varied and less common vocables (such as ordinary, accordingly, and examination) were characteristic of more successful books.Moreover, we found no evidence that the subject of the books impacted the class discrimination obtained.
For the classification tasks, we tested two strategies for preprocessing the two distinct representations: (i) standardizing and (ii) non-standardizing the embeddings.Then, we evaluated the proposed representations via different classifiers (namely: KNN, LR, NB, DT, RF, and SVM).The best-obtained result was acquired with the complete dataset (219 books), using the LR classifier with the standardized bag-of-words representation.In this case, the final classification accuracy was 0.75.Still dealing with the complete set, the best accuracy obtained for D2V embedding was 0.72, combining the standardized representation with the NB model.For the dataset considering only the PS subject (144 books), the bag-of-words approach throughput the most promising results for the standardized data inputted in the RF classifier.The D2V representation, in contrast, returned better outcomes for the standardized data combined with the NB classifier.These results agree with the tendency of the two classes' separation found in the visualization analysis.Interestingly, the standardization did not affect the results significantly in the doc2vec approach for both datasets.
The reported methodology and results pave the way for several related studies, some of which are described as follows.Firstly, it would be interesting to adapt the reported method to other types of embeddings, being one example the BERT transformer modified to work with long texts.Secondly, it would be interesting to consider the described approach for better understanding: (i) other types of documents, such as scientific books and articles, and (ii) additional types of artistic production, including music, poetry, and theater.Lastly, another point that could be explored concerns the explanation of reasons why some literary works become best sellers and others do not.

FIG. 1 :
FIG.1: Representation of doc2vec and SemAxis approaches.In (a) we illustrate the neural network employed to obtain the embedding representation of books based on sequences of words (encoded as one-hot vectors) extracted from a book.The network is trained to predict a target word in the sentence based on the adjacent terms.Additionally, the original book ID is also encoded as input to the neural network, and their respective trained vectors correspond to the embedding space of books.In (b), we illustrate the SemAxis approach in which the line connecting the two categories' (success vs. others) centroids defines an axis to project all the books.This process results in a continuous one-dimensional (scalar) representation of books, which is employed for visualization purposes.

FIG. 2 :
FIG.2: Overall diagram of the main approaches, methods, representations, and validations employed in the present work.All the methods within the blue and orange boxes are applied to the two considered embeddings in a combined fashion.For example, a valid path would be: (i) embedding: doc2vec with dimension equals to 32; (ii) preprocessing: standardized; (iii) learning method: logistic regression; (iv) validation: leave-one-out.

FIG. 3 :
FIG. 3: Kernel density estimation of the 219 investigated literary works, considering (a) LDA projection and (b) SemAxis projection of M .The colors on the plot correspond to the categories of the books, blue indicating success and orange other.

FIG. 4 :
FIG. 4: Kernel density estimation of the 219 investigated literary works, considering (a) LDA projection and (b) SemAxis projection of D2V representation (adopting # 2 D = 64).The colors on the plot correspond to the categories of the books, blue indicating success and orange other.

FIG. 5 :
FIG.5: On the y-axis, the ten most common subjects in the dataset.On the x-axis, the SemAxis projection of the books' D2V representation (adopting # 2 D = 64).The colors on the plot correspond to the categories of the books, blue indicating success and orange other.

FIG. 6 :
FIG. 6: Kernel density estimation of the 144 investigated literary works belonging to the PS subject, considering (a) LDA projection and (b) SemAxis projection of M P S .The colors on the plot correspond to the categories of the books, blue indicating success and orange other.

FIG. 7 :
FIG. 7: Kernel density estimation of the 144 investigated literary works belonging to the PS subject, considering (a) LDA projection and (b) SemAxis projection of D2V representation (adopting # 2 D P S = 64).The colors on the plot correspond to the categories of the books, blue indicating success and orange other.

TABLE II :
Classification accuracy for different models and arrangements (i.e., whether were employed M or M and leave-one-out or k-fold cross-validation).Highlighted in bold is the best result for each configuration.

TABLE III :
Forty most significant words to the SemAxis projection discrimination between best sellers and others.The importance of each term for the method dictates its allocation order in the table: the element on the first row and the first column (of success/other) is the most important for the class; the one on the second row and the second column is the second most important; and so forth.

TABLE IV :
Classification accuracy for LOO cross-validation combined with different models and arrangements (i.e., whether were employed D or D and with which D2V vector size: 32, 64, 128, or 256).Highlighted in bold is the best result for each configuration.
C. Are the subjects being grasped by the approaches?

TABLE V :
Classification accuracy for 10-fold cross-validation combined with different models and arrangements (i.e., whether were employed D or D and with which D2V vector size: 32, 64, 128, or 256).Highlighted in bold is the best result for each configuration.

TABLE VI :
Classification accuracy for different models and arrangements (i.e., whether were employed M P S or MPS and leave-one-out or k-fold cross-validation), considering the PS dataset.Highlighted in bold is the best result for each configuration.
and the non-standardized versions of the model vectors -called DPS and D P S , respectively.

TABLE VII :
Classification accuracy for LOO cross-validation combined with different models and arrangements (i.e., whether D P S or DPS were employed and with which D2V vector size: 32, 64, 128, or 256), considering the PS dataset.Highlighted in bold is the best result for each configuration.

TABLE VIII :
Classification accuracy for 10-fold cross-validation combined with different models and arrangements (i.e., whether D P S or DPS were employed and with which D2V vector size: 32, 64, 128, or 256), considering the PS dataset.Highlighted in bold is the best result for each configuration.