Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Prediction of future customer needs using machine learning across multiple product categories

  • David Kilroy ,

    Roles Conceptualization, Data curation, Formal analysis, Funding acquisition, Investigation, Methodology, Project administration, Resources, Software, Validation, Visualization, Writing – original draft, Writing – review & editing

    david.kilroy1@ucdconnect.ie

    Affiliation School of Computer Science, University College Dublin, Dublin, Ireland

  • Graham Healy,

    Roles Conceptualization, Funding acquisition, Investigation, Methodology, Project administration, Resources, Supervision, Validation, Writing – original draft, Writing – review & editing

    Affiliation School of Computing, Dublin City University, Dublin, Ireland

  • Simon Caton

    Roles Conceptualization, Funding acquisition, Investigation, Methodology, Project administration, Resources, Supervision, Validation, Writing – original draft, Writing – review & editing

    Affiliation School of Computer Science, University College Dublin, Dublin, Ireland

Abstract

In recent years, computational approaches for extracting customer needs from user generated content have been proposed. However, there is a lack of studies that focus on extracting unmet needs for future popular products. Therefore, this study presents a supervised keyphrase classification model which predicts needs that will become popular in real products in the marketplace. To do this, we utilize Trending Customer Needs (TCN)—a monthly dataset of trending keyphrase customer needs occurring in new products during 2011-2021 across multiple categories of Consumer Packaged Goods e.g. toothpaste, eyeliner, beer, etc. We are the first study to use this specific dataset and employ it by training a time series algorithm to learn the relationship between features we generate for each candidate keyphrase on Reddit to the ones in the dataset 1-3 years in the future. We show that our approach outperforms a baseline in the literature and through Multi-Task Learning can accurately predict needs for a category it wasn’t trained on e.g. train on toothpaste, cereal, and beer products yet still predict for shampoo products. The findings from this research could provide many advantages to businesses such as gaining early access into markets.

1 Introduction

Constantly driving innovation by producing new products is a critical success factor for Small and Medium-sized Enterprises (SMEs) [1] as well as large ones [2]. Businesses listen to the Voice of the Customer (VOC) to aid in the new product development/discovery process [35] as customer fulfillment is essential for the success of new products. Businesses are highly interested in identifying customer needs that are currently unmet [6] or anticipating/predicting future ones their customers may be unaware of [7]. Identifying these types of needs allows companies to gain early access into new markets and increase their overall profitability [8].

Over the last decade, researchers and businesses have been turning to the use of computational approaches to identify new customer needs in addition to traditional methods e.g. questionnaires, user observations, customer specifications, interviews, etc. [9]. Generally, these approaches mine User Generated Content (UGC) using statistical techniques from Artificial Intelligence (AI) and Machine Learning (ML) [10]. However, there is a lack of these techniques focusing on mining needs that are unmet or will be of importance in the future. Yet, approaches that can identify these types of needs are of major interest, for example, Black Swan Data (a firm specializing in predicting future market needs) accumulated over £15.2M in investments in 2022 alone—https://www.crunchbase.com/organization/black-swan-data/company_financials.

Computational approaches in the literature mainly mine sets of general customer needs which are posted as UGC [1134]. However, these approaches fail to narrow their focus on identifying needs that are of substantial value to businesses e.g. unmet needs [16], needs of future importance [35], needs that can be turned into product opportunities [11, 13], etc. Unless the task is document classification, researched approaches typically don’t employ supervised ML to extract customer needs instead opting to utilize unsupervised or rule-based approaches e.g. unsupervised clustering [11, 12, 20, 24, 36, 37] or rule-based keyphrase extraction [23, 3841]. This is because supervised ML requires a ground truth dataset. Having this would most likely provide an increase in accuracy compared to previous approaches. Aside from the literature on customer needs mining needing a supervised approach, there seems to be a lopsided number of studies analyzing at the product model level rather than the product category level as pointed out in [35] (e.g. iPhone4 compared to mobile phones). This category-level analysis provides a different view as it allows needs in general products to be found rather than needs for a specific product model.

A major obstacle in attaining the mentioned aims is the lack of a supervised keyphrase classification model to predict customer future needs that are currently unmet. Therefore, in this study, we build a Multivariate Time Series Classification (MTSC) model which attempts to predict needs that will become popular in future products. To formulate this task, we utilize Trending Customer Needs (TCN) [42]—a dataset of trending keyphrase needs occurring in products each month from 2011–2021 which spans multiple product categories in the area of Consumer Packaged Goods (CPG) e.g. toothpaste, eyeliner, beer, etc. We are the first study to utilize this dataset and we use it by training a time series algorithm to learn the relationships between the keyphrases on Reddit to the ones in the dataset 1–3 years into the future. In our evaluation, we show that our approach outperforms a baseline from a previous study carrying out the same task. We also build a model that incorporates Multi-Task Learning (MTL) by being trained on multiple product categories (e.g. toothpaste, cereal, and beer) rather than just one category (e.g. toothpaste). This is significant as it can still predict accurately for a category it doesn’t use during training e.g. can be trained on toothpaste, cereal, and beer yet still predict for cookies. By doing the following, our approach addresses the aforementioned limitations by predicting future customer needs occurring in multiple product categories using a supervised time series classification approach. There are 4 unique contributions of our work:

  • The challenging task of predicting future customer needs is performed better than previous studies, allowing product development teams to identify unmet needs ahead of their competitors with greater accuracy.
  • Due to the availability of the newly made TCN dataset, supervised ML is used to build a model capable of identifying future customer needs.
  • The use of MTL is employed by incorporating data from multiple product categories in order to make predictions, which yields a model capable of predicting a category it doesn’t use during training.
  • Due to the availability of the TCN dataset our analysis spans many product categories. Consequently, it’s also performed on the product category level (e.g. cheese) rather than the product model level (e.g. The Laughing Cow).

The remainder of the paper is organized as follows. Section 2 provides a literature review of the related approaches for extracting customer needs from UGC. Section 3 illustrates and describes our proposed method. Section 4 discusses the proposed evaluation and provides results. Finally, Section 5 concludes the study and discusses future research directions. We refer the reader to our GitHub repository for resources associated with the study which are mentioned throughout the paper—https://github.com/davidkilroy/Multi-Task-Future-Customer-Needs-Model.

2 Related work

Studies that mine customer needs from UGC use techniques from text mining, Natural Language Processing (NLP) and ML. These studies can be categorised based on: 1) Data Used; 2) Methods Performed; 3) Application Scenario; and 4) Evaluation. This section discusses studies in the area regarding the mentioned factors. Before discussing these techniques, we need to contextualise what is meant by a future customer need. Yet, we note that there is no universally accepted definition.

A customer need has been defined in the marketing literature as a “description in the customer’s own words of the benefit to be fulfilled by the product or service” [43] e.g. the need to prevent chapping for Vaseline lip balm; i.e, they generally refer to requirements, demands, preferences, wants etc. Computational approaches analyzing UGC have also included the features or attributes of a product in this definition [44, 45] as they contain benefits e.g. the coconut flavour/scent which contains the need fragrant for Vaseline. In our study, the definition of a customer need is based on the output label from the TCN dataset [42] i.e. our ground truth label. In TCN, a customer need is in the form of a keyphrase, not a document or group of words/phrases (i.e. topic) as in other studies. It groups needs into two categories: 1) direct needs—stated benefits as claims the user gets/overcomes from using the product (e.g. prevent chapping); and 2) indirect needs—actual features or attributes of a product which contain benefits (e.g. coconut). There are not many generic definitions in the literature of a future customer need. However, it is often mentioned alongside words like hidden [46] or unmet [47], hinting that they are sometimes undiscovered/unsolicited. In our study, we define it as a keyphrase that has direct/indirect benefits, but that this benefit will only obtained at some future time period. In our definition, this future time period is 1–3 years before the need starts trending in real products on the market (i.e. in the TCN dataset).

To clarify in our definition of a customer need, a keyphrase captures the main topics in a document [4851]. It is different from a keyword as it connotes a multiword lexeme [51]. By extension, a candidate keyphrase is a phrase an algorithm analyzes to predict whether it is a keyphrase. In many computational studies, candidate keyphrases are initial sets of phrases that are first analyzed by an algorithm [48, 49, 5153].

2.1 Data used

The main types of data used for extracting customer needs from UGC are social media [1117, 54], product reviews [1828, 55] and patents [2934]. Social media has the drawback of containing a large number of posts that are irrelevant to customer needs when compared to product reviews and patents [35]. However, given the context of our research which aims to discover future needs, social media is the most suitable given that it’s been used as a proving ground for new and emerging ideas e.g. social media users discussing do-it-yourself solutions to beauty products before they’ve become popularized in the market [56].

The social media platform Reddit is chosen for our analysis due to it being one of the few open opinion-based Application Programming Interface (API) platforms since the “Post-API Age” [57] of platforms like Twitter and Facebook restricting access following major data scandals (e.g. Facebook Cambridge Analytica [58]) or changes in company policy (e.g. Twitter [59]). Specifically, when obtaining data we use the Pushshift API which has a limit “five times greater” [60] than the official Reddit API. We note that since collecting all the data needed for this article, API rules for Reddit have changed. At present it is unclear how Reddit data will be accessed in the future. Currently, it seems that a user licence (similar to that of Twitter) will be put in place for academics—https://techcrunch.com/2023/04/18/reddit-will-begin-charging-for-access-to-its-api. Baring data access, previous research using Reddit for customer needs mining [1113] has noted the advantages of using the platform. It has stated the benefit of having data organized into defined “subreddits” when capturing needs where platforms like Facebook and Twitter are limited in this sense [11, 35] e.g. the subreddit r/PickAnAndroidForMe for Android products. In our study, we make use of this “subreddit” structure when classifying keyphrases as future customer needs. Documents are also generally longer on Reddit than on other platforms [11, 13] (e.g. Twitter’s 280-character limit), which could potentially help in mitigating the short text problem [61, 62].

Many studies use ground truth data to train/evaluate ML algorithms capable of detecting customer needs from UGC. Depending on the task this type of ground truth data is different e.g. for document classification, a binary indicator may be included to indicate whether the instance contains a type of customer need [37, 6370]. With our task being a keyphrase prediction problem, we instead require a dataset of ground truth keyphrases with an indicator of whether they are future customer needs.

To do this we use the newly available TCN dataset [42] which is specifically designed to train/evaluate models for discovering future customer need keyphrases. The dataset itself provides a set of the top 20 trending keyphrases each month from 2011–2021 for 37 product categories in the area of CPG e.g. toothpaste, eyeliner, beer, etc. (although we only use 15 categories in our analysis). It is constructed by having annotators label over 9000 keyphrases which are automatically extracted from a database of new-to-market product descriptions provided by Mintel Global New Products Database (GNPD) [71] (a large product information database used by industry and academics). By labelling needs from a dataset of product descriptions, TCN assures a certain quality of the keyphrases it provides as they are needs addressed in real products. The goal of this research is to predict the top needs occurring at a future time period in the dataset e.g. use Reddit data in 2015 to predict needs occurring in TCN in 2018.

The only other dataset that allows the training/evaluation of ML-based algorithms for predicting future customer needs in the form of keyphrases is [35], which also uses Mintel GNPD to annotate keyphrases for the evaluation of their approach. Unlike TCN, they annotate keyphrases using a named entity annotation approach, however, the output of the two datasets is the same which consists of a ranked list of trending keyphrases that are attempted to be identified ahead of time by algorithms run over UGC. One obvious drawback of the dataset in [35] is that it only is available for one product category (i.e. toothpaste) compared to TCN which evaluates for multiple categories. Due to this, TCN allows a major contribution of this research to be made possible which is in the use of MTL to build a single model that learns to predict future customer needs from any product category e.g. it can be trained on toothpaste, eyeliner and popcorn customer needs yet still predict toothpaste or even tea customer needs.

2.2 Methods performed

There are generally three families of methods used to mine for customer needs in the text mining and NLP literature: i) document classification; ii) clustering; and iii) keyphrase classification/ranking.

Document classification methods reduce the number of documents under analysis to ones that are “informative” from the standpoint of mining for customer needs. This definition of informative changes depending on the study at hand. In [63], this definition is based on whether the document contains a “wish” for making suggestions about an improvement for a product or an intention to purchase it. Similarly, studies on “purchasing intent” [6467] identify documents showing “a desire to purchase a product or service in the future” [64]. Purchase intent studies mine on various UGC data sources such as Quora [64], Yahoo Answers [64] and Twitter [65, 66]. Other studies specifically classify documents based on whether they contain “customer needs” [37, 68, 69] from the definition in their respective studies. [68] gives examples of what customer needs are when obtaining labeled data for their classification task e.g. they state that the sentence “this product can make your teeth super-sensitive” is a need as it’s informative from the sense of providing information about the product, however, “this product can be found at CVS” is uninformative as it only mentions the store it can be purchased in. Other studies have tried to classify documents that contain “product innovations” [70]. Methods used to solve these tasks range from classical ML methods (like Support Vector Machines, tree-based and Bayesian classifiers [69]) to deep learning approaches (such as Convolutional Neural Networks [68] or Long Short Term Memory Networks [70]).

Clustering methods can generally be split into three subfamilies of approaches: a) keyphrase clustering; b) document clustering; and c) topic modeling. Keyphrase methods group similar customer needs in the form of keyphrases together e.g. [36] clusters keyphrases in reviews from Amazon reviews of 4 smartphone products. Document methods group similar documents discussing customer needs together e.g. [20] clusters documents of reviews from Amazon for recliner products. Topic modeling can be seen as both keyphrase and document clustering as each document is a probability over topics, which are in turn distributions over keyphrases. They are used quite extensively in the literature, although used for various purposes e.g. analyzing fashion trends, smartphones or Amazon product ecosystems using Latent Dirichlet Allocation (LDA) [11, 12, 24, 37] or finding shorter-lived trends using LDA, Non-negative Matrix-Factorization (NMF), Latent Semantic Analysis (LSA) and neural topic models [7274]. As our approach uses a keyphrase classification algorithm it doesn’t necessitate using any clustering algorithms. However, it does borrow many techniques from these studies, for example, text preprocessing (e.g. Part-of-speech (POS) tag filtering [24]), document classification before running an ML algorithm (e.g. [37] ran an uninformative/informative review classification algorithm before running LDA), etc.

Approaches that work on the keyphrase level use various techniques and are applied for multiple purposes. A large body of work run over product reviews has focused on extracting a ranked list of the most important keyphrases [23, 3841]. These keyphrases are found using rule-based approaches with various studies considering factors such as frequency or sentiment when ranking [23, 38]. However, the ranking of these keyphrases is not specifically designed for finding customer needs for product development, with many of these studies noting their uses for assistive purchasing information for future buyers based on previous ones [23, 3841]. Other approaches do however focus on the ranking of keyphrases representing customer needs for product development [14, 15, 35, 75]. For example, [14, 15, 75] ranks and categorizes needs into strong, weak and controversial phrases from Twitter data for specific models of smartphone and automobile products (e.g. iPhone4, Motorola Droid RAZR, Tesla Model S, etc.). [35] builds on previous research by using a rule-based approach to compare their ranked list of customer need keyphrases extracted from Reddit to that of future needs extracted from a large database of toothpaste products. With the following studies using rule-based approaches to rank keyphrases from UGC data, one of the research gaps this study fills is in the usage of supervised ML to solve the ranking problem. We do this by leveraging the TCN dataset [42], a recent benchmark that extracts top trending keyphrase needs between 2011–2021 from Mintel GNPD [71]—a database of new-to-market CPG products e.g. toothpaste, eyeliner, beer, etc. This dataset allows us to fit a supervised model to predict future trending keyphrases appearing in a set of new-to-market products from UGC. Although supervised ML has been used to classify documents in customer needs mining, it has not been used to predict trending keyphrases in this context before and is hence a contribution to our work. To classify keyphrases, we generate a group of time series representing features for each keyphrase e.g. an individual univariate time series representing some signal of sentiment, frequency, number of comments in the post, etc. We then classify these keyphrases using techniques from MTSC [76] based on whether they go on to trend in the TCN dataset at some future time period or not.

This study also explores the use of MTL, which has been described as having the aim of improving the learning of a model for a task by using the knowledge contained in a number of other learning tasks where these other tasks are related but not identical to the initial learning task [77]. The general approach of MTL has been applied in many applications of ML including but not limited to image classification [78], NLP tasks (such as sentiment analysis) [79, 80] and time series classification [81, 82]. It has also been used in unrelated customer needs mining tasks e.g. understanding customer needs from vehicle behaviour [83]. In our study we use it to learn an instance of a general customer need across multiple product categories, so that it can be used for a category it has or hasn’t seen during training e.g. build a model on needs from toothpaste, lip balm and soda to predict for a seen category like toothpaste or even an unseen category like pizza. In our evaluation, we show how using MTL in this manner results in a high-performing model capable of accurately predicting categories it has and hasn’t seen during training.

2.3 Application scenario

Other than methodological differences, studies in the area of customer needs mining also differ on the application level. Relevant to our study, is how other approaches differ on: i) the types of products analyzed; ii) whether the studies are based on previous methodologies in the business literature; and iii) the types of needs mined.

2.3.1 Types of products analyzed.

Many studies in the literature tend to focus on analyzing customer needs for a specific product model such as smartphone products e.g. [11, 12] extract needs for the Samsung Galaxy Note 5 by mining on the subreddit r/galaxynote5. Other studies extract needs for multiple product models e.g. [14] finds needs for 4 smartphone models (e.g. iPhone4, Motorola Droid RAZR, etc.), [75] extracts for 4 automobile models (e.g. Tesla Model 5, Honda Civic, etc.) while [15] extracts for 4 models of smartphones and 4 models of automobiles. There is a lack of studies that analyze on the product category level however e.g. mine for general smartphone needs on social media rather than a particular model (e.g. iPhone4). Some that do include a document classification technique [68] which not only extracts needs on the category level but also for multiple categories i.e. toothpaste, kitchen appliances, skin treatment products and prepared foods. Likewise, [70] classifies documents containing innovation ideas across 20 categories of Amazon products. Similar to our approach, [35] extracts needs on the product category level using a keyphrase ranking approach. However, it only extracts and evaluates needs on one product category i.e. toothpaste. In contrast, our approach extracts and evaluates needs across 15 different product categories in the area of CPG. This is a required solution to show that a proposed approach generalizes beyond just one product category. To implement this training and evaluation across 15 product categories we make use of the aforementioned TCN dataset.

2.3.2 Business methodology.

A major application scenario of customer needs mining has sought to provide automated solutions for previous models/methodologies defined in the business literature. In [11, 13], the idea of an “opportunity algorithm” is implemented to find new or existing customer needs. As initially described in [84], this algorithm works on the basis that if a need has high importance but low satisfaction then a business opportunity is present. In [11, 13], these importance and satisfaction values are computed based on the frequency at which a need is discussed (importance) along with the sentiment of it (satisfaction).

Kansei (Japanese for “affective”) engineering is another business model that has been attempted to be automated by computational approaches. It deals with translating human emotions toward a product into design elements [85]. Previous business studies have implemented this method through the use of questionnaires, in which respondents rate their feelings on a point scale between two opposite pairs of words (one positive and one negative) known as Kansei attributes [86, 87]. However, recently automated studies using text mining and ML have tried to solve this problem using UGC to remove the time it takes to gather requirements by carrying out questionnaires [20, 8890]. For example, [88] shows that their algorithm run over Amazon product reviews can extract emotions towards a customer need with high precision and recall.

Another business model addressed by computational methods is the Kano model [91], which is used in product development to weigh how much needs satisfy/dissatisfied customers. There have been many computational approaches to implement this model [18, 22, 37, 9294]. For example, [37] applied sentiment analysis to the output returned by LDA to get the levels of satisfaction and dissatisfaction of customer needs in the form of topics. In our analysis, we don’t implement any of the mentioned business methodologies specifically, however, we do use a particular study in Kansei engineering [88] to detect emotional words in UGC documents which contributes to us providing more features to our MTSC model for the prediction of future customer needs.

2.3.3 Types of customer needs.

Many of the studies mining customer needs in the literature don’t go beyond detecting general needs in UGC to identify ones that may be of more business interest e.g. specifically looking for unmet needs. For example, some document classification studies only detect “purchasing intent” [6467] or only distinguish posts containing a customer need [68, 69, 95] without determining whether the document contains information that is of higher business interest e.g. contains an innovation that can disrupt the market. Similarly, some clustering approaches only detect groups of documents/terms forming needs discussed in UGC without highlighting ones that have more value e.g. [24] detects groups of general fashion needs in Amazon and Rakuten reviews. The same can be said for some keyphrase ranking approaches which consider the factors of frequency and sentiment when sorting phrases without specifically identifying ones that are of perceived business value [23, 3841].

Some studies do however specifically focus on mining needs that are of business interest. [70] classifies documents that contain customer needs detailing product innovations which could be seen as more important than detecting general needs. In [11, 13], the aforementioned “opportunity algorithm” is implemented which identifies unmet needs by finding ones with high importance (keyphrase frequency) and low satisfaction (sentiment). This approach to identifying needs of greater interest to business builds on previous literature, however, it has been described as simplistic [84, 96, 97], and is criticized for this reason i.e. not all unmet needs conform to having high importance and low satisfaction. Similarly to the opportunity algorithm, the Kano model also goes beyond detecting general customer needs by providing a categorization/prioritization framework of needs into 3 groups (although some studies extend to more): 1) basic/must-have—needs, if left unfilled, will lead to dissatisfaction; 2) performance/one-dimensional—needs which give a proportionate increase in satisfaction as they are invested in; and 3) excitement/attractive—needs which give no decrease in satisfaction if not fulfilled but may give disproportionately high satisfaction if fulfilled [98100]. There are many computational approaches to implementing the classification of these need types using the Kano methodology from UGC [18, 22, 37, 92, 93]. For example, [92] classifies LDA topics into the 3 mentioned categories including 2 more (reverse and indifferent). The studies implementing this model go beyond just detecting general needs by providing businesses with more information on whether they should use them in their product and hence may be of more use in certain product development scenarios.

Comparable to clustering approaches, studies in keyphrase ranking also attempt to identify unmet needs, however, using different ideologies. In [35], unmet needs are detected using a rule-based approach by attempting to predict what needs will go on to be heavily addressed in products up to 3 years in the future. This idea goes on the basis that future needs are currently unmet by consumers and therefore finding them is of interest to businesses. Other approaches have used this principle but have applied regression techniques rather than keyphrase ranking to solve the problem [101103]. Similarly, our approach uses the idea of attempting to find unmet needs by predicting ones that will be popular in the future. To do this, we use the TCN dataset which has been specifically designed for this purpose i.e. finding future customer needs.

2.4 Evaluation

In the customer needs mining literature, the difficulty of defining an evaluation strategy depends on the task being solved. For document classification tasks (e.g. does this document contain a “customer need”), the evaluation is straightforward where ML metrics such as accuracy, precision, recall and F1 can be employed [37, 6370]. However, for tasks such as clustering or topic modeling the evaluation can be more difficult to define i.e. how do you say that a cluster/topic of keyphrases represents a customer need that is useful for product development? Due to this, some studies don’t evaluate their approaches at all and instead demonstrate the usage of their approach [1113]. Others use general intrinsic validation measures such as the Bayesian Information Criterion (BIC) [104] or the Silhouette score [17] to evaluate clustering solutions or perplexity to evaluate topic models [25, 37, 105, 106]. However, these studies don’t show the validity of their solution in terms of evaluating extrinsically against clusters/topics of manually labeled customer needs, which is a necessary measure to employ in Information Retrieval (IR) tasks [107]. The lack of a benchmark dataset to evaluate clusters/topics has been a recently mentioned issue of studies evaluating these techniques [42].

When evaluating our approach the two main branches of literature we consider are those for 1) keyphrase-based evaluation; 2) future customer needs evaluation. This is done as our approach is a combination of the two of these areas in the general landscape of customer needs mining. One common methodology applied to evaluate keyphrase algorithms is an examination of the top set of keyphrases produced to obtain performance metrics [14, 15, 75] e.g. [14] examines a list of the top keyphrases generated from tweets. One of the main pitfalls of this approach is that it is only able to calculate precision but not recall, as a list of the total set of needs is not generated in the evaluation [42]. Other approaches do generate a list of ground truth keyphrases to calculate recall and precision [23, 3841], but, they don’t generate these lists with the specific task of mining needs for product development [42] (however they do help guide our evaluation). Approaches that predict future customer needs are mainly framed as a regression problem [101103] which then go on to apply metrics such as Prediction Error [101] or Mean Absolute Percentage Error [102] to evaluate their approaches. These approaches have large prediction errors indicating the difficulty of the future customer need prediction problem. As pointed out in [35], a problem with these approaches is that they don’t predict a meaningful dependent/target variable for product development (e.g. sales), and instead forecast other variables based on the sentiment or frequency of a keyphrase in UGC.

To evaluate our approach, we use the TCN dataset—a set of the top 20 keyphrases representing customer needs each month from 2011–2021 across multiple product categories. This dataset allows for an evaluation approach that is keyphrase-based and that evaluates future customer needs. When using it for evaluation, we use traditional ML metrics (i.e. precision, recall, F1) in such a way that our algorithm can be assessed to identify future customer needs. In addition, we compute metrics that are used to evaluate lists of future customer needs. These metrics are the same ones calculated in the baseline approach [35], that we compare our algorithm to during our evaluation. The baseline is the only other available approach we are aware of which predicts future customer needs as keyphrases. The metrics calculated in the baseline are based on previous literature and represent a formulation of precision and recall that can be used for evaluating ranked lists i.e. List Mean Precision and List Recall. They are also the recommended metrics to use in the TCN dataset study [42], which advises a repeatable evaluation methodology to be used so that other researchers can benchmark their algorithms. The results in [35] show the difficulty of identifying future customer needs with the algorithm achieving a List Mean Precision result in the range of 10.3% to 15.8% and a List Recall result of 2.1% to 4.6%. These results may seem low, however, given the quality and difficulty of the evaluation approach (as discussed in [35, 42]), they are expected given the difficult task of identifying future customer needs. Having an algorithm with this performance may seem of low benefit at first, however, given the outcome of potentially identifying future customer needs, which can be highly profitable for businesses [7], its value becomes more apparent.

Table 1 summarizes the key studies mentioned in this section and shows the advantages of our approach and where it addresses the gap in the research when compared to other studies. The table shows that contributions are made in 1) finding unmet needs; 2) using supervised learning and MTL to mine for customer needs; 3) evaluating on the product category level (e.g. mobile phone) as opposed to the product model level (e.g. iPhone 2); and 4) evaluating using multiple product categories (e.g. lip balm, toothpaste and beer compared to just beer).

thumbnail
Table 1. Summary of studies in the customer needs mining literature.

https://doi.org/10.1371/journal.pone.0307180.t001

3 Methodology

Fig 1 outlines the keyphrase prediction problem addressed in this study. In this section, the main approach used to tackle this problem is described. In brief, the proposed approach aims to extract candidate keyphrases from social media data that predict keyphrases representing future customer needs i.e. in a future time period within TCN. The task itself is the exact one performed in [35] (i.e. future customer needs prediction), however, it’s analyzed on several different product categories in our evaluation with the use of MTL, whereas [35] just evaluates it on one i.e. toothpaste. The algorithm makes use of the timestamp associated with each social media post to make predictions at each Fixed Time Window i.e. predict keyphrases as customer needs each month (as in Fig 1). For the social media algorithm to make predictions, it considers data from Previous Time Windows. This is seen in Fig 1, where the algorithm uses data from 3 Previous Time Windows (i.e. April 2018, May 2018 and June 2018) to produce its final predictions for an individual Fixed Time Window (i.e. July 2018). Data from these Previous Time Windows is required due to the nature of the method being used (requiring past data for its computation). The overall prediction task is then to observe whether keyphrases from social media can be used to predict ones that appear in the TCN dataset at a future time period. This is seen in Fig 1 where the keyphrases predicted in the Fixed Time Window by the algorithm run over the social media data (i.e. June 2018) predict the keyphrases in the TCN dataset for a specific product category (i.e. June 2021, July 2021, August 2021, etc.). It is of note that the exact time frame seen in Fig 1 is not the one used in the experimental setup but is rather used to illustrate how the task is performed.

thumbnail
Fig 1. Overview of task: Using past social media data to predict trending keyphrases in future customer needs addressed in products i.e. in the TCN dataset.

https://doi.org/10.1371/journal.pone.0307180.g001

In our experiments, for each product category in the analysis, the TCN dataset is used to train and evaluate a supervised keyphrase extraction model run over social media to identify future customer needs. This is framed as a binary classification problem by checking whether a keyphrase on social media appears in the TCN dataset in some future time period for each product category i.e. keyphrase does/doesn’t appear in the TCN dataset in the future. The significance of this is that the TCN dataset contains top keyphrases addressed in products and therefore by predicting what will occur in it we effectively forecast what new customer needs will be heavily addressed in future products e.g. predict the top needs for breakfast cereal items.

Fig 2 outlines the algorithm’s approach when predicting future customer need keyphrases at each Fixed Time Window on social media for a specified product category. First social media data is collected from Reddit (e.g. collect a corpus of posts for the category “soup”). Secondly, the posts are cleaned using text preprocessing techniques (e.g. tokenization, lemmatization, etc.) and candidate keyphrases are selected for the classification task. Thirdly, a large number of univariate time series are generated for each candidate keyphrase for the MTSC task. These series range from linguistic-based (e.g. dependency or part-of-speech tags) to sentiment-based information series. Finally, an MTSC model is trained on the univariate series associated with the candidate keyphrases which contain the binary output label determined by whether it appears in the TCN dataset in the future i.e. is a future customer need. This general framework of classifying keyphrases from social media using MTSC techniques has been addressed before with [108] generating similar families of features to our study (e.g. sentiment and content-based) to distinguish between “organic” and “promoted” hashtags on Twitter. A lot of the core concepts employed in the methodology of [108] are also used in our study, especially when generating univariate time series for each candidate keyphrase (or hashtag in their study). However, the types of univariate time series we generate in our study are tailored for the task of finding future customer needs e.g. time series encoding whether a keyphrase occurs in posts discussing products to help the model find keyphrases which are customer needs.

thumbnail
Fig 2. Overview of methodology used to predict keyphrases representing future customer needs.

https://doi.org/10.1371/journal.pone.0307180.g002

All of the numbered components in Fig 2 make up the subsections of this section with the addition of a further section to explain the types of univariate time series created to solve the classification task: 1) Data Collection—scraping Reddit data for each product category analyzed (Section 3.1); 2) Text Preprocessing & Keyphrase Selection—preprocessing posts and selecting candidate keyphrases (Section 3.2); 3) Generate Multiple Time Series For Each Keyphrase—discussing how time series are generated for each keyphrase for the classification task (Section 3.3); 4) Families of Univariate Time Series Generated—examining the families of univariate time series computed for the task of identifying future customer needs (Section 3.4); and 5) Time Series Classification—detailing how the ground truth label is added in the binary classification set-up and reviewing the MTSC algorithm applied (Section 3.5). As shown in Fig 2, we also discuss the use of MTL which is one of the main contributions of this study (detailed in Section 3.5). Here we describe the approach of building a single model capable of accurately predicting future customer needs in any product category.

3.1 Data collection

The social media data we use in our study is from Reddit. Some previous approaches using Reddit have looked at specific subreddits when mining customer needs for a specific product category [1113] e.g. r/mobiles for mobile phone products. Instead, our approach searches for posts with a target keyphrase(s) that represents the product category in the analysis [16, 35] e.g. the target keyphrases “cookie” and “biscuit” are searched for when analyzing the Cookie product category. We confirm that we have permission to use all the data collected in this analysis. We also confirm that the collection and analysis methodology complies with the terms and conditions of the data owners. This work received an ethics waiver from the UCD Human Research Ethics Committee—Sciences (HREC-LS) under ref LS-E-20-81-Kilroy-Caton.

Table 2 shows the 15 product categories we analyze in this study. We collect data for each of these categories from 2011-01-01 to 2018-12-31. When choosing categories to analyze, we are restricted to ones that are in TCN (i.e. ground truth dataset)—initially totaling 37 CPG product categories. From these 37, we are further restricted due to some categories having too few Reddit posts to analyze them e.g. the product category Dishwashing Liquid has 14,161 posts from 2011-01-01 to 2018-12-31 with the searched target keyphrases: “dishwashing liquid”, “washing up liquid”, “wash up liquid”, “dishwasher detergent” and “dishwashing detergent”. From the remaining categories in TCN, we identified 15 of these for analysis to reduce computational requirements. For the 15 categories, we select a diverse range of category classes ranging from the ones stated within TCN i.e. Health & Beauty (e.g. eyeliner), Pet (e.g. dog food), Food (e.g. cookie) and Drink (e.g. beer). This is done to show that our approach can work on multiple different classes of product categories e.g. not just Food and Drink.

thumbnail
Table 2. Overview of Product Categories used in analysis along with the corresponding: a) searched Target Keyphrase(s) on Reddit; b) total Number of Posts (rounded to nearest thousand) for each category we collect on Reddit; and c) the date the TCN ground truth is available.

https://doi.org/10.1371/journal.pone.0307180.t002

Table 2 also shows the target keyphrase(s) searched for when collecting Reddit posts. When deciding which keyphrases are to be used as search terms we first include the product category name as a target keyphrase e.g. “beer” for the category Beer. Secondly, we include any obvious synonyms of the category name e.g. “soft drink” for the category Soda. Finally, we include any potential spelling variations or misspellings of the category name as it could account for a substantial number of posts missed on Reddit e.g. “eye liner” for the category Eyeliner.

Additionally, Table 2 shows the total number of posts we collect for each product category. To keep computational requirements reasonable for the following stages of the approach, we limit the number of posts to analyze for each product category. We do this by randomly sampling posts for each category for which we scrape data. Specifically, we employ disproportionate stratified random sampling at each Fixed Time Window (or strata) [109, 110]. That is to say, in our case, we sample a maximum number of posts at each Fixed Time Window regardless of its size proportional to the total number of posts. We do this to ensure that there is a sufficient number of posts in each Fixed Time Window. Specifically, the maximum sampling rate at each Fixed Time Window is 5000 posts, even if some of the windows don’t have this amount of posts e.g. early in 2011 when Reddit uptake was low.

Furthermore, Table 2 shows the dates for which the ground truth data from TCN is available for each product category. Unfortunately, TCN only started collecting ground truth data for some categories on 2018-01-01. This makes some of the experiments in our study more challenging as it’s not possible to use these categories in the current evaluation set-up of our model training process (detailed in Section 4). These categories can be used in the model testing process, however, hence their inclusion in the analysis.

We refer the reader to our GitHub repository (Section 1) for the release of Reddit IDs associated with each post from every category analyzed in the study.

3.2 Text preprocessing & keyphrase selection

For each product category analyzed, text preprocessing and keyphrase selection techniques are applied to automatically choose keyphrases from the social media post data—required to perform the future customer need classification task. For all the main preprocessing tasks implemented in this section, the Python library spaCy is used [111] i.e. sentence boundary detection, lemmatization and POS tagging. Specifically, the en_core_web_lg model from spaCy trained on OntoNotes 5 [112] is used to perform these tasks, which achieves high performance across many general NLP problems.

Fig 3 outlines the steps we perform to select candidate keyphrases from social media posts. The first step we carry out is sentence boundary detection [113] i.e. splitting a post into an array of sentences. We do this as we only extract the sentence where the searched “Target Keyphrase(s)” (i.e. Table 2) is mentioned e.g. when mining for the Lip Balm product category we only search for sentences containing “lipbalm”, “lip balm” or “chapstick”. As in [35], we do this as posts on Reddit can be quite large, with much of the discussion unrelated to the product category being analyzed.

thumbnail
Fig 3. Overview of text preprocessing & keyphrase selection in order to extract candidate keyphrases.

https://doi.org/10.1371/journal.pone.0307180.g003

Secondly, the sentence is tokenized, lemmatized and uncased, as performed in many other studies using keyphrase extraction [114116]. Tokenization is needed as it is the first step required to separate candidate keyphrases for the classification task, while lemmatization and uncasing are carried out to group inflected phrases together.

Thirdly, multi-word phrases are formed by taking the set of all possible consecutive n-grams in the range of 1 to 4 grams [117, 118]. This process is seen in Fig 3 by forming multi-word phrases from unigrams e.g. “coconut_lip” from the consecutive words of “coconut” and “lip”.

Finally, only n-grams with specific POS tag combinations are kept for the next stages of the analysis, as in [119, 120]. For our task, we require tag combinations that correspond to customer needs. To do this, we extract tag combinations of phrases that are already labelled as customer needs in the TCN dataset. Specifically, we extract all the keyphrases recorded across 5 product categories in the dataset i.e. Vitamins & Dietary Supplements, Cat Food, Pasta Sauce, Tea and Potato Snacks. These categories are not used in the primary analysis (Table 2) to avoid any potential bias in our experimental evaluation. They are also diverse in category classes including Pet Food (i.e. Cat Food), Drink (i.e. Tea), etc. This diversity is necessary as POS tag combinations associated with keyphrase customer needs are different across category class types e.g. the POS tag combinations in Pet Food are different to Drink tag combinations. To extract these tag combinations from the keyphrases in TCN, we run the same en_core_web_lg model over them to identify their POS. In total, we identified 31 tag combinations from the 5 product categories. These are made up of single-word combinations (e.g. nouns like chicken or adjectives like energetic) as well as multi-word combinations (e.g. adjective phrases like micro-cleaning). All the POS tag combinations identified are contained within the single POS tags of nouns, verbs, adjectives, adverbs or proper nouns. This is expected as the TCN dataset instructed annotators to only label customer needs with these POS tags [42]—https://github.com/davidkilroy/TCN-Dataset. A complete list of these POS tags can be found in the GitHub repository which accompanies this study (Section 1). It’s important to note that there is similar work for generating task-specific POS tag combinations for keyphrase extraction e.g. [119] generates a list of tags for the extraction of computational linguistic terms.

3.3 Generating multiple time series for each keyphrase

In this section, we describe how we transform the collection of preprocessed posts for each product category (discussed in Section 3.2) to a form suitable for the prediction of keyphrases using techniques from MTSC. Fig 4 shows an example output of the data we produce. As with classical ML, we generate several features for each candidate keyphrase. However, for our task, the value in the fields generated for each feature is not an individual number but a univariate time series. For each candidate keyphrase instance, each of these univariate series makes a multivariate series (required for the task of predicting future customer needs using MTSC techniques). In this section, we solely describe the process of going from a collection of preprocessed posts (i.e. output of Section 3.2) to multivariate time series data for each keyphrase (in Fig 4). In the next sections, we discuss the types of time series features generated (Section 3.4) before detailing how the ground truth label from the TCN dataset is added to each candidate keyphrase instance along with the MTSC techniques used to classify them (Section 3.5).

thumbnail
Fig 4. Example output of data suitable for multivariate time series classification.

https://doi.org/10.1371/journal.pone.0307180.g004

Fig 5 shows a top-down view of the transformations performed to move from a collection of preprocessed posts to a set of candidate keyphrases with multiple associated features in the form of univariate time series. Each of the steps in the figure will be described in this section. At a high level we do the following: 1) Add Additional Features—text-based models are run over the post data e.g. running a text classification “Buy Intent” model from HuggingFace over posts; 2) Group Keyphrases & Summarize Features—features are summarized for each candidate keyphrase at each Fixed Time Window (i.e. month) e.g. for the keyphrase charcoal calculate the mean “Buy Intent”; and 3) Turn Into Time Series—for each keyphrase at a given Fixed Time Window (i.e. month) the features are turned into individual univariate time series by obtaining the values each month for the keyphrase of interest 36 months into the past (i.e. number of Previous Time Windows) and sorting them by time e.g. for charcoal on the 2014-01-01 find the “Mean Purchase Intent” each month from 2011-01-01 to 2014-01-01. As previously discussed, this entire generation process is performed the same way as [108] by constructing time series for keyphrases (or hashtags in their case) based on calculating summary statistics from the posts it occurs in at each Fixed Time Window. We do, however, generate different time series in our study to reflect the task of recognizing future customer needs (discussed in Section 3.4), which is different from classifying between “organic” and “promoted” hashtags [108].

thumbnail
Fig 5. Overview of preprocessing to extract candidate keyphrases.

https://doi.org/10.1371/journal.pone.0307180.g005

The first step of the approach is to generate several additional features about the post or candidate keyphrases of interest in the post. When calculating post-level features, we apply various text-based models to the sentence with the target keyphrase and record its output. As seen in Fig 5, the models we choose may record document-level information such as Buy Intent from the library Hugging Face—distinguishing between posts containing “buy intent”. We include these features as we believe there is a justification for them improving the task of predicting future customer needs. A detailed list of all the features used along with reasons why they are added is addressed in the next section (Section 3.4).

The second step is to group candidate keyphrases in the posts together by Fixed Time Window (i.e. month) and calculate summary statistics for them based on the posts they appear in. This is seen in Fig 5, where we compute the “Mean Buy Intent” for the keyphrase charcoal on 2011-01-01. Depending on the data type of the feature being summarized different summary statistics are computed. For example in Fig 5, the mean can be calculated for the “Buy Intent” feature as it’s a continuous feature, however, it can’t be calculated for the “Post Type (sub/com)” feature as it’s a string. Here “sub” and “com” refer to Reddit submissions (main posts) and comments (comments on submissions).

In total, we calculate summary statistics for four different types of features (three of which are data types): 1) continuous features, 2) boolean features, 3) string features and 4) keyphrase-level features. Continuous features are columns where the fields contain numerical values (e.g. 4, 5.1, etc). For these features we compute the following summary statistics a) mean, b) maximum, c) minimum and d) sum. Boolean features are columns where the fields contain True or False values. For these features, we only compute the percentage of posts that are True as a summary statistic of the feature column. We only record True, as False can be inferred from it, and hence does not add any additional information to the ML model. In some situations, however, we record False along with the value Not a Number (NaN) as it appears in some fields provided by the Reddit Pushshift API (e.g. “is_video” field). String features are columns where the fields contain a sequence of characters e.g. “submission” for the “Post type (sub/com)” in Fig 5. For string features, we carry out type matching of a user-defined string and report the percent of posts that contain the matched string as a summary statistic e.g. percent of posts that are “submissions” based on the “Post type (sub/com)” feature (as in Fig 5). The process for choosing the types of strings to search for is dependent on the feature summarized. In some cases, it’s an exhaustive list of all possible strings in the string-based feature column (e.g. “submission” and “comment” for the “Post type (sub/com)” feature) and in other cases, only specific strings are searched for due to there being too many values in the feature (e.g. subreddit feature column). Keyphrase-level statistics are obtained by retrieving an attribute from the keyphrase, therefore accounting for just one summarized statistic. As seen in Fig 5, an example of a keyphrase-level feature is the Document Frequency at a given Fixed Time Window. Another example may include a dimension of a word embedding for a keyphrase. Table 3 outlines the discussed summary statistics we compute in this study. It’s important to note that it’s possible to include additional statistics (e.g. median for continuous types), however, for computational reasons we decided to keep this number low. The current statistics are just put in place to test whether the general framework works i.e. to see whether it can learn what future customer need trends look like.

The third and final step of the approach is to turn each summary statistic feature into an individual univariate time series. This is done by obtaining the values each month for the candidate keyphrase of interest 36 months into the past (i.e. number of Previous Time Windows) for a specific feature and ordering these values by time e.g. charcoal on the 2014-01-01 find the “Mean Purchase Intent” each month from 2011-01-01 to 2013-12-31. As seen in Fig 5, the final output of this is multiple univariate time series for each candidate keyphrase instance per month (i.e. Fixed Time Window).

As we build time series based on data 36 months into the past (i.e. 36 Previous Time Windows), we reduce the time frame in which we have multivariate time series data. When collecting Reddit post data, we scraped between 2011-01-01 to 2018-12-31 for each product category. However, due to this construction step, we only have time series data from 2014-01-01 to 2018-12-31. Another consequence of the construction step is that instances with the same keyphrase for the same product category between Fixed Time Windows (i.e. months) are highly similar. This is a result of the rolling window nature of our approach when constructing the time series e.g. the two instances of “charcoal” in the product category Toothpaste for the months 2014-05-01 and 2014-06-01 are nearly the same. As we described later in our evaluation (i.e. Section 4), due to this we are constrained to having the training and testing data separated by at least 36 Previous Time Windows for each instance i.e. the time period required for the instances to no longer share any time series data. Therefore in our evaluation, the training period for each category is between 2014-01-01 to 2014-12-31 and the testing period is between 2018-01-01 to 2018-12-31. We do this as 2014-12-31 and 2018-01-01 are separated by 36 months (avoids any potential train/test overlap that would occur). Because the training period is between 2014-01-01 to 2014-12-31, we can only train on 7 product categories in our analysis as there is only TCN ground truth available for these 7 for those dates (as seen in Table 2). Although the remaining 8 categories are not used in the training process they are used in our evaluation (Section 4).

Each month when generating candidate keyphrases, the approach also only considers keyphrases for the classification task that exceed a minimum frequency over the past 36 months (i.e. length of Previous Time Windows). Specifically, the minimum thresholds we apply are a minimum document frequency of 0.00005 (as in [35]) and a minimum raw count of 2 over the past 36 months. These thresholds are very lenient for a candidate keyphrase to pass as ones that don’t pass these thresholds are unlikely to be future customer needs which trend in the TCN dataset i.e. are addressed as top customer needs in future products. By applying these thresholds it also allows us to summarize fewer candidate keyphrases (we cut off the long tail of the distribution), which leads to less computational resources being used.

So to obtain a better understanding of the process described in this section, we refer the reader to our GitHub repository (Section 1) for a small release of a random sample of our data generated at each of the detailed steps in this section i.e. release of data at each of the processes in Fig 5.

3.4 Families of univariate time series generated

In our analysis, we can split the total number of unique univariate time series into families of series, as shown in Table 4. In total, there are 1263 univariate time series from 10 families of series. The idea behind including this large feature set is to learn what a future customer need instance looks like on the social media platform Reddit. In this section, we give an overview of these families of features with a rationale behind the selection process for their inclusion. The appendices give a more in-depth description of how these features are generated.

The series from the Reddit Information Based Series are generated from attributes that are provided with each post from Pushshift i.e. historical Reddit API [60]—https://pushshift.io/api-parameters/. These collected attributes range from the score (i.e. number of upvotes minus number of downvotes on a post) to whether the post contains a video. The majority of the time series generated here (e.g. a time series generated from an attribute about a post containing a video) may not necessarily be directly useful in the multivariate problem of detecting whether a keyphrase will become a future customer need addressed in real products. However, some features are directly useful, such as series derived from the score or the number of comments, with other studies on Twitter using retweet and like attributes to identify future product trends [121]. Refer to S1 Appendix for a more detailed description of the exact features created for the Reddit information-based series.

The time series from the Frequency Based Series are generated from different statistics about each candidate keyphrase’s occurrence in each Fixed Time Window (i.e. month). All of these types of features are keyphrase level statistics (as described in Section 3.3) e.g. document frequency. Measures of keyphrase frequency have been used in previous tasks identifying customer needs from social media [35, 44] and therefore are useful in this classification task. Refer to S2 Appendix for a more detailed description of the exact features generated for the frequency-based series.

For the Product Information Based Series we generate features from pre-trained models which are run over Reddit posts. These models all try to capture whether a post is “product-related” in some sense e.g. post contains purchase intent [67]—https://huggingface.co/j-hartmann/purchase-intention-english-roberta-large. These types of features are good at identifying customer needs [68], hence their inclusion. Refer to S3 Appendix for a more detailed description of the exact features generated for the product information-based series.

For the Sentiment Based Features, as with Product Information Based Features, we generate features from a pre-trained model that is run over Reddit posts. Specifically, we summarize the outputs of a model run over the GoEmotions dataset [122], which contains 28 output class labels representing emotions (e.g. anger, caring, disappointment, excitement, etc.). Sentiment has been widely used in the customer needs mining literature [37, 105], hence its inclusion as a feature. Refer to S4 Appendix for a more detailed description of the exact features generated for the sentiment information-based series.

For the Question Detection Based Series, as with some other features discussed in this section, we generate features from models that are run over Reddit posts. These models try to detect whether a post is asking a question or stating an answer. These features are included with the hypothesis of future customer need keyphrases being in more posts that contain questions or statements e.g. people asking what charcoal toothpaste was before it became a popular customer need in toothpaste products—https://www.reddit.com/r/NaturalBeauty/comments/2s6h2u/best_homemade_whitening_toothpaste/. Refer to S5 Appendix for a more detailed description of the exact features generated for the question detection-based series.

For the Embedding Based Series, as with some other features discussed in this section, we generate features from models that are run over Reddit posts. We do this by using pretrained document and phrase embedding models with the Python libraries SBERT [123], spaCy [111] and fastText [124]. Embedding information has already been used to identify customer needs in other studies [68, 125] (although used for document classification). It’s also feasible to say that it will provide predictive information for our classification task as past trending phrases likely share a similar vector space by having similar meaning (i.e. phrase embeddings) while customer need keyphrases may be in similar documents to past trending ones (i.e. document embeddings). Refer to S6 Appendix for a more detailed description of the exact features generated for the embedding-information based series.

For the Subreddit Based Series, we summarize subreddit (discussion forum on Reddit) information associated with each post. As the subreddit feature on Reddit is a string (e.g. r/AskReddit, r/Music), we search for defined strings to generate a statistic for each candidate keyphrase (as described in Section 3.3). The defined strings we search for come from 100 of the most subscribed subreddits at the time of experimentation, resulting in 100 new univariate time series features in the classification problem. The use of subreddit information has been applied in previous research using Reddit to identify future customer needs [35]. It is useful as certain subreddits may be indicative of places where new trends are discussed e.g. in the subreddit r/eli5 people may ask questions about queries they want solved such as best ingredients to use for smoother lips (lip balm) or whiter teeth (toothpaste). Refer to S7 Appendix for a more detailed description of the exact features generated for the subreddit information-based series.

For the Kansei Engineering Based Series, we classify posts based on whether they contain one of the words in a Kansei group. Kansei engineering has been described as “translating technology of a consumer’s feeling and image for a product into design elements” [85]. Recently, it has become an important topic in the customer needs mining literature for product development, with many computational approaches to it being built [19, 20, 8890]. Traditional non-computational approaches to Kansei engineering work on questionnaires to measure a user’s feelings towards a customer need where groups of words called Kansei attributes are used to measure their emotions. Kansei attributes consist of a pair/groups of bipolar words in which respondents choose to indicate their feeling towards a product e.g. 1) unique-personalized-rare vs common-general; 2) quality-reliable-sturdy-safe vs unreliable or 3) novel-fresh-interesting vs boring. We classify posts based on whether they contain one of the words in a Kansei group. To retrieve a list of these Kansei attributes, we follow the work in [88] which first identifies 16 groups of bipolar Kansei attributes from 10 previous Kansei engineering studies (mostly in the last decade) and then expands on these attributes using an automated method. Refer to S8 Appendix for a more detailed description of the exact features generated for the Kansei Engineering information-based series.

For the Linguistic Based Series, as with some other features discussed in this section, we summarize features from models that are run over Reddit posts. We also analyze keyphrase-level statistics. All the univariate series we generate either represent 1) tagging information (e.g. POS, dependency labels, etc.), 2) document information (e.g. post length) or 3) phrase-level information (e.g. the number of vowels, whether it contains an @ symbol, etc.). Refer to S9 Appendix for a more detailed description of the exact features generated for the linguistic information-based series.

For User Based Series we generate features based on authors (i.e. users on Reddit). The use of author information has been seen in many of the social media forecasting topics already discussed in this study e.g. predicting customer needs [95] and detecting future occurrences using MTSC [108]. Refer to S10 Appendix for a more detailed description of the exact features generated for the user information-based series.

3.5 Time series classification

In this section, we describe the time series techniques used to address the future customer needs keyphrase classification problem. Specifically, we discuss 1) how the ground truth label is added to each candidate keyphrase from the TCN dataset; 2) the MTSC algorithm used for the task (i.e. Supervised ML); and 3) the use of MTL to build a single model capable of identifying future customer needs in any product category.

As previously discussed, the TCN dataset consists of the top 20 most addressed customer needs in products each month from 2014-01-01 across multiple product categories. Fig 6 shows how we add this data as the ground truth label for each multivariate time series instance, where each instance is a candidate keyphrase in a given Fixed Time Window (i.e. month) for a particular product category (i.e. Toothpaste in Fig 6). A binary output label indicates whether the phrase appears in the TCN dataset for the product category being analyzed 1–3 years in the future. This is seen in the figure for the term “charcoal” on 2014-01-01 which has a positive output label (i.e. Yes) given that it appears as a top customer need in TCN3 years in the future i.e. in 2018-01-01. The next instance “bread” has a negative output label (i.e. No) as it doesn’t appear in the TCN dataset during that time period. It’s important to understand that the main objective of adding the ground truth label to the instances is to train and evaluate a MTSC algorithm that predicts customer needs ahead of time before they hit the marketplace (specifically 1–3 years ahead). The main premise behind this is that future customer needs in a product dataset (i.e. TCN) represent needs that are currently unmet and are therefore valuable for businesses to identify. Although not seen in Fig 6, it’s also valuable to point out that there is a high degree of data imbalance between the positive and negative label in the final ground truth column (i.e. “Trend in TCN1-3 Years in the Future?” column in Fig 6). To clarify, there are a lot more negative instances than positive ones because multiple thousand instances are being analyzed in each Fixed Time Window (i.e. month) and there are only 20 customer needs 1–3 years ahead each month in the TCN dataset. In Section 4, we discuss the techniques used to handle this data imbalance.

thumbnail
Fig 6. How ground truth label is added for the classification problem from TCN.

https://doi.org/10.1371/journal.pone.0307180.g006

Unlike univariate time series classification where an instance is a time series with a number of temporally ordered observations and output class, multivariate time series classification is a list of vectors with a number of dimensions along with a number of observations and an output class [76]. As of 2018, an archive of 30 multivariate time series datasets has been released (diverse in series length, number of dimensions and number of output classes) allowing for the benchmarking of algorithms run over these data types which has led to an increase in research in this area [126]. Families of these algorithms have not been applied often when mining for customer needs, however, they have been applied in related areas of study e.g. smart manufacturing [127] and customer churn prediction [128]. Additionally, the inclusion of libraries implementing many popular algorithms in the field has been made publicly available, allowing studies showing the applicability of these algorithms to be made. Two popular ones include 1) sktime—a Python-based package compatible with sklearn [129] and 2) tsml—a Java-based package compatible with Weka. In our study, we use the multivariate supervised MINImally RandOm Convolutional KErnel Transform (MINIROCKET) algorithm [130], a faster version of the RandOm Convolutional KErnel Transform (ROCKET) algorithm [131], which has been shown to obtain better results in terms of speed and accuracy than comparative approaches [76]. ROCKET is an algorithm for transforming a 3D multivariate time series into a 2D vector space using random convolutional kernels. This 2D vector space is then used as ML features to train a linear classifier such as Ridge/Logistic Regression [131] to solve the classification task. In our analysis, we use MINIROCKET, which is a deterministic algorithm that speeds this ROCKET transformation process up to 75 times faster on large datasets [130]. When applying MINIROCKET, we use the multivariate version from sktime—http://www.sktime.net/en/v0.13.0/api_reference/auto_generated/sktime.transformations.panel.rocket.MiniRocketMultivariate.html. To train the linear classifier on the embeddings produced by MINIROCKET, the cross-validated version of Ridge Regression from sklearn is used (one of the recommended algorithms to use with MINIROCKET [130])—https://scikit-learn.org/1.0/modules/generated/sklearn.linear_model.RidgeClassifierCV.html. We apply the following implementations of models from the mentioned libraries as they are the recommended ones used in the linked coding repository of MINIROCKET- https://github.com/angus924/minirocket. We also use the same default hyper-parameter values for the two models as in the repository. Two of these important hyper-parameter values include: 1) 10,000 for the num_kernels parameter of MINIROCKET- producing an embedding space of 10,000 dimensions which the linear model is trained on; and 2) True for the normalize parameter of Ridge Regression—standardizes the embeddings before training/testing the classifier. It’s important to note that although we normalise the embedding inputs into the linear classifier we do not normalise the multivariate time series data before running MINIROCKET (as performed in major MTSC benchmarking studies [76]). This is because scale and variance in one dimension within multivariate data may be discriminatory factors, which is particularly relevant to MTSC where interactions in shape, level and variance are required [76].

As discussed in Section 2, the use of MTL is a key contribution to our study. How we use it is described in Fig 7. During training, we generate time series features from the instances of the available training product categories we have at our disposal (e.g. Dog Food, Shampoo and Toothpaste in Fig 7). A model is then built from these instances which contains the ground truth label which allows the prediction of future customer needs. During testing, we generate time series features using the same process during training, however only for one category. The described trained model is then used to classify these instances. We use two types of product categories when testing our model during evaluation: 1) Seen Testing Category—a category the model has seen during training (e.g. Dog Food in Fig 7); and 2) Unseen Testing Category—a category the model has not seen during training (e.g. Cookies in Fig 7). For the Seen Testing Category, although the model has used the category in the training process the same data is not used for training and testing—described at more detail in our evaluation (Section 4). As discussed in Section 3.3, the categories in Table 2 which have ground truth data on/before 2014-01-01 are used to train the model (i.e. 7 categories) and thus also make up the Seen Testing Categories. This is due to the fact the training time period in our evaluation is between 2014-01-01 to 2014-12-31 (for the reasons described in Section 3.3). All other categories are not used in model training and therefore make up the Unseen Testing Categories (i.e. 8 categories). In our evaluation, we show that the model which is produced from this described MTL approach leads to similar performance compared to training and testing on the same product category e.g. train on Dog Food to predict Dog Food. This is important as this model can be used on categories the model has not seen during training. It does this by learning what general future customer needs look like on Reddit rather than one for a particular product category. The reason this model performs better is due to the MTL characteristic of Task Relatedness [77] i.e. tasks are similar. In our task, this characteristic is seen due to the signals of future customer needs on Reddit for different categories being similar e.g. Toothpaste and Cookies. This is also the logic behind most of the MTL approaches working better throughout the ML literature e.g. [132] made a better-performing classification model which learned higher-level features by using MTL to train on images from multiple object categories. This characteristic of Task Relatedness for our problem is helped by how we generate task-agnostic features. This is seen in Section 3.4 where the features we generate are not specific to any one product category but rather general to multiple product categories e.g. user/frequency/sentiment features.

thumbnail
Fig 7. Multi-Task Learning: A generalizable model is built from multiple product categories (e.g. Dog Food, Shampoo and Toothpaste) and tested on categories it has seen (e.g. Dog Food) and not seen during training (e.g. Cookies).

https://doi.org/10.1371/journal.pone.0307180.g007

4 Evaluation

This section aims to answer the following two research questions: 1) can future customer needs be predicted with better performance than previous approaches using UGC from the social media platform Reddit across multiple product categories within CPG?; and 2) can the use of MTL (described in Section 3.5) be employed to achieve similar performance to training/testing on the same category so that it can be applied to categories the model hasn’t seen during training i.e. by learning what a general future customer need looks like across multiple product categories? To do this, we first measure our approach against a baseline in the literature [35] by assessing how the model compares to it when trained and tested on the same product category e.g. train and test on toothpaste. We then show the performance of MTL when compared to the approach of training and testing on the same category.

This section first details the two different training strategies we use in our experiments i.e. training on one versus multiple categories for prediction (Section 4.1). When describing the strategies, we also give an in-depth explanation of the specific model training and validation details used in our experiments. We then describe the evaluation approaches we employ and explain why certain metrics are used to assess performance (Section 4.2). Our approach is then compared to a baseline [35] in the literature which carries out the same future prediction task as in our study (Section 4.3). The impact of MTL is then assessed (Section 4.4). A further investigation into the results is then performed which highlights the benefits of the approach in general and looks at where future improvements can be made e.g. viewing misclassifications (Section 4.6). Finally, a summary and discussion of our evaluation is given (Section 4.7).

4.1 Evaluation methodology

In our evaluation, we employ two training strategies: 1) One Category training approach and 2) Multiple Category training approach i.e. MTL (as shown in Fig 7). We use these different approaches at various stages in the evaluation and finally compare them at the end of this section. When generating results for the One Category approach, we use the same category data to train and test the model e.g. train and test on dog food. In the Multiple Category approach, we train one model using a large number of product categories and then test individually on categories the model has seen during training and not seen during training (as described in Section 3.5) e.g. train on dog food, nail polish, shampoo, etc. and test on a Seen Training Category like dog food as well as an Unseen Category like cookies. Although we use the same product category data for the One Category and Multiple Category approaches, this data is still split to remove any train/test overlap.

As discussed in Section 3.3, as a result of the 36-month (i.e. length of Previous Time Window) rolling window nature of our approach, instances with the same keyphrase for the same product category between Fixed Time Windows (i.e. months) are highly similar e.g. the two instances of “charcoal” in the product category Toothpaste for the months 2014-05-01 and 2014-06-01 are nearly the same. As a result of this data overlap, we are constrained to having the training and testing data for the One Category approach separated by at least 36 Previous Time Windows (i.e. 36 months) because we use 36 Previous Time Windows of data for each instance (to avoid potential train/test contamination). We have multivariate time series data available from 2014-01-01 to 2018-12-31. Due to this train/test overlap issue, we train on data from 2014-01-01 to 2014-12-31 and test on data from 2018-01-01 to 2018-12-31. For similar reasons, we follow the same train/test split for the Multiple Category approach. As discussed in Section 3.1 and Section 3.5, this is the reason why 7 (and not 15) categories are used in the model training process. 7 categories have ground truth data at/before the dates between 2014-01-01 to 2014-12-31 (see Table 2). These are the only categories used in the One Category approach in our experiments, as the One Category uses the same category to train and test the model i.e. if no training data is available for a category then it cannot be tested. These are also the only categories used to train the Multiple Category MTL approach, therefore making up the Seen Testing Categories (described in Section 3.5). The remaining 8 categories are solely used to test the Multiple Category MTL model to see if it generalizes on categories it has not seen during training, thus making up the Unseen Testing Categories (described in Section 3.5).

As discussed in Section 3.5, we use a Mini Rocket model followed by a Linear Ridge Regression classifier when detecting future customer needs. During the model-building process, we discovered that these two models are infeasible to use on the entirety of our training data from a time complexity standpoint. This is because on average each category has ∼34,000 instances every month with 1,263 univariate time series, which are each 36 months in length. This results in ∼18,550,944,000 data points for each category over 12 months i.e. from 2014-01-01 to 2014-12-31. We do not have the time or computing power to deal with this data, even with the fast speed of Mini Rocket. Hence, we undersample our data. This significantly reduces the number of instances used for training as there is a massive data imbalance between the positive and negative ground truth labels (Section 3.5). Specifically, there is a ∼260:1 ratio of negative instances to positive instances before undersampling across the categories used in the analysis from 2014-01-01 to 2014-12-31. When undersampling, we employ random undersampling—https://imbalanced-learn.org/stable/references/generated/imblearn.under_sampling.RandomUnderSampler.html.

For our task, undersampling allows us to significantly reduce training time, however, it comes with the drawback of not representing the true distribution of the output class(es) which affects predictions produced by classifiers [133]. For example, a model trained on all the training data versus a model trained on an undersampled version of the same data could predict a different output class for the same instance. Our model likely predicts far too many instances as the positive class (i.e. future customer needs) during testing given that it’s trained on a downsampled version of the data which highly undersampled the negative class. To mitigate this undesired consequence, we use the information from the predicted probability output of the model rather than asking it what class an instance belongs to. We use this predicted probability output to find an optimal threshold that yields the highest performance on a held-out validation set at predicting future customer needs e.g. all instances that have a probability output greater than 0.8 (threshold) for the positive class are predicted as positive instances. Specifically, we validate the probability output of the Linear Ridge Regression classifier (which is first trained on the 2D embeddings produced by the Mini Rocket algorithm run over the 3D multivariate time series data). When choosing the threshold, we perform an exhaustive grid search for the probability value which yields the best F1 score between the values 0 to 1 with a step size of 0.01 (we elaborate on the choice of the F1 metric for model performance later in this section). It’s noteworthy that finding this probability threshold for the classification of imbalanced data is an area that has been thoroughly explored in the ML literature [134]. We also note that we don’t validate any other parameters in the overall process for computation/time purposes. Such parameters may include the inputs into Mini Rocket (e.g. number of kernels) or Linear Ridge Regression (e.g. alpha parameter).

When splitting the training data from 2014-01-01 to 2014-12-31 into training and validation sets we don’t use the traditional approach of randomly splitting the data on a per-instance basis, as instances with the same keyphrase could have highly similar multivariate time series (as previously discussed in this section). This would not represent the real relationship between the training and testing sets, which have no overlap. To bypass this issue, we split on the unique keyphrase level instead which ensures no training/validation data overlap. Specifically, we randomly sample 90% of the unique keyphrases from the training data and use the instances associated with these keyphrases to train the model. The remaining 10% of keyphrases representing instances are used to validate it. When splitting the keyphrases, we use stratified random sampling—where unique keyphrases are divided based on if they ever become a future customer need i.e. contained in the TCN dataset 1–3 years in the future. When training the model, the data is then undersampled to a 1:1 ratio of positive and negative instances (allowing for faster training times). The validation data is then only undersampled to the ratio which represents the initial class distribution between positive and negative in the original training data e.g. 260:1 ratio. This is done as a probability threshold which represents the class distribution in the real training data is required to be estimated. By removing 90% of the unique keyphrases that represent future customer need instances this distribution is misrepresented, hence the need to undersample again.

Social Media Algorithm: How the One Category approach is trained, validated and tested

Data: catgoryArr ← [′shampoo′, ′toothpaste′, ′eyeliner′]

for categorycategoryArr do

allData = getDataForCategory(category)

allTrainData, allTestData = splitDataByDate(allData)

trainData, validData = splitDataByValidProcedure(allTrainData)

trainDataUndersampled = undersample(trainData)

model = trainModel(trainDataUndersampled)

probabilityThreshold = validate(model, validData)

prediction = test(allTestData, model, probabilityThreshold)

end

The Social Media Algorithm shows the pseudocode for how we validate, train and test our model. It draws upon all of the topics discussed in this section e.g. how we undersample our data, how we split our data into training/validation sets etc. The Social Media Algorithm specifically shows how we do this for the One Category approach (i.e. train and test using the same product category), however, this example can easily be extended to the Multiple Category approach by instead using multiple categories to train/validate/test the model. To start, we define the product categories we analyze i.e. categoryArr. In the algorithm, these categories are shampoo, toothpaste and eyeliner. However, in our experiments, we analyze more than just these categories, as discussed throughout the study. These categories are looped through with various processes being applied in each iteration which are responsible for training, validating and testing the model. For the first process applied in each iteration, we get all the data for the category being analyzed i.e. getDataForCategory. We assume all processing has been applied in this step to turn the Reddit posts into candidate keyphrases—which each consist of multiple features in the form of univariate time series (as described in Section 3.3). Secondly, we split the multivariate time series data by date into training and testing splits i.e. splitDataByDate. Due to the train/test overlap issue described in this section, for each category, we reserve data from 2014-01-01 to 2014-12-31 for training and 2018-01-01 to 2018-12-31 for testing. Thirdly, we further split the training data into training and validation sets i.e. splitDataByValidProcedure. Here we split on the unique keyphrase level (for the reasons described in this section). We then undersample the training data, using random undersampling, due to it being computationally infeasible to run the model on the entirety of our data (as described in this section) i.e. undersample. The model (Mini Rocket followed by a Linear Ridge Regression classifier) is then trained on the undersampled data i.e. trainModel. It’s then validated on the held-out validation set to obtain the probability threshold which optimises the F1 score i.e. validate. This is done as the output class distribution represented by the undersampled training data does not estimate the class distribution of the testing data (which is not sampled). When the model is finally applied to the testing data (i.e. test), we use the probability threshold when classifying instances e.g. if the probability output of the instance is above the 0.8 threshold it’s classified as a positive instance.

When running our experiments on the test data we noticed that different versions of the same model type were sometimes not producing similar results e.g. One Category model for lip balm not producing the same results. This is due to the various stochastic processes we perform when transforming our data. Such processes include applying the Mini Rocket algorithm which produces random 2D kernel embeddings for each run or undersampling our training data randomly. So to obtain a realistic result set, we run each approach 10 times and report the mean results in our experiments. This increases the time complexity of our experiment, however, reduces experimental bias.

4.2 Evaluation approaches

When assessing our approach we use two evaluation strategies. The first evaluates the model on the instance level in a binary classification setting, thus assessing the performance of the model on instances it predicts as future customer needs. The second evaluates the model based on the lists of ranked keyphrases it produces each Fixed Time Window (i.e. month), therefore assessing the model’s capability of creating lists of keyphrases it predicts as future customer needs.

4.2.1 Binary Classification Evaluation.

For the first approach, we simply evaluate on the instance level for the binary task of predicting whether the keyphrase instance will become a trend in 1–3 years in the TCN dataset. With our task being an imbalanced classification problem, the F1 measure is used. This is used as accuracy is not a suitable metric for these types of problems [135] e.g. could predict everything as the majority class and still achieve high accuracy. The F1 metric is a trade-off between precision and recall, both suitable metrics for the evaluation of our task (and imbalanced classification tasks in general [135]). Recall is necessary for the task as it evaluates the number of future customer needs that can be found by the model while precision is needed to keep the number of false positives low i.e. everything can’t be predicted as a future customer need.

4.2.2 List Evaluation.

The second approach we use evaluates a ranked list of submitted keyphrases for each Fixed Time Window (i.e. month). We use this procedure as it’s the exact evaluation used by the baseline approach [35] (detailed in Section 4.3). The baseline approach itself informed its evaluation procedure based on related text mining literature [107, 136] for which many of its decisions for assessment are made. We also use this as we want to show that our approach performs better even when using the same evaluation methodology as the baseline. Although we use the same evaluation approach, we do not use the same evaluation data as the baseline. This is because the baseline only uses evaluation data for one category in its experiments i.e. Toothpaste. We instead use the aforementioned TCN dataset which incorporates multiple categories and is in the same final output as the evaluation data used by the baseline i.e. lists of keyphrases representing customer needs addressed in real products. To assess our approach using this ranked List Evaluation, we transform the output of our classification approach to a ranked list of keyphrases for each Fixed Time Window (discussed in Section 4.3). This is also done for the baseline approach, which produces a ranked list of keyphrases so that it can be assessed by the Binary Classification Evaluation (also discussed in Section 4.3).

Differently from the Binary Classification Evaluation, when comparing ranked lists of keyphrases in each Fixed Time Window, this approach considers string matches between the output submitted by the model and the TCN dataset within a Levenshtein distance of 0.8 [35]. This is done to allow for some potential misspellings that can occur on social media. As the keyphrases in this approach are submitted in ranked lists, this is done over reduced numbers of submitted keyphrases by the model run over the UGC data and the TCN dataset. Specifically, the number of keyphrases K used is 5, 10, 15, and 20 (as in the baseline [35]). This is enough to capture highly important needs (e.g. top 5 needs) and also needs that are slightly less important but still relevant (e.g. top 20 needs). In this evaluation, two metrics are recorded a) List Mean Precision and b) List Recall. To calculate List Mean Precision, List Precision is first calculated. List Precision is calculated at each Fixed Time Window (i.e. each month) and is defined as the number of correct keyphrases the model run over UGC can find (1–3 years ahead of the date in which the keyphrases are found) divided by the number of keyphrases K it produced. As List Precision is calculated every month, List Mean Precision over the entire evaluation period can be computed, which is defined as the mean of the List Precision scores. List Recall is defined as the total number of unique keyphrases the model run over UGC can match in the TCN dataset (i.e. 1–3 years ahead) divided by the total unique number of keyphrases K the TCN dataset produces in the entirety of the testing period. This metric instead focuses on assessing the unique customer needs detected by the model which are contained in the TCN dataset. However, the model run over UGC produces keyphrases from 2018-01-01 to 2018-12-31 and the ground truth data produces keyphrases from 2019-01-01 to 2021-12-31 in our experiments. As pointed out in the baseline study [35], this makes it very difficult to achieve high performance for this metric (i.e. List Recall).

4.3 Baseline approach comparison

In this section, we outline how our model performs against a baseline [35] using the two evaluation approaches defined in Section 4.2 i.e. Binary Classification Evaluation and List Evaluation. Here we only compare the baseline against the One Category approach i.e. training/testing on the same product category (Section 4.1). We do this as we want to clearly show that the ML approach is better than the baseline, to illustrate that future customer needs can be predicted with better performance than previous approaches i.e. our first research question discussed at the beginning of our evaluation. In Section 4.4, we show the performance of the Multiple Category MTL approach.

The baseline approach we compare against is a recent rule-based algorithm that finds customer need keyphrases from Reddit that are of interest to businesses [35]. It addresses the same overall task as in our study i.e. predicts candidate keyphrases which it estimates will be future customer needs addressed in real products. However, these keyphrases are predicted in a ranked list rather than a binary output (as in this study). It does this by performing 3 main steps, each containing many substeps: 1) data reduction; 2) keyphrase extraction; and 3) keyphrase ranking. For the final step (i.e. keyphrase ranking), it’s noteworthy that the algorithm incorporates knowledge from Google Trends, as well as Reddit, as it leads to an increase in performance. The baseline approach is evaluated on the task of identifying future customer needs in Toothpaste products [35]. It’s able to significantly outperform a random baseline for both the metrics discussed in Section 4.2 i.e. List Mean Precision and List Recall. It’s also able to detect 4 out of 6 highly important needs identified by a large Multinational Corporation (MNC) specializing in the oral-care sector. Even though the baseline is only assessed for one product category in its evaluation, we compare it for three categories as our study is a multi-category analysis. These categories are 1) Toothpaste, 2) Perfume and 3) Dog Food. We don’t compare for more than these categories due to the time it takes to collect the Google Trends data for each category for the baseline to work. When picking categories to compare, we chose Toothpaste as it’s the one used in the baseline. We selected the other two categories as they make up a diverse range of categories for the baseline comparison experiment. There are several parameters required to run the baseline, including category-specific input parameters. We have intended to act as favourably as possible to the baseline when providing it with these parameters for categories it didn’t use in its evaluation i.e. Dog Food and Perfume. S11 Appendix further details the parameters and parameter values used for each category.

4.3.1 Baseline approach comparison—Binary Classification Evaluation.

As the output of the rule-based baseline is a ranking of keyphrases for each Fixed Time Window (i.e. month), some manipulations are required for it to be able to be assessed using the Binary Classification Evaluation (as discussed in Section 4.2). To turn this ranking approach into a form suitable for binary classification, we use the natural ordering of the keyphrases when classifying. Specifically, we use a threshold to set the first number of ranked keyphrases (sorted in ascending order of rank) to be true (i.e. are future customer needs) while the others remain false (i.e. are not future customer needs). In our experiments, we use 15 of these thresholds to explore which one yields the best F1 score. Precisely, we use the thresholds 5, 10, 15, 20, 50, 100, 250, 500, 750, 1000, 1250, 1500, 1750, 2000, and 2500. These thresholds are used as they represent a large and widespread range of values used to find a near-optimal F1 for the baseline. Smaller increment values are used at the start (e.g. 5, 10, 15, 20) due to there being only a small number of positive instances in the dataset (Section 4.1).

The results of this evaluation for the One Category approach and baseline approach are seen in Table 5. As seen in the table, the One Category approach outperforms the best baseline approach across all 3 categories. Furthermore, it outperforms the baseline on the categories that aren’t addressed in its experiments by a large margin i.e. Dog Food and Perfume. This is probably because the baseline’s parameter values aren’t tailored to these categories. In the broader picture, however, it does better because it uses supervised ML. It therefore can learn from past instances of future customer needs which generalizes better than rules encoded by a human in the baseline approach e.g. is the Min Chi Square P-value parameter < 0.02? For completeness, we also record the precision and recall results of this comparison (S12 Appendix).

thumbnail
Table 5. Binary Classification Evaluation showing the mean F1 scores (rounded to 3 decimal places) for the One Category and baseline approaches.

The best result for the baseline across each threshold for each category is in bold.

https://doi.org/10.1371/journal.pone.0307180.t005

In this section, we also compare the F1 scores of the One Category and baseline approaches using a statistical test for each of the product categories used in the baseline comparison i.e. Toothpaste, Dog Food and Perfume. This can be run as both the One Category approach and the baseline approach are run multiple times, as detailed in Section 4.1 and Section 4.3. Specifically, a Mann-Whitney U test [137] is run comparing the F1 scores of the One Category approach and the “best” baseline approach. The “best” baseline approach for each category is represented by the threshold which has the highest mean performance seen in Table 5 i.e. 500 for Toothpaste, 1250 for Dog Food and 1000 for Perfume. As with previous studies comparing result data from the output of different ML models [35, 72], the reason this test is used instead of a t-test is that the F1 scores for each approach are not normally distributed [137]. The t-test compares the means of the two samples and assumes they are normally distributed while the Mann-Whitney U test compares the rank sum of the two samples and doesn’t assume they are normally distributed [137]. It’s important to note, that for the same reasons, this test is used throughout our evaluation to compare different samples of results. Table 6 shows the p-value (rounded to 3 decimal places) of this test for each product category. If the test is in favour of the baseline by the median scores being greater than the One Category scores, a + is post-fixed to the result (as in [72]). The reason why we compare the median scores, in this case, is because the Mann-Whitney U test (i.e. test we use to compare results) is a “test of medians” [138]—therefore the test is in favour of the baseline if the median scores are greater than the One Category scores. The results in the table solidify the fact that the One Category approach is better than the baseline for the F1 metric using the Binary Classification Evaluation. This is because for all the categories analyzed the results from the two samples are significantly different (i.e. all p-values are <0.001) and the median results for the One Category approach are all greater than the baseline.

thumbnail
Table 6. P-value (rounded to 3 decimal places) for the Mann-Whitney U test of F1 scores from the One Category approach vs the best baseline approach for Binary Classification Evaluation.

https://doi.org/10.1371/journal.pone.0307180.t006

4.3.2 Baseline approach comparison—List Evaluation.

Similarly to the output of the ranking algorithm, the proposed ML approach addressed in this study needs to have its output transformed for it to be evaluated by the List Evaluation approach i.e. to calculate List Mean Precision and List Recall (described in Section 4.2). Specifically, this involves changing the binary prediction output to a ranked list of keyphrases each Fixed Time Window (i.e. month). This is done by using the predicted probability score outputted by the Linear Ridge Regression classifier (i.e. ML model used in this study) to rank the terms of each Fixed Time Window in descending order of confidence. By ranking this way, the instances the model estimates are most likely to become future customer needs will be at the top of the list, while the ones it least estimates will become future customer needs will be at the bottom.

The results of the evaluation for the One Category and baseline approach are seen in Table 7. The One Category approach is better than the baseline across all the results for the Dog Food and Perfume categories. However, the baseline performs better for the Toothpaste category by obtaining higher performance on all the List Mean Precision results and one of the List Recall results. As discussed previously in the evaluation, the baseline is specifically tuned for the Toothpaste category across the metrics used in the List Evaluation, so it’s not surprising that it performs better here.

thumbnail
Table 7. List Evaluation showing the mean results (rounded to 3 decimal places) for the One Category and baseline approaches.

For each category, the result from the best approach is in bold.

https://doi.org/10.1371/journal.pone.0307180.t007

As in Section 4.3.1, we also compare the results of the One Category and baseline approaches using a Mann-Whitney U test, as they are run multiple times. Table 8 shows the p-value (rounded to 3 decimal places) for each metric in the List Evaluation over every product category used in the baseline comparison. For all the results in the table, the One Category approach is significantly better 18/24 times when the p-value level is either 0.1, 0.05 or 0.01 (i.e. test is in favor of the One Category approach and the p-value is under the mentioned levels). The baseline is significantly better 4/24 times when the p-value level is 0.1 and 3/24 times when the level is 0.05 or 0.01. The levels (i.e. 0.1, 0.05, 0.01) are reported here as they have been commonly used in other studies to test for statistical significance [139]. The results in this table demonstrate that the One Category approach is better than the baseline for the List Evaluation barring the List Mean Precision metric for the Toothpaste category.

thumbnail
Table 8. P-value (rounded to 3 decimal places) for the Mann-Whitney U Test of results from the One Category approach vs the baseline approach for List Evaluation.

https://doi.org/10.1371/journal.pone.0307180.t008

4.3.3 Baseline approach comparison—Summary.

To summarize, the One Category approach outperforms the baseline entirely in the Binary Classification Evaluation (Section 4.3.1) and mostly in the List Evaluation (Section 4.3.2), except for the List Mean Precision metric for the Toothpaste category. Considering these results, we address our first research question that future customer needs can be predicted with better performance than previous approaches using UGC (as discussed at the beginning of our evaluation).

4.4 Impact of Multi-Task Learning

In this section, we outline how the Multiple Category approach achieves similar performance to the One Category approach i.e. MTL model trained on all product categories is similar to using the same category data to solely train/test the model. We do this as in the previous section (i.e. Section 4.3), we showed that our ML approach is better than the current baseline in the literature i.e. our first research question. Therefore, in this section we address the second research question (discussed at the beginning of our evaluation) that MTL can achieve similar detection performance for the classification of future customer needs to the One Category approach. This is important as an MTL model can be used to predict categories not seen in the training process, therefore generalizing to unseen categories without having to be retrained.

As discussed in Section 3.5, there are two types of categories used to test the MTL model: 1) Seen Testing Categories; and 2) Unseen Testing Categories. The Seen Testing Categories are categories used in the training process. Conversely, the Unseen Testing Categories are categories used in testing but are not used by the model in the training process. These Seen Testing Categories are also the only categories used in the One Category approach as they have data to train and test on the same category. As seen in Table 2, the Seen Testing Categories are Dog Food, Eyeliner, Lip Balm, Nail Polish, Perfume, Shampoo and Toothpaste. The Unseen Testing Categories make up the 8 remaining categories in Table 2 i.e. Beer, Cereal, Coffee, Cookie, Pizza, Popcorn, Soda and Soup. In this section, we carry out two separate evaluations for assessing the Seen and the Unseen Testing Categories. The main reason for this is that a comparison analysis between the One Category approach and the Multiple Category approach can only be performed on the Seen Testing Categories because the One Category approach can only be performed on these categories. Although the Unseen Testing Categories are not used in the comparison analysis (i.e. to examine whether the MTL approach is better than using the same category to train/test a model), they still contribute to the evaluation as they test if the MTL model is capable of detecting future customer needs on categories it hasn’t seen during training e.g. can a model trained on Eyeliner, Toothpaste and Perfume predict an unseen category such as Cookies.

4.4.1 Multi-Task Learning approach comparison for Seen Categories—Binary Evaluation.

In this section, we compare the One Category approach to the Multiple Category MTL approach for the Seen Testing Categories using the Binary Classification Evaluation (Section 4.2). The results of this evaluation, shown in Table 9, illustrate that the Multiple Category approach outperforms the One Category approach across 5 of the 7 categories. The One Category approach obtains higher performance in 1 category (i.e. Perfume) and they both achieve the same performance for 1 category (i.e. Shampoo). The precision and recall scores associated with the F1 scores in Table 9 are also recorded (S12 Appendix).

thumbnail
Table 9. Binary Classification Evaluation showing the mean F1 scores (rounded to 3 decimal places) for the One Category and Multiple Category approaches across the Seen Testing Categories.

For each category, the result from the best approach is in bold.

https://doi.org/10.1371/journal.pone.0307180.t009

As in Section 4.3, we also compare the F1 scores of the One Category and Multiple Category approaches for the Seen Testing Categories using a Mann-Whitney U test. Table 10 shows the p-value (rounded to 3 decimal places) of this test for each product category analyzed. Although the Multiple Category approach performs better across 5 of the 7 categories (as shown in Table 9), it only performs significantly better 1/7 times when the p-value level is 0.1 or 0.05 and never when the level is 0.01. It’s of note that the Multiple Category approach for Shampoo is the same as the One Category approach in Table 9, however, the One Category approach slightly outperforms it in Table 10. This is because the mean result is recorded in Table 9 and the median is recorded in Table 10. The baseline is also better 1/7 times when the p-value level is 0.1 or 0.05 and never when the level is 0.01. The results in the table show that the Multiple Category approach performs similarly to the One Category approach for the Binary Classification Evaluation.

thumbnail
Table 10. P-value for the Mann-Whitney U Rank Test of F1 scores (rounded to 3 decimal places) from One Category approach vs Multiple Category approach for Binary Classification Evaluation across the Seen Testing Categories.

https://doi.org/10.1371/journal.pone.0307180.t010

4.4.2 Multi-Task Learning approach comparison for Seen Categories—List Evaluation.

In this section, we compare the One Category approach to the Multiple Category MTL approach for the Seen Testing Categories using the List Evaluation (discussed in Section 4.2). As performed in the baseline comparison (i.e. Section 4.3), we change the output of both the One Category and Multiple Category approaches for them to be evaluated using the List Evaluation approach. The results of this evaluation are seen in Table 11. The One Category outperforms the Multiple Category approach, with it obtaining 33 of the best results from a total of 56 in the table. The Multiple Category approach obtains 21 of the best results while they both obtain the same result twice (i.e. Recall when K is 10 for Nail Polish and Recall when K is 5 for Toothpaste).

thumbnail
Table 11. List Evaluation showing the mean results (rounded to 3 decimal places) for the One Category and Multiple Category approaches across the Seen Testing Categories.

For each category the best approach is in bold.

https://doi.org/10.1371/journal.pone.0307180.t011

As in Section 4.3, we also compare the results of the One Category and Multiple Category approaches using a Mann-Whitney U test. Table 12 shows the p-value (rounded to 3 decimal places) of this test for each product category. Although the results in Table 11 may portray that many of the results are better for the One Category approach, in fact, it only performs significantly better 6/56 times and 4/56 times when the p-value level is 0.1 and 0.05 and never when the level is 0.01. The Multiple Category approach also only performs significantly better 3/56 times and 2/56 times when the p-value level is 0.1 and 0.05 and never when the level is 0.01. Furthermore, for both approaches, statistical significance across the mentioned values is only ever achieved in 2 categories: Perfume (One Category) and Lip Balm (Multiple Category). The results in the table show that the Multiple Category approach performs similarly to the One Category approach for the List Evaluation. To summarize, the Multiple Category approach performs similarly to the One Category approach in the Binary and List Evaluation approaches. Due to this, we partly address our second research question that future customer needs can be predicted with similar performance using MTL (as discussed at the beginning of our evaluation).

thumbnail
Table 12. P-value (rounded to 3 decimal places) for the Mann-Whitney U Test of results from One Category approach vs Multiple Category approach for List Evaluation.

https://doi.org/10.1371/journal.pone.0307180.t012

4.4.3 Multi-Task Learning approach comparison for Unseen Categories—Binary Evaluation.

In this section, we show the results of the Multiple Category MTL approach for the Unseen Categories using the Binary Evaluation (discussed in Section 4.2). The results of this evaluation are seen in Table 13. We also record the precision and recall scores associated with the F1 results in Table 13 (S12 Appendix). It would be an unfair test to compare the results from the Seen and Unseen Testing Categories using a statistical test because some categories are predicted with better performance than others—therefore making the test unfair. That being said, the results in the table are not too different from the Seen Testing Category results in Table 9. This shows that (according to the Binary Evaluation) the MTL model can still predict future customer needs on a category it has not seen during training with relatively similar performance to ones it has seen during training. This is very useful because even if there is no ground truth category data available for a product category, future customer needs for it can still be predicted on Reddit. To further emphasise the fact that the results from the Seen and Unseen Testing Categories don’t differ much from each other, we plot the distribution of F1 scores for these category types using the Multiple Category approach in S13 Appendix.

thumbnail
Table 13. Binary Classification Evaluation showing the mean F1 scores (rounded to 3 decimal places) for the Multiple Category approach for the Unseen Testing Categories.

https://doi.org/10.1371/journal.pone.0307180.t013

4.4.4 Multi-Task Learning approach comparison for Unseen Categories—List Evaluation.

In this section, we show the results of the Multiple Category MTL approach for the Unseen Categories using the List Evaluation (discussed in Section 4.2). The results of this evaluation are seen in Table 14. As with the previous section (i.e. Section 4.4.3), it would not be fair to compare the results from the Seen and Unseen Testing categories using a statistical test, however, the results are not too dissimilar from the Seen Testing Category results in Table 11. Because the information seen in Tables 11 and 14 can be difficult to summarize, we also show the mean result across all the categories of the Multiple Category approach for each of the 80 Unseen Testing Categories and the 70 Seen Testing Categories results for each metric across each value of K in Table 15. As seen in the tables, the performance for the Seen and Unseen Testing Categories are very similar. This shows that (according to the List Evaluation) the MTL model can still predict future customer needs on a category it has not seen during training with relatively similar performance to ones it has seen during training. To further visualize the fact that the results from the Seen and Unseen Testing Categories don’t differ much from each other, we plot the distribution of Mean Precision and Recall scores across all the mentioned values of K for these categories types using the Multiple Category approach in S14 Appendix. This in tandem with the Binary Evaluation for Unseen Testing Categories (i.e. Section 4.4.3) shows that even if no ground truth data is available for a category, future customer needs for it can still be predicted on Reddit.

thumbnail
Table 14. List Evaluation showing the mean results (rounded to 3 decimal places) for the Multiple Category approach across the Unseen Testing Categories.

https://doi.org/10.1371/journal.pone.0307180.t014

thumbnail
Table 15. Seen and Unseen Testing Category mean results across all the categories used in the analysis for List Evaluation (rounded to 3 decimal places).

For each metric the result from the best approach is in bold.

https://doi.org/10.1371/journal.pone.0307180.t015

4.4.5 Multi-Task Learning approach comparison—Summary.

To summarize, the Multiple Category approach and One Category approach perform similarly when assessed on Seen and Unseen Testing categories. This addresses our second research question (discussed at the beginning of our evaluation), that MTL achieves similar performance to the approach that uses the same category data to train and test a model for predicting future customer needs. Although the two approaches perform similarly, we recommend using the Multiple Category MTL approach. This is because it can provide predictions for categories it hasn’t seen during training.

4.5 Comparison to State of the Art methods

In this section, we carry out a State of the Art (SOTA) method comparison analysis using 4 case study examples from work related to this study. This is performed to show how our approach fits into the general customer needs mining literature by comparing it to SOTA methods.

First, we compare our approach to a method that mines current customer needs in the form of ranked keyphrases for specific product models [75]. This method differs from the approach described in this study in two main ways: 1) it mines product models (e.g. Coca-Cola, Haribo) while our approach mines product categories (e.g. soft drinks, sweets); and 2) it predicts current customer needs while our approach predicts future customer needs. The approach in [75] uses LDA to rank keyphrases according to the topics they are in. In a case study, it achieved precision results ranging from 12% to 38% for detecting customer needs from 4 models of automobiles (i.e. Toyota Prius, Tesla Model S, Honda Civic and Jeep Wrangler). These precision metrics were obtained by having humans manually read through the predictions to check if they were correct. Although it is an unfair comparison, our method achieves precision results higher than this method i.e. 19.7% to 51.3% as shown in S14 and S15 Tables in S12 Appendix. Such a comparison is unfair as each approach addresses a different task and has a different evaluation methodology. That being said, achieving similar metrics given the task of predicting current customer needs in comparison to future customer needs (more difficult task) shows the value of the work in this study.

The second example case study also mines current customer needs in the form of ranked keyphrases for specific product models [14]. It has a highly similar methodology to the previous example case study and also uses a LDA based model when ranking customer needs. In a case study on ranking customer needs from 4 mobile phone products (i.e. iPhone4, Samsung Galaxy S II, Motorola Droid RAZR and Sony Ericsson Xperia Play), it achieved precision results ranging from 0.1–0.62. As in [75], these precision metrics were obtained by having humans scan through the ranked lists. The approach in this study achieves similar precision scores, thus showing its usefulness as an approach.

The third case study is the same study that was used as the baseline comparison in Section 4.3. As stated, this approach mines future customer needs using a rule-based algorithm that is run over Reddit data. The task it addresses is the exact same as the one addressed in this study i.e. predicting future customer needs for product categories as ranked lists of keyphrases. Due to this, it can be easily compared with our approach. As shown in Section 4.3, our approach significantly outperforms the rule-based algorithm in various evaluation metrics across 3 different product categories. The approach in this paper can therefore be seen as contributing to the literature for the work in this area (as it makes significant improvements).

Finally, our approach is compared to a case study method which predicts future customer needs as a regression problem [103]. Specifically, this approach uses a fuzzy time series method to predict the importance of customer needs addressed in an electric hairdryer product. Comparing this approach to the one addressed in this study is difficult to do, given they both use different methods i.e. one outputs lists of keyphrases (keyphrase ranking) while another predicts a continuous value for a keyphrase (regression). Due to these methodological differences, these approaches are instead compared based on their usefulness of being able to predict far into the future. The approach in [103] has been shown to provide high performance at predicting Google Trends data far into the future and it has also shown how it’s useful for the purposes of product development. Similarly, our approach is shown to perform highly compared to other methods at predicting ranked lists of customer needs that are of importance far into the future i.e. 1–3 years.

4.6 Further examination of results

In this section, we perform a deeper analysis of the results and see where certain optimizations can be made to improve the model’s capability. We specifically examine the MTL approach as it’s the model we recommend using (detailed in Section 4.4.5). Therefore, the discussion in this section assumes that this model is applied. As the analysis performed in this section does not present the major outcomes of our approach (e.g. our model performs better than a baseline), we only present high-level findings in this section. Linked appendices back up the specific claims made in this section.

The first analysis looks at the lead times the MTL model detects future customer needs in trending products in the marketplace i.e. in the TCN dataset. The finding from this is that although the model detects a large number of customer needs with lead times of less than 5 months in advance, it also detects a lot of needs with lead times of up to 2 years. It does this by checking how far out the keyphrases are from the month of prediction to the date they first trend in the TCN dataset. Fig 8 shows a kernel density estimate plot of these lead times across all 15 categories in the analysis. Such lead times would be highly beneficial for companies to identify before these needs start to become popular in the marketplace. Refer to S15 Appendix for a more detailed description of this analysis e.g. details on how the plot is generated.

thumbnail
Fig 8. Lead times (years) of detecting future keyphrase customer needs before they are addressed in the marketplace i.e. TCN dataset.

https://doi.org/10.1371/journal.pone.0307180.g008

The second analysis explores the room for future optimizations of the approach. The main finding is that large performance increases can be made in the way in which the parameters are validated, in particular the probability threshold parameter. An area where this could be improved is how we split the data into training and validation sets, which is not conventional given the nuances of the training data having overlapping time series (detailed in Section 4.1). Based solely on the estimation of the probability threshold parameter, the model can predict categories with an increased F1 score of 2.1%-5.4%. Refer to S16 Appendix for a more detailed description of this analysis.

The third analysis shows alternative visual plots of the performance of the model, specifically the Receiver Operating Characteristic (ROC) Curve and the Precision-Recall (PR) Curve. This is done to provide a different view of our results other than the F1 score which is used extensively in our study. When displaying the plots we also show the performance of a random classifier. The key finding reaffirms that our model achieves high performance given the task of predicting future customer needs. Refer to S17 Appendix for a more detailed description of this analysis.

The final analysis highlights the misclassifications made by the model. The main findings show the primary issue the MTL model makes is misclassifying keyphrases that are irrelevant for a particular product category e.g. “garlic” for the Beer category or “beef” for the Cookie category. This occurs due to how the MTL model is trained on a wide variety of product categories. Through training this way it learns the characteristics of a future customer need across multiple categories rather than needs for its own category. This is how the MTL approach obtains generalizable performance, however, it’s also a drawback of the approach. Refer to S18 Appendix for a more detailed description of this analysis.

4.7 Summary and discussion

To help guide our evaluation we proposed two research questions at the beginning of the section: 1) can future customer needs be predicted with better performance than previous approaches using UGC from the social media platform Reddit across multiple product categories within CPG; and 2) can the use of MTL (described in Section 3.5) be employed to achieve similar performance to training/testing on the same category. To address these questions, we described how our approach is implemented and detailed the two training strategies used in our evaluation (Section 4.1): 1) One Category training—which uses the same category data to train and test the model at finding future customer needs; and 2) Multiple Category training—which incorporates MTL by using multiple categories to train a model. We also described the two evaluation approaches we used to test the described models (Section 4.2): 1) Binary Classification Evaluation and 2) List Evaluation.

Using the One Category approach, we compared our approach to a baseline in the literature (Section 4.3). We only used the One Category approach here as we wanted to test if our general ML approach is better than the baseline. We showed that our One Category model significantly outperformed the baseline in two of the described evaluation approaches across multiple categories used in the baseline analysis. By illustrating this we show that our approach can predict with better performance than previous approaches (question 1). The Multiple Category MTL approach was then tested against the One Category approach to observe if the approach of training on multiple categories is similar to using one category to test the model. We showed that this was the case as the Multiple Category approach performed highly similar to the One Category approach in two of the evaluation approaches across multiple categories used in the experiment (question 2). By showing this we illustrate that the MTL model can be employed to predict on categories it hasn’t seen, therefore being useful in situations where the category doesn’t have ground truth training data available for it. It’s noteworthy that the two research questions addressed in this evaluation map to two of the research contributions initially stated in our Introduction (Section 1): 1) task of predicting future customer needs is performed better than previous studies (question 1); and 2) the use of MTL is employed by incorporating data from multiple product categories to make predictions, which yields a model capable of predicting a category it doesn’t use during training (question 2).

Throughout our evaluation some of the reported results may seem underwhelming drawing criticism e.g. F1 scores ranging from 7–14% for the best-performing MTL model across 15 product categories used to evaluate it. Similar criticisms are given to tasks with a high data imbalance which are also difficult to predict. Tasks of this kind are seen in a various range of topics in the ML literature including hashtag prediction [140], intrusion detection [141], image classification [142], predicting responses to intensive Post-Traumatic Stress Disorder (PTSD) treatment [143], predicting treatment discontinuation in patients with diabetes [144], classification of unstructured medical notes [145], etc. Depending on the level of imbalance these tasks can achieve similar performance to our study. In a particular study addressing the task of virality prediction of hashtags [140] with a class imbalance of 15:1, the best two models achieved an F1 score of 36.28% and an Area Under Curve (AUC) PR score of 30%. In a study on intrusion detection [141], a model achieved an AUC PR score of 20.51% at detecting blacklist intrusions with a label distribution ratio of ≈166:1. A classification model identifying images on Wikipedia achieved a mean F1 score of 26.7% across 31 labels which had a positive label percentage of 5.71–7.55% (depending on the dataset used). A model predicting Treatment Discontinuation (TD) for diabetes patients achieved an AUC PR score of 8.1%, 22.8% and 29% a respective 2, 3 and 4 months into treatment with a positive to negative label ratio of ≈30:1 in the training set and ≈26:1 in the test set. Finally, a text classification model predicting medical notes into 16 different classes obtained a mean AUC PR score between ≈10% and ≈90% depending on the prevalence of the disease (i.e. label distribution), with lower disease prevalence obtaining lower AUC PR scores. These studies show that models that perform difficult tasks with a high data imbalance generally achieve low performance (if evaluated correctly with high-quality ground truth data and suitable metrics). Although these studies build low-performing models they still perform useful tasks e.g. predicting responses to intensive PTSD treatment [143]. The same can be said for the area of research addressed in this study (i.e. predicting future customer needs) which has been talked about in various studies in the business literature for many years [46, 146148].

5 Conclusion & future work

There are many families of approaches that mine customer needs from UGC. Some perform document classification by reducing the number of documents under analysis to ones that contain customer needs. Some cluster keyphrases or documents into groups that contain customer needs. Finally, there is a large body of work that focuses on the keyphrase level, by highlighting important ones that are deemed to be customer needs. From the literature analyzing on the keyphrase level, there is a lack of research addressing unmet customer needs through the form of predicting future ones. There is also an absence of supervised approaches that highlight important customer needs due to the unavailability of a ground truth dataset. Furthermore, few studies analyze customer needs over multiple product categories (e.g. toothpaste, cereal, beer, etc.). Having a multi-category ground truth dataset for detecting customer needs on the keyphrase level would open the door to carrying out many different tasks, such as training a single model that detects customer needs from a range of different product categories.

To address these limitations, we outline an approach to predicting future customer needs from UGC using supervised ML. We do this by framing the problem of extracting customer needs from Reddit as a binary keyphrase classification problem where candidate keyphrases are classified at each Fixed Time Window. 15 individual corpora each representing product categories were collected by only considering posts that contain the presence of defined keyphrase(s) likely to discuss the category of analysis e.g. the defined keyphrases “‘cookie” and “biscuit” make up the Cookie category. The posts from each of the categories were then preprocessed and candidate keyphrases from them were selected for the classification task. We then generated 1263 features for each of the candidate keyphrases in each product category—each coming from 10 families of features e.g. frequency-based, product-based, sentiment-based, user-based, etc. Each feature is in the form of a univariate time series, therefore associating each candidate keyphrase instance with a multivariate time series data type. We then described the process of adding the ground truth label to each candidate keyphrase instance across each of the 15 product categories. To do this we utilized the TCN- a dataset of trending keyphrase needs occurring in products each month from 2011–2021 which spans multiple product categories in the area of CPG. We made use of the dataset to indicate whether a candidate keyphrase will appear as a top customer need to be addressed in real products 1–3 years in the future. We are the first study to use TCN, without it supervised ML would not be able to be performed for the task of classifying future keyphrase customer needs. Finally, we detailed the MTSC algorithm in our study (i.e. Mini Rocket followed by Linear Ridge Regression) used to learn the relationship between the candidate keyphrases and the binary output label from TCN. To evaluate our approach 15 product categories were analyzed. In the main evaluation investigation, we showed that our approach could detect future customer needs significantly better than previous approaches and that our MTL model could detect future customer needs in categories it hadn’t seen during training with performance similar to categories it had seen during training. In a further examination of our model, we showed that it could detect customer needs with lead times up to 2–3 years in advance of them occurring in products and be improved by large margins by changing the validation procedure.

The contributions of this research are as follows:

  • Task of predicting future customer need keyphrases is forested with higher performance than previous approaches, thus improving the detection of unmet needs.
  • Supervised ML was employed for the keyphrase classification task—made possible by the TCN dataset.
  • MTL was employed by incorporating data from multiple product categories to build a single model.
  • Conducted over multiple product categories on the category level (e.g. cheese) rather than the product model level (e.g. Charleville)—where other studies have tended to analyze on one product at the product level (e.g. Toyota rather than cars).

In light of these contributions, there were also various limitations, indicating areas of future work. Although our experiments were run on powerful machines we lacked a degree of resources in order to perform some highly intensive tasks. As a workaround, we included the use of undersampling in the training process and didn’t employ the validation of important algorithm hyper-parameters e.g. the num_kernels parameter to the MiniRocket algorithm or the alpha parameter to the Linear Ridge Regression algorithm (as discussed in Section 4.1). An obvious area of future work would thus be to explore more ways to reduce computational demands to run the experiment on all the training data and optimize additional hyper-parameters, although other techniques could also be used to efficiently perform hyper-parameter optimization for large datasets [149]. After generating features for the task, we didn’t perform feature selection. Hence, we didn’t explore which features most impacted the model or could be excluded to allow for potential increases in model performance or improvements in training times i.e. by reducing the input space. Such benefits may have allowed other important limitations of this study to be addressed e.g. the validation of model hyper-parameters due to the decrease in the initial feature input space. Recently, there has been an increase in the number of algorithms performing feature selection for multivariate time series data [150153], so there’s no reason this can’t be performed in future studies. This study also has the limitation of not providing an Explainable Artificial Intelligence (XAI) analysis of the classification task, which has been a growing area of research for studies using ML on social media [154]. This could be useful for providing feature-level explanations for why a particular attribute is important for the task i.e. feature importance. This would also allow for benefits including answering questions on why a particular feature (e.g. admiration sentiment) or feature family (e.g. sentiment) is important e.g. the feature family “sentiment” does provide a crucial role in the prediction of future customer needs. This could also help provide instance-level explanations. This would allow for analysts using the prediction model to understand why a particular customer need is being predicted e.g. “vegan” for Dog Food products is predicted to be popular in products in the future due to its rising frequency, high sentiment and diverse user base discussing it. As with feature selection, there has been an increase in the number of algorithms allowing for the explainability of MTSC tasks [155158], so there’s no reason this analysis can’t be performed in future studies. Similar to the limitation of not performing an XAI analysis of our model, an examination into what product categories most impacted the performance of the MTL model built in this study could be carried out. Many findings could arise from performing an analysis of this kind. These may include discovering that only a small number of product categories produce a similar performing model (in comparison to using all available categories) or finding out that some categories negatively impact the performance of the model.

Another limitation of our study is that although it is performed on multiple product categories, these categories are all in the area of CPG. It could be the case that our model only works in this area. A further analysis into this would need to be performed to test whether this is the case, requiring another ground truth dataset (which would be a very expensive task). Finally, although careful consideration is taken in the treatment of not allowing any bias in our experimental set-up (e.g. correctly splitting our data into training and testing sets), this study has the limitation in the fact that customer needs are being predicted retrospectively. This has been the stated limitation of the results from a previous study predicting future customer needs retrospectively [35] and has been the criticism of predicting election results after the fact [159]. That being said, our analysis is of interest nonetheless as we were able to map customer needs on Reddit to future needs in a dataset of real products i.e. TCN.

References

  1. 1. Feindt S, Jeffcoate J, Chappell C. Identifying success factors for rapid growth in SME e-commerce. Small business economics. 2002;19(1):51–62.
  2. 2. Freund YP. Critical success factors. Planning Review. 1988;.
  3. 3. Melander L. Customer involvement in product development: Using Voice of the Customer for innovation and marketing. Benchmarking: An International Journal. 2019;.
  4. 4. Kärkkäinen H, Piippo P, Tuominen M. Ten tools for customer-driven product development in industrial companies. International journal of production economics. 2001;69(2):161–176.
  5. 5. Cooper RG. The drivers of success in new-product development. Industrial Marketing Management. 2019;76:36–47.
  6. 6. Urban GL, Hauser JR. ‘Listening in’to Find Unmet Customer Needs and Solutions. Available at SSRN 373061. 2003;.
  7. 7. Tseng MM, Du X. Design by customers for mass customization products. Cirp Annals. 1998;47(1):103–106.
  8. 8. Sawhney M, Wolcott RC, Arroniz I. The 12 different ways for companies to innovate. MIT Sloan management review. 2006;47(3):75.
  9. 9. Araujo C, Benedetto-Neto H, Campello A, Segre F, Wright I. The utilization of product development methods: A survey of UK industry. Journal of Engeering Design. 1996;7(3):265–277.
  10. 10. HAMDANI F, MONTICOLO D, BOLY V. Etude de l’apport de l’Intelligence Artificielle pour l’innovation de produit. PFIA 2023. 2023;.
  11. 11. Jeong B, Yoon J, Lee JM. Social media mining for product planning: A product opportunity mining approach based on topic modeling and sentiment analysis. International Journal of Information Management. 2019;48:280–290.
  12. 12. Ko N, Jeong B, Choi S, Yoon J. Identifying product opportunities using social media mining: application of topic modeling and chance discovery theory. IEEE Access. 2017;6:1680–1693.
  13. 13. Choi J, Oh S, Yoon J, Lee JM, Coh BY. Identification of time-evolving product opportunities via social media mining. Technological Forecasting and Social Change. 2020;156:120045.
  14. 14. Tuarob S, Tucker CS. Fad or here to stay: Predicting product market adoption and longevity using large scale, social media data. In: International Design Engineering Technical Conferences and Computers and Information in Engineering Conference. vol. 55867. American Society of Mechanical Engineers; 2013. p. V02BT02A012.
  15. 15. Tuarob S, Tucker CS. Quantifying product favorability and extracting notable product features using large scale social media data. Journal of Computing and Information Science in Engineering. 2015;15(3).
  16. 16. Ko T, Rhiu I, Yun MH, Cho S. A novel framework for identifying Customers’ unmet needs on online social media using context tree. Applied Sciences. 2020;10(23):8473.
  17. 17. Han X, Li R, Li W, Ding G, Qin S. User requirements dynamic elicitation of complex products from social network service. In: 2019 25th International Conference on Automation and Computing (ICAC). IEEE; 2019. p. 1–6.
  18. 18. Chen D, Zhang D, Liu A. Intelligent Kano classification of product features based on customer reviews. CIRP Annals. 2019;68(1):149–152.
  19. 19. Chiu MC, Lin KZ. Utilizing text mining and Kansei Engineering to support data-driven design automation at conceptual design stage. Advanced Engineering Informatics. 2018;38:826–839.
  20. 20. Kim W, Ko T, Rhiu I, Yun MH. Mining affective experience for a kansei design study on a recliner. Applied ergonomics. 2019;74:145–153. pmid:30487093
  21. 21. Zhou F, Jiao RJ. Latent customer needs elicitation for big-data analysis of online product reviews. In: 2015 IEEE international conference on industrial engineering and engineering management (IEEM). IEEE; 2015. p. 1850–1854.
  22. 22. Jiang K, Li Y. Mining customer requirement from online reviews based on multi-aspected sentiment analysis and Kano model. In: 2020 16th Dahe Fortune China Forum and Chinese High-educational Management Annual Academic Conference (DFHMC). IEEE; 2020. p. 150–156.
  23. 23. Zha ZJ, Yu J, Tang J, Wang M, Chua TS. Product aspect ranking and its applications. IEEE transactions on knowledge and data engineering. 2013;26(5):1211–1224.
  24. 24. Hananto VR, Kim S, Kovacs M, Serdült U, Kryssanov V. A machine learning approach to analyze fashion styles from large collections of online customer reviews. In: 2021 6th International Conference on Business and Industrial Research (ICBIR). IEEE; 2021. p. 153–158.
  25. 25. Joung J, Kim HM. Automated keyword filtering in latent Dirichlet allocation for identifying product attributes from online reviews. Journal of Mechanical Design. 2021;143(8).
  26. 26. Aman JJ, Smith-Colin J, Zhang W. Listen to E-scooter riders: Mining rider satisfaction factors from app store reviews. Transportation research part D: transport and environment. 2021;95:102856.
  27. 27. Chen WK, Riantama D, Chen LS. Using a text mining approach to hear voices of customers from social media toward the fast-food restaurant industry. Sustainability. 2021;13(1):268.
  28. 28. Kwon HJ, Ban HJ, Jun JK, Kim HS. Topic modeling and sentiment analysis of online review for airlines. Information. 2021;12(2):78.
  29. 29. Lee S, Lee S, Seol H, Park Y. Using patent information for designing new product and technology: keyword based technology roadmapping. R&d Management. 2008;38(2):169–188.
  30. 30. Wang J, Chen YJ. A novelty detection patent mining approach for analyzing technological opportunities. Advanced Engineering Informatics. 2019;42:100941.
  31. 31. Jin G, Jeong Y, Yoon B. Technology-driven roadmaps for identifying new product/market opportunities: Use of text mining and quality function deployment. Advanced Engineering Informatics. 2015;29(1):126–138.
  32. 32. Roh T, Jeong Y, Jang H, Yoon B. Technology opportunity discovery by structuring user needs based on natural language processing and machine learning. PloS one. 2019;14(10):e0223404. pmid:31661516
  33. 33. Russo D, Spreafico M, Spreafico C. Supporting decision making in design creativity through requirements identification and evaluation. International Journal of Design Creativity and Innovation. 2023; p. 1–17.
  34. 34. Livotov P. Using patent information for identification of new product features with high market potential. Procedia engineering. 2015;131:1157–1164.
  35. 35. Kilroy D, Healy G, Caton S. Using Machine Learning to Improve Lead Times in the Identification of Emerging Customer Needs. IEEE Access. 2022;10:37774–37795.
  36. 36. Jin J, Jia D, Chen K. Mining online reviews with a Kansei-integrated Kano model for innovative product design. International Journal of Production Research. 2021; p. 1–20.
  37. 37. Zhou F, Ayoub J, Xu Q, Jessie Yang X. A machine learning approach to customer needs analysis for product ecosystems. Journal of Mechanical Design. 2020;142(1).
  38. 38. Yu J, Zha ZJ, Wang M, Chua TS. Aspect ranking: identifying important product aspects from online consumer reviews. In: Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies; 2011. p. 1496–1505.
  39. 39. Liu C, Tang L, Shan W. An extended hits algorithm on bipartite network for features extraction of online customer reviews. Sustainability. 2018;10(5):1425.
  40. 40. Alrababah SAAA, Gan KH, Tan TP. Product aspect ranking using sentiment analysis and TOPSIS. In: 2016 Third International Conference on Information Retrieval and Knowledge Management (CAMP). IEEE; 2016. p. 13–19.
  41. 41. Alrababah SAA, Gan KH, Tan TP. Comparative analysis of MCDM methods for product aspect ranking: TOPSIS and VIKOR. In: 2017 8th International Conference on Information and Communication Systems (ICICS). IEEE; 2017. p. 76–81.
  42. 42. Kilroy D, Healy G, Caton S. The Trending Customer Needs (TCN) Dataset: A Benchmarking and Automated Evaluation Approach for New Product Development. HICS. 2022;10:37774–37795.
  43. 43. Gaskin SP, Griffin A, Hauser JR, Katz GM, Klein RL. Voice of the customer. Wiley International Encyclopedia of Marketing. 2010;.
  44. 44. Kühl N, Mühlthaler M, Goutier M. Supporting customer-oriented marketing with artificial intelligence: automatically quantifying customer needs from social media. Electronic Markets. 2020;30:351–367.
  45. 45. Kühl N, Satzger G. Needmining: Designing digital support to elicit needs from social media. arXiv preprint arXiv:210106146. 2021;.
  46. 46. Kärkkäinen H, Piippo P, Puumalainen K, Tuominen M. Assessment of hidden and future customer needs in Finnish business-to-business companies. R&d Management. 2001;31(4):391–407.
  47. 47. Hoonsopon D, Puriwat W. Organizational agility: Key to the success of new product development. IEEE Transactions on Engineering Management. 2019;68(6):1722–1733.
  48. 48. Nguyen TD, Kan MY. Keyphrase extraction in scientific publications. In: Asian Digital Libraries. Looking Back 10 Years and Forging New Frontiers: 10th International Conference on Asian Digital Libraries, ICADL 2007, Hanoi, Vietnam, December 10-13, 2007. Proceedings 10. Springer; 2007. p. 317–326.
  49. 49. Liu Z, Li P, Zheng Y, Sun M. Clustering to find exemplar terms for keyphrase extraction. In: Proceedings of the 2009 conference on empirical methods in natural language processing; 2009. p. 257–266.
  50. 50. Wu YfB, Li Q, Bot RS, Chen X. Domain-specific keyphrase extraction. In: Proceedings of the 14th ACM international conference on Information and knowledge management; 2005. p. 283–284.
  51. 51. Siddiqi S, Sharan A. Keyword and keyphrase extraction techniques: a literature review. International Journal of Computer Applications. 2015;109(2).
  52. 52. Hasan KS, Ng V. Automatic keyphrase extraction: A survey of the state of the art. In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers); 2014. p. 1262–1273.
  53. 53. Turney PD. Learning algorithms for keyphrase extraction. Information retrieval. 2000;2:303–336.
  54. 54. Yin C, Jiang C, Jain HK, Liu Y, Chen B. Capturing product/service improvement ideas from social media based on lead user theory. Journal of Product Innovation Management. 2023;.
  55. 55. Lee J, Jeong B, Yoon J, Song CH. Context-aware customer needs Identification by linguistic pattern mining based on online product reviews. IEEE Access. 2023;.
  56. 56. Morais I, Brito-Eliane E. Productive consumption and marketplace dynamics: A study in the DIY homemade natural beauty products context. ANPAD, São Paulo, Brazil, Tech Rep. 2015;.
  57. 57. Freelon D. Computational research in the post-API age. Political Communication. 2018;35(4):665–668.
  58. 58. Isaak J, Hanna MJ. User data privacy: Facebook, Cambridge Analytica, and privacy protection. Computer. 2018;51(8):56–59.
  59. 59. Kupferschmidt K. Twitter’s threat to curtail free data access angers scientists. Science (New York, NY). 2023;379(6633):624–625. pmid:36795820
  60. 60. Baumgartner J, Zannettou S, Keegan B, Squire M, Blackburn J. The pushshift reddit dataset. In: Proceedings of the international AAAI conference on web and social media. vol. 14; 2020. p. 830–839.
  61. 61. Cheng X, Yan X, Lan Y, Guo J. Btm: Topic modeling over short texts. IEEE Transactions on Knowledge and Data Engineering. 2014;26(12):2928–2941.
  62. 62. Pang J, Li X, Xie H, Rao Y. SBTM: Topic modeling over short texts. In: International Conference on Database Systems for Advanced Applications. Springer; 2016. p. 43–56.
  63. 63. Ramanand J, Bhavsar K, Pedanekar N. Wishful thinking-finding suggestions and’buy’wishes from product reviews. In: Proceedings of the NAACL HLT 2010 workshop on computational approaches to analysis and generation of emotion in text; 2010. p. 54–61.
  64. 64. Gupta V, Varshney D, Jhamtani H, Kedia D, Karwa S. Identifying purchase intent from social posts. In: Eighth International AAAI Conference on Weblogs and Social Media; 2014.
  65. 65. Wang J, Cong G, Zhao XW, Li X. Mining user intents in twitter: A semi-supervised approach to inferring intent categories for tweets. In: Twenty-Ninth AAAI Conference on Artificial Intelligence; 2015.
  66. 66. Hollerit B, Kröll M, Strohmaier M. Towards linking buyers and sellers: detecting commercial intent on twitter. In: Proceedings of the 22nd international conference on world wide web; 2013. p. 629–632.
  67. 67. Hartmann J, Heitmann M, Schamp C, Netzer O. The power of brand selfies. Journal of Marketing Research. 2021;58(6):1159–1177.
  68. 68. Timoshenko A, Hauser JR. Identifying customer needs from user-generated content. Marketing Science. 2019;38(1):1–20.
  69. 69. Kuehl N, Scheurenbrand J, Satzger G. Needmining: Identifying micro blog data containing customer needs. arXiv preprint arXiv:200305917. 2020;.
  70. 70. Zhang M, Fan B, Zhang N, Wang W, Fan W. Mining product innovation ideas from online reviews. Information Processing & Management. 2021;58(1):102389.
  71. 71. Solis E. Mintel global new products database (GNPD). Journal of Business & Finance Librarianship. 2016;21(1):79–82.
  72. 72. Kilroy D, Caton S, Healy G. Finding Short Lived Events on Social Media. In: AICS; 2020. p. 49–60.
  73. 73. Forler C, Egyed-Zsigmond E. Studies on interactive event detection and labeling from timestamped texts. In: Proceedings of the 2nd Joint Conference of the Information Retrieval Communities in Europe (CIRCLE 2022), Samatan, Gers, France, July; 2022.
  74. 74. Kazemi A, Younus A, Jeon M, Qureshi MA, Caton S. InÉire: An Interpretable NLP Pipeline Summarising Inclusive Policy Making Concerning Migrants in Ireland. IEEE Access. 2023;.
  75. 75. Tuarob S, Tucker CS. Automated discovery of product preferences in ubiquitous social media data: A case study of automobile market. In: 2016 International Computer Science and Engineering Conference (ICSEC). IEEE; 2016. p. 1–6.
  76. 76. Ruiz AP, Flynn M, Large J, Middlehurst M, Bagnall A. The great multivariate time series classification bake off: a review and experimental evaluation of recent algorithmic advances. Data Mining and Knowledge Discovery. 2021;35(2):401–449. pmid:33679210
  77. 77. Zhang Y, Yang Q. An overview of multi-task learning. National Science Review. 2018;5(1):30–43.
  78. 78. Chen S, Bortsova G, García-Uceda Juárez A, Van Tulder G, De Bruijne M. Multi-task attention-based semi-supervised learning for medical image segmentation. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2019: 22nd International Conference, Shenzhen, China, October 13–17, 2019, Proceedings, Part III 22. Springer; 2019. p. 457–465.
  79. 79. Zhang Z, Yu W, Yu M, Guo Z, Jiang M. A survey of multi-task learning in natural language processing: Regarding task relatedness and training methods. arXiv preprint arXiv:220403508. 2022;.
  80. 80. Chen S, Zhang Y, Yang Q. Multi-task learning in natural language processing: An overview. arXiv preprint arXiv:210909138. 2021;.
  81. 81. Mahmoud RA, Hajj H, Karameh FN. A systematic approach to multi-task learning from time-series data. Applied Soft Computing. 2020;96:106586.
  82. 82. Wei C, Wang Z, Yuan J, Li C, Chen S. Time-frequency based multi-task learning for semi-supervised time series classification. Information Sciences. 2023;619:762–780.
  83. 83. Khoshkangini R, Mashhadi P, Tegnered D, Lundström J, Rögnvaldsson T. Predicting Vehicle Behavior Using Multi-task Ensemble Learning. Expert systems with applications. 2023;212:118716.
  84. 84. Ulwick AW. Turn customer input into innovation. Harvard business review. 2002;80(1):91–7. pmid:12964470
  85. 85. Nagamachi M. Kansei engineering: a new ergonomic consumer-oriented technology for product development. International Journal of industrial ergonomics. 1995;15(1):3–11.
  86. 86. Schütte S, Eklund J. Design of rocker switches for work-vehicles—an application of Kansei Engineering. Applied ergonomics. 2005;36(5):557–567. pmid:15950167
  87. 87. SAITO E. Analysis of the desirable images for clothes in modern society. Kansei Engineering International. 2000;1(3):33–38.
  88. 88. Wang WM, Li Z, Tian Z, Wang J, Cheng MN. Extracting and summarizing affective features and responses from online product descriptions and reviews: A Kansei text mining approach. Engineering Applications of Artificial Intelligence. 2018;73:149–162.
  89. 89. Lin S, Shen T, Guo W. Evolution and emerging trends of Kansei engineering: A visual analysis based on citespace. IEEE Access. 2021;9:111181–111202.
  90. 90. Lai X, Zhang S, Mao N, Liu J, Chen Q. Kansei engineering for new energy vehicle exterior design: An internet big data mining approach. Computers & Industrial Engineering. 2022;165:107913.
  91. 91. Kano N. Attractive quality and must-be quality. Hinshitsu (Quality, The Journal of Japanese Society for Quality Control). 1984;14:39–48.
  92. 92. Bi JW, Liu Y, Fan ZP, Cambria E. Modelling customer satisfaction from online reviews using ensemble neural network and effect-based Kano model. International Journal of Production Research. 2019;57(22):7068–7088.
  93. 93. Zhao M, Zhang C, Hu Y, Xu Z, Liu H. Modelling consumer satisfaction based on online reviews using the improved Kano model from the perspective of risk attitude and aspiration. Technological and Economic Development of Economy. 2021;27(3):550–582.
  94. 94. Li Y, Sha K, Li H, Wang Y, Dong Y, Feng J, et al. Improving the elicitation of critical customer requirements through an understanding of their sensitivity. Research in Engineering Design. 2023; p. 1–20. pmid:36811036
  95. 95. Kuehl N. Needmining: Towards analytical support for service design. In: International Conference on Exploring Services Science. Springer; 2016. p. 187–200.
  96. 96. Ulwick AW. What Is Outcome-Driven Innovation®(ODI)? White Paper. 2009;.
  97. 97. Killen CP, Walker M, Hunt RA. Strategic planning using QFD. International Journal of Quality & Reliability Management. 2005;.
  98. 98. Chaudha A, Jain R, Singh A, Mishra P. Integration of Kano’s Model into quality function deployment (QFD). The International Journal of Advanced Manufacturing Technology. 2011;53(5):689–698.
  99. 99. Velikova N, Slevitch L, Mathe-Soulek K. Application of Kano model to identification of wine festival satisfaction drivers. International Journal of Contemporary Hospitality Management. 2017;.
  100. 100. Basfirinci C, Mitra A. A cross cultural investigation of airlines service quality through integration of Servqual and the Kano model. Journal of Air Transport Management. 2015;42:239–248.
  101. 101. Jiang H, Kwong CK, Yung KL. Predicting future importance of product features based on online customer reviews. Journal of Mechanical Design. 2017;139(11).
  102. 102. Tucker C, Kim H. Predicting emerging product design trend by mining publicly available customer review data. In: DS 68-6: Proceedings of the 18th International Conference on Engineering Design (ICED 11), Impacting Society through Engineering Design, Vol. 6: Design Information and Knowledge, Lyngby/Copenhagen, Denmark, 15.-19.08. 2011; 2011.
  103. 103. Yakubu H, Kwong C. Forecasting the importance of product attributes using online customer reviews and Google Trends. Technological Forecasting and Social Change. 2021;171:120983.
  104. 104. Suryadi D, Kim H. Automatic identification of product usage contexts from online customer reviews. In: Proceedings of the Design Society: International Conference on Engineering Design. vol. 1. Cambridge University Press; 2019. p. 2507–2516.
  105. 105. Ayoub J, Zhou F, Xu Q, Yang J. Analyzing customer needs of product ecosystems using online product reviews. In: International design engineering technical conferences and computers and information in engineering conference. vol. 59186. American Society of Mechanical Engineers; 2019. p. V02AT03A002.
  106. 106. Wang W, Feng Y, Dai W. Topic analysis of online reviews for two competitive products using latent Dirichlet allocation. Electronic Commerce Research and Applications. 2018;29:142–156.
  107. 107. Aiello LM, Petkos G, Martin C, Corney D, Papadopoulos S, Skraba R, et al. Sensing trending topics in Twitter. IEEE Transactions on multimedia. 2013;15(6):1268–1282.
  108. 108. Varol O, Ferrara E, Menczer F, Flammini A. Early detection of promoted campaigns on social media. EPJ data science. 2017;6:1–19.
  109. 109. Aoyama H. A study of stratified random sampling. Ann Inst Stat Math. 1954;6(1):1–36.
  110. 110. Iliyasu R, Etikan I. Comparison of quota sampling and stratified random sampling. Biom Biostat Int J Rev. 2021;10:24–27.
  111. 111. Honnibal M, Montani I, Van Landeghem S, Boyd A. spaCy: Industrial-strength natural language processing in python. Zenodo, Honolulu, HI, USA. 2020;.
  112. 112. Weischedel R, Palmer M, Marcus M, Hovy E, Pradhan S, Ramshaw L, et al. Ontonotes release 5.0 ldc2013t19. Linguistic Data Consortium, Philadelphia, PA. 2013;23.
  113. 113. Read J, Dridan R, Oepen S, Solberg LJ. Sentence boundary detection: A long solved problem? In: Proceedings of COLING 2012: Posters; 2012. p. 985–994.
  114. 114. Zesch T, Gurevych I. Approximate matching for evaluating keyphrase extraction. In: Proceedings of the International Conference RANLP-2009; 2009. p. 484–489.
  115. 115. Berend G. Exploiting extra-textual and linguistic information in keyphrase extraction. Natural Language Engineering. 2016;22(1):73–95.
  116. 116. Gopan E, Rajesh S, Vishnu G, Thushara M, et al. Comparative study on different approaches in keyword extraction. In: 2020 Fourth International Conference on Computing Methodologies and Communication (ICCMC). IEEE; 2020. p. 70–74.
  117. 117. Li X, Song F. Keyphrase extraction and grouping based on association rules. In: The Twenty-Eighth International Flairs Conference; 2015.
  118. 118. Papagiannopoulou E, Tsoumakas G. Local word vectors guiding keyphrase extraction. Information Processing & Management. 2018;54(6):888–902.
  119. 119. QasemiZadeh B, Handschuh S. The ACL RD-TEC: a dataset for benchmarking terminology extraction and classification in computational linguistics. In: Proceedings of the 4th International Workshop on Computational Terminology (Computerm); 2014. p. 52–63.
  120. 120. Ahel R, Dalbelo Bašic B, Šnajder J. Automatic keyphrase extraction from Croatian newspaper articles. The Future of Information Sciences, Digital Resources and Knowledge Sharing. 2009; p. 207–218.
  121. 121. Simon H, Leker J. Using startup communication for opportunity recognition—an approach to identify future product trends. International Journal of Innovation Management. 2016;20(08):1640016.
  122. 122. Demszky D, Movshovitz-Attias D, Ko J, Cowen A, Nemade G, Ravi S. GoEmotions: A Dataset of Fine-Grained Emotions. In: 58th Annual Meeting of the Association for Computational Linguistics (ACL); 2020.
  123. 123. Reimers N, Gurevych I. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics; 2019. Available from: https://arxiv.org/abs/1908.10084.
  124. 124. Mikolov T, Grave E, Bojanowski P, Puhrsch C, Joulin A. Advances in Pre-Training Distributed Word Representations. In: Proceedings of the International Conference on Language Resources and Evaluation (LREC 2018); 2018.
  125. 125. Stahlmann S, Ettrich O, Kurka M, Schoder D. What Do Customers Say About My Products? Benchmarking Machine Learning Models for Need Identification. In: Proc. of the HICSS; 2023.
  126. 126. Bagnall A, Dau HA, Lines J, Flynn M, Large J, Bostrom A, et al. The UEA multivariate time series classification archive, 2018. arXiv preprint arXiv:181100075. 2018;.
  127. 127. Hsieh RJ, Chou J, Ho CH. Unsupervised online anomaly detection on multivariate sensing time series data for smart manufacturing. In: 2019 IEEE 12th Conference on Service-Oriented Computing and Applications (SOCA). IEEE; 2019. p. 90–97.
  128. 128. Xiahou X, Harada Y. B2C E-Commerce Customer Churn Prediction Based on K-Means and SVM. Journal of Theoretical and Applied Electronic Commerce Research. 2022;17(2):458–475.
  129. 129. Löning M, Bagnall A, Ganesh S, Kazakov V, Lines J, Király FJ. sktime: A unified interface for machine learning with time series. arXiv preprint arXiv:190907872. 2019;.
  130. 130. Dempster A, Schmidt DF, Webb GI. Minirocket: A very fast (almost) deterministic transform for time series classification. In: Proceedings of the 27th ACM SIGKDD conference on knowledge discovery & data mining; 2021. p. 248–257.
  131. 131. Dempster A, Petitjean F, Webb GI. ROCKET: exceptionally fast and accurate time series classification using random convolutional kernels. Data Mining and Knowledge Discovery. 2020;34(5):1454–1495.
  132. 132. Kuang Z, Li Z, Zhao T, Fan J. Deep multi-task learning for large-scale image classification. In: 2017 IEEE Third International Conference on Multimedia Big Data (BigMM). IEEE; 2017. p. 310–317.
  133. 133. Kaur H, Pannu HS, Malhi AK. A systematic review on imbalanced data challenges in machine learning: Applications and solutions. ACM Computing Surveys (CSUR). 2019;52(4):1–36.
  134. 134. Hancock J, Johnson JM, Khoshgoftaar TM. A Comparative Approach to Threshold Optimization for Classifying Imbalanced Data. In: 2022 IEEE 8th International Conference on Collaboration and Internet Computing (CIC). IEEE; 2022. p. 135–142.
  135. 135. Brownlee J. Imbalanced classification with Python: better metrics, balance skewed classes, cost-sensitive learning. Machine Learning Mastery; 2020.
  136. 136. Comito C, Forestiero A, Pizzuti C. Bursty event detection in Twitter streams. ACM Transactions on Knowledge Discovery from Data (TKDD). 2019;13(4):1–28.
  137. 137. Mann HB, Whitney DR. On a test of whether one of two random variables is stochastically larger than the other. The annals of mathematical statistics. 1947; p. 50–60.
  138. 138. Hart A. Mann-Whitney test is not just a test of medians: differences in spread can be important. Bmj. 2001;323(7309):391–393. pmid:11509435
  139. 139. Cowles M, Davis C. On the origins of the. 05 level of statistical significance. American Psychologist. 1982;37(5):553.
  140. 140. Bora S, Singh H, Sen A, Bagchi A, Singla P. On the role of conductance, geography and topology in predicting hashtag virality. Social Network Analysis and Mining. 2015;5:1–15.
  141. 141. Yilmaz I, Masum R, Siraj A. Addressing imbalanced data problem with generative adversarial network for intrusion detection. In: 2020 IEEE 21st International Conference on Information Reuse and Integration for Data Science (IRI). IEEE; 2020. p. 25–30.
  142. 142. Vieira Bernat M. Topical Classification of Images in Wikipedia: Development of topical classification models followed by a study of the visual content of Wikipedia; 2023.
  143. 143. Held P, Schubert RA, Pridgen S, Kovacevic M, Montes M, Christ NM, et al. Who will respond to intensive PTSD treatment? A machine learning approach to predicting response prior to starting treatment. Journal of psychiatric research. 2022;151:78–85. pmid:35468429
  144. 144. Kurasawa H, Waki K, Chiba A, Seki T, Hayashi K, Fujino A, et al. Treatment Discontinuation Prediction in Patients With Diabetes Using a Ranking Model: Machine Learning Model Development. JMIR Bioinformatics and Biotechnology. 2022;3(1):e37951. pmid:38935955
  145. 145. Lu H, Ehwerhemuepha L, Rakovski C. A comparative study on deep learning models for text classification of unstructured medical notes with various levels of class imbalance. BMC Medical Research Methodology. 2022;22(1):181. pmid:35780100
  146. 146. Judson K, Schoenbachler DD, Gordon GL, Ridnour RE, Weilbaker DC. The new product development process: let the voice of the salesperson be heard. Journal of Product & Brand Management. 2006;15(3):194–202.
  147. 147. Frishammar J. Managing information in new product development: A literature review. International Journal of Innovation and Technology Management. 2005;2(03):259–275.
  148. 148. Chong YT, Chen CH. Customer needs as moving targets of product development: a review. The International Journal of Advanced Manufacturing Technology. 2010;48:395–406.
  149. 149. Klein A, Falkner S, Bartels S, Hennig P, Hutter F. Fast bayesian optimization of machine learning hyperparameters on large datasets. In: Artificial intelligence and statistics. PMLR; 2017. p. 528–536.
  150. 150. Kathirgamanathan B, Cunningham P. Correlation based feature subset selection for multivariate time-series data. arXiv preprint arXiv:211203705. 2021;.
  151. 151. Sun Y, Li J, Liu J, Chow C, Sun B, Wang R. Using causal discovery for feature selection in multivariate numerical time series. Machine Learning. 2015;101:377–395.
  152. 152. Pistorius F, Baumann D, Sax E. Differential Correlation Approach for Multivariate Time Series Feature Selection. In: Proceedings of the Future Technologies Conference (FTC) 2021, Volume 1. Springer; 2022. p. 928–942.
  153. 153. Kathirgamanathan B, Cunningham P. A feature selection method for multi-dimension time-series data. In: Advanced Analytics and Learning on Temporal Data: 5th ECML PKDD Workshop, AALTD 2020, Ghent, Belgium, September 18, 2020, Revised Selected Papers 6. Springer; 2020. p. 220–231.
  154. 154. Younus A, Qureshi MA, Jeon M, Kazemi A, Caton S. XAI Analysis of Online Activism to Capture Integration in Irish Society Through Twitter. In: International Conference on Social Informatics. Springer; 2022. p. 233–244.
  155. 155. Le Nguyen T, Gsponer S, Ilie I, O’reilly M, Ifrim G. Interpretable time series classification using linear models and multi-resolution multi-domain symbolic representations. Data mining and knowledge discovery. 2019;33:1183–1222.
  156. 156. Fauvel K, Lin T, Masson V, Fromont É, Termier A. Xcm: An explainable convolutional neural network for multivariate time series classification. Mathematics. 2021;9(23):3137.
  157. 157. Assaf R, Giurgiu I, Bagehorn F, Schumann A. Mtex-cnn: Multivariate time series explanations for predictions with convolutional neural networks. In: 2019 IEEE International Conference on Data Mining (ICDM). IEEE; 2019. p. 952–957.
  158. 158. Ozyegen O, Ilic I, Cevik M. Evaluation of interpretability methods for multivariate time series forecasting. Applied Intelligence. 2022; p. 1–17. pmid:34764613
  159. 159. Gayo-Avello D. No, you cannot predict elections with Twitter. IEEE Internet Computing. 2012;16(6):91–94.