The Detection of Emerging Trends Using Wikipedia Traffic Data and Context Networks

doi:10.1371/journal.pone.0141892

Fig 1.

(a) A semantic concept (SC) is represented by Wikipedia pages regarding the same topic in different languages (semantic core nodes, SCN, blue). One node from the SCN, i.e. a specific language, is selected as central node (CN, green) and separated from the SCN. All pages in the same language directly linked to the CN make up the local neighborhood (LN, light green), while all pages directly linked to the other SCN make up the global neighborhood (GN, light blue). (b) Schematic representation of the corresponding adjacency matrix. SCN and CN combined are called multilingual semantic core, while LN and GN form the hull. The pages in the core are connected via inter-language links (ILL, black for links involving CN and purple for links among SCN). Links between CN and LN are shown in green, while links between SCN and GN are shown in gray. All other links are ignored because only the nearest neighborhood is taken into account. (c) Network representation using the same colors as in parts (a, b). The network representation allows an inspection of the dataset, before group definitions are finalized. A qualitative interpretation and a quantitative measurement of network properties helps to evaluate the impact of separation into sub-networks, e.g. to prepare data for time resolved relevance analysis.

More »

Expand

Table 1.

Selection of English Wikipedia pages (CNs) regarding topics with a direct relation to the emerging Hadoop (Big Data) market.

Apache Hadoop is the central software project, beside Apache SOLR, and Apache Lucene (SW, software). Companies which offer Hadoop distributions and Hadoop based solutions are the central companies in the scope of the study (HV, hardware vendors). Other companies started very early with Hadoop related projects as early adopters (EA). Global players (GP) are affected by this emerging market, its opportunities and the new competitors (NC). Some new but highly relevant companies like Talend or LucidWorks have been selected because of their obvious commitment to the open source ideas. Widely adopted technologies with a relation to the selected research topic are represented by the group TEC.

More »

Expand

Fig 2.

(a) Wikipedia network representation for several interconnected local networks (several CNs with their LNs). The selected topics are related to the emerging Hadoop market, see Table 1 in section 2.1 for details. The colors indicate the membership of the nodes to topical clusters (see legend). Wikipedia pages about Hadoop-related projects are found in close neighborhood of new companies and also close to two important programming languages, Java, and C++, which have both been highly recognized for many years. All CNs with less then five links (k = |LN|<5) have been removed as the layout was calculated with Gephi (OpenOrd layout). (b) Representation and relevance plot showing text volume REL_v versus REP_v with circle sizes given by logREP_k for all selected CNs as in (a). Colors indicate the role of each group in the emerging Hadoop market and differ from modularity classes used in 2a. One can see a high text volume representation index for pages about the early adopters of Hadoop technology (EA, purple). Wikipedia pages about core technology, which is software such as Apache Lucene, Apache Nutch, and Apache Mahout (SW, orange) have a low relevance index REL_v. Their representation index REP_v is also small compared with the companies that use the new technology. The label (HV, green) stands for Hadoop-related hardware vendors, (TEC, light blue) for general technology companies, (GP, blue) for global players, and (NC, red) for the competitors within the emerging market.

More »

Expand

Fig 3.

Relative attraction and trends based on Google Trends data (a, b, c) and on access-rate time series from Wikipedia (d) for names of NoSQL databases as search terms (see legends) versus time in months since (a, b) 2010-01-01, (c) 2004-01-01, and (d) 2009-01-01. The approximately linear slopes in (a) differ, and very strong fluctuations are found for Hadoop (black) and SOLR (green), so that the different curves can hardly be compared. In (b) the Google Trends data from (a) have been normalized for zero average and unit standard deviation to facilitate a comparison. However, ambiguous keywords can strongly influence the results as shown in in (c), where raw Google Trends data acquired simultaneously for eight different keywords (using an unofficial API software [31]) is shown. Here, the maximum at interest level 100 (arbitrary units from Google Trends) occurs only for one of the curves in 2015 and is not shown in the plot. Wikipedia access-rate data in (d), corresponding to the time range of the yellow box in (c), indicate a jump in user interest for pages about Apache Hadoop, Apache Zookeeper, and Apache SOLR, which was not visible in the Google Trends data. The gray boxes in (d) indicate times for which no Wikipedia access data is available for technical reasons.

More »

Expand

Fig 4.

Project life cycle phases derived from Wikipedia usage data based on L.TRRI_a(t) (straight lines, Eq (5)) and G.TRRI_a(t) (dashed lines, Eq (6)) for Apache Hadoop (black) and Apache SOLR (green) versus time in weeks since 2009-01-01.

For Hadoop L.TRRI_a(t) we find a strong linear increasing trend with some short-term fluctuations fading out after three months in the years 2009 and 2010. These short-term peaks reflect conference seasons. In 2011, finally, L.TRRI_a(t) gets greater than one: a sporadic jump in the user interest is followed by a saturation. G.TRRI_a(t) shows a significantly weaker trend during the same time and remains below one. This means that public interest in Hadoop related information is bound to the English language. The ‘break through’ of Apache Hadoop as a relevant topic is thus in 2011, about 18 months after Apache SOLR became a relevant topic. However, finally Apache SOLR is less relevant than Apache Hadoop. The gray areas indicate times for which no data was available for technical reasons. While the thin lines are the original L.TRRI_a(t) data, the thick lines have been obtained by applying a running average filter with a window length of 12 weeks.

More »

Expand

Fig 5.

Version number versus release date (in weeks since 2004-01-01) for Apache Hadoop (gray) and Apache SOLR (green).

The version numbers for different parallel development branches of both projects indicate ongoing improvements up to the ends of the active projects. The dates of the jump in interest derived from the plot of L.TRRI_a(t) in Fig 4 are shown here as vertical black lines for SOLR (a) and Hadoop (b).

More »

Expand

Fig 6.

(a) Contextual local and global time resolved relevance indexes L.TRRI_a(t) (straight lines, Eq (5)) and G.TRRI_a(t) (dashed lines, Eq (6)) for Wikipedia pages regarding the companies Oracle (red), Capgemini (blue) and for Apache Hadoop (black) between 2009-01-01 and 2011-12-31 at weekly resolution. In (b), the text volume representation and relevance indexes are shown (as in Fig 2b) for Capgemini regarding six different languages: the the highest representation is in French. In this case, the local context can be defined as a hybrid context by two CNs—French because of the country of origin, and English because of international IT business. If only one language is used, as in (a), one cannot clearly differentiate between local and global relevance.

More »

Expand