Fig 1.
Workflow and architecture of the computational processing framework.
Spouts (tap symbol) emit data (here: web pages), bolts (lightning symbol) process data (i.e. term statistics, readability metrics, vocabulary-based text difficulty, storing results). SVM: Support Vector Machine, R: Readability Metrics.
Fig 2.
Workflow of the processing steps and software components for topic modeling: (1) text material is retrieved from a central relational database; (2) several processing threads perform a collection of pre-processing tasks; (3) LDA is applied to the resulting document vectors. The software takes raw text material as an input and outputs n topics. The n is a user-defined input parameter to LDA.
Fig 3.
Size-rank plot of degree distribution of the host-aggregated sGHW graph.
Table 1.
Domains of 25 top-ranked web sites for ccTLD “.de” with their respective information provider according to PageRank.
Table 2.
Domains of 25 top-ranked web sites for ccTLD “.at” with their respective information provider according to PageRank.
Table 3.
Domains of 25 top-ranked web sites for ccTLD “.ch” with their respective information provider according to PageRank.
Table 4.
Mapping readability and vocabulary scales to corresponding classes as follows: VE very easy; E easy; M moderate; D difficult; VD very difficult according to Wiesner et al. [25].
Fig 4.
Distribution of readability values on the Flesch Reading Ease scale for each ccTLD (“.de”, “.at”, “.ch”).
Difficulty indicated by color, with dark green as the highest readability (90–100) and dark red as the lowest readability (0–10). Note: For consistency reasons, the x axis is reverted and ranges from 100 to 0.
Fig 5.
Distribution of readability values on the Vienna formula scale for each ccTLD (“.de”, “.at”, “.ch”).
Difficulty is indicated by color, with dark green as the highest readability (4–5) and dark red as the lowest readability (14–15).
Fig 6.
Distribution of achieved vocabulary values on the SVM classification scale L for each ccTLD (“.de”, “.at”, “.ch”).
Difficulty is indicated by color with dark green as the most layman friendly (1) and dark red as the highest expert level required (10). SVM: support vector machine.
Fig 7.
Scatter plot of the distributions for FRE, WSTF and L for each ccTLD.
Fig 8.
Perplexity score per number of topics for those 3,747,055 health-related web pages that belong to the three times 1000 top-ranked web sites from the sGHW.
Table 5.
The 50 topics that were identified from the web pages of the top 1000 web sites for each ccTLD.
The sample terms were ordered based on their relevance to the topic.
Fig 9.
Theme distribution per information provider type for the ccTLD “.de”.
Information provider types: GPH: Government, Public Institution or Public Health, NPO: Non-Profit Organization, PO: Private Organization, M: Mainstream or Local News, PC: Pharmaceutical Company, PB: Private Blog, Other: O.
Fig 10.
Theme distribution per information provider type for the ccTLD “.at”.
Information provider types: GPH: Government, Public Institution or Public Health, NPO: Non-Profit Organization, PO: Private Organization, M: Mainstream or Local News, PC: Pharmaceutical Company, PB: Private Blog, Other: O.
Fig 11.
Theme distribution per information provider type for the ccTLD “.ch”.
Information provider types: GPH: Government, Public Institution or Public Health, NPO: Non-Profit Organization, PO: Private Organization, M: Mainstream or Local News, PC: Pharmaceutical Company, PB: Private Blog, Other: O.
Fig 12.
Theme distribution per ccTLD.