The internet has become an increasingly important resource for health information, especially for lay people. However, the information found does not necessarily comply with the user’s health literacy level. Therefore, it is vital to (1) identify prominent information providers, (2) quantify the readability of written health information, and (3) to analyze how different types of information sources are suited for people with differing health literacy levels.
In previous work, we showed the use of a focused crawler to “capture” and describe a large sample of the “German Health Web”, which we call the “Sampled German Health Web” (sGHW). It includes health-related web content of the three mostly German speaking countries Germany, Austria, and Switzerland, i.e. country-code top-level domains (ccTLDs) “.de”, “.at” and “.ch”. Based on the crawled data, we now provide a fully automated readability and vocabulary analysis of a subsample of the sGHW, an analysis of the sGHW’s graph structure covering its size, its content providers and a ratio of public to private stakeholders. In addition, we apply Latent Dirichlet Allocation (LDA) to identify topics and themes within the sGHW.
Important web sites were identified by applying PageRank on the sGHW’s graph representation. LDA was used to discover topics within the top-ranked web sites. Next, a computer-based readability and vocabulary analysis was performed on each health-related web page. Flesch Reading Ease (FRE) and the 4th Vienna formula (WSTF) were used to assess the readability. Vocabulary was assessed by a specifically trained Support Vector Machine classifier.
In total, n = 14,193,743 health-related web pages were collected during the study period of 370 days. The resulting host-aggregated web graph comprises 231,733 nodes connected via 429,530 edges (network diameter = 25; average path length = 6.804; average degree = 1.854; modularity = 0.723). Among 3000 top-ranked pages (1000 per ccTLD according to PageRank), 18.50%(555/3000) belong to web sites from governmental or public institutions, 18.03% (541/3000) from nonprofit organizations, 54.03% (1621/3000) from private organizations, 4.07% (122/3000) from news agencies, 3.87% (116/3000) from pharmaceutical companies, 0.90% (27/3000) from private bloggers, and 0.60% (18/3000) are from others. LDA identified 50 topics, which we grouped into 11 themes: “Research & Science”, “Illness & Injury”, “The State”, “Healthcare structures”, “Diet & Food”, “Medical Specialities”, “Economy”, “Food production”, “Health communication”, “Family” and “Other”. The most prevalent themes were “Research & Science” and “Illness & Injury” accounting for 21.04% and 17.92% of all topics across all ccTLDs and provider types, respectively. Our readability analysis reveals that the majority of the collected web sites is structurally difficult or very difficult to read: 84.63% (2539/3000) scored a WSTF ≥ 12, 89.70% (2691/3000) scored a FRE ≤ 49. Moreover, our vocabulary analysis shows that 44.00% (1320/3000) web sites use vocabulary that is well suited for a lay audience.
We were able to identify major information hubs as well as topics and themes within the sGHW. Results indicate that the readability within the sGHW is low. As a consequence, patients may face barriers, even though the vocabulary used seems appropriate from a medical perspective. In future work, the authors intend to extend their analyses to identify trustworthy health information web sites.
Citation: Zowalla R, Pfeifer D, Wetter T (2023) Readability and topics of the German Health Web: Exploratory study and text analysis. PLoS ONE 18(2): e0281582. https://doi.org/10.1371/journal.pone.0281582
Editor: Nabeel Al-Yateem, University of Sharjah, UNITED ARAB EMIRATES
Received: January 12, 2022; Accepted: January 27, 2023; Published: February 10, 2023
Copyright: © 2023 Zowalla et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: All relevant data are within the paper and its Supporting information files.
Funding: The authors received no specific funding for this work.
Competing interests: The authors have declared that no competing interests exist.
Abbreviations: API, Application Programming Interface; ASL, Average Sentence Length; ASW, Average Number of Syllables per Word; ccTLD, country-code top level domain; FKG, Flesch-Kincaid Grade; FRE, Flesch Reading Ease Scale; GHW, German Health Web; L, vocabulary measure; LDA, Latent Dirichlet Allocation; MS, Words with three or More Syllables; NLP, Natural Language Processing; PA, Percent Agreement; PCC, Pearson correlation coefficient; SEO, Search Engine Optimization; sGHW, Sampled German Health Web; SMOG, Simple Measure of Gobbledygook score; SVM, Support Vector Machine; TLD, top level domain; WSTF, 4th Vienna Formula (German: Wiener SachTextFormel)
The Internet has become an increasingly important resource for health information, especially for lay people [1–7]. Web users perform online searches to obtain health information regarding diseases, diagnoses, and different treatments . However, the information found does not necessarily comply with the users’ health literacy level and–consequently–might not be well understood by the respective reader. This can result in an overall poorer general health status, as well as greater barriers for the access to adequate medical care .
In addition, another major problem of written information is the gap between the language of medical experts and lay people. Even with a higher level of education, medical vocabulary poses problems for people reading relevant health information . Moreover, the medical terms associated with the etiology of a disease tend to differ between health professionals and patients [10–12].
Health information on the web is provided by different stakeholders, each with its own set of interests . Thus, the provided health information material does not necessarily reflect the needs of a (lay) health information seeker. Therefore, it is important to (1) identify information providers, (2) quantify the readability of as well as the type of vocabulary, and (3) to analyze how different types of information sources are suited for people with differing health literacy levels.
Given the great variety and vast amount of health information available on the internet, a manual or semiautomatic approach for analysis seems futile. To the best of the authors’ knowledge, there exists no study that applies machine learning methods in order to find relevant health information and that determines its readability level as well as its vocabulary level in a fully automated approach.
As a follow-up of the research conducted by Zowalla et al. in , this study provides a fully automated readability and vocabulary analysis of the health-related web restricted to web pages in German. We limit our study to the three predominantly German-speaking countries Germany, Austria, and Switzerland (D-A-CH) and call the sample of the “German Health Web” (GHW) acquired by our focused web crawler the “Sampled German Health Web” (sGHW). In addition, our study per country finds the 1000 top-ranked information providers in the sGHW according to PageRank and uses Latent Dirichlet Allocation (LDA) to find abstract topics as present within the sGHW.
Readability of health information material.
The health literacy level of an individual living in Europe was assessed within the European Health Literacy Survey. It offers an instrument with a scale ranging from 1 (lowest) to 50 (highest) and was used to compare health literacy levels in different European countries. For Germany, Zok reports an average score of 31.9 for participants, which was below the European average score (33.8) . In 2016, Schaeffer et al. reported that “54.3% of [German study participants] were found to have limited health literacy” (n = 2000) . For Switzerland, Bieri et al. reported, that 54% of the study participants (n = 2000) were found to have limited health literacy . Pelikan et al.  reported, that 51.6% of the Austrian study participants (n = 1813) achieved a limited health literacy level. These findings support the need for online health information materials that meet the capabilities of their readers. Consequently, such information should be written at a sufficient readability level and (medical) specialty language should be avoided in order to reduce barriers for patients.
However, several studies found that online health information is often written and published with low readability, which reduces or even hinders understandability for its intended readers (mainly laymen) [18–27].
A recent analysis by Brütting et al.  about prominent web sites (n = 45) on melanoma immunotherapy written in German revealed low readability scores according to the Flesch Reading Ease Scale (FRE), which ranges from 0 to 100. A low FRE indicates an unsufficient level of readability while a high FRE indicates easy-to-read text material. In 2018, Basch et al.  assessed the readability of online information material related to prostate cancer. They found that the “majority of web sites had difficult readability” and concluded that a “large majority of information available on the Internet about prostate cancer will not be readable for many individuals.”
Similar studies were conducted for other diseases: Thomas et al.  analyzed nephrology related Wikipedia articles written in English as a resource for patient education. The overall mean FRE was 19.4, which corresponds to an unsufficient level of readability. A study by Edmunds et al.  assessed the readability of 160 web sites providing ophthalmic patient information and found “83% [..] as being of ‘difficult’ readability.” Tulbert et al.  assessed the readability of “three sources of patient-education material on the internet (WebMD.com, Wikipedia.org, and MedicineOnline.com)”. They found that “no single source of commonly used internet patient-education material demonstrates optimal features with regard to readability, length, and presence of photographic illustrations.”
In 2014, Zowalla et al. used a specifically trained Support Vector Machine (SVM) to assess the difficulty of health-related text material . It was trained to distinguish between documents written for laymen and documents written for (medical) experts on the basis of 10.000 texts from various German health content providers. The resulting SVM classifier was tested against two datasets (n1 = 1202, n2 = 1200) and achived an accuracy of 0,8458 and 0,8741 respectively. Subsequently, it was applied to online health websites in the context of a Firefox browser extension in 2015 . The SVM outputs a class probability using Platt Scaling . This class probability is then transformed to an “expert level” expressing vocabulary-based text difficulty, which was named L.
In 2018, Zowalla and Wiesner  analyzed 2931 articles of the „Public Health Portal of Austria”(www.gesundheit.gov.at) using FRE, the 4th Vienna formula (WSTF) and the measure L. Their analysis revealed low readability levels paired with a “moderate level of vocabulary difficulty.” In 2018, L, WSTF and FRE were also applied by Keinki et al.  on 51 German cancer information booklets. They report, “that the majority of the 51 booklets (92.16%) is hard to read”. In 2020, the study design was replicated by Wiesner et al.  for Psoriasis/Psoriatic Arthritis material written in German. They found, that “patient education materials in German require, on average, a college or university education level [..] even though the vocabulary used seems appropriate”.
McInnes and Haglund  entered 22 health condition terms in five different search engines and computed the readability scores of the first 10 web sites retrieved via each individual search using the Gunning Fog Index (FOG), Simple Measure of Gobbledygook score (SMOG), Flesch-Kincaid Grade (FKG) and FRE. They found, that “Websites with.gov and.nhs TLDs [top level domains] were the most readable while.edu sites were the least”. A recent study by Worrall et al.  used Google search to collect the first 20 web pages for searches related to the coronavirus diseases and assessed the readability using FOG, FRE, FKG and SMOG. They conclude that “only 17.2% [(n = 165)] of web pages [were] at a universally readable level.” In addition, Worrall et al. reported, that “Public Health organisations and Government organisations provided the most readable COVID-19 material, while digital media sources were significantly less readable” .
In addition to classic readability metrics such as FRE or WSTF, other approaches for computing the readability of (German) text material exist. In , vor der Brück et al. describe the readability checker DeLite, which uses 48 morphological, lexial, syntactic, and semantic indicators to assess the readability of a text written in German. A similar approach is presented by Berends and Vajjala in , which uses 165 custom features to assess the readability of German geography text books for secondary school. However, neither approach can easily be applied as the related source code is not publicly available. In addition, these tools are not commonly used for readability assessment of (health-related) text material.
Other studies [33–35] leveraged crowd sourcing to measure the readability of text material. In this context, crowd workers are used to judge the readability of a given text. However, such approaches require high financial resources as the related crowd workers need to be paid. The costs highly depend on the amount of text material to be reviewed, which might not be feasible for large scale analyses of text material from the web.
Topic modeling on health information material.
Topic modeling is a well-accepted technique to discover abstract topics in unstructured text. It is often applied to clinical and/or health-related content posted on social media, online newspapers or on web sites in general [36–42].
In 2014, Paul and Dredze  showed, that topic models can be leveraged to infer health topics in Twitter messages. To do so, they analyzed 144 million health-related Twitter posts and discovered 13 topics, e.g. “cancer & serious illness”, “dental health”, “exercises” or “injuries & pain”, in the dataset. Another study by Liu and Yin  used topic modeling to analyze the abstract topics of 477,904 posts in r/loseit of the reddit community. They identified 25 topics concerning the overall theme “weight loss” such as “food and drinks”, “exercises”, or “communication”.
Another study by Muralidhara and Paul  leveraged topic modeling to discover the abstract health-related topics contained in 96,426 Instagram posts with hashtags related to health. Overall, they identified 47 health-related topics covering ten broad themes such as “acute illness”, “alternative medicine”, “chronic illness and pain”, or “substance use”. The most prevalent topics were related to “diet” and “exercise”.
In 2017, Melkers et al.  assessed the content of 89 dental blogs by using topic modeling techniques. In total, the authors found 176 abstract topics inside the data and grouped them into four leading themes: “Status/Social”, “Dental care”, “Dental practice related”, and “Other”.
Liu et al.  collected 642 newspaper articles related to third hand smoke and analyzed the text material by using LDA. They discovered ten topics, e.g. “cancer”, “risks of smoking”, or “air quality” and grouped them into three major themes.
In 2020, Min et al.  analyzed the content of 145 web sites related to “occupational accidents” by using topic modeling. They discovered 14 topics with three themes: “workers’ compensation benefits”, “illicit agreements with the employer”, and “fatal and non-fatal injuries and vulnerable workers”.
Bahng and Lee  analyzed posts on the social question-and-answer platform “Naver Knowledge-iN” by using LDA “to identify patients’ perceptions, concerns, and needs on hearing loss.” They found 21 topics, which “mostly correspond to sub-fields established in hearing science research”, and grouped them into five main themes such as “noise-induced hearing loss” or “sudden hearing loss”.
Crawling the German Health Web.
In 2020, we demonstrated the suitability of a distributed focused web crawler for the acquisition of a large sample of the GHW . The presented system run for 277 days and had an average harvest rate of 19.76% and the recall estimated via a seed-target approach was 0.821, which indicates, that our approach is a suitable method to acquire most health-related content found under the country-code top-level domains (ccTLDs) “.de”, “.at”, and “.ch”. The crawler uses an SVM text classifier to estimate the health relevance of a given web page. It was trained on a large data set (n = 70.048) acquired from various German content providers to distinguish between health-related and non–health-related web pages. The classifier was evaluated based on two different datasets. The first dataset (TD1) consisted of 17.514 documents and was based on a-priori class labeling, the second one (TD2) consisted of 384 real-world web pages and was annotated by using a crowd sourcing approach. Both, TD1 and TD2, had an equal class distribution. The system achieved an accuracy of 0.937 for TD1 (TD2: 0.966), precision on TD1 of 0.934 (TD2 = 0.954), and a recall of 0.944 (TD2 = 0.989). The results indicated that the presented crawler was a suitable method for acquiring a large sample of the GHW in a fully automated manner. Subsequently, we call the acquired sample of the GHW the “Sampled German Health Web” (sGHW).
This paper presents a follow-up study of the research conducted in 2020 . The latter analyzes the acquired data, namely the sGHW graph and the content of health-related web pages after running the distributed focused web crawler presented in  for 370 days.
Aims of the study.
In line with the methodology presented in , the authors decided to concentrate on health-related web pages available free of charge on the internet in the D-A-CH region that can be found under the respective ccTLDs “.de”, “.at”, and “.ch”. In this context, the aim of this study was four-fold:
- Analyze the current situation, that is, the volume of and the information providers behind health-related web pages in the D-A-CH region.
- Demonstrate the suitability of a fully automated approach to compute the following three aspects of the sGHW: its readability by using established readability formulas, its type of vocabulary, and the prevalent topics.
- Quantify the level of readability of and the type of vocabulary used within the sGHW. In addition, identify the topics presented within health-related web pages in the sGHW.
- Evaluate whether web pages offered by certain types of information providers are better suited for citizens with lower health literacy levels than others.
Definition of health information
In the context of this study, we define “health information” or the “health relevance” of a given web page very openly. Therefore, we include, among others, the following topics:
- Diseases and their diagnoses,
- Diagnostic procedures, therapies or treatments,
- Pharmaceutical Information (e.g., about medications),
- Nutrition, sports and lifestyle information that is intended to lead to a “healthier” life (prevention),
- Information on health care structure (hospitals, doctor’s offices, etc),
- Information from and about self-help groups,
- Content generated by patients or users on the topic of health, e.g. in social media or internet forums.
Thus, websites considered as “health-related” do not necessarily comply with the criteria of evidence-based medicine and may have both laypersons and professionals as their target audience. Information on the health condition of animals or their treatment (veterinary medicine) is not considered as health information in the context of this study.
This study of health-related web pages consisted of four stages:
- Regarding study aim 1, we used the focused web crawling system presented in  to collect health-related web pages and to create a health-related host-aggregated web graph. As in , we applied the PageRank algorithm  to identify important web sites in the sGHW on the aforementioned graph representation.
- Then, one author screened the 1000 top-ranked web sites for each ccTLD by visiting the related web site in the incognito mode of a Chromium browser. In addition, the same author looked for legal information (imprint) of the web site’s owner. If a legal entity could be identified, a background check was conducted using popular search engines.
- Based on these findings, one of the following nine categories was assigned to each web site’s information provider: Government or Public (Health) Institution (GPH), Non-Profit Organization (NPO), Private Organization or Individual Person (PO), Mainstream or Local News (M), Pharmaceutical Company (PC), Personal Blog (PB), Social Network (SN), and Other (O). The categories were defined on the basis of . A detailed explanation for each category is given in S1 Appendix.
To mitigate rater bias, the assignment was done twice with a gap of two months between each run. If there was a tie, the rater reviewed the case again and resolved it by performing an additional background check. In addition, the interrater reliability metrics percent agreement (PA)  and Cohen’s κ  were computed.
- At the last stage, a fully automated readability and vocabulary analysis was conducted on the 1000 top-ranked web sites for each ccTLD. In addition, topic modeling was applied on the same data. The resulting topics were then paraphrased in a group discussion. These analyses were intended to answer the aims of the study 2 to 4.
Several studies have extensively analyzed the graph structure of the web [46–48]. In this context, a graph node represents a web page and an edge represents a link between two web pages. In our study, we generated a host-aggregated graph in order to reduce its computational complexity and explore its properties . To do so, individual web pages are combined and represented by their parent web site (including outgoing and ingoing links). On the resulting host-aggregated sGHW graph, we applied the following metrics or algorithms:
- Average degree is the average number of edges connected to a node . For a directed web graph, this is defined as the total number of edges divided by the total number of nodes.
- Modularity measures the strength of division of a graph into clusters or groups [50, 51]. Graphs with a high modularity have dense connections between the web sites within certain clusters but sparse connection to other web sites, which are contained in different clusters.
- PageRank is a centrality-based metric that allows identification of web sites (nodes) of importance inside a graph . The underlying assumption is that an important graph node (web site) will receive more links from other important nodes (i.e., higher in-degree).
Other metrics such as network diameter and the average path length (i.e., the average number of clicks which will lead from one web site to another) are frequently used for graph analysis [50, 52].
Coverage of relevant web sites
The coverage (or completeness) of our focused web crawl was evaluated by comparing the overlap to another web crawl. For this purpose, search results of the commercial search engine provider Google were used. The underlying assumption is that a (commercial) search engine provider such as Google has already indexed a large part of the web. To compute the overlap, search queries with relevant (medical) terms were sent to the application programming interface (API) of the related search engine over a period of time. Based on the results, it is then possible to determine the percentage of URLs returned by Google that are included in our focused web crawl. The related proportion is an indicator regarding the completeness of our sampled dataset.
Web site ranking strategies
Web sites can be ranked by using different, potentially combined approaches ranging from estimating the traffic of a given website, the amount of unique visitors in a given timeframe, manual or search-engine based approaches or graph-based ranking algorithms . Many ranking strategies originate from the field of search engine optimization (SEO) and aim to reproduce confidential black box ranking algorithms of (commercial) search engine providers such as Google.
In most cases, related metrics and rankings are offered by commercial third party providers such as ALEXA , Sistrix , Searchmetrics  or SimilarWeb  as part of their business. However, their methods to rank a given web site as well as influencing factors remain confidential. Obviously, this leaves an enormous gap with respect to transparency and reproducibility [53, 58].
In this study, we solely relied on PageRank , a clearly defined and transparent algorithm which is well established in computer science in order to assess the relevance of graph nodes. In particular, we apply PageRank to the host-aggregated graph representation of the sGHW. Therefore, our ranking is not based on any traffic estimations, popularity or visibility indices measured by third party providers. Moreover, it is not influenced by commercial interests and can easily be reproduced by other researchers. It provides a ranking of the sGHW based on its link structure as collected by our focused web crawler.
Readability describes the properties of written text with respect to the readers’ understanding of a document [59, 60]. It depends on the complexity of a text’s structure, the sentence structure and the vocabulary used.
Flesch reading ease scale.
The FRE is a well-established readability metric for the English language . FRE relies on the average sentence length (ASL) and the average number of syllables per word (ASW). FRE assumes that short words or sentences are usually easier to understand than longer ones. We applied the modified FRE scale by Toni Amstad  for the German language. It is defined as follows:
In contrast to the FRE, the Vienna formula (WSTF) was originally developed for the German language by Bamberger and Vanacek . They derived different versions of the Vienna formula for prose and non-fictional text. Typically, the 4th WSTF is used for text analysis. It relies on the average sentence length (ASL) and on the proportion of (complex) words with three or more syllables (MS):
Vocabulary-based text difficulty.
The German language makes use of many compound words (e.g. “Halsschmerzen”, “Magen-Darm-Erkrankung”, “Zuckerkrankheit”). These terms are quite layman friendly (for an average patient) but are very lengthy. Consequently, average word length or syllable counts are not a good indicator to decide if a given word is easily comprehensible (that means, if it can be easily understood by people with a grade level of 6–7).
Machine learning techniques can be used to compensate for the limitations of established sentence-based readability measures such as FRE scale or WSTF [28, 64].
To quantify the vocabulary-based text difficulty (i.e., the “expert-centricity” of a given text), we defined the measure L ∈ [1, .., 10] similar to [23–25, 29], which leverages the SVM classifier of  as described in “Related Work”. Before using this pretrained classifier to assess the vocabulary-based difficulty of medical text material, several preprocessing steps are necessary . As a first step, text material is cleaned from syntactic markup (i.e. boilerplate code, HTML tags). Next, each text is tokanized (i.e. split into single word fragments) and each character is converted to lower case (case folding). Stop words are removed (e.g. “the”, “and”, “it”) as they do not influence the difficulty of a text. Next, stemming techniques are applied in order to map tokens to their stems and reduce morphological variations of words (e.g. “goes” becomes “go”). Finally, the text content of a document is transformed into a document vector based on previously selected features from . For each text, the SVM classifier outputs a class probability using Platt Scaling . The class probability is then transformed to the value L, which expresses vocabulary-based text difficulty.
Low values of L indicate a very easy text written for the elementary level or elementary school; a value of 3–4 corresponds to an easy text (intermediate level / junior high school), a value of 4–5 to a moderate text (laymen with medical educational background), a value of 5–6 to a difficult text, a value of 7–8 to a very expert-centric text and a value of > 8 indicates that an academic (medical) background knowledge or working experience in the medical domain is required. The procedure and the related processing steps are described in detail in .
In this study, we applied topic modeling to identify themes and topics within the GWH. Specifically, we used LDA to identify the main topics of the three times 1000 top-ranked web sites . Since LDA is an unsupervised algorithm, we relied on perplexity to determine the optimal number of topics . To do so, we trained LDA models using Gibbs Sampling  with 3000 iterations for 1 to 90 topics (with a step size of 10) on the full dataset of the three times 1000 top-ranked web sites consisting of 3,746,055 web pages. To mitigate word sparsity, we conducted stemming and removed words with little to no analytical value (e.g., “der” (article), “und” (conjunction), “jetzt” (particle)). In addition, only words with a minimum frequency of 200 were kept in the text corpus.
To estimate LDA’s hyper parameters (named α and β), we applied a method from Asuncion et al.  which is based on Minka  and an EM procedure nesting the actual Gibb’s sampling algorithm. Thus, the approach determines optimized hyper parameters as part of the topic inference. Moreover, we relied on Wallach et al. (Equation 7)  in order to assess the prevalence of topics in web pages as described in  (Section 3.4). To describe the statistical dispersion of the topic distribution, we used the Gini coefficient .
The preprocessing steps and software libraries used to conduct this analysis are described in more detail in Section “Computational Processing & System Environment“.
Each topic consists of a set of keywords and was visualized using word clouds. The word clouds were subsequently labeled by eight volunteers with different backgrounds including “Medical Informatics”, “Health Economics”, “Physics”, “Social Economics”, “Marketing”, and “Electrical Engineering”: A spread sheet document containing the word clouds to be labeled was provided along with instructions to each volunteer (see S2 Appendix). The results were then aggregated by one of the authors and given to two other volunteers (“Medical Informatics” and “Civil Engineering”), who conducted the final paraphrazing for each topic in a group discussion. Summarization into themes was conducted via a group discussion among two of the authors.
The graph database Neo4j, version 4.1.1, was used to store the host-aggregated web graph, which was generated by the focused crawler. The Neo4j graph algorithm plugins were used to compute PageRank and related metrics on an Ubuntu 20.04 LTS 64-bit server.
The statistics software R (The R Foundation for Statistical Computing), version 3.6.3 (February 29, 2020), on an Ubuntu 20.04 LTS 64-bit computer was used to compute PA, Cohen’s κ and the Pearson correlation coefficient (PCC).
Computational processing & system environment
Given the results of our previous study , it became obvious that sequential processing of the huge amount of crawled data would take too much time and resources. For this reason, a parallel and distributed system architecture is necessary to process the crawled data efficiently. There are several frameworks that allow for such distributed processing; in this study, we relied on the Apache Storm framework –a software development kit for building scalable computation systems in Java.
Fig 1 depicts the architecture of our distributed text analysis framework. A set of spouts emit yet unprocessed URLs along with their underlying text material (as tuples) from the crawl database. The tuples are assigned to cluster nodes (based on their hostname) and directed to text analysis components. First, the raw text material is tokenized (i.e., split into single word fragments) and transformed into a bag of words, which is added to the given tuple. Next, several statistical measures such as syllable counts, (complex) word counts, or character counts are computed.
Spouts (tap symbol) emit data (here: web pages), bolts (lightning symbol) process data (i.e. term statistics, readability metrics, vocabulary-based text difficulty, storing results). SVM: Support Vector Machine, R: Readability Metrics.
Each tuple is then processed to compute the readability measures FRE and WSTF. To do so (see lower part of Fig 1 “gear icon” marked with the label “R”), the tuple’s full text is fed to a natural language processing (NLP) pipeline. Regular expression filters sanitize the input and remove disturbance artifacts (e.g., different hyphen encoding schemes). Finally, the aforementioned readability metrics are computed. For sentence detection, we rely on the Apache OpenNLP library  and its sentence model for the German language . Liang’s hyphenation algorithm is used to estimate syllable counts .
Next, the tuple is processed to gauge the vocabulary-based text difficulty (see lower part of Fig 1, “gear icon” marked with the label “SVM”). Several pre-processing steps are necessary to apply the pre-trained classifier to our text material [28, 65]: As a first step, regular expression (regex) filters are applied in a similar manner as for FRE and WSTF. Second, a text is tokenized, converted to lower-case and stop words are removed. The latter is important as stop words do not influence the difficulty of a text. Third, the remaining tokens are reduced to their stems (e.g., goes becomes go) in order to limit linguistic variations by means of Porter’s Snowball Stemmer .
Each text is transformed into a bag of words representation (document vector) based on a broad list of previously selected terms from the medical domain as such terms greatly influence the vocabulary-based difficulty of a text. Each document vector is then fed into the classifier and the related output is mapped to the vocabulary measure L. Finally, each enriched tuple is stored in a PostgreSQL (v10.15) database for subsequent analysis.
The computing cluster consists of 22 virtual machines running on Ubuntu 18.04 LTS 64bit. Two physical servers (each equipped with two Intel Xeon E5-2689 and 256GB of memory) of a Cisco unified computing system provide the computational resources and run as a virtualization platform to allow shared resource allocation. The analysis was conducted between August 6 and August 30, 2020.
Fig 2 depicts the architecture of our analysis framework to conduct topic modeling using LDA.
Workflow of the processing steps and software components for topic modeling: (1) text material is retrieved from a central relational database; (2) several processing threads perform a collection of pre-processing tasks; (3) LDA is applied to the resulting document vectors. The software takes raw text material as an input and outputs n topics. The n is a user-defined input parameter to LDA.
As a first step, the bag of words representation of each web page is fetched by multiple threads from the PostgreSQL database containing the pre-processed web pages. If a corresponding web page had not yet been handled via the readability analysis, pre-processing steps are conducted in the same way as for the Classification pipeline from Section “Readability Analysis”. As an additional step, terms are filtered based on their minimum frequency within the document collection. Next, LDA is applied to the given document collection. We relied on the LDA implementation contained in the Topic Grouper framework by Pfeifer and Leidner .
The LDA-procedure and analysis to determine a reasonable number “n” of topics using the perplexity score (see “Methods” section) was conducted on a bare-metal server (equipped with two Intel Xeon E5-2630 v4 and 384 GB of memory) running Ubuntu 18.04 LTS with Java 11.0.9 between November 5 and December 30, 2020.
The focused web crawling system  ran from May 27, 2019 to May 31, 2020 and collected 14,193,743 health-related web pages. The resulting host-aggregated web graph of the sGHW comprises 231,733 nodes (web sites) connected via 429,530 edges (links between web sites).
A total of 82.63% (191,479/231,733) of the web sites belong to the ccTLD “.de”; 7.89% (18,272/231,733) to”.at”, and 9.48% (21,976/231,733) to “.ch”. The graph has a network diameter of 25. The average path length is 6.804. The average degree is 1.854. Modularity was computed to be 0.717.
Fig 3 depicts the size-rank plot of the degree distribution of the host-aggregated sGHW graph. In- and out-degree represent the number of hyperlinks to or from all web pages that belong to an individual host. From what we can see visually, there is a concavity, indicating that the distribution does not follow a power law. This is in line with the results by Meusel et al. in , who conducted a similar analysis for a host-aggegated graph of an unfocused web crawl.
As the ccTLD “.de” has the highest share within the graph, a global ranking according to PageRank would be dominated by “.de” web sites. For this reason, we used the 1000 top-ranked web sites according to PageRank in the subsequent analyses for each ccTLD separately.
Coverage of relevant web sites
To measure the coverage of our focused web crawl, we computed the overlap of our data against the commercial search engine provider Google. For this purpose, term-based search queries were sent to a Google Search Engine configured for the ccTLDs “.de”, “.at”, and “.ch” over a period of 306 days (September 16, 2020 to July 19, 2021).
The search queries were based on the then most common diseases in Germany  and included the following terms: “Coronary heart disease”, “Back pain”, “Lung cancer”, “Chronic obstructive pulmonary disease COPD”, “Alzheimer’s disease”, “Falls”, “Diabetes”, “Stroke”, “Migraine headache” and “Neck pain”. In addition, search queries for ten randomly selected rare diseases  with the following terms were used: “Cystic Fibrosis”, “Narcolepsy”, “Gaucher disease”, “Acrodermatitis”, “Munchhausen-by-proxy syndrome”, “Niemann-Pick disease”, “Multiple endocrine neoplasia”, “Huntington’s disease”, “Creutzfeldt-Jakob syndrome”, and “Asperger syndrome”.
A total of 4,093 web sites for the most common diseases and 2,736 for the random selection of rare diseases were returned by Google. Our focused web crawl covered a total of 3,519/4,093 (85.98%) of the most web sites for common diseases and 2,425/2,736 (88.63%) of the web sites for rare diseases. In summary, the web crawl contained 5,944/6,829 (87.04%) of the web sites returned by Google.
This suggests that we obtained a high coverage of health-related German web sites as our results parallel the coverage of a very comprehensive commercial web crawler.
Ranking of web sites
The most important host-aggregated URLs (according to PageRank) were categorized according to the categories introduced in Section “Study Setting”. The raters achieved a PA of 0.879 and a Cohen’s κ of 0.797. According to Landis and Koch , these κ values correspond to a “substantial agreement”. In 10.82% (364/3000) of the cases, no majority vote was achieved. Such cases were subsequently cleared following the procedure described in “Study setting”. The category “Social Network” was not selected, as no social network was contained in the 1000 top-ranked web sites for each ccTLD.
Table 1 lists the 25 top-ranked web sites according to PageRank with their respective information provider for “.de”. In total, 214 out of 1000 (21.40%) are published by governmental or public (health) institutions (GPH), 23.70% (237/1000) are published by non-profit organizations (NPO) and 43.50% (435/1000) by private organizations or individual persons (PO), i.e. web sites of medical professionals or related businesses. 62 out of 1000 (6.20%) are published by mainstream or local news agencies (M), 39 out of 1000 (3.90%) by pharmaceutical companies (PC) and 0.80% (8/1000) originated from private or personal blogs (PB). The category “Other” was given to 5 out of 1000 web sites (0.50%).
Table 2 lists the 25 top-ranked web sites according to PageRank with their respective information provider for “.at”. In total, 145 out of 1000 (14.50%) are published by GPH, 14.70% (147/1000) are published by NPO and 60.30% (603/1000) by PO. 40 out of 1000 (4.00%) are published by M, 46 out of 1000 (4.60%) by PC and 1.20% (12/1000) originated from PB. The category “Other” was given to 7 out of 1000 web sites (0.70%).
Table 3 lists the 25 top-ranked web sites according to PageRank with their respective information provider for “.ch”. In total, 196 out of 1000 (19.60%) are published by GPH, 15.70% (157/1000) are published by NPO and 58.30% (583/1000) by PO. 20 out of 1000 (2.00%) are published by M, 31 out of 1000 (3.10%) by PC and 0.70% (7/1000) originated from PB. The category “Other” was assigned to 6 out of 1000 web sites (0.60%).
Overall, 555 out of 3000 (18.50%) were published by GPH, 18.03% (541/3000) by NPO, 54.03% (1,621/3000) by PO, 4.07% (122/3000) by Ms, 3.87% (116/3000) by PC and 0.90% (27/3000) by PB. The category “Other” was given to 18 out of 3000 web sites (0.60%).
S3 Appendix provides a full listing of the 1000 top-ranked web sites for each ccTLD.
Overall, the web pages from 2720 of the top ranked web sites were included for readability and vocabulary assessment. These web pages account for 26.39% (3,746,055/14,193,743) of the initially crawled dataset. In this sample, 75.1% (2,813,953/3,746,055) originated from the ccTLD “.de”, 9.2% (344,828/3,746,055) from “.at”, and 15.7% (587,274/3,746,055) from “.ch”.
The average number of web pages per web site ranged from 1–304,420 (mean 1375.7; median 24; SD 10,570.7). The average number of sentences per web site ranged from 1–2,836 (mean 51.182; median 29.9; SD 107.7) and the average number of words from 17–21,865 (mean 852.7; median 504.7; SD 1277.3). Complex words, i.e. ≥3 syllables, ranged from 4–10,429 (mean 307.6; median 176; SD 483.3).
A complete listing for each web site with data on the number of sentences, words, complex words, and syllables is given in S3 Appendix. 280 out of the 3000 top-ranked web sites could not be analyzed as (a) the related web pages were either not visited or not stored by our focused crawler, (b) text material could not be extracted, or (c) was too short for further analyses.
All web sites were analyzed according to the readability metrics FRE, WSTF and L, as outlined in the Methods section. The applied metrics FRE, WSTF and L are based on different scales. For a more accessible presention, we mapped the values of each scale to five classes in order to note text difficulty across the metrics in a uniform way. We applied the same mapping as presented by Wiesner et al. . The mapping for each metric is given in Table 4.
The class distribution for FRE, WSTF and L, for each information provider type, is given in S4 Appendix. For the ccTLD “.de”, the web site with the lowest readability was “www.uksh.de” (n = 168,185) with an FRE value of 0.147 (SD = 2.105) and a WSTF of 14.936 (SD = 0.923). This corresponds to VD (very difficult to read). For the ccTLD “.at”, the lowest readability was computed for “www.mycare.at” (n = 1398) with an FRE value of 0.025 (SD = 0.330) and a WSTF of 15 (SD = 0) (VD). “www.implantat-berater.ch” (n = 251) had the lowest readability in “.ch” with FRE = 0.091 (SD = 0.827) and WSTF = 14.998 (SD = 0.0152) (VD). For the ccTLD “.ch”, the best readable web sites in all three countries were offered by web sites for which the focused crawler only collected a low amount of web pages (n < 10) (see S3 Appendix).
According to FRE, most web sites (90.533%; 2,716/3000) are difficult (D) or very difficult (VD) to read. This corresponds to the WSTF scores for which 2,539/3000 (84.633%) web sites are difficult or very difficult to read. The distributions for each ccTLD are depicted in Fig 4 (FRE) and Fig 5 (WSTF)
Difficulty indicated by color, with dark green as the highest readability (90–100) and dark red as the lowest readability (0–10). Note: For consistency reasons, the x axis is reverted and ranges from 100 to 0.
Difficulty is indicated by color, with dark green as the highest readability (4–5) and dark red as the lowest readability (14–15).
Regarding the vocabulary-based difficulty, a total of 568/3000 (18.93%) web sites had an L ≥9 and are thus only suitable for an academic readership. 829 out of 3000 (27.63%) web sites achieved a score ≤4 (VE+E) and are therefore suitable for a lay audience. For the remaining web sites (44.07%, 1322/3000), a score between >4 and <9 corresponds to a level suitable for persons with medical knowledge or a strong medical background.
The web sites of the ccTLD “.at” scored the lowest vocabulary measure with L = 5.796 (SD = 2.543), followed by L = 5.885 (SD = 2.499) for web sites under the ccTLD “.ch”. Web sites under the ccTLD “.de” scored the highest vocabulary measure with L = 6.340 (SD = 2.572). The distribution of the classification results over all web sites is depicted in Fig 6. In this context, 281 out of the 3000 top-ranked web sites could not be analyzed for reasons explained in the “Readability Analysis” section.
Difficulty is indicated by color with dark green as the most layman friendly (1) and dark red as the highest expert level required (10). SVM: support vector machine.
Fig 7 shows a scatter plot of the distributions of FRE, WSTF and L for each ccTLD. The scatter plots indicate a correlation between FRE vs WSTF, WSTF vs L, and FRE vs L. This is confirmed by the related PCCs: PCCde(L, WSTF) = 0.5906, PCCde(L, FRE) = 0.5601, PCCde(WSTF, FRE) = 0.9333; PCCat(L, WSTF) = 0.5692, PCCat(L, FRE) = 0.5541, PCCat(WSTF, FRE) = 0.9128; PCCch(L, WSTF) = 0.4234, PCCch(L, FRE) = 0.3813, PCCch(WSTF, FRE) = 0.8748. As one can see, WSTF and FRE are highly correlated, and therefore function as almost interchangeable measures to characterize sentence complexity. Also, high vocabulary difficulty moderately correlates with sentence complexity.
In order to determine a suitable number of topics, we performed LDA topic modeling with a varying topic number and observed perplexity (see “Methods”). Fig 8 depicts the corresponding perplexity graph: With LDA hyper parameter optimization in place, an increasing number of topics allows to better predict the document collection. However, the gain lessens considerably beyond 50 topics. Therefore, we decided to work with n = 50 topics for further analysis.
Table 5 shows the inferred 50 topics, their marginal distribution, and the most relevant terms of the web pages (N = 3,746,055) of the top 3000 web sites (1000 for each ccTLD). The marginal distribution of a topic was measured by the probability that the topic was sampled from web pages, while the relevance of a term was measured by the probability that it was sampled from its topic. Word cloud representations of these topics can be found in S5 Appendix. The topics were summarized into 11 themes (see “Methods”). The most prevalent theme was related to “Research & Science”, followed by “Illness & Injury”, “The State”, “Healthcare structures”, “Diet & Food”, “Medical Specialities”, “Economy”, “Food production”, “Health communication”, “Family”, and “Other”.
The sample terms were ordered based on their relevance to the topic.
The theme “Research & Science” covered eight topics: “Clinical Trials” (T6), “Funding” (T8), “Efficacy Studies” (T9), “Human Biology & Genetics” (T13), “Science Communication” (T29), “Medical Newspaper” (T33), “Space Research” (T37), and “Medical Journals” (T49).
“Illness & Injury” contained nine topics: “Oncology” (T5), “Accidents & Injuries” (T19), “Chronic Illness” (T21), “Pain” (T26), “Heart diseases” (T28), “Mental Illness” (T31), “Pandemic & Vaccination” (T35), “Respiratory & pulmonary diseases” (T42) and “Neurological diseases” (T46). Among them were topics related to the most common diseases in the D-A-CH region (T5, T21, T26, T31, T42, T46) . It also contained one topic (T35) referring to the COVID-19 pandemic and one topic (T19) about (car) accidents and related injuries. The theme “Medical Specialities” covered topics related to “Therapies in Cardiology” (T20), “Treatment planning in Cardiology” (T48), “Medication information” (T10), “Homeopathy” (T34) and “Plastic surgery” (T38). In addition, a theme “Familiy” containing only one topic “Pregnancy & family planning” (T16) was found.
“The state” covered four topics about the “Healthcare system” (T1), “Sociology & society” (T4) and legal aspects related to healthcare (T7, T18). The closely related field “Healthcare structures” contained topics related to “Health insurance” (T15), “Medical education” (T23), “Research at University hospitals” (T24), “Institutes at University hospitals” (T25), “Online Pharmacy” (T30), “(Online) Appointment & Telemedicine” (T32), “Hospital clinics” (T36) and “University Hospital clinics” (T50).
The theme “Diet & Food” covered aspects such as “Food intolerance” (T2), “Healthy lifestyle & nutritional counseling” (T12), “Food ingredients” (T39) and “Cooking recipes” (T41). The closely related theme “Food production” covers topics such as “Environmental protection agriculture” (T14), “Pasture and agriculture” (T43), “Food safety & consumer protection” (T44) and “Agricultural land use” (T47). “Economy” covered topics “Work & process organization”, “Economic growth in Germany” (T17), and “Production of medical products” (T2).
In addition, we found a theme “Health communication” including two topics: “Health (disussion) forum” (T22), “Doctor rating portal” (T27). “Other” was assigned to T40 and T45, which could not be named by the volunteers.
Figs 9–11 depict the theme distribution per information provider type for each ccTLD. The theme distribution for each ccTLD and for each information provider type seems to be similar between each country. Mainstream or local news agencies (M) report primarily on the topics "Illness and Injury" and "Economy”. Governmental or public (health) organizations (GPH), on the other hand, mainly focus on "Research & Science," "Healthcare Structures," and "Illness and Injury". In contrast, NPOs report predominantly on "Illness and Injury," followed by "Research & Science" and "Healthcare Structures". This is similar to the topic distribution for private organizations (POs) and pharmaceutical companies (PCs). Overall, it seems that the primary content of the sGHW across all ccTLDs is focused on "Research & Science," "Illness & Injury," and "Healthcare Structures".
Information provider types: GPH: Government, Public Institution or Public Health, NPO: Non-Profit Organization, PO: Private Organization, M: Mainstream or Local News, PC: Pharmaceutical Company, PB: Private Blog, Other: O.
Information provider types: GPH: Government, Public Institution or Public Health, NPO: Non-Profit Organization, PO: Private Organization, M: Mainstream or Local News, PC: Pharmaceutical Company, PB: Private Blog, Other: O.
Information provider types: GPH: Government, Public Institution or Public Health, NPO: Non-Profit Organization, PO: Private Organization, M: Mainstream or Local News, PC: Pharmaceutical Company, PB: Private Blog, Other: O.
Fig 12 depicts the theme distribution per ccTLD. On average, the theme “Research & Science” accounts for a 21.04% (“Illness & Injury”: 17.92%; “Healthcare Structures”: 15.27%; “The State”: 10.52%; “Economy”: 10.50%; “Medical Specialities”: 7.30%; “Diet & Food”: 6.36%; “Other”: 3.35%; “Food production”: 2.94%; “Health Communication”: 2.90%; “Familiy”: 2.00%) of all topics across all ccTLDs and provider types. This suggests, that the content of the sGHW is similar between the countries of the D-A-CH region (at least for the ccTLDs studied) and that the information need of users may not vary greatly between the individual countries.
In addition, we computed the gini coefficient  for the topic distributions of each ccTLD: G(“.de”) = 0.763, G(“.at”) = 0.746, and G(“.ch”) = 0.748. These values indicate that the topics vary strongly between websites of a given ccTLD.
The graph analysis (see study aim 1) shows that the sGHW is dominated by private stakeholders (54.03%; 1621/3000) followed by public institutions (18.50%; 555/3000) and nonprofit organizations (18.03%; 541/3000). However, looking at the top-ranked 25 web sites (see Tables 1–3), the majority of web sites originate from governmental or public (health) institutions (35%; 26/75) and non-profit organizations (16%; 12/75). “Mainstream or Local News” account for 15% (11/75). In addition, we were able to identify 50 abstract topics, that we summarized and grouped into 11 themes: “Research & Science”, “Ilness & Injury”, “The State”, “Healthcare structures”, “Diet & Food”, “Medical Specialities”, “Economy”, “Food production”, “Health communication”, “Family” and “Other”.
With respect to study aims 2 and 3, our readability analysis reveals that the majority of the collected web sites is difficult or very difficult (D+VD) to read (see S4 Appendix), as shown by the WSTF (84,63%; 2539/3000). This ratio is similar for each ccTLD: 86.20% (862/1000) for “.de”, 84.40% for “.at”, and 83.30% (833/1000) for “.ch”. This finding coincides with the outcome of the German adoption of the FRE scale: 2691/3000 (89.70%) web sites are D or VD. Again, the ratio is similar for each ccTLD: 88.30% (883/1000) for “.de”, 90.70% (907/1000) for “.at”, and 90.10% (901/1000) for “.ch”. Thus, health-related web sites are often written at high readability level and might not suit the intended group of readers. This is in line with the results of other studies, which also reported high readability levels for such resources [18–20, 22, 23, 26, 27].
Our vocabulary analysis revealed that 44.00% (1320/3000) web sites use vocabulary that is well suited for a lay audience. Again, the ratio is similar for each ccTLD: 48.50% (485/1000) for “.de”, 41.90% (419/1000) for “.at”, and 41.60% (416/1000) for “.ch”. This suggests that relatively few medical expert terms have been used on related web pages, or expert terminology has been actively avoided.
The distribution of in- and out-degrees i.e. links per host by rank is in line with the results from Meusel . Although the latter publication analysed a large but unfocused crawl, the nature of its respective distribution is similar to ours. This suggests that the distribution of incoming and outgoing links in the sGHW is not different from the rest of the web.
We found that the sentence complexity measures FRE and WSTF are strongly correlated on health-related web pages such that they can be used interchangeably. Also, high vocabulary difficulty moderately correlates with sentence complexity. On average, the theme “Research & Science” accounts for a 21.04%; “Illness & Injury”: 17.92%; “Healthcare Structures”: 15.27%; “The State”: 10.52%; “Economy”: 10.50%; “Medical Specialities”: 7.30%; “Diet & Food”: 6.36%; “Other”: 3.35%; “Food production”: 2.94%; “Health Communication”: 2.90%; “Familiy”: 2.00% of all topics across all ccTLDs and provider types. This suggests, that the content of the sGHW is similar between the countries of the D-A-CH region (at least for the ccTLDs studied).
Overall, we demonstrated that a focused crawling approach and subsequent graph analysis can be leveraged to conduct a full scale readability and vocabulary assessment on a large sample of a language-specific part of the health-related web (study aim 4).
Several limitations apply for this study. First, we only considered the ccTLDs “.de”, “.at”, and “.ch” to avoid the need for a language classification system, as most web sites on these ccTLDs are written in German. Therefore, our dataset covers only a certain fraction of the GHW. For example (German) web sites published under “.com”, e.g. the web site of the electronic health record provider “www.vivy.com”, are not contained. In addition, our web crawl represents only a snapshot of the time when it was taken, i.e. web sites, which were created after the end of our crawl, are also not included in our dataset as we abstained from performing update operations to reduce computational complexity. A famous example for such a web site is the national health portal of Germany “gesund.bund.de” operated by the German Ministry of Health and released on 1st September 2020.
Second, with a mean accuracy of 0.951 our classifier might have produced false positive results during the crawling process (see ). Third, we used a focused web crawling system to collect health-related web pages and to extract the raw text material from HTML content. For this reason, disturbance artifacts, such as different kinds of hyphens, XML fragments or misencoded characters, may still have been included in the extracted text material and thus have influenced our readability analysis. In addition, some analyzed web sites may only contain a small amount of (content) web pages which might lead to an either underestimated or overestimated average readability and/or vocabulary score (see S3 Appendix). This is due to the automatic nature of our web crawling process: (1) we omit (content) web pages, which were classified as non-relevant, (2) we respect crawler ethics (i.e., robots.txt), and (3) we are using an estimated priority value to determine crawling priority. Consequently, we might have missed additional relevant (content) web pages for a given website.
Next, we relied on the PageRank algorithm to determine a ranking of the most important web sites contained within the generated host-aggregated sGHW graph. This does not necessarily comply with the perspective of an individual user who is using a (commercial) search engine to find relevant health content nor does it correlate with visibility indices or “organic ranks” provided by (commercial) third party services. However, we think that ranking web sites based on PageRank, which was computed on the host-aggegated sGHW graph is justified as it is not biased by commercial interest and can be reproduced easily. Even more importantly, it is a well accepted approach to assess the importance of a graph node in graph theory [50, 82].
Moreover, detecting syllables is not a trivial task for the German language and is not always reliably . As the adapted FRE and the WSTF are computed on the basis of the mean number of syllables per word, they can be influenced by the aforementioned inaccuracies. However, this applies to all NLP analysis tools for German text material. In addition, there is a lack of proper validation studies on the application of readability measures for German health-related text material. However, due to the frequent use of these instruments in the scientific community and their use by the German Agency for Quality in Medicine to assess the readability of their patient education guidelines and S3 guidelines , we consider them as a reference that allows comparisons of analyses of readability of health-related text material written in German.
Furthermore, solely computing the readability of text material disregards the individual knowledge and motivation of readers . Aspects related to illustration and design were not included in the analysis. Consequently, the suitability of health-related web sites cannot exclusively be judged based on its readability or its used vocabulary . Other methods, such as the Suitability Assessment of Materials (SAM) instrument  or DISCERN  go beyond measures of word and sentence length and cover other aspects of a web page that influence the understandability (or quality) of health information and text comprehension. However, these instruments require manual work and a sufficient number of judges to ensure an objective assessment. Moreover, with regard to our study, assessing 3,746,055 texts (i.e. web pages) would impose very high financial and human resources, which is not feasible.
Comparison with prior work
Readability of health information material.
Previous studies investigated the readability of health-related web pages [18, 26, 27] or the vocabulary difficulty of health education material provided as PDF brochures [24, 25].
In contrast to McInnes and Haglund  or Worrall et al. , we obtained our data collection by using a specifically trained focused web crawler  instead of retrieving it via a (commercial) search engine provider such as Google. Thus, our data collection is not influenced by commercial interests.
McInnes and Haglund  analyzed 352 web sites and computed a mean FRE of 46.08, which is difficult to read. In 2020, Worrall et al.  report that “only 17.2% [(n = 165)] of web pages [related to COVID-19 were written] at a universally readable level.” These findings are supported by Brütting et al.  who found low readability scores for 45 prominent web sites on melanoma immunotherapy written in German. These results are in line with our findings which reveal that the majority of the collected web sites is difficult or very difficult (D+VD) to read (see S4 Appendix).
In a previous study , Keinki et al. analyzed information booklets for German cancer patients. The authors found a mean vocabulary score of L = 5.09 signaling a higher difficulty for lay people. Wiesner et al.  found a mean vocabulary score of L = 3.66 for health education materials on Psoriasis/Psoriatic Arthritis written in German, indicating the use of less complex medical terminology. In contrast to the aforementioned studies, our study revealed higher mean vocabulary scores: L = 6.340 (SD = 2.572) for “.de”, L = 5.796 (SD = 2.543) for “.at”, and L = 5.885 (SD = 2.499) for “.ch”. This difference might result from the fact that we focused on health-related material contained in the GWH rather than limiting our study to patient information material only. Consequently, our data collection might contain web pages targeting (medical) experts, who make use of (medical) expert vocabulary.
Topic modeling on health information material.
Previous studies applied topic modeling techniques to a variety of health information material such as content posted on social media, online newspaper articles or on web sites in general [36–42]. Most of these studies [38–42] focused on a specific health-related topic such as “hearing loss“, “weight loss“, “dental health”or”occupational accidents“. Only two studies [36, 37] analyzed health topics covered by posts in social media (Twitter and Instagram).
Compared to the study by Paul and Dredze  on health topics on Twitter, we identified similar themes and/or topics within the sGHW such as “cancer & serious illness”, “injuries & pain”, “diet & exercise” and “family”. Muralidhara and Paul  explored health topics on Instagram and discovered ten broad categories. Compared to their work, we were able to identify similar topics such as “acute illness”, “alternative medicine”, “chronic illness and pain”, “mental health”, “diet” as well as “substance use”.
In contrast to the studies by Paul and Dredze  and Muralidhara and Paul , we focused on the German language and the sGHW rather than on social media. In addition, contrary to [38–42], we explored general health topics within the sGHW rather than focusing on one certain (health-related) discipline.
We found topic representations of the most common diseases in the D-A-CH region such as “Oncology” (T5), “Chronic Illness” (T21), “Pain” (T26), “Heart diseases (T28)”, “Mental Illness” (T31), “Pandemic & Vaccination” (T35), “Respiratory & pulmonary diseases” (T42) and “Neurological diseases” (T46) . Interestingly, T35 includes terms such as “covid” and “vaccination” referring to the current pandemic situation although the European outbreak started in the last months of our web crawl. In addition, (healthy) food (T12, T39, T41), food intolerance (T2) as well as food production (T14, T43, T44, T47) seem to play an important role within the sGHW.
Conclusions and further research.
In this study, a system was presented which computes the readability and vocabulary difficulty of health-related web pages gathered by a focused web crawler in a fully-automated way. We showed, that a graph representation of the sGHW can be extracted during the data collection phase, which can then be used to compute a ranking of the top 1000 web sites for the ccTLDs “.de”, “.at”, and “.ch”. In addition, we demonstrated that LDA can be used to explore the collected dataset. In total, we were able to identify 50 topics, which were summarized into 11 themes.
Our results indicate that the readability within the sGHW is low. For this reason, publishing organizations and authors should reevaluate existing text material and reduce sentence complexity. However, our findings suggest that the use of vocabulary often suits the target audience but could be improved. Therefore, we recommend the use of both sentence dimension and vocabulary dimension as supportive measures to ensure and provide understandable online health information. Therefore, content providers should be supported by proper tooling during text production: I.e., one could envision a cloud service where health content providers could check their health-related web content automatically for readability and vocabulary difficulty. In addition, users should be supported by proper browser-based tooling (i.e., browser extensions such as ) to identify easy-to-read content but also to get an indication of the quality of the related content.
In future work, the authors intend to extend their analyses to identify trustworthy health information web sites. To do so, we plan to combine the DISCERN instrument  with crowd-sourcing approaches. Using these insights and with the acquired data available, an implementation and evaluation of a trustworthy health-specific search engine for information seeking citizens will be possible.
S1 Appendix. Overview of information provider categories.
S2 Appendix. Instructions for volunteers written in German and word clouds to name.
S3 Appendix. Top-ranked 1000 web sites for each ccTLD, their linguistic characteristics and the related text difficulty.
S4 Appendix. Class distribution for FRE, WSTF and L for each web site category.
S5 Appendix. Topics inferred by LDA in word clouds representation.
The authors would like to thank Dr. Monika Pobiruchin for her valuable feedback and input to the work. In addition, the authors would like to thank Sebastian Eisenhardt, Saskia Koch, Maximilian Kurscheidt, Philipp Höfer, Dr. Monika Pobiruchin, Verena Sauter, Marcel Schmid, Susanne Steuer, Martin Wiesner and Stefanie Zowalla for labeling the topics generated by LDA.
- 1. Cline RJW, Haynes KM. Consumer health information seeking on the Internet: the state of the art. Health Educ Res 2001 Jan 1;16(6):671–692. pmid:11780707
- 2. Eysenbach G, Köhler C. How do consumers search for and appraise health information on the world wide web? Qualitative study using focus groups, usability tests, and in-depth interviews. BMJ 2002 Jan 1;324(7337):573–577. pmid:11884321
Fox S, Duggan M. Health Online 2013 [Internet]. 2013. https://www.pewinternet.org/2013/01/15/health-online-2013/
Wetter T. Consumer Health Informatics New Services, Roles, and Responsibilities. Cham: Springer International Publishing; 2016. ISBN:978-3-319-19590-2
- 5. Jacobs W, Amuta AO, Jeon KC. Health information seeking in the digital age: An analysis of health information seeking behavior among US adults. Cogent Soc Sci 2017 Jan 1;3(1):1302785.
- 6. Sbaffi L, Rowley J. Trust and Credibility in Web-Based Health Information: A Review and Agenda for Future Research. J Med Internet Res 2017;19(6):e218. pmid:28630033
- 7. Wong DK-K, Cheung M-K. Online Health Information Seeking and eHealth Literacy Among Patients Attending a Primary Care Clinic in Hong Kong: A Cross-Sectional Survey. J Med Internet Res 2019;21(3):e10831. pmid:30916666
- 8. Berkman ND, Sheridan SL, Donahue KE, Halpern DJ, Crotty K. Low Health Literacy and Health Outcomes: An Updated Systematic Review. Ann Intern Med 2011 Jul 19;155(2):97. pmid:21768583
- 9. Ownby RL. Influence of vocabulary and sentence complexity and passive voice on the readability of consumer-oriented mental health information on the Internet. AMIA Annu Symp Proc 2005;585–589. pmid:16779107
- 10. Chapple A, Campion P, May C. Clinical terminology: anxiety and confusion amongst families undergoing genetic counseling. Patient Educ Couns 1997 Oct;32(1–2):81–91. pmid:9355575
- 11. Wittenberg-Lyles E, Goldsmith J, Oliver DP, Demiris G, Kruse RL, Van Stee S. Using medical words with family caregivers. J Palliat Med 2013 Sep;16(9):1135–1139. pmid:23937064
- 12. Wittenberg E, Goldsmith J, Ferrell B, Platt CS. Enhancing Communication Related to Symptom Management Through Plain Language. J Pain Symptom Manage 2015 Nov;50(5):707–711. pmid:26162506
- 13. Zowalla R, Wetter T, Pfeifer D. Crawling the German Health Web: Exploratory Study and Graph Analysis. J Med Internet Res 2020 Jul 24;22(7):e17853. pmid:32706701
- 14. Zok K. Unterschiede bei der Gesundheitskompetenz—Ergebnisse einer bundesweiten Repräsentativ-Umfrage unter gesetzlich Versicherten [Differences of Health Literacy—Results of a nation-wide Representative Survey among Statutory Health Insurees]. WIdO-monitor 2014;11(2):1–12.
- 15. Schaeffer D, Berens E-M, Vogt D. Health Literacy in the German Population: Results of a Representative Survey. Dtsch Arztebl 2017 Jan 27;
Bieri U, Kocher JP, Gauch C, Tschöpe S, Venetz A, Hagemann M, et al. Bevölkerungsbefragung Erhebung Gesundheitskompetenz 2015 [Internet]. Bern, Switzerland: gfs.bern; 2016. https://www.obsan.admin.ch/sites/default/files/uploads/152131_geskomp_sb_def.pdf
Pelikan JM, Röthlin F, Gahnal K. Die Gesundheitskompetenz der österreichischen Bevölkerung [Internet]. Wien: Ludwig Boltzmann Instituts Health Promotion Research; 2013. https://fgoe.org/sites/fgoe.org/files/project-attachments/Gesundheitskompetenz_Bundesl%C3%A4nder_%C3%96ffentlich.pdf
- 18. Brütting J, Steeb T, Reinhardt L, Berking C, Meier F. Exploring the Most Visible German Websites on Melanoma Immunotherapy: A Web-Based Analysis. JMIR Cancer 2018 Dec 13;4(2). pmid:30545808
- 19. Basch CH, Ethan D, MacLean SA, Fera J, Garcia P, Basch CE. Readability of Prostate Cancer Information Online: A Cross-Sectional Study. Am J Mens Health 2018 Sep;12(5):1665–1669. pmid:29888641
- 20. Thomas GR, Eng L, de Wolff JF, Grover SC. An evaluation of Wikipedia as a resource for patient education in nephrology. Semin Dial 2013 Apr;26(2):159–163. pmid:23432369
- 21. Edmunds MR, Barry RJ, Denniston AK. Readability assessment of online ophthalmic patient information. JAMA Ophthalmol 2013 Dec;131(12):1610–1616. pmid:24178035
- 22. Tulbert BH, Snyder CW, Brodell RT. Readability of Patient-oriented Online Dermatology Resources. J Clin Aesthet Dermatol 2011 Mar;4(3):27–33. pmid:21464884
Zowalla R, Wiesner M. Quantifying readability and vocabulary metrics of the Austrian National Health Portal. 63 Jahrestagung der Deutschen Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie (GMDS) eV 2018 Aug 27.
- 24. Keinki C, Zowalla R, Pobiruchin M, Huebner J, Wiesner M. Computer-Based Readability Testing of Information Booklets for German Cancer Patients. J Canc Educ 2019 Aug;34(4):696–704. pmid:29651761
- 25. Wiesner M, Zowalla R, Pobiruchin M. The Difficulty of German Information Booklets on Psoriasis and Psoriatic Arthritis: Automated Readability and Vocabulary Analysis. JMIR Dermatol 2020 Feb 28;3(1):e16095.
- 26. Mcinnes N, Haglund BJA. Readability of online health information: implications for health literacy. Inform Health Soc Care 2011 Dec 1;36(4):173–189. pmid:21332302
- 27. Worrall AP, Connolly MJ, O’Neill A, O’Doherty M, Thornton KP, McNally C, et al. Readability of online COVID-19 health information: a comparison between four English speaking countries. BMC Public Health 2020 Nov 13;20. pmid:33183297
- 28. Zowalla R, Wiesner M, Pfeifer D. Automatically Assessing the Expert Degree of Online Health Content using SVMs. Stud Health Technol Inform 2014 Jan 1;202:48–51. pmid:25000012
- 29. Zowalla R, Wiesner M, Pfeifer D. Expertizer: A Tool to Assess the Expert Level of Online Health Websites. Stud Health Technol Inform 2015;10–14. pmid:25991092
Platt JC. Probabilistic Outputs for Support Vector Machines and Comparisons to Regularized Likelihood Methods. Advances in Large-Margin Classifiers MIT Press; 1999. p. 61–74.
- 31. vor der Brück T, Hartumpf S, Helbig H. A Readability Checker with Supervised Learning Using Deep Indicators. Informatica 32(4):429–435.
- 32. Berendes K, Vajjala S, Meurers D, Bryant D, Wagner W, Chinkina M, et al. Reading demands in secondary school: Does the linguistic complexity of textbooks increase with grade level and the academic orientation of the school track? Journal of Educational Psychology 2018 May;110(4):518–543.
- 33. Crossley SA, Skalicky S, Dascalu M. Moving beyond classic readability formulas: new methods and new models. Journal of Research in Reading 2019 Nov;42(3–4):541–561.
- 34. De Clercq O, Hoste V, Desmet B, Van Oosten P, De Cock M, Macken L. Using the crowd for readability prediction. Nat Lang Eng 2014 Jul;20(3):293–325.
Temnikova I, Vieweg S, Castillo C. The Case for Readability of Crisis Communications in Social Media. Proceedings of the 24th International Conference on World Wide Web [Internet] Florence Italy: ACM; 2015 [cited 2022 Aug 15]. p. 1245–1250.
- 36. Paul MJ, Dredze M. Discovering Health Topics in Social Media Using Topic Models. Lambiotte R, editor. PLoS ONE 2014 Aug 1;9(8):e103408. pmid:25084530
- 37. Muralidhara S, Paul MJ. #Healthy Selfies: Exploration of Health Topics on Instagram. JMIR Public Health Surveill 2018;4(2):e10150. pmid:29959106
- 38. Melkers J, Hicks D, Rosenblum S, Isett KR, Elliott J. Dental Blogs, Podcasts, and Associated Social Media: Descriptive Mapping and Analysis. J Med Internet Res 2017;19(7):e269. pmid:28747291
- 39. Liu Y, Yin Z. Understanding Weight Loss via Online Discussions: Content Analysis of Reddit Posts Using Topic Modeling and Word Clustering Techniques. J Med Internet Res 2020 Jun 8;22(6):e13745. pmid:32510460
- 40. Liu Q, Chen Q, Shen J, Wu H, Sun Y, Ming W-K. Data Analysis and Visualization of Newspaper Articles on Thirdhand Smoke: A Topic Modeling Approach. JMIR Med Inform 2019;7(1):e12414. pmid:30694199
- 41. Bahng J, Lee CH. Topic Modeling for Analyzing Patients’ Perceptions and Concerns of Hearing Loss on Social Q&A Sites: Incorporating Patients’ Perspective. Int J Environ Res Public Health 2020 Jan;17(17):6209. pmid:32867035
- 42. Min K-B, Song S-H, Min J-Y. Topic Modeling of Social Networking Service Data on Occupational Accidents in Korea: Latent Dirichlet Allocation Analysis. J Med Internet Res 2020 Aug 13;22(8):e19222. pmid:32663156
Page L, Brin S, Motwani R, Winograd T. The PageRank citation ranking: bringing order to the web. 1999 Jan 1;
- 44. Lombard M, Snyder‐Duch J, Bracken CC. Content Analysis in Mass Communication: Assessment and Reporting of Intercoder Reliability. Hum Commun Res 2002 Oct 1;28(4):587–604.
- 45. Cohen J. A Coefficient of Agreement for Nominal Scales. Educ Psychol Meas 1960 Apr 1;20(1):37–46.
- 46. Broder A, Kumar R, Maghoul F, Raghavan P, Rajagopalan S, Stata R, et al. Graph structure in the Web. Comput Netw 2000 Jun;33(1–6):309–320.
Meusel R, Vigna S, Lehmberg O, Bizer C. Graph structure in the Web—revisited: a trick of the heavy tail. Proceedings of the 23rd International Conference on World Wide Web Seoul, Korea: International World Wide Web Conferences Steering Committee; 2014. p. 427–432. [10.1145/2567948.2576928]
- 48. Meusel R. The Graph Structure in the Web–Analyzed on Different Aggregation Levels. JWS 2015 Aug 13;1(1):33–47.
Lehmberg O, Meusel R, Bizer C. Graph structure in the Web: aggregated by pay-level domain. Proceedings of the 2014 ACM conference on Web science Bloomington, Indiana, USA: ACM; 2014. p. 119–128.
Gross JL, Yellen J, editors. Handbook of graph theory. Boca Raton: CRC Press; 2004. ISBN:978-1-58488-090-5
- 51. Blondel VD, Guillaume J-L, Lambiotte R, Lefebvre E. Fast unfolding of communities in large networks. J Stat Mech 2008 Jan 1;2008(10):10008.
- 52. Albert R, Barabási A-L. Statistical mechanics of complex networks. Rev Mod Phys 2002 Jan 30;74(1):47–97.
- 53. Król K, Zdonek D. Aggregated Indices in Website Quality Assessment. Future Internet Multidisciplinary Digital Publishing Institute; 2020 Apr;12(4):72.
Cooper K. Keyword Research, Competitor Analysis, & Website Ranking | Alexa [Internet]. Alexa.com. [cited 2021 Nov 29]. https://www.alexa.com/
SISTRIX: bessere Rankings, mehr Sichtbarkeit & wirksamere Inhalte [Internet]. SISTRIX. [cited 2021 Nov 29]. https://www.sistrix.de/
Digital Marketing Analytics for Leaders, SEO & Content Professionals | Searchmetrics [Internet]. Searchmetrics. [cited 2021 Nov 29]. https://www.searchmetrics.com/
Website-Traffic—Überprüfen und Analysieren jeder Website [Internet]. Similarweb. [cited 2021 Nov 29]. https://www.similarweb.com/de/
Härting R-C, Mohl M, Steinhauser P, Möhring M. Search Engine Visibility Indices Versus Visitor Traffic on Websites. In: Abramowicz W, Alt R, Franczyk B, editors. Business Information Systems Cham: Springer International Publishing; 2016. p. 91–101.
- 59. Klare GR. Assessing Readability. Read Res Q 1974;10(1):62.
Klare GR. The formative years. In: Zakaluk BL, Samuels SJ, editors. Readability: its past, present, and future Newark, Del: International Reading Association; 1988.
- 61. Flesch R. A new readability yardstick. J Appl Psychol 1948;32(3):221–233. pmid:18867058
Amstad T. Wie verständlich sind unsere Zeitungen? Universität Zürich; 1978.
Bamberger R, Vanecek E. Lesen—Verstehen—Lernen—Schreiben. Die Schwierigkeitsstufen von Texten in deutscher Sprache [Reading—Understanding—Learning—Writing. The difficulty levels of German texts]. Wien: Jugend u. Volk Sauerlaender; 1984.
- 64. Leroy G, Miller T, Rosemblat G, Browne A. A balanced approach to health information evaluation: A vocabulary-based naïve Bayes classifier and readability formulas. J Am Soc Inf Sci 2008 Jul;59(9):1409–1419.
Joachims T. Text categorization with support vector machines: Learning with many relevant features. Dortmund: Dekanat Informatik Univ; 1997.
- 66. Blei DM, Ng AY, Jordan MI. Latent Dirichlet Allocation. J Mach Learn Res 2003 Jan 1;3:9931022.
- 67. Griffiths TL, Steyvers M. Finding scientific topics. Proceedings of the National Academy of Sciences 2004 Apr 6;101(Supplement 1):5228–5235. pmid:14872004
Asuncion A, Welling M, Smyth P, Teh YW. On Smoothing and Inference for Topic Models. Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence Arlington, Virginia, USA: AUAI Press; 2009. p. 27–34.
Minka TP. Estimating a Dirichlet distribution [Internet]. Cambridge, UK: Microsoft Research; 2000. https://tminka.github.io/papers/dirichlet/minka-dirichlet.pdf
Wallach HM, Murray I, Salakhutdinov R, Mimno D. Evaluation methods for topic models. Proceedings of the 26th Annual International Conference on Machine Learning—ICML ‘09 Montreal, Quebec, Canada: ACM Press; 2009. p. 1–8.
Pfeifer D, Leidner JL. A Study on Topic Modeling for Feature Space Reduction in Text Classification. In: Cuzzocrea A, Greco S, Larsen HL, Saccà D, Andreasen T, Christiansen H, editors. Flexible Query Answering Systems Cham: Springer International Publishing; 2019. p. 403–412. https://doi.org/10.1007/978-3-030-27629-4_37
- 72. Gini C. Measurement of Inequality of Incomes. Econ J 1921 Mar;31(121):124.
Allen ST, Jankowski M, Pathirana P. Storm applied: Strategies for real-time event processing [Internet]. Shelter Island, NY: Manning Publications Co; 2015. http://proquest.tech.safaribooksonline.de/9781617291890ISBN:978-1-61729-189-0
Apache Software Foundation. Apache OpenNLP [Internet]. 2020 [cited 2020 Nov 11]. https://opennlp.apache.org/
Apache Software Foundation. Apache OpenNLP Tools—Models [Internet]. 2020 [cited 2020 Nov 11]. http://opennlp.sourceforge.net/models-1.5/
Liang F. Word Hy-phen-a-tion by Com-put-er. Stanford University; 1983.
- 77. Porter M. An algorithm for suffix stripping. Program: electronic library and information systems 1980 Jan 1;14(3):130–137.
Pfeifer D, Leidner JL. Topic Grouper: An Agglomerative Clustering Approach to Topic Modeling. In: Azzopardi L, Stein B, Fuhr N, Mayr P, Hauff C, Hiemstra D, editors. Advances in Information Retrieval Cham: Springer International Publishing; 2019. p. 590–603. https://doi.org/10.1007/978-3-030-15712-8_38
- 79. James SL, Abate D, Abate KH, Abay SM, Abbafati C, Abbasi N, et al. Global, regional, and national incidence, prevalence, and years lived with disability for 354 diseases and injuries for 195 countries and territories, 1990–2017: a systematic analysis for the Global Burden of Disease Study 2017. The Lancet Elsevier; 2018 Nov 10;392(10159):1789–1858. pmid:30496104
Dávila Vanegas MM, Krause T, Dulas F, Weber S. Zusammenführung der ICD-10-GM und der Orpha-Kennnummer für die Kodierung von Seltenen Erkrankungen. 61 Jahrestagung der Deutschen Gesellschaft für Medizinische Informatik 2016 Aug 8.
- 81. Landis JR, Koch GG. The Measurement of Observer Agreement for Categorical Data. Biometrics 1977;33(1):159–174. pmid:843571
Zaki MJ, Meira W. Data mining and analysis: fundamental concepts and algorithms. New York, NY: Cambridge University Press; 2014. ISBN:978-0-521-76633-3
Müller K. Automatic detection of syllable boundaries combining the advantages of treebank and bracketed corpora training. Proceedings of the 39th Annual Meeting on Association for Computational Linguistics—ACL ‘01 Toulouse, France: Association for Computational Linguistics; 2001. p. 410–417.
- 84. Schaefer C, Zowalla R, Wiesner M, Siegert S, Bothe L, Follmann M. Patientenleitlinien in der Onkologie: Zielsetzung, Vorgehen und erste Erfahrungen mit dem Format. Z Evid Fortbild Qual Gesundhwes 2015;109(6):445–451.
Doak CC, Doak LG, Root JH. Teaching patients with low literacy skills. Philadelphia: J.B. Lippincott; 1996. ISBN:978-0-397-55161-3
Charnock D, editor. Das DISCERN-Handbuch: Qualitätskriterien für Patienteninformationen über Behandlungsalternativen; Nutzerleitfaden und Schulungsmittel. München: Zuckschwerdt; 2000. ISBN:978-3-88603-694-3