Hybrid text mining models for investigative keyword expansion on child sexual abuse in the dark web

Jin Gyeong Kim; Jiyeon Kim

doi:10.1371/journal.pone.0344470

Abstract

The distribution of child sexual abuse materials (CSAM) via the dark web continues to hinder digital investigations due to the network’s inherent anonymity and fragmentation. This work presents a comparative analysis of text mining techniques for extracting investigative keywords from CSAM-related content on the dark web and aims to establish a foundation for scalable, expandable keyword-based detection. Using a custom crawler, we collected data from 2,414 dark web pages indexed by the Torch search engine. Based on this dataset, three methods—TF-IDF, Eigenvector Centrality, and Word2Vec—were applied to extract CSAM-related keywords, and their effectiveness was evaluated through dark web search experiments measuring the retrieval performance of CSAM-related sites. Among the individual techniques, Eigenvector Centrality—a graph-based keyword ranking algorithm—showed the highest precision and contextual relevance by identifying structurally central terms within co-occurrence networks. Building on this, we developed hybrid models that combined Eigenvector Centrality with either TF-IDF or Word2Vec. In particular, the model integrating Eigenvector Centrality with Word2Vec-based semantic similarity proved most effective in expanding investigative clues and retrieving highly relevant keywords. Based on empirically collected and domain-specific dark web data, this work differs from prior studies by empirically demonstrating a multi-method approach that not only improves keyword accuracy but also enables the dynamic expansion of early-stage crime indicators. The proposed methodology offers practical value for automating the detection of illicit content and improving the operational efficiency of cyber investigations.

Citation: Kim JG, Kim J (2026) Hybrid text mining models for investigative keyword expansion on child sexual abuse in the dark web. PLoS One 21(5): e0344470. https://doi.org/10.1371/journal.pone.0344470

Editor: Masoud Rahmati, Lorestan University, IRAN, ISLAMIC REPUBLIC OF

Received: August 22, 2025; Accepted: February 9, 2026; Published: May 8, 2026

Copyright: © 2026 Kim, Kim. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: In accordance with the PLOS ONE data sharing policy, we have made publicly available the non-sensitive derived datasets (i.e., aggregated and processed analysis outputs that do not allow reconstruction of the underlying raw text) to enhance transparency. The shared materials include keywords and keyword ranking outputs derived through text mining, keyword co-occurrence–based network data (nodes, edges, and weights), and trained Word2Vec embedding vectors generated during the analysis. The underlying raw text collected from dark web sources is not included in the repository because sharing it may not be appropriate under relevant legal and ethical considerations and the terms and conditions of the data sources. The repository is available here: https://github.com/jingyeong27/DarkWeb_CSAM_KeywordAnalysis_Repository.git.

Funding: This work was supported by Daegu University Research Grant, 2022 (to JK).

Competing interests: The authors have declared that no competing interests exist.

Introduction

The dark web, which operates through encrypted communication protocols, has become a central medium for organized cybercrime activities such as digital sexual abuse, illegal gambling, and drug trafficking [1]. Unlike the surface web that is easily accessed via standard browsers like Chrome or Microsoft Edge, the dark web requires specialized software—most commonly the Tor browser—for access. Tor utilizes onion routing, a system that encrypts traffic and relays it through multiple nodes—entry, middle, and exit—thus concealing IP addresses and preserving user anonymity. This structural anonymity enables the dark web to serve as a platform for the dissemination of highly illegal and harmful content. In particular, child sexual abuse materials (CSAM) continue to circulate across hidden services with minimal regulatory oversight [2,3]. A well-known case is “Welcome to Video” in South Korea, where a large-scale CSAM distribution network operated by a single individual drew thousands of international users who accessed explicit content involving minors [4]. These crimes have evolved from informal exchanges between individuals into a profit-driven and organized ecosystem [5]. Operators deliberately dismantle and re-establish sites to avoid detection by law enforcement. This practice makes it increasingly difficult to track criminal activity or secure digital evidence. As CSAM production and distribution increasingly follow commercial and coordinated models, investigations require scalable, adaptive, and keyword-driven techniques to extract actionable leads from dynamic and opaque platforms [6]. However, most prior research has relied on static datasets extracted from previously known dark web forums or now-defunct sites, which often fail to reflect the latest patterns of criminal activity. The lack of real-time, domain-specific data collection frameworks has limited the ability to detect newly emerging threats and expand investigative leads. While existing studies have applied machine learning and content classification techniques to analyze dark web data, most have primarily focused on surface-level patterns or isolated terms. Few have addressed the need for methods that can dynamically expand early investigative clues into deeper leads across platforms. To address these limitations, we propose a text mining–based approach for the automated collection and analysis of textual data from dark web sources distributing CSAM. Specifically, we compare three representative techniques—TF-IDF, Eigenvector Centrality, and Word2Vec—to evaluate their effectiveness in extracting investigative keywords. Furthermore, we design combined models that improve contextual relevance and detection accuracy by integrating the structural centrality captured by Eigenvector Centrality with either the statistical rarity captured by TF-IDF or the semantic similarity provided by Word2Vec. The remainder of this paper is structured as follows.

The related works review existing studies on dark web investigation and keyword extraction using text mining.

The materials and methods section then describes the dataset construction process and the individual text mining models. The results section presents the proposed combined models, compares their performance, and evaluates their effectiveness in identifying CSAM-related content. The discussion examines the robustness and generalizability of the proposed approach, including additional experiments using alternative seed keywords, and the conclusion summarizes the key findings and outlines directions for future research.

Related works

Research on dark web investigation trends

The dark web has emerged as a central medium for illegal activities such as drug trafficking, unlawful gambling, and the distribution of CSAM. To respond to these crimes, various investigative strategies have been introduced to detect and monitor criminal activities across dark web platforms. Machine learning approaches have been widely applied to classify and analyze content from the dark web. One line of research involved extracting HTML tags and textual content from dark web pages and applying neural networks in combination with semi-supervised support vector machines to classify crime types [7]. Another approach employed convolutional neural networks (CNNs) and KeyBERT to identify relevant keywords and classify images and texts extracted from illicit websites [8]. Deep learning methods, such as long short-term memory (LSTM) models, have also been trained on dark web forum data to detect criminal behaviors [9], while transductive semi-supervised learning techniques have been adopted for improved detection using domain-specific datasets [10]. Beyond content classification, structural and network-based methods have been used to identify patterns in criminal ecosystems. Studies have analyzed hyperlink networks among dark web sites to uncover the evolution of criminal groups and identify high-centrality nodes [11,12]. Other research focused on financial traces by analyzing Bitcoin transactions—commonly used on the dark web—to estimate the scale of illegal marketplaces and map user behaviors [13]. These combined efforts reflect a growing trend toward integrating content analysis, network structure, and transaction-level data to better expose and understand hidden criminal infrastructures.

However, most of these studies have relied on static or outdated datasets, often based on previously identified or now-defunct illicit sites. As the dark web evolves rapidly and many such platforms are taken offline or replaced, these approaches struggle to reflect the current threat landscape. More importantly, there remains a lack of research on establishing real-time collection systems that can accurately identify crime-relevant pages and dynamically expand early investigative clues into broader leads. To address these limitations, this work develops a custom crawler to collect live dark web data and proposes a text mining–based framework that can extract and scale high-precision investigative keywords for active cybercrime detection.

Cyber investigation research based on text mining

Text mining has become a fundamental methodology in cybercrime analysis, enabling researchers to extract meaningful patterns from large-scale unstructured data. Prior studies have demonstrated that text-mining techniques can support security and risk analytics across various application domains [14]. In the crime analysis field, graph-based community detection and clustering algorithms—such as the Louvain method—have been applied to real-world crime reports to uncover threat-related structures and derive investigative cues [15]. WordNet-based semantic analysis has also been employed to identify relationships among crime-related terminology and support cybercrime investigation workflows [16]. In addition, machine-learning models built on TF–IDF representations have been used to classify crime-related textual records from public datasets, including categorizing theft or other offense types using XGBoost and similar classifiers [17]. Within dark web research specifically, machine-learning approaches have been applied to classify cybercrime offenses and analyze linguistic characteristics of illicit online communications, demonstrating the utility of embedding-based and feature-based text representations [18]. Graph-embedding techniques have further been introduced to track the evolution of hacker-forum terminology over time and proactively identify emerging cyber threats [19]. Recent dark web research has widely applied text-mining and machine-learning techniques to analyze illicit content, using TF–IDF and related term-weighting schemes as standard baselines for representing salient terms, and combining them with methods such as K-means clustering, decision trees, logistic regression, and N-gram-based topic modeling to categorize onion sites and characterize dark web forums [20–23]. From a network analysis standpoint, graph-based metrics such as degree distribution, centrality, and PageRank have been applied to identify influential sites and trace content distribution paths in dark web ecosystems [24–27], providing insights into the structure and flow of illicit information,

While these studies have advanced both content-based and structure-aware approaches, many studies still rely on single techniques and have not been empirically validated in real-world investigations. Moreover, few approaches focus on building keyword-driven detection models that can rapidly identify early-stage criminal clues and actively expand them into broader investigative leads. To address these limitations, we compare multiple text mining techniques for extracting CSAM-related keywords and evaluate their effectiveness in dark web environments. In addition to individual methods, we propose hybrid models designed to improve both the precision and scalability of investigative keyword extraction, with an emphasis on applicability to actual forensic workflows.

Materials and methods

Keyword extraction with individual text mining models

This section describes the process of extracting CSAM-related keywords from dark web sources. Using a seed keyword, we developed a crawler to collect CSAM-related pages from the dark web. Based on the collected dataset, we applied three text mining techniques—TF-IDF, Eigenvector Centrality, and Word2Vec—to extract key terms. We then conducted a comparative evaluation of these models to identify the most effective approach for detecting high-relevance CSAM keywords.

Data collection from the dark web.

To construct a dataset for CSAM-related keyword analysis, a dark web crawler was developed using the Torch search engine. Torch is one of the longest-running search engines on the Tor network and provides access to a wide range of.onion domains [28]. Fig 1 illustrates the overall keyword crawling process.

Download:

Fig 1. Workflow of the dark web keyword crawler for crime data collection.

https://doi.org/10.1371/journal.pone.0344470.g001

In the first step, the crawler establishes a session through the Tor browser and accesses Torch. Next, it enters seed keywords—terms known to be associated with CSAM—into the search engine. In this study, the term “Lolita”, which signifies an adult man’s sexual attraction to young girls, was selected as a seed keyword. Using this term, the crawler retrieved 2,414 dark web pages indexed by Torch that were potentially related to child sexual abuse. The crawler then follows each resulting link and collects all visible text data from the retrieved pages.

Throughout this process, the crawler was employed solely for academic research purposes, and the collected raw data were neither publicly released nor redistributed. We utilized textual data collected from the Torch search engine, accessed via the Tor browser, using CSAM-related seed keywords as search inputs. The resulting dataset consists exclusively of textual content from dark web sites retrieved with the designated seed keywords; no illegal CSAM content such as images or videos was downloaded or stored.

To prevent redundancy in the corpus, previously collected content is compared against new entries and duplicates are removed. The collected raw text corpus then undergoes a multi-stage preprocessing procedure. First, unnecessary text components such as HTML tags, scripts, and boilerplate markup are removed. The remaining data are processed through a regular expression–based tokenizer and morphological analysis to extract tokens, with a particular focus on nouns that are more likely to represent core entities or concepts. Standard English stopwords and additional stopwords defined to reflect the characteristics of the dark web domain are removed to reduce noise caused by meaningless or non-informative text. In addition, random strings, system placeholders, and markup artifacts are manually excluded. All subsequent analyses are performed on this refined set of textual tokens, which constitutes a de-duplicated, text-only dataset derived from CSAM-related dark web pages.

The reason this study adopted an English-based analysis is that nearly all dark web CSAM-related sites collected were written in English. In our crawling results, excluding inaccessible pages, 100% of the retrieved sites were English-based. Accordingly, focusing on English for keyword extraction and model construction was deemed the most appropriate approach. At the same time, the TF-IDF, Eigenvector Centrality, and Word2Vec techniques used in this study are language-agnostic; with appropriate adjustments to morphological structures and tokenization rules, they can be applied to other languages as well.

Building on this dataset, the study proposes a text mining–based approach for the automatic collection and analysis of textual data from dark web sources associated with the distribution of CSAM. Three representative techniques are applied: term frequency–inverse document frequency (TF–IDF), which quantifies the importance of terms based on document-level frequency; eigenvector centrality, which identifies structurally central keywords within a term co-occurrence network; and Word2Vec, which captures semantic similarity between terms. The effectiveness of crime-related keywords extracted by each individual method is first validated, and the performance of TF–IDF, eigenvector centrality, and Word2Vec in investigative keyword extraction is then compared and evaluated. Based on these experiments, a hybrid model is designed that integrates statistical rarity information from TF–IDF and semantic similarity information from Word2Vec, centered on eigenvector centrality, with the aim of simultaneously improving the contextual relevance and detection accuracy of extracted keywords. The study qualitatively and quantitatively compares the performance of the individual and hybrid models to evaluate their effectiveness in detecting CSAM-related content.

The data collection and analysis procedures were designed to comply with all applicable legal and ethical standards, including regulations concerning CSAM. Although the raw data used in this research may contain textual expressions related to CSAM collected from the dark web, these data were used exclusively for internal analysis within the research team and were not shared externally in any form. The dataset employed in this paper consists of derivative text-only data generated from the raw corpus, and it does not include any multimedia files such as images or videos. Moreover, the crawler was implemented as a strict text-only collector and did not download any images, videos, or file attachments from target pages, thereby minimizing legal and ethical risks and preventing researchers from being directly exposed to illegal materials.

Ethical and legal compliance.

All data collection and analysis procedures were conducted in accordance with applicable legal and ethical standards for research related to CSAM. In this study, only textual data were collected and analyzed, and no illegal CSAM content, such as images or videos, was downloaded, stored, or viewed by the researchers. The crawler was implemented as a text-only collector, so multimedia files were not collected. All raw corpus data were stored in a secure local research environment, with access restricted to authorized members of the research team for research purposes only. Furthermore, because the methodology proposed in this study relies exclusively on derived, text-based analytical outputs, it does not enable readers to access CSAM content.

Terms and conditions compliance.

Data collection and analysis were conducted in accordance with the publicly available terms and conditions and acceptable-use policies of the data sources used in this study (including the Tor Browser software and the Torch search service used to locate.onion pages). The crawler accessed only content retrievable through standard browsing without authentication, did not circumvent access controls, and did not redistribute third-party page content.

Text mining techniques for CSAM keyword extraction.

This section introduces three distinct text mining methods—TF-IDF, Eigenvector Centrality, and Word2Vec—used to extract keywords related to CSAM from dark web text data. Each method is described in terms of its conceptual framework and analytical strengths. Their comparative performance and validation results are addressed in a subsequent section.

TF-IDF-based keyword extraction: TF-IDF is a statistical method used to evaluate how important a word is within a document relative to a collection of documents. It favors terms that appear frequently in a specific document but are rare across the broader corpus, making it effective for extracting distinctive terms [29–32]. When applied to the collected dataset of 71,666 nouns from 2,414 dark web pages, TF-IDF surfaced a set of top-ranked keywords that reflect the underlying structure and distribution mechanisms of CSAM-related content. These keywords were ranked by their TF-IDF scores, which quantify a term’s distinctiveness by combining its frequency within a document and its rarity across the corpus. Table 1 summarizes the extracted keywords.

Download:

Table 1. Top 20 keywords extracted using TF-IDF.

https://doi.org/10.1371/journal.pone.0344470.t001

The highest-ranking keyword, “maxchan” (1st), refers to publicly available login credentials on a dark web site, which are used to facilitate repeated user access. “mixedlolitas” (2nd) appears as a hyperlink embedded within site navigation, while “zenphoto” (20th) denotes software used to organize and present multimedia files on CSAM-sharing sites. “pornoslonik” (8th) and “camera” (16th) are related to tools used for producing or storing exploitative materials. Community structures are reflected in terms like “endchan” (10th) and “community” (13th), which suggest the presence of forums or boards for sharing materials and peer communication. Personal identifiers such as “hanamaru” (14th) and “sonya” (15th) likely function as usernames or pseudonyms associated with content creators or distributors. In addition, broader contextual terms—”countries” (3rd), “city” (6th), and “language” (7th)—highlight the cross-national and multilingual nature of these platforms, reflecting the global reach of CSAM dissemination. Although many of these keywords do not explicitly describe abusive content, they provide investigative value by revealing the infrastructural, communicative, and operational layers of dark web ecosystems.

Eigenvector centrality-based keyword extraction: Eigenvector Centrality is a graph-based ranking algorithm that evaluates a keyword’s importance based on both the number and influence of its co-occurring neighbors. This allows it to identify structurally central terms in a co-occurrence network, often capturing deeply embedded or thematically cohesive concepts [32–35].

In this study, Eigenvector Centrality identified high-impact CSAM-related keywords such as “child” (1st), “zoo” (3rd), “pedomoms” (4th), and “jblinks” (6th), which showed strong semantic relevance to exploitative content and related community structures. These terms occupy structurally central positions within the co-occurrence network, indicating their frequent appearance alongside many other key terms. Table 2 presents the top 20 keywords ranked by their eigenvector centrality scores.

Download:

Table 2. Top 20 keywords extracted using eigenvector centrality.

https://doi.org/10.1371/journal.pone.0344470.t002

“child” (1st) had the highest centrality, reaffirming the primary focus of these networks on minors. “lola” (2nd), a stylized variant of “Lolita”, and “jblinks” (6th) function as hyperlink labels or navigational anchors connecting users across multiple sites. “Myloveboard” (8th), a known category within the “MixedLolitas” ecosystem, likely serves as a thematic hub organizing exploitative content. Other keywords point to specific behaviors or victim profiles.

For instance, “zoo” (3rd) and “pedomoms” (4th) reference extreme themes such as animal exploitation or parental abuse. “Toddlers” (5th) indicates an even narrower target age group, likely children aged 1–3. Media-related keywords such as “thumbnailed” (9th), “videos” (10th), “pic” (13th), “pictures” (14th), “photo” (18th), and “image” (19th) highlight practices of visual content organization and dissemination. Terms like “little” (11th), “cute” (12th), and “young” (17th) reflect grooming language or descriptors that may serve to attract specific audiences. The appearance of terms such as “nude” (16th), “model” (15th), and “beautiful” (20th) suggest attempts to normalize or stylize illegal content.

Overall, this analysis highlights that CSAM content on the dark web is both semantically rich and structurally organized around central keywords performing navigational and thematic roles. The results demonstrate Eigenvector Centrality’s strength in identifying deeply embedded keywords that define the topology of exploitative content networks, thereby providing practical insights for investigative detection and intervention.

Word2Vec-based keyword extraction: Word2Vec is a neural embedding model that represents words in a continuous vector space, allowing for the identification of semantically similar or related terms. It is particularly adept at surfacing slang or euphemistic expressions that might not be captured by frequency-based models [36–39]. Table 3 presents the top 20 keywords extracted using the Word2Vec technique from the dark web CSAM dataset. The similarity scores indicate the semantic proximity of each keyword to the target context within the embedding space.

Download:

Table 3. Top 20 keywords extracted using Word2Vec.

https://doi.org/10.1371/journal.pone.0344470.t003

The highest-ranked keywords, such as “guys” (1st) and “boys” (2nd), reflect the strong association with the thematic core of CSAM, especially targeting young male victims. Emotionally expressive terms like “sweet” (3rd), “cutie” (7th), and “enjoy” (16th) suggest promotional or evaluative language often used within exploitative communities.

Sexually explicit slang terms are also prominent, including “pussy” (5th) and “ass” (6th), which are frequently used to describe body parts in a sexualized context. Notably, “telegru” (10th) appears as a slang variation of “Telegram,” indicating platforms commonly mentioned in dark web communications. Geographic indicators such as “Russian” (13th) and “american” (20th) imply the global distribution of such content, often referenced through national or regional tags. The term “school” (9th) anchors the content in specific social contexts, while “models” (4th) may reflect attempts to legitimize or disguise exploitative imagery using softened terminology. These findings highlight the model’s ability to extract not only semantically related keywords but also criminal slang, emotional rhetoric, platform references, and distributional clues. Word2Vec proves effective for uncovering concealed linguistic signals within CSAM-related content on the dark web, offering valuable insights for interpreting criminal lexicons and designing robust detection strategies.

Comparative evaluation of individual models.

To assess the performance of the three techniques, we conducted a comparative analysis based on two metrics: keyword categorization and retrieval accuracy. Fig 2 categorizes the top 20 keywords from each model into three CSAM-relevant categories: sexual crimes, children, and criminal organizations.

Download:

Fig 2. Categorization of the top 20 keywords extracted using individual text mining techniques by crime-related category.

https://doi.org/10.1371/journal.pone.0344470.g002

Eigenvector Centrality yielded the most balanced results, identifying five keywords related to sexual crimes (e.g., “lola”, “pedomoms”, “jblinks”), four to children (e.g., “child”, “toddlers”), and one to criminal organizations (“myloveboard”). This balance indicates that the graph-based mechanism of Eigenvector Centrality effectively captures structurally important and thematically meaningful keywords embedded in dark web text.

TF-IDF extracted two keywords associated with sexual crimes (“pornoslonik”, “erotic”), one with children (“child”), and three with criminal organizations (“maxchan”, “mixedlolitas”, “endchan”). While TF-IDF performs well in identifying terms with strong document-level frequency characteristics, some extracted keywords may reflect infrastructural elements or general web terminology rather than direct criminal indicators.

In contrast, Word2Vec’s results skewed toward emotionally expressive or colloquial language (e.g., “sweet”, “cutie”, “enjoy”), with only one keyword (“boys”) clearly categorized under child-related terms. This pattern demonstrates Word2Vec’s strength in capturing semantic similarity rather than domain-specific criminal terminology, resulting in broader yet occasionally less precise keyword groups.

The CSAM site identification accuracy used in this study represents the proportion of sites identified as CSAM-related among the valid sites collected from the dark web using a specific keyword k, and is defined as shown in Equation (1).

(1)

In Equation (1), denotes the number of valid sites collected using keyword k that were accessible and whose actual content could be confirmed, while represents the number of those valid sites classified as CSAM-related. Sites that were offline, returned error codes, or were otherwise inaccessible were excluded from the analysis, and accuracy was computed solely based on verifiable, valid sites.

To obtain these measurements, major keywords were submitted to the dark web search engine Torch, and the sites exposed on the result pages were collected. Each collected site was then manually inspected in full to determine whether it was related to CSAM. CSAM-related sites were further categorized into two types: (1) Child Sexual Abuse Material Distribution, referring to sites that directly facilitate the upload or download of child sexual abuse material, and (2) Sexual Abuse Community, referring to sites that primarily share news, information, or discussions related to sexual crimes involving minors. Site classification was conducted manually by the researchers according to predefined criteria. Each site was evaluated based on whether it directly distributed CSAM content or exhibited characteristics of a CSAM-related community. Due to the sensitive nature of the data and ethical considerations, independent multi-annotator labeling and the assessment of inter-rater reliability were not performed. Instead, classification consistency was ensured by applying clearly defined annotation guidelines and conducting an exhaustive review of all collected sites under the same criteria. The validation results obtained using this procedure are presented in Table 4.

Download:

Table 4. Validation accuracy based on keywords extracted from individual models.

https://doi.org/10.1371/journal.pone.0344470.t004

Keywords derived based on Eigenvector Centrality demonstrated consistently high retrieval accuracy for CSAM-related sites, with four out of five keywords achieving over 79% and some exceeding 95% (“pedomoms” at 97.54%, “toddlers” at 95.63%). These results affirm the method’s ability to uncover high-impact, structurally central terms. In contrast, TF-IDF’s top five keywords showed wide variability in accuracy. Three keywords (“maxchan”, “mixedlolitas”, and “stronghold”) scored below 6%, whereas only ‘countries’ achieved a relatively high accuracy of 72.25%. This variance indicates TF-IDF’s sensitivity to context and its tendency to extract general or infrastructural terms when applied to diverse web content. Word2Vec displayed intermediate performance, with several keywords exceeding 80% accuracy (“boys” at 89.08%, “sweet” at 90.21%, “pussy” at 91.61%), yet others showed semantic drift, reducing their forensic utility.

These findings indicate that Eigenvector Centrality is the most effective individual method for extracting structurally important CSAM-related keywords within the dark web ecosystem. Its awareness of network topology enables the identification of deeply embedded terms that are frequently central to illicit content clusters. However, despite its strength in capturing co-occurrence-based centrality, this approach may overlook context-specific or semantically nuanced clues that are equally critical in digital investigations. Therefore, the following section refines the approach by introducing hybrid models that integrate complementary methods to overcome these limitations and enhance the practical applicability of keyword-based crime detection.

Refinement of investigative keywords using combined text mining models

Building on the findings of the earlier analysis, which highlighted the relative strengths and weaknesses of TF-IDF, Eigenvector Centrality, and Word2Vec, this section aims to enhance keyword extraction performance by combining their complementary advantages. While Eigenvector Centrality proved effective in identifying structurally central terms within co-occurrence networks, its limitations in capturing context-specific and semantically rich expressions necessitate a more holistic approach.

To this end, we introduce four hybrid models that integrate Eigenvector Centrality with either TF-IDF or Word2Vec. These combined models are designed to balance structural importance with semantic relevance, thereby improving the contextual accuracy and investigative utility of extracted CSAM-related keywords. Specifically, Combined Models 1 and 2 integrate Eigenvector Centrality with TF-IDF, while Combined Models 3 and 4 incorporate Word2Vec-based semantic similarity to expand the range of crime-relevant terms.

Combined Models 1 and 2: Eigenvector centrality–TF-IDF based models.

In Combined Model 1, the top 20 keywords extracted via Eigenvector Centrality were selected based on their structural centrality within the co-occurrence network. These were paired with the top 20 TF-IDF keywords, excluding 2 overlapping terms, resulting in 18 unique TF-IDF keywords. A total of 360 keyword pairs (20 × 18) were constructed to combine structurally prominent and statistically rare terms, enhancing the discovery of crime-related indicators. In Combined Model 2, the same 20 Eigenvector Centrality keywords were combined with a broader set of 6,974 TF-IDF keywords. After removing 19 overlapping terms, 6,955 non-redundant TF-IDF keywords were retained, producing 139,100 unique keyword pairs (20 × 6,955). To ensure consistent scoring across differing value scales, Min-Max normalization was applied, transforming all keyword scores into a [0, 1] range. Subsequently, the final score for each keyword pair was computed by summing the normalized Eigenvector Centrality value of keyword i and the normalized TF-IDF value of keyword t, as shown in Equation (2). This scoring mechanism integrates both structural centrality and statistical rarity to highlight investigative relevance. The top 20 keyword pairs generated from Combined Models 1 and 2 are presented in Table 5.

Download:

Table 5. Top 20 keyword pairs extracted from combined models using eigenvector centrality and TF-IDF.

https://doi.org/10.1371/journal.pone.0344470.t005

(2)

The keyword pair “child+maxchan (1st)”, which appeared in both Combined Models 1 and 2, indicates a structurally strong association between the two keywords. ‘child’ is a general target keyword related to CSAM within the dataset, and its combination with ‘maxchan’ highlights a network centered on this theme. These findings indicate that ‘maxchan’ functions as a central hub, connecting multiple keywords and CSAM-related sites rather than acting as an isolated platform. “lola+maxchan (2nd)” and “zoo+maxchan (3rd)” represent specific content and communities, respectively, and their combination with ‘maxchan’ confirms that these keyword pairs frequently co-occur within the network. These keyword pairs go beyond mere co-occurrence, implying that theme-specific crime-related terms are interconnected through ‘maxchan’ as a central hub. This structural pattern illustrates how networks form and materials are exchanged among child sexual abuse–related sites. The keyword pairs “myloveboard+maxchan (4th)” and “jblinks+maxchan (4th)” illustrate the flow of materials or navigation paths between platforms through categories or hyperlink structures. In particular, keywords like “jblinks” are not merely nodes within the network but also function as pathways facilitating access to other sites. The structural analysis of Combined Models 1 and 2 confirms that “maxchan” exerts substantial influence within the CSAM site network. The associated keyword pairs clarify the primary information flow and interconnections among nodes. This provides critical insights into the connection structure and material exchange mechanisms within the dark web network and highlights the role and importance of specific hub nodes within the network.

Combined Models 1 and 2 provided useful information for collecting crime leads, but several limitations were identified. First, the simple combination of eigenvector centrality and TF-IDF does not sufficiently ensure meaningful associations between keywords. For instance, “zoo+stronghold” was extracted as a key keyword pair by both methods, but the two keywords did not appear simultaneously on the same dark web site. This indicates that such keyword pairs may fail to represent a concrete relationship with crime leads. Second, since TF-IDF and eigenvector centrality rely on keyword frequency and importance, there is a possibility that frequently appearing or highly connected keywords may be evaluated as significant even if they are unrelated to actual crimes.

Overall, the analysis of the results from Combined Models 1 and 2 shows that the interpretation of keyword pairs remains limited to identifying frequently appearing keywords or tracking general crime trends across multiple CSAM-related sites. These limitations are attributed to the approach of simply combining top-ranked keywords from both methods to generate keyword pairs.

To address these issues and secure more reliable crime leads, we explore additional experiments by incorporating semantic similarity from Word2Vec into the eigenvector centrality method, as detailed in the following analysis.

Combined Models 3 and 4: Eigenvector centrality–Word2Vec based models.

This section presents the design of Combined Models 3 and 4, based on eigenvector centrality and Word2Vec. First, Combined Model 3 focuses on the top 20 critical criminal keywords identified through eigenvector centrality and utilizes Word2Vec to extract additional keywords. By combining the 20 keywords derived from eigenvector centrality with 120 unique keywords extracted from Word2Vec that do not overlap with eigenvector centrality, a total of 2,400 keyword pairs were generated. To maximize the reflection of the characteristics of the keywords extracted through the combined model, Equation (3) proposed in this study was applied to calculate the scores of the keyword pairs. In this context, the “characteristics of keywords” refer to the use of eigenvector centrality keywords as-is, while incorporating the values of related keywords with semantic similarity extracted based on the central eigenvector centrality keyword i.

Equation (3) was designed to calculate by incorporating the characteristics of both eigenvector centrality and Word2Vec, enabling an evaluation of how effective each keyword pair is as a critical criminal lead.

(3)

Equation (3) applies when only keyword i belongs to the eigenvector centrality set X. is calculated by accumulating the eigenvector centrality coefficient of i and the probability of j appearing centered around i. The higher the , the more effective the keyword pair for CSAM-related evidence collection. In this study, the priority of the keyword pairs is determined based on

For example, if “lola” belongs to the eigenvector centrality set X and “sex” does not, the score for the keyword pair is calculated by considering both the eigenvector centrality value of ‘lola’ and the semantic similarity value of the related keyword “sex”, which is extracted using Word2Vec with “lola” as the central keyword. Unlike Combined Model 3, Combined Model 4 uses keywords extracted from eigenvector centrality as center keywords to further extract related keywords, and then reuses these newly extracted keywords as center keywords to extract additional keywords. This process results in the extraction of 1 to 94 related keywords for each eigenvector centrality keyword, generating a total of 1,886 keyword pairs In Combined Model 4, the value is calculated using Equation (4), which, like Equation (3), is designed to fully reflect the characteristics of both eigenvector centrality and Word2Vec.

(4)

Equation (4) considers two cases: when both keywords i and j belong to the eigenvector centrality set X, and when only i belongs to X. In both cases, the score is calculated using the eigenvector centrality values and , and then adding the higher of the two Word2Vec similarity values: or , along with the maximum of the two Word2Vec-based semantic similarity values: and . In the first case, where both i and j belong to set X, the score is determined by summing the eigenvector centrality values of both keywords ( and ), and adding the higher semantic similarity score between the two—max(, ). For example, if both “lola” and “child” are in set X, the final score includes their centrality values and the strongest semantic similarity between them as calculated by Word2Vec. In the second case, where only i belongs to X and j does not, the score is calculated by accumulating and max(, ).

Even though j is not structurally central, the model allows for its semantic relevance to be considered in combination with a central keyword. For example, if “city” is not in set X but “zoo” is, and “zoo” is set as i, then the score is calculated using and the higher of the two semantic similarity values between “zoo” and “city”. This unified scoring formula enables the model to reflect both structural importance and semantic closeness in the keyword pairing process.

The scores derived in this manner are used to extract the top 20 keyword pairs in Combined Model 4. By comparison, Combined Model 3 generated 2,400 keyword pairs based on non-overlapping keywords between eigenvector centrality and Word2Vec, and selected the top 20 pairs using Equation (3). In contrast, Combined Model 4 expands the semantic search space by iteratively retrieving additional keywords centered around the eigenvector keywords and evaluating them using Equation (4). Thus, Combined Model 3 generated 2,400 keyword pairs by combining 20 non-overlapping eigenvector centrality keywords with 120 semantically related terms from Word2Vec, and selected the top 20 pairs using Equation (3). In contrast, Combined Model 4 expanded this approach by recursively extracting related terms centered on eigenvector keywords and calculating their scores using Equation (4). The final top 20 keyword pairs identified through both models are summarized in Table 6.

Download:

Table 6. Top 20 keyword pairs extracted from combined models using eigenvector centrality and Word2Vec.

https://doi.org/10.1371/journal.pone.0344470.t006

The analysis of keyword pairs in Combined Model 3 revealed that “child” emerged as the central keyword, highlighting its significance in eigenvector centrality analysis. Moreover, when combined with keywords extracted through the Word2Vec method related to CSAM, it suggests that “child” could serve as a critical clue in crimes involving children. Notably, keywords paired with “child” included “loves,” “models,” “alice,” “library,” and “hard.” While these individual keywords may seem superficially unrelated, the contextual analysis capabilities of Word2Vec indicate their potential to provide crucial insights into child sexual crimes. Among the top-ranking keyword pairs, “child+loves” (1st) serves as a significant clue indicating a connection to CSAM on the dark web. Meanwhile, “child+models” (2nd), although it might ostensibly refer to child models, may implicitly suggest sexual activities involving children in specific contexts. The keyword “alice” (3rd) was verified as the name of a child model recurrently linked to CSAM. Additionally, keywords such as “teen”, “age”, “baby” and “material” derived through Word2Vec may not appear directly as criminal terms. However, when paired with ‘child,’ they can be interpreted as crime-related clues referring to acts or materials linked to child sexual crimes.

In Combined Model 4, the analysis of keyword pairs identified “child+zoo” (1st), suggesting a connection on the dark web to child-animal sexual exploitation. Other high-ranking keyword pairs, such as “child+kitty” (2nd), “lola+pedomoms” (3rd), and “child+jblinks” (4th), were confirmed to be directly associated with CSAM, frequently appearing together on dark web sites related to such crimes. This indicates that these top-ranking keywords effectively reflect the significance of child sexual crime-related terms. However, lower-ranking keyword pairs, while appearing less relevant on the surface, may still carry latent connections to crimes due to their contextual pairing. This suggests the necessity of not excluding lower-ranking keyword pairs from analysis, as these may reveal seemingly less significant terms to be pivotal crime-related clues. Therefore, a comprehensive analysis using diverse techniques is essential to perform in-depth crime clue investigations and uncover additional leads. This section demonstrates that the integration of eigenvector centrality and the Word2Vec method enables the extraction of meaningful crime-related information beyond simple keyword combinations. The experiments with these Combined Models hold significant value in precisely identifying associations with crimes and expanding connections to potential criminal activities by securing additional keywords.

Comparative analysis of combined model performance.

We objectively evaluated the effectiveness of critical crime keyword collection by categorizing the keywords extracted through combined models 1, 2, 3, and 4 into categories such as sex crimes, children, and criminal organizations. Fig 3 shows the classification results of the top keywords extracted using the combined text mining methods.

Download:

Fig 3. Categorization of the top 20 keywords extracted using combined text mining models by crime-related category.

https://doi.org/10.1371/journal.pone.0344470.g003

Fig 3 illustrates that the top keywords from Combined Models 1 and 2 are grouped into five related to sexual crimes, five to children, and three to criminal organizations. The sexual crime category includes keywords such as “lola,” “pedomoms,” and “zoo.” The child category includes “child” and “toddlers,” while the criminal organization category includes “myloveboard” and “mixedlolitas.” In contrast, Combined Model 3 yielded no keywords related to sexual crimes or criminal organizations and only three related to children, namely “child,” “teen,” and “baby.” Compared to other models, this indicates that Combined Model 3 was less effective in collecting crime-related terms across categories. Finally, Combined Model 4 produced four sexual crime–related keywords, four child-related keywords, and one related to criminal organizations. All of these were validated as highly relevant to their respective categories.

The sexual crime category includes keywords such as “lola,” “pedomoms,” “zoo,” and “jblinks,” while the child category includes “child” and “toddlers.” The criminal organization category includes “myloveboard.” An analysis of the methods reveals that Combined Models 1 and 2, which utilized eigenvector centrality and TF-IDF, extracted a total of 19 keywords, forming keyword pairs. Among the extracted keywords, 12 from Combined Models 1 and 2 were confirmed to be valid across the categories of sexual crimes, children, and criminal organizations. In contrast, Combined Model 3 generated 21 keywords, yet only three were relevant to the child category.

This outcome suggests that, although Model 3 aimed to diversify the keyword pool by combining eigenvector centrality with non-overlapping keywords from Word2Vec, it was less effective in retrieving keywords directly related to the crime categories. Two primary factors may explain this result: (1) semantic discrepancies between the keyword sets derived from eigenvector centrality and Word2Vec, and (2) the exclusion of meaningful terms due to the strict emphasis on non-overlapping extraction.

On the other hand, Combined Model 4 produced only nine keywords—fewer than any other model—but all were confirmed to be valid within the key crime categories. Despite the limited number, this result highlights the model’s strength in generating high-relevance keywords by effectively capturing semantically associated terms through recursive keyword expansion. These findings support the potential of combining eigenvector centrality with TF-IDF and Word2Vec for generating effective keyword pairs that serve as crime leads in dark web investigations. Compared to using a single text-mining technique, this integrated approach enables broader keyword coverage and improved contextual relevance. In conclusion, eigenvector centrality consistently demonstrated superior performance in identifying core keywords tied to CSAM. Its ability to detect high-importance terms within the network structure makes it a particularly valuable technique for extracting direct and actionable CSAM-related investigative leads. Thus, a strategy centered on eigenvector centrality emerges as the most reliable and efficient method for analyzing dark web crime content.

Quantitative evaluation of model performance.

Each model’s effectiveness was quantitatively validated using search accuracy in the Torch dark web search engine. Top five keyword pairs from each model were used as queries. Table 7 presents the accuracy of these searches.

Download:

Table 7. Validation accuracy based on keyword pairs extracted from combined models.

https://doi.org/10.1371/journal.pone.0344470.t007

Since Combined Models 1 and 2 generated identical top 5 keyword pairs, they are presented together in a single row in Table 7 for clarity and conciseness. The analysis results showed that the keyword pairs extracted from Combined Models 1 and 2 — “child+maxchan” (1st, 0.00%), “lola+maxchan” (2nd, 0.00%), “zoo+maxchan” (3rd, 0.00%), “pedomoms+maxchan” (4th, 0.00%), and “toddlers+maxchan” (5th, 0.00%) — recorded no search results on the dark web search engine Torch. This demonstrates that Combined Models 1 and 2 are ineffective for identifying CSAM sites or collecting crime leads. These keyword pairs failed to reflect the actual search patterns or contexts used on the dark web, exposing limitations in contextual relevance and practical usability. Consequently, the practical applicability of these models is limited, and their effectiveness for searching and collecting data in the dark web environment is very low.

On the other hand, the top keyword pairs extracted from Combined Model 3 — “child+loves” (1st, 88.61%), “child+models” (2nd, 75.35%), “child+alice” (3rd, 95.96%), “child+library” (4th, 92.83%), and “child+hard” (5th, 76.44%) — achieved high accuracy. Notably, ‘child+loves’ and ‘child+library’ showed exceptional performance in detecting CSAM communities, serving as highly effective keywords for investigative lead collection. Combined Model 4 outperformed the other models, with top keyword pairs — “child+zoo” (1st, 97.89%), “child+kitty” (2nd, 84.09%), “lola+pedomoms” (3rd, 100.00%), “child+jblinks” (4th, 100.00%), and “lola+myloveboard” (5th, 100.00%) — showing consistently high accuracy in detecting child sexual exploitation content. These results demonstrate superior performance compared to the standalone eigenvector centrality model. Furthermore, Combined Model 4 significantly improved the efficiency of collecting CSAM sites and crime leads by accurately identifying a large number of relevant sites.

Robustness checks using multiple seed keywords.

In this part of the study, we conducted a multi-seed stability evaluation using pthc, upskirt, and pedomoms as seed keywords to minimize biases caused by relying on a single seed keyword and to assess the robustness of the model. These three keywords are among the most frequently used search terms for detecting CSAM on the dark web and commonly appear in CSAM-related posts and community discussions.

First, “pthc”, an abbreviation for “pre-teen hard core,” refers to illegal sexual exploitation material involving children under the age of 13 and is widely used as a high-risk tag on websites that share or trade such illicit content. “upskirt” is a term that indicates voyeuristic images or videos secretly taken under the skirt of a woman or minor. Lastly, “pedomoms” is a slang term derived from pedophile, referring to individuals who perceive their own children as sexual targets. Using these seed keywords, we collected text data with the dark web crawler developed in this study and evaluated the crime-site retrieval accuracy under the multi-seed setting. The results of this multi-seed evaluation for pthc, upskirt, and pedomoms are presented in Table 8.

Download:

Table 8. Multi-seed-based stability verification results for TF-IDF, eigenvector centrality, and Word2Vec models.

https://doi.org/10.1371/journal.pone.0344470.t008

Table 8 presents a comparison of the top 10 extracted keywords for each seed keyword—“pthc”, “upskirt”, and “pedomoms”—based on their accuracy in identifying CSAM-related dark web sites. The analysis evaluates the performance of “TF–IDF”, “Eigenvector Centrality”, and “Word2Vec”, focusing specifically on the core crime-related keywords extracted by each method. For the seed keyword “pthc”, the “TF–IDF” method yielded a substantial number of keywords with 0% or very low accuracy, with only “child” (79.7%) and “language” (77.66%) showing relatively high accuracy. This indicates a limitation in “TF–IDF”’s ability to extract high-quality core CSAM-related keywords. In contrast, the “Eigenvector Centrality” method produced several high-accuracy keywords such as “boys” (90.74%), “pedo” (82.45%), and “child” (79.7%). Among these, “boys” and “child” directly reference offenses against minors, whereas “pedo” and “incest” denote sexual abuse and taboo relations—terms frequently used within actual CSAM communities. “Word2Vec”, meanwhile, extracted keywords such as “fotos” (97.93%), “nude” (97.41%), and “young” (82.12%), which are strongly associated with visual or age-related cues. However, unrelated terms such as “sticky” (18.8%) and “shorturl” (0%) were also included, revealing limitations in selectively isolating core crime indicators during contextual expansion.

For the seed keyword “upskirt”, the “TF–IDF” method produced highly accurate technical terms related to illicit recording and gallery management—”zenphoto” (100%), “gallery” (98.23%), and “aperture” (98.26%). “Eigenvector Centrality” similarly extracted “zenphoto” (100%), “upskirt” (97.92%), and “gallery” (98.23%), demonstrating strong overlap with “TF–IDF” and presenting consistently high accuracy, thereby confirming their suitability as core CSAM-related keywords. In contrast, “Word2Vec” returned general or camera-related terms such as “nikon” (74.74%), “country” (73.02%), and “cock” (70.07%), while many remaining keywords showed lower accuracy, indicating limited effectiveness in pinpointing core crime indicators for this seed.

For the seed keyword “pedomoms”, the “TF–IDF” method produced a few high-accuracy terms such as “boys” (90.74%), “society” (90.1%), and “child” (79.7%), whereas most other keywords showed 0% or low accuracy, demonstrating restricted capability in identifying core CSAM-related terms. Conversely, “Eigenvector Centrality” extracted “pedomoms” (97.54%), “myloveboard” (100%)—a known board or community name used for sharing illegal child content—and “toddler” (92.63%), a term denoting very young children. These keywords are widely observed in real CSAM contexts, supporting the method’s ability to reliably capture crime-specific terminology. “Word2Vec” extracted partially relevant terms such as “boys” (90.74%), “school” (62.3%), and “abuse” (59.1%), but its overall accuracy remained lower than that of “Eigenvector Centrality”. Moreover, several extracted terms (e.g., “school”) are commonly used in non-criminal, everyday contexts, indicating a tendency to include keywords whose meaning may become ambiguous when interpreted within a crime-specific environment.

Overall, across all three seed keywords—“pthc”, “upskirt”, and “pedomoms”—”Eigenvector Centrality” consistently achieved the highest average accuracy and exhibited stable structural coherence. While “TF–IDF” and “Word2Vec” captured certain core CSAM-related terms through document-based weighting or semantic expansion, they frequently introduced unrelated keywords or showed reduced accuracy in identifying crime-specific sites. In contrast, “Eigenvector Centrality” reliably extracted core crime-related keywords directly associated with CSAM across all seed configurations. These findings empirically demonstrate that even without relying solely on a single seed keyword such as “lolita”, the eigenvector-based approach can robustly and consistently identify key CSAM-related terminology in diverse seed keyword environments.

Discussion

The results of the multi-seed robustness analysis confirm that the proposed framework does not depend on a single seed keyword, such as “lolita,” and can consistently identify core child sexual abuse material (CSAM)–related terminology across heterogeneous seed environments using different seed keywords (pthc, upskirt, and pedomoms). In particular, the eigenvector centrality–based approach exhibited stable performance across all evaluated seed keywords, indicating its effectiveness in capturing structurally meaningful crime-related terms that are deeply embedded within CSAM-related communities. This robustness is especially important in real-world investigative settings, where initial seed keywords may vary depending on case types, platforms, or newly emerging slang. Unlike conventional approaches that rely solely on document-level frequency or simple semantic similarity, the proposed framework effectively integrates complementary analytical techniques, thereby providing a robust and scalable keyword expansion mechanism that remains applicable in the rapidly evolving dark web environment.

Despite these strengths, the proposed framework has a limitation. Because both the corpus construction and the subsequent validation processes in this study were conducted based on content accessible through the Torch search engine, the analytical results may not fully reflect the diversity of CSAM-related content across the entire dark web, but may instead primarily capture characteristics of data observed within the Torch-accessible ecosystem. Such dependency may influence the interpretation of retrieval performance under certain environments or conditions. Nevertheless, Torch is currently one of the most accessible dark web search engines and provides a practical and appropriate data foundation aligned with the scope and objectives of this study.

In addition, the search-hit–based accuracy metric used in this study was calculated using only non-duplicated search results for each seed keyword, thereby limiting potential distortions caused by duplicate pages. However, depending on the characteristics of the initial seed keywords selected for collecting search results, factors such as term ambiguity, search engine ranking bias, and top-k cutoff settings may partially affect the measured accuracy, leading to a potential overestimation of performance. To account for this possibility, we conducted an additional multi-seed robustness analysis rather than relying on a single seed keyword. The results demonstrated consistent performance trends across different seed environments, thereby jointly validating the stability and reliability of the observed findings.

Conclusion

As crimes on the dark web become increasingly diverse and sophisticated, investigators must devote substantial time and effort to understanding criminal activity across platforms. In particular, child sexual abuse is severely punished worldwide, and viewing CSAM increases the risk of further abuse by fostering the perception of children as sexual objects. Therefore, developing technologies that can assist in the detection and investigation of such crimes is crucial.

To address this challenge, we conducted a text-mining-based investigation aimed at systematically collecting investigative clues related to child sexual abuse crimes on the dark web. To achieve this, we developed a custom crawler that collected text data from 2,414 dark web pages, resulting in a dataset of approximately 71,666 nouns. Using this dataset, we applied three text mining techniques—TF-IDF, Eigenvector Centrality, and Word2Vec—to extract the top 20 keywords with high structural or semantic significance. Among the individual models, Eigenvector Centrality proved to be the most effective for identifying CSAM-related keywords. Building on these results, we designed combined models that integrated Eigenvector Centrality with TF-IDF or Word2Vec. Combined Model 4, which incorporated Word2Vec-based semantic similarity into the Eigenvector-based keyword set, achieved superior retrieval accuracy ranging from approximately 84% to 100%, clearly outperforming the other approaches. This demonstrates the value of combining structural and semantic methods, with Eigenvector Centrality serving as a robust foundation. Importantly, the proposed methodology enables the dynamic expansion of early-stage investigative clues by automatically generating semantically associated keywords, moving beyond static keyword sets. This allows investigators to adapt more effectively to evolving dark web threats and content variations. In future work, we aim to extend the proposed crime information collection model to track dark web activities linked to social media platforms such as Telegram and Discord, ultimately evolving it into an integrated investigative support framework capable of operating across diverse platforms. Within this expanded system, the keyword-ranking–based clue identification framework developed in this study can function as a core module, automatically filtering and prioritizing large-scale dark web data to substantially reduce the volume of information investigators must manually review. This will further enable the system to directly support investigative workflows by identifying and monitoring high-risk sites that should be examined with priority.

References

1. Liggett R, Lee JR, Roddy AL, Wallin MA. The dark web as a platform for crime: an exploration of illicit drug, firearm, CSAM, and cybercrime markets. The Palgrave Handbook of International Cybercrime and Cyberdeviance. Cham: Springer International Publishing; 2019. pp. 1–27.
- View Article
- Google Scholar
2. Beshiri AS, Susuri A. Dark web and its impact in online anonymity and privacy: a critical analysis and review. J Comput Commun. 2019;07(03):30–43.
- View Article
- Google Scholar
3. Ngo FT, Marcum C, Belshaw S. The dark web: what is it, how to access it, and why we need to study it. J Contemp Crim Justice. 2023;39(2):160–6.
- View Article
- Google Scholar
4. Wang Y, Arief B, Franqueira VNL, Coates AG, Ó Ciardha C. Investigating the availability of child sexual abuse materials in dark web markets: Evidence gathered and lessons learned. In: Proceedings of the 2023 European Interdisciplinary Cybersecurity Conference. 2023. pp 59–64.
- View Article
- Google Scholar
5. Oosthoek K, Van Staalduinen M, Smaragdakis G. Quantifying dark web shops’ illicit revenue. IEEE Access. 2023;11:4794–808.
- View Article
- Google Scholar
6. Edwards G, Christensen LS, Rayment-McHugh S, Jones C. Cyber strategies used to combat child sexual abuse material. Trends Issues Crime Criminal Justice. 2021;636:1–16.
- View Article
- Google Scholar
7. Rajawat AS, Bedi P, Goyal SB, Kautish S, Xihua Z, Aljuaid H, et al. Dark Web data classification using neural network. Comput Intell Neurosci. 2022;2022:8393318. pmid:35387252
- View Article
- PubMed/NCBI
- Google Scholar
8. Fayzi A, Fayzi M, Ahmadi K. Dark web activity classification using deep learning. arXiv. 2023. https://arxiv.org/abs/2306.07980
- View Article
- Google Scholar
9. Adewopo V, Gonen B, Elsayed N, Ozer M, Elsayed ZS. Deep learning algorithm for threat detection in hackers forum (deep web). arXiv. 2022. https://arxiv.org/abs/2202.01448
- View Article
- Google Scholar
10. Ebrahimi M, Nunamaker JF Jr, Chen H. Semi-supervised cyber threat identification in dark net markets: a transductive and deep learning approach. J Manage Inform Syst. 2020;37(3):694–722.
- View Article
- Google Scholar
11. Takaaki S, Atsuo I. Dark Web Content Analysis and Visualization. In: Proceedings of the ACM International Workshop on Security and Privacy Analytics. 2019. pp. 53–9.
- View Article
- Google Scholar
12. Aoki T, Goto A. Graph visualization of dark web hyperlinks and their feature analysis. Int J Netw Comput. 2021;11(2):354–82.
- View Article
- Google Scholar
13. Hiramoto N, Tsuchiya Y. Measuring dark web marketplaces via bitcoin transactions: from birth to independence. Forensic Sci Int: Digit Investig. 2020;35:301086.
- View Article
- Google Scholar
14. Hassani H, Beneki C, Unger S, Mazinani MT, Yeganegi MR. Text mining in big data analytics. Big Data Cogn Comput. 2020;4(1):1.
- View Article
- Google Scholar
15. Degadwala S, Vyas D, Hossain MR, Dider AR, Ali MN, et al. Location-based modelling and analysis of threats by using text mining. In: 2021 2nd International Conference on Electronics and Sustainable Communication Systems (ICESC). 2021. pp. 1940–4.
- View Article
- Google Scholar
16. Iqbal F, Fung BCM, Debbabi M, Batool R, Marrington A. Wordnet-based criminal networks mining for cybercrime investigation. IEEE Access. 2019;7:22740–55.
- View Article
- Google Scholar
17. Qi Z. The text classification of theft crime based on TF-IDF and XGBoost model. In: 2020 IEEE International Conference on Artificial Intelligence and Computer Applications (ICAICA). 2020. pp 1241–6.
- View Article
- Google Scholar
18. Ch R, Gadekallu TR, Abidi MH, Al-Ahmari A. Computational system to classify cyber crime offenses using machine learning. Sustainability. 2020;12(10):4087.
- View Article
- Google Scholar
19. Samtani S, Zhu H, Chen H. Proactively identifying emerging hacker threats from the dark web: a diachronic graph embedding framework (D-GEF). ACM Transactions on Privacy and Security. 2020;23:21.
- View Article
- Google Scholar
20. Ghosh S, Das A, Porras P, Yegneswaran V, Gehani A. Automated Categorization of Onion Sites for Analyzing the Darkweb Ecosystem. In: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2017. pp. 1793–802.
- View Article
- Google Scholar
21. Xu Y, Chen G, Wu J, Xu W, Liu Q. Research on dark web monitoring crawler based on TOR. In: 2021 IEEE 2nd International Conference on Information Technology, Big Data and Artificial Intelligence (ICIBA). 2021. pp 197–202.
- View Article
- Google Scholar
22. Nazah S, Huda S, Abawajy JH, Hassan MM. An unsupervised model for identifying and characterizing dark web forums. IEEE Access. 2021;9:112871–92.
- View Article
- Google Scholar
23. Al Nabki MW, Fidalgo E, Alegre E, de Paz I. Classifying illegal activities on Tor network based on web textual contents. In: Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics. 2017. pp 35–43. https://aclanthology.org/E17-1004
- View Article
- Google Scholar
24. Chatzimarkaki G, Karagiorgou S, Konidi M, Alexandrou D, Bouras T, et al. Harvesting large textual and multimedia data to detect illegal activities on dark web marketplaces. In: 2023 IEEE International Conference on Big Data (BigData). 2023. pp 4046–4055.
- View Article
- Google Scholar
25. Zulkarnine AT, Frank R, Monk B, Mitchell J, Davies G. Surfacing collaborated networks in dark web to find illicit and criminal content. In: 2016 IEEE Conference on Intelligence and Security Informatics (ISI). 2016. pp. 109–14.
- View Article
- Google Scholar
26. L’Huillier G, Alvarez H, Ríos SA, Aguilera F. Topic-based social network analysis for virtual communities of interests in the dark web. SIGKDD Explor Newsl. 2011;12(2):66–73.
- View Article
- Google Scholar
27. Borgonovo F, Fisher A, Porrino G, Lucini SR, Martignon L. Target Detection of Dark web Hidden Services Network through MoniTOR and Social Network Analysis. In: 2023 IEEE International Workshop on Technologies for Defense and Security (TechDefense). 2023. pp. 412–6.
- View Article
- Google Scholar
28. Boekhout HD, Blokland AAJ, Takes FW. Early warning signals for predicting cryptomarket vendor success using dark net forum networks. Sci Rep. 2024;14(1):16336. pmid:39009720
- View Article
- PubMed/NCBI
- Google Scholar
29. Aizawa A. An information-theoretic perspective of tf–idf measures. Inform Process Manage. 2003;39(1):45–65.
- View Article
- Google Scholar
30. Cahyani DE, Patasik I. Performance comparison of TF-IDF and Word2Vec models for emotion text classification. Bulletin EEI. 2021;10(5):2780–8.
- View Article
- Google Scholar
31. Jiang Z, Gao B, He Y, Han Y, Doyle P, Zhu Q. Text Classification Using Novel Term Weighting Scheme-Based Improved TF-IDF for Internet Media Reports. Math Problem Eng. 2021;2021:1–30.
- View Article
- Google Scholar
32. Dessi D, Helaoui R, Kumar V, Recupero DR, Riboni D. TF-IDF vs word embeddings for morbidity identification in clinical notes: An initial study. arXiv. 2021. https://arxiv.org/abs/2105.09632
- View Article
- Google Scholar
33. Ruhnau B. Eigenvector-centrality — a node-centrality? Soc Netw. 2000;22(4):357–65.
- View Article
- Google Scholar
34. Ando H, Bell M, Kurauchi F, Wong KI, Cheung K-F. Connectivity evaluation of large road network by capacity-weighted eigenvector centrality analysis. Transportmetrica A: Transp Sci. 2020;17(4):648–74.
- View Article
- Google Scholar
35. Carrizosa E, Marin A, Pelegrin M. Spotting key members in networks: clustering-embedded eigenvector centrality. IEEE Syst J. 2020;14(3):3916–25.
- View Article
- Google Scholar
36. Jatnika D, Bijaksana MA, Suryani AA. Word2Vec model analysis for semantic similarities in English words. Procedia Computer Science. 2019;157:160–7.
- View Article
- Google Scholar
37. Savytska L, Vnukova NM, Bezugla IV, Pyvovarov V, Sübay MT. Using Word2Vec technique to determine semantic and morphologic similarity in embedded words of the Ukrainian language. Kharkiv National University of Economics; 2021. http://repository.hneu.edu.ua/handle/123456789/26122
38. Shen Y, Liu J. Comparison of Text Sentiment Analysis based on Bert and Word2vec. In: 2021 IEEE 3rd International Conference on Frontiers Technology of Information and Computer (ICFTIC). 2021. pp. 144–7.
- View Article
- Google Scholar
39. Qiao Y, Zhang W, Du X, Guizani M. Malware classification based on multilayer perception and Word2Vec for IoT security. ACM Trans Internet Technol. 2021;22(1):1–22.
- View Article
- Google Scholar

[ref1] 1. Liggett R, Lee JR, Roddy AL, Wallin MA. The dark web as a platform for crime: an exploration of illicit drug, firearm, CSAM, and cybercrime markets. The Palgrave Handbook of International Cybercrime and Cyberdeviance. Cham: Springer International Publishing; 2019. pp. 1–27.
View Article
Google Scholar

[2] View Article

[3] Google Scholar

[ref2] 2. Beshiri AS, Susuri A. Dark web and its impact in online anonymity and privacy: a critical analysis and review. J Comput Commun. 2019;07(03):30–43.
View Article
Google Scholar

[5] View Article

[6] Google Scholar

[ref3] 3. Ngo FT, Marcum C, Belshaw S. The dark web: what is it, how to access it, and why we need to study it. J Contemp Crim Justice. 2023;39(2):160–6.
View Article
Google Scholar

[8] View Article

[9] Google Scholar

[ref4] 4. Wang Y, Arief B, Franqueira VNL, Coates AG, Ó Ciardha C. Investigating the availability of child sexual abuse materials in dark web markets: Evidence gathered and lessons learned. In: Proceedings of the 2023 European Interdisciplinary Cybersecurity Conference. 2023. pp 59–64.
View Article
Google Scholar

[11] View Article

[12] Google Scholar

[ref5] 5. Oosthoek K, Van Staalduinen M, Smaragdakis G. Quantifying dark web shops’ illicit revenue. IEEE Access. 2023;11:4794–808.
View Article
Google Scholar

[14] View Article

[15] Google Scholar

[ref6] 6. Edwards G, Christensen LS, Rayment-McHugh S, Jones C. Cyber strategies used to combat child sexual abuse material. Trends Issues Crime Criminal Justice. 2021;636:1–16.
View Article
Google Scholar

[17] View Article

[18] Google Scholar

[ref7] 7. Rajawat AS, Bedi P, Goyal SB, Kautish S, Xihua Z, Aljuaid H, et al. Dark Web data classification using neural network. Comput Intell Neurosci. 2022;2022:8393318. pmid:35387252
View Article
PubMed/NCBI
Google Scholar

[20] View Article

[21] PubMed/NCBI

[22] Google Scholar

[ref8] 8. Fayzi A, Fayzi M, Ahmadi K. Dark web activity classification using deep learning. arXiv. 2023. https://arxiv.org/abs/2306.07980
View Article
Google Scholar

[24] View Article

[25] Google Scholar

[ref9] 9. Adewopo V, Gonen B, Elsayed N, Ozer M, Elsayed ZS. Deep learning algorithm for threat detection in hackers forum (deep web). arXiv. 2022. https://arxiv.org/abs/2202.01448
View Article
Google Scholar

[27] View Article

[28] Google Scholar

[ref10] 10. Ebrahimi M, Nunamaker JF Jr, Chen H. Semi-supervised cyber threat identification in dark net markets: a transductive and deep learning approach. J Manage Inform Syst. 2020;37(3):694–722.
View Article
Google Scholar

[30] View Article

[31] Google Scholar

[ref11] 11. Takaaki S, Atsuo I. Dark Web Content Analysis and Visualization. In: Proceedings of the ACM International Workshop on Security and Privacy Analytics. 2019. pp. 53–9.
View Article
Google Scholar

[33] View Article

[34] Google Scholar

[ref12] 12. Aoki T, Goto A. Graph visualization of dark web hyperlinks and their feature analysis. Int J Netw Comput. 2021;11(2):354–82.
View Article
Google Scholar

[36] View Article

[37] Google Scholar

[ref13] 13. Hiramoto N, Tsuchiya Y. Measuring dark web marketplaces via bitcoin transactions: from birth to independence. Forensic Sci Int: Digit Investig. 2020;35:301086.
View Article
Google Scholar

[39] View Article

[40] Google Scholar

[ref14] 14. Hassani H, Beneki C, Unger S, Mazinani MT, Yeganegi MR. Text mining in big data analytics. Big Data Cogn Comput. 2020;4(1):1.
View Article
Google Scholar

[42] View Article

[43] Google Scholar

[ref15] 15. Degadwala S, Vyas D, Hossain MR, Dider AR, Ali MN, et al. Location-based modelling and analysis of threats by using text mining. In: 2021 2nd International Conference on Electronics and Sustainable Communication Systems (ICESC). 2021. pp. 1940–4.
View Article
Google Scholar

[45] View Article

[46] Google Scholar

[ref16] 16. Iqbal F, Fung BCM, Debbabi M, Batool R, Marrington A. Wordnet-based criminal networks mining for cybercrime investigation. IEEE Access. 2019;7:22740–55.
View Article
Google Scholar

[48] View Article

[49] Google Scholar

[ref17] 17. Qi Z. The text classification of theft crime based on TF-IDF and XGBoost model. In: 2020 IEEE International Conference on Artificial Intelligence and Computer Applications (ICAICA). 2020. pp 1241–6.
View Article
Google Scholar

[51] View Article

[52] Google Scholar

[ref18] 18. Ch R, Gadekallu TR, Abidi MH, Al-Ahmari A. Computational system to classify cyber crime offenses using machine learning. Sustainability. 2020;12(10):4087.
View Article
Google Scholar

[54] View Article

[55] Google Scholar

[ref19] 19. Samtani S, Zhu H, Chen H. Proactively identifying emerging hacker threats from the dark web: a diachronic graph embedding framework (D-GEF). ACM Transactions on Privacy and Security. 2020;23:21.
View Article
Google Scholar

[57] View Article

[58] Google Scholar

[ref20] 20. Ghosh S, Das A, Porras P, Yegneswaran V, Gehani A. Automated Categorization of Onion Sites for Analyzing the Darkweb Ecosystem. In: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2017. pp. 1793–802.
View Article
Google Scholar

[60] View Article

[61] Google Scholar

[ref21] 21. Xu Y, Chen G, Wu J, Xu W, Liu Q. Research on dark web monitoring crawler based on TOR. In: 2021 IEEE 2nd International Conference on Information Technology, Big Data and Artificial Intelligence (ICIBA). 2021. pp 197–202.
View Article
Google Scholar

[63] View Article

[64] Google Scholar

[ref22] 22. Nazah S, Huda S, Abawajy JH, Hassan MM. An unsupervised model for identifying and characterizing dark web forums. IEEE Access. 2021;9:112871–92.
View Article
Google Scholar

[66] View Article

[67] Google Scholar

[ref23] 23. Al Nabki MW, Fidalgo E, Alegre E, de Paz I. Classifying illegal activities on Tor network based on web textual contents. In: Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics. 2017. pp 35–43. https://aclanthology.org/E17-1004
View Article
Google Scholar

[69] View Article

[70] Google Scholar

[ref24] 24. Chatzimarkaki G, Karagiorgou S, Konidi M, Alexandrou D, Bouras T, et al. Harvesting large textual and multimedia data to detect illegal activities on dark web marketplaces. In: 2023 IEEE International Conference on Big Data (BigData). 2023. pp 4046–4055.
View Article
Google Scholar

[72] View Article

[73] Google Scholar

[ref25] 25. Zulkarnine AT, Frank R, Monk B, Mitchell J, Davies G. Surfacing collaborated networks in dark web to find illicit and criminal content. In: 2016 IEEE Conference on Intelligence and Security Informatics (ISI). 2016. pp. 109–14.
View Article
Google Scholar

[75] View Article

[76] Google Scholar

[ref26] 26. L’Huillier G, Alvarez H, Ríos SA, Aguilera F. Topic-based social network analysis for virtual communities of interests in the dark web. SIGKDD Explor Newsl. 2011;12(2):66–73.
View Article
Google Scholar

[78] View Article

[79] Google Scholar

[ref27] 27. Borgonovo F, Fisher A, Porrino G, Lucini SR, Martignon L. Target Detection of Dark web Hidden Services Network through MoniTOR and Social Network Analysis. In: 2023 IEEE International Workshop on Technologies for Defense and Security (TechDefense). 2023. pp. 412–6.
View Article
Google Scholar

[81] View Article

[82] Google Scholar

[ref28] 28. Boekhout HD, Blokland AAJ, Takes FW. Early warning signals for predicting cryptomarket vendor success using dark net forum networks. Sci Rep. 2024;14(1):16336. pmid:39009720
View Article
PubMed/NCBI
Google Scholar

[84] View Article

[85] PubMed/NCBI

[86] Google Scholar

[ref29] 29. Aizawa A. An information-theoretic perspective of tf–idf measures. Inform Process Manage. 2003;39(1):45–65.
View Article
Google Scholar

[88] View Article

[89] Google Scholar

[ref30] 30. Cahyani DE, Patasik I. Performance comparison of TF-IDF and Word2Vec models for emotion text classification. Bulletin EEI. 2021;10(5):2780–8.
View Article
Google Scholar

[91] View Article

[92] Google Scholar

[ref31] 31. Jiang Z, Gao B, He Y, Han Y, Doyle P, Zhu Q. Text Classification Using Novel Term Weighting Scheme-Based Improved TF-IDF for Internet Media Reports. Math Problem Eng. 2021;2021:1–30.
View Article
Google Scholar

[94] View Article

[95] Google Scholar

[ref32] 32. Dessi D, Helaoui R, Kumar V, Recupero DR, Riboni D. TF-IDF vs word embeddings for morbidity identification in clinical notes: An initial study. arXiv. 2021. https://arxiv.org/abs/2105.09632
View Article
Google Scholar

[97] View Article

[98] Google Scholar

[ref33] 33. Ruhnau B. Eigenvector-centrality — a node-centrality? Soc Netw. 2000;22(4):357–65.
View Article
Google Scholar

[100] View Article

[101] Google Scholar

[ref34] 34. Ando H, Bell M, Kurauchi F, Wong KI, Cheung K-F. Connectivity evaluation of large road network by capacity-weighted eigenvector centrality analysis. Transportmetrica A: Transp Sci. 2020;17(4):648–74.
View Article
Google Scholar

[103] View Article

[104] Google Scholar

[ref35] 35. Carrizosa E, Marin A, Pelegrin M. Spotting key members in networks: clustering-embedded eigenvector centrality. IEEE Syst J. 2020;14(3):3916–25.
View Article
Google Scholar

[106] View Article

[107] Google Scholar

[ref36] 36. Jatnika D, Bijaksana MA, Suryani AA. Word2Vec model analysis for semantic similarities in English words. Procedia Computer Science. 2019;157:160–7.
View Article
Google Scholar

[109] View Article

[110] Google Scholar

[ref37] 37. Savytska L, Vnukova NM, Bezugla IV, Pyvovarov V, Sübay MT. Using Word2Vec technique to determine semantic and morphologic similarity in embedded words of the Ukrainian language. Kharkiv National University of Economics; 2021. http://repository.hneu.edu.ua/handle/123456789/26122

[ref38] 38. Shen Y, Liu J. Comparison of Text Sentiment Analysis based on Bert and Word2vec. In: 2021 IEEE 3rd International Conference on Frontiers Technology of Information and Computer (ICFTIC). 2021. pp. 144–7.
View Article
Google Scholar

[113] View Article

[114] Google Scholar

[ref39] 39. Qiao Y, Zhang W, Du X, Guizani M. Malware classification based on multilayer perception and Word2Vec for IoT security. ACM Trans Internet Technol. 2021;22(1):1–22.
View Article
Google Scholar

[116] View Article

[117] Google Scholar

Figures

Abstract

Introduction

Related works

Research on dark web investigation trends

Cyber investigation research based on text mining

Materials and methods

Keyword extraction with individual text mining models

Data collection from the dark web.

Ethical and legal compliance.

Terms and conditions compliance.

Text mining techniques for CSAM keyword extraction.

Comparative evaluation of individual models.

Refinement of investigative keywords using combined text mining models

Combined Models 1 and 2: Eigenvector centrality–TF-IDF based models.

Combined Models 3 and 4: Eigenvector centrality–Word2Vec based models.

Comparative analysis of combined model performance.

Quantitative evaluation of model performance.

Robustness checks using multiple seed keywords.

Discussion

Conclusion

References