Abstract

Social networks on the Internet have become a home that attracts all types of human thinking to exchange knowledge and ideas and share businesses. On the other hand, it has also become a source for researchers to analyze this knowledge and frame it in patterns that define types of thoughts circulating on these networks and representing the communities around them. In particular, some social networks on the Dark Web attract a special kind of thinking centered around the malicious and illegal activities disseminated on websites and marketplaces on the Dark Web. These networks involve discussions to exchange and discourse information, tips, and advice on performing such business. Studying social networks on the Dark Web is still in its infancy. In this paper, we present a methodology for analyzing the content of social networks on the Dark Web using topic modeling methods. We demonstrate the needed stages for the topic modeling process, beginning with data preprocessing and feature extraction to topic modeling algorithms. We utilize and discuss the following four topic models: LDA, CTM, PAM, and PTM. We discuss the following four topic coherence measures as evaluation metrics: UMass, UCI, CNPMI, and CV, demonstrating the selection of the best number of topics for each model according to the most coherent produced topics. Furthermore, we discuss the limitations, challenges, and future work. Our proposed approach highlights the ability to discover the latent thematic patterns in conversations and messages in the common language used in social networks on the Dark Web, constructing topics as groups of terms and their associations. This paper provides researchers with a leading methodology for analyzing thought patterns on the Dark Web.

1. Introduction

In crime analysis, studying content related to malicious and criminal activities forms a significant aspect of understanding crime and its motivations. The malpractices of technological and communication development have led to the dissemination of the malignant content and encouraged conducting of illicit businesses on various websites and social networks. This content is particularly abundant in the dark part of the Internet, or the Dark Web, which prompts researchers to analyze and study published posts and discussions on platforms on the Dark Web to enrich their knowledge about crime and the patterns of thoughts related to it.

The Dark Web is a part of the web that includes websites only accessed via encryption software, such as the onion router (TOR), and hosted on these encrypted networks. These sites cannot be reached or indexed by search engines [1, 2]. Such software provides users anonymity and traffic encryption, encouraging illicit and criminal activities to take place widely on the Dark Web without the fear of revealing identities or geographic locations and increasing their ability to avoid detection and arrest [3]. Such malicious activities include drug trade, stolen and counterfeit credit cards, fraud, terrorism and extremism, propaganda dissemination, hacking tools and tutorials, child pornography, weapon trade, and many others [1, 3, 4]. It is estimated that activities of such types form about 80% of the Dark Web [5].

Moreover, the Dark Web thrives through the rapid digital development in communication and social networking, leading to an emerged digital culture of illegal activities on the Dark Web that is imposed on individuals to join Dark Web communities, such as the marketplaces on the Dark Web, or cryptomarkets. Such cultures are built and strengthened through social interactions, as they are crucial to ensure the operability and sustainability of cryptomarkets [6]. Therefore, some social networks on the Dark Web are essential platforms for participating in cryptomarkets, forming cybercrime communities. These social networks are mostly the forums associated with cryptomarkets, which are organized according to specific topics in which common interests and goals are shared and illicit products are presented and promoted, in addition to exchanging information, experiences, advice, and negotiations [68].

For these reasons, the Dark Web, in general, has gained a prominent interest from researchers. In particular, cryptomarkets and their associated forums make substantial sources for law enforcement and cybersecurity agencies to investigate and detect cybercrimes [2].

It is worth noting that the Surface Web includes communities related to criminal activities on the Dark Web that support the continuity of cryptomarkets and achieve flexibility for their systems [9]. Although the policies of platforms on the Surface Web prohibit communities related to illicit activities on the Dark Web, such communities still emerge and thrive, which makes them another target for researchers and law enforcement agencies to study and analyze them in depth [9].

Studies emphasize the importance of analyzing the content of the Dark Web communities and the related ones on the Surface Web to understand the thoughts and concepts they comprise, understand members’ interests, trending topics, and crime perpetration methods, and anticipate new events [2, 10]. Therefore, the semantic analysis of the social network contents helps to discover the relationships within the semantics of messages and discussions on the social network.

Consequently, intellectual contents of Dark Web platforms have been studied and analyzed for various purposes, and different methods and tools were developed and employed. Such approaches included machine learning and data mining methodologies, such as classification, clustering, and summarization, and statistical analysis methodologies, which depend on the statistics calculated on tokens or chunks of the text. However, such techniques may suffer from missing the hidden semantic relationships in the analyzed content. Therefore, semantic-based approaches are needed to extract the shared concepts from massive textual data while considering the significant relationships within.

This paper presents an empirical methodology to analyze Dark Web social networks semantically using generative and probabilistic topic modeling. The approach utilizes several topic modeling methods and evaluation metrics, providing comprehensive insights about the semantic correlations among words and among topics and exemplifying the intellectual concepts on which the associated community is based.

The main contributions of the paper are as follows:(1)Providing a comprehensive background of the needed stages of the followed methodology, including a thoughtful selection of data preprocessing procedures that address the particularity of the language used in Dark Web social networks, a term weighting scheme for feature extraction, and four selected topic models that serve various purposes(2)Presenting a discussion about evaluating the generated topics to choose the optimal number of topics for each method(3)Presenting further aspects of the generated topics by employing different visualization techniques, which help to understand the modeled content and the extracted relationships(4)Presenting a discussion about the limitations and challenges that lead to new fields of research in the domain of Dark Web content analysis.

The remaining of the paper is structured as follows. Section 2 discusses the related work. Section 3 demonstrates the methodology basics, including topic modeling, data preprocessing, feature extraction, topic modeling algorithms, evaluation metrics, and the proposed approach. Section 4 discusses the results. Section 5 indicates limitations, challenges, and future work. Section 6 is the conclusion.

Porter [2] presented a topic modeling approach to analyze a public subreddit related to the Dark Web called DarkNetMarkets. LDA was used to discover monthly-reciprocated information regarding the state of marketplaces on the Dark Web, in addition to information about security and anonymization technologies, cryptocurrency, and commercial exchange services. The research studies the recent changes and trends occurring in the Dark Web communities during a specific period. A relevance measure was used, then ranking the words for each topic according to the relevance scores, and then labeling the topics according to the ranked keywords. The results showed that during the studied period and up to the security crisis, the topics shifted from expressing normalcy and comfort to expressing a state of tension, low confidence, and an increased orientation towards a security mindset.

Similarly, Cho and Wright [11] presented a study that sheds light on the social phenomenon resulting from forum bans executed against communities of illegal products. The study included an assessment of the extent to which unexpected disruptions cause changes in the public debate. The approach included topic modeling using LDA and sentiment analysis to examine how members perceive the ban and how user participation has changed in the new system.

In Kigerl’s approach [7], comments written by each user were combined into a single piece of text, making each word appears at least once in any comment. The word is represented by the number of times it is used by a particular user, using the Bag-of-Words model and a term frequency matrix, where the row represents an individual user and the column represents a unique word containing the frequency of use of the word by each user. LDA was used to cluster texts into topics so that one user can be a member of more than one topic, with the topic probabilities totaling 1.0 for each user. The approach utilized model fit and performance measures to determine the optimal number of topics, including internal methods based on the similarity of documents assigned to the same topics and external metrics to measure the separation and distance between each topic and the others.

Kwon and Shao [12] relied on the communicative constitutive of organization (CCO) theory to analyze the human side of social networks on the Dark Web and to study the formation of knowledge in cryptomarkets’ communities. The study aims to discover the characteristics that make the sociotechnical environment associated with markets on the Dark Web resilient through topic modeling. They utilized structural topic modeling (STM), an algorithm based on LDA, after NLP and text-cleaning operations, in addition to setting a minimum threshold of word frequency to ten occurrences. Quantitative and heuristic evaluations were used to determine the optimal number of topics, including exclusivity and coherence, in addition to manual reviews of thematic overlaps.

Heistracher et al. [10] highlighted the importance of determining the appropriate procedures of NLP to process the text and convert it into structured data. The research focuses on named entity recognition, relationship extraction, and event detection. The presented approach utilized topic modeling to visualize the deliberated topics and their distributions and to generate data classifications by ranking the most critical keywords of each topic. Relationship extraction was implemented using part of speech tagging, dependency tree analysis from SpaCy, neural networks, and word embeddings. DBSCAN was used to cluster the data elements according to their similarities using the cosine similarity measure.

Yang et al. [13] presented an approach to extract latent and trending topics from Dark Web forums. They suggested improving the results of the biterm topic model (BTM) by filtering words to define more coherent and interpretable topics and reduce cost and complexity. The filtering process was based on reducing the redundancy of biterms using a proposed new criterion called “generality” based on the document ratio formula. The generality measure works on identifying the least significant terms to be filtered out. They argued that if a term appears frequently and widely in the entire set of documents, the term belongs to the stopwords, while topical terms appear frequently in a few documents. The coherence measures UMass, UCI, and centroid coherence were used to assess the topics quality.

Kwon and Shao [9] presented an approach to examine the types of hidden knowledge shared in Dark Web-related communities on the Surface Web, specifically Reddit, and the extent to which the distribution of this knowledge differs in periods of the constant operation of the market and unstableness or crisis. LDA was utilized and implemented in R with the STM package. The study tried the model with different numbers of topics and applied FREX weights and semantic coherence for evaluation.

Several research studies analyzed the Dark Web content using content analysis and topic modeling techniques as a part of integrated processes. Topic modeling was used as a part of a two-step methodology to reduce the dimensionality of the feature space for better classification results [14] and with association rules mining among top words and other frequent words to describe the content of categorized Dark Web sites [15]. Some approaches involved text summarization [16] and classification techniques to classify reactions occurring in Dark Web communities in times of crises or shutdowns of the markets [17].

Topic modeling is still not widely utilized in semantic content analysis of the Dark Web communities, and shedding the light on the conceptual bases of such communities is still in its early steps. By using topic modeling methods and coherence score measures, the goal of this study is to extract the concepts forming the intellectual properties of communities around Dark Web marketplaces and represent the knowledge emerging from such communities in different forms.

3. Methodology

In this section, we begin with the basic concepts on which this study relies, including topic modeling definition in subsection 3.1, data preprocessing procedures in subsection 3.2, and feature extraction definition in subsection 3.3, and four algorithms of topic modeling are discussed in subsection 3.4, namely, latent Dirichlet allocation (LDA), correlated topic model (CTM), Pachinko allocation topic model (PAM), and pseudodocument topic model (PTM). Consequently, topic coherence measures are discussed as the evaluation metrics in subsection 3.5. Subsection 3.6 introduces the proposed approach.

3.1. Topic Modeling

In psychology, researchers define a concept as a network of correlated words. In a more generalized notion, a concept can be defined as “elements and their organization” [18].

In computational linguistics, researchers proposed several definitions of a concept, or topic. A topic can be defined as a set of words and their frequencies [19], as a set of words or phrases that represent a common temporal concept [20] or as latent topical features in given texts that correspond to contextually related words [21]. In another definition, a topic is a set of words likely to appear in the same context [22, 23].

On the other hand, topic extraction, or topic modeling, is defined as the technique used to infer conceptual topics hidden in a set of documents, or corpus [22, 24], where there is no explicit taxonomic scheme to project onto a corpus or when such projection (or labeling) is costly [25]. In another definition, topic modeling is an automated process to define the “latent thematic structure” of a corpus, summarizing the texts into topics or categorizing them into labels [26].

There are two fundamental research interests in topic modeling, topic interpretation and labeling and defining dominant topics from the generated topic models [27]. Topic modeling can be considered as a form of text mining used to extract frequent patterns of words in a corpus, where the set of words expresses a topic and thus infers the nature of the information in the document set. These interpretable topical schemes help to label each document in the corpus with an annotation. Subsequently, the annotations have usages in social computing and numerous other applications, such as information retrieval, classification, summarization, and sentiment analysis [22, 24, 28].

A topic model links documents and words using probabilistic and statistical analysis to extract latent features from the text where these features symbolize the hidden topical themes in the text [22]. Consequently, a document can contain several topics, where each topic is represented by a probability distribution over the vocabulary [29, 30].

Topic modeling can process massive volumes of textual data to extract hidden concepts, distinguish features, and define latent variables from text according to the application purpose [31]. Moreover, it can represent documents of a large corpus with concise but comprehensive commentaries [22], with the most likely ones assigned to each document [25].

Methodologically [32], it is advantageous to determine the following definitions as the basic units in the modeling process:(i)Word or Term: the single basic unit of data(ii)Document: a string of N words(iii)Corpus: a collection of M documents(iv)Vocabulary: the set of all the unique words in a corpus(v)Topic or Concept: it is represented by a probability distribution over the vocabulary.

The words can be correlated through similarity, co-occurrence, proximity, and a subject-predicate structure [33, 34]. Each document is represented by a vector with dimensions corresponding to each term in the vocabulary and valued with the weights of the terms [35].

In this context, topic modeling considers that choosing the words and their positions in the text is intentional. Therefore, statistical analysis of the vocabularies and their co-occurrences with other terms in a particular text helps discover the vital concepts, premises, and intentions implied within the text [34].

In sociology, topic modeling reduces the human impact on the analysis objectivity and improves its efficiency compared to traditional methods. Therefore, advantages such as accuracy and objectivity make topic modeling a robust tool for sociologists [34, 36].

On the other hand, traditional content analysis methods suffer from unhandled polysemy. Topic modeling overcomes this limitation and considers the different meanings of the word by including the lexical context for a more accurate analysis of the word [34, 37].

Topic Modeling methods evaluate the importance of terms at several levels. Each word is weighted based on its position (i.e., the distances between words), word frequency, and the context, or semantics, of the word. Then, it follows the following steps to create a theme according to [34]:(1)Determine which words have the highest weight (salient words)(2)Determine the words closest to the words with the highest weights(3)Determine the prominence of a particular set of words according to its frequency.

Depending on how high the weight of a particular set of words is, this set can be considered one of the topical themes in the corpus. The topic is thus composed of a set of significant words with sufficient proximity to each other and frequently appears as an integrated unit in the corpus [34].

Topic modeling methods can be supervised, unsupervised, or semisupervised using structured or unstructured data. Consequently, it is widely used on web resources to discover the abstract topics underlying a variety of text inputs on the web, such as short articles, chats, social media posts, user comments and reviews, blogs, emails, and other formats [31].

3.2. Data Preprocessing

The main challenge when analyzing Dark Web forums and marketplaces is the diversity of content structures and writing styles, which can contain grammatical and spelling errors, slang, and symbolized and ambiguous words intended to obscure their nature [10].

Like any text mining method, topic modeling inevitably relies on cleaning and preparing the raw text to its optimum prior analysis. For this purpose, textual data enter several preprocessing procedures to remove unimportant, irrelevant, and redundant attributes to reduce the dimensionality of the data space. Some of the most essential and widespread techniques used in text preprocessing are as follows [30, 38]:(1)Normalization, which converts all letters to lowercase(2)Removing common words or stopwords(3)Removing nonalphabetic characters(4)Removing punctuation, which includes removing all special characters and symbols (such as @ # $ % ^ & < >)(5)Tokenization, which breaks a text into elements or attributes called tokens(6)Lemmatization, which assembles the different morphological forms of a term in one single form (the lemma), so they can be analyzed as a single element(7)Stemming, which returns the word to its root(8)Parsing, which finds different dependencies between the words in a sentence and represents them in a tree structure called the parsing tree. A parsing tree helps to find relationships among vocabularies by extracting the shortest path in the tree structure(9)Parts of speech tagging, which determines the types of the different parts of speech that occurred in the text(10)Removing hashtags, HTML tags, and links.

3.3. Feature Extraction

The data enter a feature extraction procedure as a further filtering process. Feature extraction is the core process of identifying word patterns and extracting topics. It aims to improve the modeling performance by reducing the dimensionality of the vocabulary space. The selected features are represented as a vector of salient words with contextual terms designated by a weighted distribution [30]. Several approaches were proposed to weight the term. The traditional and common method is the term frequency-inverse document frequency (TF-IDF) and its variants, which depend on the frequency of term occurrences within the document and through the corpus [25]. On the other hand, pointwise mutual information (PMI) is a leading weighting scheme as it weights terms regarding their dependencies and co-occurrences [25].

3.4. Topic Modeling Algorithms

Topic modeling algorithms are a multiperspective technique applied to discover the semantics in a corpus and to group extracted word patterns into topics. These algorithms generate a representation of the word meanings based on statistical and probabilistic analyses of the words [24, 28, 33]. The algebraic perspective was initially followed to reduce dimensionality, as the original matrix is decomposed into a matrix of factors. Thus, algorithms in topic modeling can be divided into the following two types: algebraic-perspective based and probabilistic-perspective based [28]. This paper concentrates on four probabilistic topic models, i.e., latent Dirichlet allocation (LDA), correlated topic model (CTM), Pachinko allocation topic model (PAM), and pseudodocument topic model (PTM). In the implementation, these algorithms follow the Bag-of-Words model to represent the documents and Gibbs sampling to identify topics. One of the most fundamental features of probabilistic models is that the topics can be extracted directly from the corpus without any predefined input from any prior knowledge [31].

The Bag-of-Words (BoWs) model represents the document as a word-document matrix regardless of the order or grammar of the words. The matrix is valued by the number of word occurrences. The word-document matrices form the input for the topic model instead of the entire document [22].

Gibbs sampler is the most common sampling method used in topic modeling. It performs conditional sampling using Markov chain Monte Carlo (MCMC), and it uses the distributions of the variables to define the posterior distribution, which is used afterwards to determine the best number of topics and identify topics [30, 39].

3.4.1. Latent Dirichlet Allocation (LDA)

LDA was developed by Blei et al. [40], and it is the most popular topic modeling algorithm. It is a generative Bayesian probabilistic model that generates a distribution in document-topic and word-topic forms using the Dirichlet priors alpha and beta as hyperparameters to estimate the document-topic and the word-topic distributions, respectively [30]. The vocabulary space is transformed into a topic space due to the reduced volume of the latter. This transformation results in the following two matrices: the first represents the probability distributions of words on topics and the second represents the probability distributions of topics on documents [22]. Thus, the Bayesian model is built in a three-level hierarchy; word, document, and topic [32, 35]. Consequently, each document is a mixture of distinct topics, and each topic is composed of probabilities of words that are likely to co-occur in the topic [21]. This contrast helps to reduce ambiguity by ensuring that each document enfolds a small set of topics and that topic consists of a small cluster of words. Topic generation in LDA depends on the probability of vocabulary co-occurrences; in other words, a term may occur in different topics, but the other terms in its context determine the interpretability of the topic they represent [20]. LDA relies on the Definneti theorem to define the statistical structure of the document internally (the relationships between terms within the document) and externally (the relationships between documents). The algorithm depends on a predetermined number of topics k, where the specified number of topics is distributed over the documents with varying proportions [28]. It has proven effective in defining consistent topics, especially in datasets with sufficient vocabulary [20].

3.4.2. Correlated Topic Model (CTM)

LDA suffers from a critical limitation, which is its inability to detect potential correlations between topics due to the use of the Dirichlet distribution to model the variability between topic parts. However, in real-world data, topics are interconnected [28]. To overcome this limitation, CTM was developed, by Blei and Lafferty [41], to model topics along with discovering the correlations between topics through the logistic normal distribution. Therefore, CTM has a more flexible and realistic distribution, taking into account topic parts to generate a hierarchical representation of the latent structure and correlations among components of the different topics [27, 28, 32, 42]. Studies based on the perplexity measure show that the CTM model provides a better fit than the LDA model. Moreover, for a document with a relatively small number of words, the perplexity value was much lower; thus, the certainty value was significantly higher for CTM. This advantage is because CTM uses the correlation among topics in the prediction procedure and infers that words appearing in the related topics may also occur in the document under process. Contrastly, LDA needs a higher proportion of the document to be observed and the topics to be fully generated to predict the remaining words [32].

3.4.3. Pachinko Allocation Topic Model (PAM)

PAM is a topic modeling algorithm developed by Li and McCallum [43]. While CTM detects correlations between any two topics at a time, PAM creates a mixture model in the form of a directed acyclic graph (DAG) to discover topical correlations of different types such as nested, arbitrary, and sparse correlations. The DAG is randomly constructed, where each word of the vocabulary is represented by a leaf node, and each topic is represented by an interior node; thus, the inner nodes are parent nodes, and the nodes branched from them (leaf and nonleaf) are children nodes. PAM can define correlations within the vocabulary and correlations among the topics; in other words, it explores the distributions of topics over other related topics in the form of categories of supertopics and subtopics representing a hierarchal relationships scheme [28, 32].

3.4.4. Pseudodocument Topic Model (PTM)

PTM was introduced by Zuo et al. [44]. It is based on LDA in a process called self-aggregation, which implicitly groups short texts into pseudodocuments to address the problem of data sparsity, achieving higher quality with reduced training samples. PTM aggregates documents without employing supplementary information so that topic distributions are modeled on larger and fewer documents of regular sizes (the latent documents) rather than on a sparsely large amount of short texts (the observed documents). The aggregation process is done with a multinomial distribution of short texts over pseudodocuments, where each short text belongs to only one pseudodocument.

3.5. Evaluation Metrics

Evaluation metrics are used to assess the robustness of discovered topics. They evaluate the understandability (or quality) of the topics and the performance and accuracy of the modeling process. Topic evaluation can be achieved through standard measures (such as recall, precision, and F-score), perplexity, or semantic coherence measures [25, 28, 31]. In this paper, we choose perplexity as a prior evaluation metric to estimate the performance of the topic modeling process and topic coherence as a posterior metric to evaluate the quality of the generated topics, as it has been proven that it is well correlated with human evaluations [45].

3.5.1. Perplexity

Perplexity estimates the log-likelihood of the held-out document [28]. It is used to examine the performance of the topic model and calculated as shown in the following equation [46]:where D is the document set, M is the number of documents in the set, N is the number of words in document d, and is the probability of words in document d.

3.5.2. Topic Coherence

Coherence expresses the interpretability of topics through word co-occurrences, as words that contextually and frequently co-occur in the corpus are more correlated, thus conveying a better-defined concept [25, 45]. Coherence is one of the paramount measuring techniques of topic quality. Probabilistic coherence estimates to which extent the words in a topic are correlated [28]. Coherence means that a set of statements or facts support each other contextually. In other words, they can be interpreted in a particular context that covers all or most of the facts; thus, they are coherent [47].

Several variations of coherence measures have been proposed that employ different probabilities calculations, such as PMI and NPMI. Studies proved that coherence measures based on PMI and NPMI give the highest agreement with human evaluations [47]. Pointwise mutual information (PMI) among top words of a topic assesses the amount of information gain of a word given the presence of the other word, taking into account the dependencies between words [25]. In an updated version, normalized PMI (NPMI) was developed and applied in many studies [25].

Coherence metrics consist of several components mainly divided into four dimensions, namely, segmentation (S), probability calculation (P), confirmation measure (M), and aggregation (Ʃ), applied to the generated topics (T) to produce the final coherence score (C), as demonstrated in Figure 1 [47].

3.5.3. Standard Coherence Measures

Several approaches have been proposed to estimate topic coherence. The measures consider the set of N top words of each topic, calculate the coherence of word pairs () based on the probabilities of word occurrences and word pair co-occurrences, and sum the resulting scores in a final coherence score of the topic. Coherence measures scores lead to a topic’s keyword ranking, where the higher value indicates better topic coherence (closest to zero in case of negative values) [45, 47, 48].

In this section, we demonstrate the four customary coherence measures, namely, UMass, UCI, CNPMI, and CV (the topic coherence measures are thoroughly explained by Röder et al. [47] with an extensive comparison. We refer the interested reader to their study for further information).

UMass is calculated as shown in the following equation (ϵ is a small value added to prevent the logarithm of zero) [47]:

UCI relies on PMI, and it is calculated as shown in the following equation [47]:where PMI is calculated as shown in the following equation [47]:

The probabilities are estimated based on the number of co-occurrences of the words, and these co-occurrences are calculated from documents generated by the sliding window technique with a specified size [47].

Researchers deduced that when a word context is represented by a vector of its co-occurrences with other words within context windows (±5 words around the keyword), the topic coherence assessment is in the highest agreement with the human assessment when NPMI is used to define the elements of these vectors [47, 49]. It also achieves the highest performance when the keyword space is limited to words belonging to the same topic. Thus, for element j of the context vector for the word , the NPMI is calculated as shown in in the following equation (γ is a weighting factor used to give high NPMI values more weight) [47]:

In a modified version of UCI, NPMI is used instead of PMI to calculate the CNPMI score [49]. Therefore, the CNPMI coherence measure is calculated as shown in the following equation:

The final coherence measure, CV, proposed by Röder et al. [47] is based on the cosine measure with NPMI and a Boolean sliding window of size ≥50.

3.6. Proposed Approach

In this section, we demonstrate the proposed approach to model topics from Dark Web social networks. Figure 2 illustrates the methodology.

3.6.1. Dataset

For our experiments, we use a dataset of a Dark Web forum associated with the Wallstreet Cryptomarket. The dataset is retrieved from AZSecure Dark Net forums datasets (https://www.azsecure-data.org/dark-net-markets.html (accessed on 20 September 2022)) provided by Du et al. [50].

For the texts corpus creation, we selected the flat content of the post as our only interest, without revealing any identities or usernames.

3.6.2. Data Cleaning and Preprocessing

We implemented our cleaner to include several basic text-cleaning procedures and some specific ones according to the nature of the selected dataset. The data preprocessing steps are described as follows:(1)Replace accented characters (such as à, ê, õ, and ü) with unicode characters using the unidecode Python module. The purpose of this step is to unify the character coding for a fair judgement of the words during the weighting and feature extraction phase.(2)Normalize characters to lowercase. The same word can be found in different capitalization forms; thus, this step is necessary to bring all cases of the word into one.(3)Remove lines, tabs, and spaces(4)Remove hyperlinks by defining them as regular expressions using re Python module(5)Expand contractions (such as you’re = you are) by defining a list of contractions and their replacements. This step prepares for a better exclusion of stopwords performed in a later step.(6)Remove special characters (such as ∼ ! @ # $) defined as regular expressions using the re module(7)Lemmatization using WordNetLemmatizer package from nltk (Natural Language Toolkit (NLTK): https://www.nltk.org/ [Accessed 2 October 2022]) library(8)Remove stopwords defined in nltk stopwords list for English(9)Remove specific words observed through the text but add no meaning to the text, such as “wrote” and “would”, and the names of the months (basically before the post declaring the posting date)

We did not include a stemming process, as we noticed that stemming transforms the word to an erroneous-spelled word or stems a word that does not need stemming (such as quality to qualiti), which may confuse the interpretation of the results. Consequently, after several cleaning trials and observations, we noticed that lemmatization produces better results than stemming. Therefore, we depend on the lemmatization process to reduce the dimensionality of the word space.

We also did not include a spelling correction process, as we discussed previously, members may use intentionally misspelled words as an obfuscation strategy, and such words may be well known among the community participants, so we keep the spelling as it is.

The following example shows a text before and after preprocessing:

Before preprocessing:

I believe one of the most important opportunities or factors of buying your drugs on the Dark Net is because once you get to know the markets, vendors and the people that post on these forums we can together create a community that watches out for the safety of all. The DNMs should be a place where we can be sure that we are buying safer quality drugs than what we can obtain from the streets. Therefore we need to learn about what can be the dangerous adulterants in the drugs we are buying. As for cocaine the one that is showing up most often that is particularly dangerous is LEVAMISOLE. For a better understanding of the whats and the whys there is a well written series of articles by the free weekly publication from Seattle, WA, USA “The Stranger”. You can find these articles here: https://cocaineo5z66elwy.onion/Levamisol...caine.html

After preprocessing:

“believe one important opportunity factor buy drug dark net get know market vendor people post forum together create community watch safety dnms place sure buy safe quality drug obtain street therefore need learn dangerous adulterant drug buy cocaine one show often particularly dangerous levamisole good understand whats well series article free weekly publication seattle usa strange find article caine html”

3.6.3. Feature Extraction and Topic Modeling Algorithms Implementation

We utilized four topic modeling algorithms, which are LDA, CTM, PAM, and PTM, implemented using Tomotopy (Tomotopy: https://github.com/bab2min/tomotopy [Accessed 14 October 2022]) Python package for topic modeling. We choose Tomotopy for the variety of algorithms and utilities it provides.

As discussed in Section 3.4, we choose these four methods for their characteristics. LDA is the most common topic modeling method, which is easy and fast. CTM has an advantage over LDA that it extracts the relationships among generated topics depicted as a network. This network benefits in detecting the connections between different thoughts in the discussions, as concepts in the real world are not independent but correlated. PAM helps build a hierarchy of supertopics and subtopics, which helps to illustrate a relationship between a generalized thought to more specific ones. PTM has an advantage when utilized to extract topics from diverse sizes of posts when grouping them in pseudodocuments of regular lengths for better modeling results.

For a fair comparison between the four models, we set the basic settings and parameters equally. First, we use PMI as a word-weighing scheme to capture the salient words according to their semantic relevance [51]. We found that the topic modeling methods performed better with PMI than TF-IDF for feature extraction, with lower entropy of term-weighted words and lower perplexity. Second, we set the iteration number of the training process where the log-likelihood stops making a significant increase. A higher log-likelihood (closer to 0) indicates a lower perplexity, and a low perplexity score means that the prior calculated probabilities define the generated topics well [7]. Thus, we set the iterations number to 30 iterations of Gibbs sampling. Each iteration implies 20 cross-sampling iterations, which makes a total of 600 iterations for each modeling process. The third setting is the number of topics k. For each algorithm, we train the model for ten candidate numbers of topics from 5 to 50 with hops of 5.

For CTM, we set the number of iterations to sample beta parameter to 5, a moderate number of iterations to regulate time cost, as CTM takes longer than the other methods. For PAM, we set the numbers of supertopics to small numbers to give it a sense of clustering, grouping the subtopics into small groups of master domains. For PTM, we set the number of pseudodocuments to ten times the number of topics (10 k). Each model is run three times to estimate the best performance and select the best model for each method. Table 1 shows the result of the word-weighting process. Table 2 presents the log-likelihood per word of the last iteration for each model and each value of k.

3.6.4. Evaluating Topics Using Topic Coherence Measures

For coherence evaluations, we use the four common coherence score measures, namely, UMass, UCI, CNPMI, and CV. The evaluations were conducted for each topic model method and each value of k, and we applied the standard deviation (denoted as STDEV) to estimate the error rate of the coherence measures for the three runs. The resulting values are demonstrated for LDA, CTM, PAM, and PTM in Tables 36, respectively.

Figures 36 illustrate the coherence scores for each topic model according to the coherence measures, i.e., UMass, UCI, CNPMI, and CV, respectively.

4. Results and Discussion

From the four coherence score measures, we discuss the results from CNPMI, as NPMI proved to have the overall best performance and correlations with human evaluations [47, 49]. Thus, from Figure 5, we infer the optimal value of k for each topic modeling method. We suggest considering several high close points of coherence for a topic model for further human observation to choose the optimal number of topics that suits the aim of the research and for further better analysis and labeling. Therefore, we discuss the highest points of coherence for each topic model we conducted on the Dark Web forum. For each model, we examine the optimal value of k, demonstrate the generated topics, and label them according to the most frequent terms and most related ones in each topic. We notice that more than one topic may fall under the same label, and a topic may hold more than one label. This influence is due to the nature of the probability distributions of terms over topics and topics over documents with different proportions.

LDA shows high coherence scores at k = 20, 25, and 30. Two of the top three highest scores are achieved with k = 20 with CNPMI = 0.0509 and 0.0452 and with an error rate STDEV = 0.0076. To help demonstrate the separation of topics for each of the three topic models, we use pyLDAvis (pyLDAvis: https://github.com/bmabey/pyLDAvis [Accessed 23 October 2022]) to visualize the topic maps corresponding with the abovementioned models, for k = 20 at CNPMI = 0.0509, k = 25 at CNPMI = 0.0445, and k = 30 at CNPMI = 0.0390, as illustrated in Figure 7.

In Figure 7, we notice the best separation is gained from k = 20 (shown in a), while for 25 and 30 (shown in b and c, respectively), more topic clusters overlap. We can infer that some numbers of topics produce good coherence; however, with an increased number of topics, more topic clusters overlap. Table 7 shows the top 20 words of the topics generated by LDA for k = 20 at CNPMI = 0.0509.

For CTM, the top three coherence scores are achieved at k = 5 and k = 10, with the highest score of all recorded at k = 10 with CNPMI = 0.0173 and error rate STDEV = 0.021, with a high peak of coherence compared to the other numbers of topics. Table 8 presents the positive and negative correlations among the generated topics. We illustrate the network of the generated topics and their correlations using pyvis (pyvis: https://github.com/WestHealth/pyvis [Accessed 23 October 2022]), as shown in Figure 8. Positive correlations indicate that two words are likely to appear in the same topic, while negative correlations indicate that two words are unlikely to appear in the same topic. As mentioned earlier, correlations are calculated through the logistic normal distribution [41]. Table 9 shows the top 20 words of topics generated by CTM for k = 10 at CNPMI = 0.0173, with the sizes of the topics defined by the number of terms in each topic.

It is worth noting that if more number of topics with more correlations is desired for a specific analysis purpose, one may choose a higher number of topics but gain lower coherence, as CTM generates topics with dense correlations structure but less guaranteed coherence [52].

PAM achieved the top three coherence scores at k = 10 with CNPMI = 0.0869, 0.0542, and 0.0662 and error rate STDEV = 0.0165. The closest point that follows is at 0.0518 for k = 50. Again, a research purpose can define the best topics number to follow by determining the desired size of the hierarchy of supertopics and subtopics. For example, for 10 subtopics (k2), we set the number of supertopics (k1) to 3, while we set it to 10 for 50 subtopics. Table 10 shows the distribution of the subtopics over supertopics for k1 = 3 and k2 = 10 at CNPMI = 0.0869, with their distribution probabilities sorted descendingly. Table 11 shows the top 20 words of the subtopics.

Lastly, for PTM, the CNPMI coherence shows a high peak at k = 10 with the top three coherence scores recorded as CNPMI = 0.0500, 0.0510, and 0.0363. Table 12 shows the top 20 words of the generated topics for k = 10 at CNPMI = 0.0510. Figure 9 illustrates the topics map.

5. Limitations, Challenges, and Future Work

Our presented approach provides comprehensive discussions and solutions for a variety of issues and challenges, including a thoughtful selection of the proper data preprocessing procedures, a weighting scheme for feature extraction and suitable topic modeling methods that serve various purposes in understanding the terms’ semantics and hidden topics in Dark Web discussions.

Text analysis inevitably depends on preprocessing and cleaning procedures for accurate results; thus, it is crucial to consider thoughtful decisions in selecting the appropriate ones. In social networks, texts often contain slang, misspellings, and grammatical errors, which may be intended in the case of the Dark Web in particular. Such datasets may require further manual observations of the posts to eliminate the unmeaningful words, which need training the topic model several times to detect them manually. On the other hand, we found that lemmatization produces better results than stemming for the studied dataset. However, some words are still missed by the lemmatization process and remain unlemmatized, which needs further research to enhance and extend the dictionary.

The BoW model uses a vocabulary of terms that explicitly occur in the document; thus, it ignores important correlations between the terms that do not co-occur, yet they are connected [35].

Topic models’ performance and coherence are steered by the properties of the dataset. Thus, results may vary according to the dataset type, size, content, and lengths of its entries. Therefore, each dataset needs unique and thoughtful evaluation decisions to determine the best-performed coherent model. Moreover, as we observed in this study, each training process for each model with the same number of topics produced different outcomes with varied coherence scores. Thus, an analysis may need to run the experimentations several times to select the best result for each topic model and each value of k.

In our previous work [4], we discussed challenges in the field of analyzing the Dark Web content, which is mostly characterized by the language inconsistency of the discussions. This inconsistency can manifest in weak grammatical contexts and intentionally ambiguated words, such as emerged slang, intentional misspellings, abbreviated terms, and idiomatic contexts, which all are customary to the Dark Web communities but ambiguous to others outside these communities. Furthermore, unlike public or regular social networks, members of the Dark Web communities create concepts of their own that might change over time as needed and may be understood by their members only. Due to these reasons, further research is needed to analyze the intentions behind the used words and contexts, which may require the knowledge of experts to decipher the linguistic purposes. Moreover, some ethical considerations must be carefully taken when studying the Dark Web content, as the discussions may contain sensitive data.

For future work, we will add N-gram for language processing, which can help overcome the shortcomings of the BoW model. Furthermore, integrating ontologies can enhance text preprocessing and the generated topic models and labeling. Other topic modeling methods, such as the dynamic topic model (DTM), hierarchical LDA (HLDA), and hierarchical PAM (HPAM), will be considered to extend the empirical analysis and comparison.

6. Conclusion

Topic modeling is a promising methodology to analyze the contents of social networks semantically and correlatedly. Social networks on the Dark Web are no exception. Forums associated with cryptomarkets are fraught with discussions about criminal behaviors and illegal business perpetrating. We introduced an approach to model the latent topics and their correlations from a forum of illicit and malicious activities on the Dark Web into thematic patterns. We emphasized the significance of choosing the appropriate preprocessing and cleaning procedures on which the accuracy and quality of the topic models primarily depend. We used four topic modeling algorithms: LDA, CTM, PAM, and PTM, and four coherence measures: UMass, UCI, CNPMI, and CV, and discussed their performance and outcomes for the studied dataset. According to these evaluation metrics, we examined the most coherent topics produced by each model to choose the optimal number of topics for each method. Subsequently, we visualized the results as labeled groups of semantically associated terms, including the relationships among topics for CTM and PAM. Lastly, we discussed limitations, challenges, and future work. Analyzing discussions and contents on the Dark Web can be tremendously advantageous to sociologists, criminologists, psychologists, law enforcement agencies, cybersecurity agencies, and many others. This study presents a leading start for further research in the field by providing a comprehensive approach to extracting hidden thought patterns from the Dark Web.

Data Availability

The dataset used to support the findings of this study was retrieved from an open source repository “Dark Net Forums Datasets” provided by AZSecure and can be viewed and downloaded from the URL: https://www.azsecure-data.org/dark-net-markets.html.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Supplementary Materials

A detailed list of the abbreviations and symbols used in this paper can be found in the Appendix as a ready reference. (Supplementary Materials)