Abstract

From proactive detection of cyberattacks to the identification of key actors, analyzing contents of the Dark Web plays a significant role in deterring cybercrimes and understanding criminal minds. Researching in the Dark Web proved to be an essential step in fighting cybercrime, whether with a standalone investigation of the Dark Web solely or an integrated one that includes contents from the Surface Web and the Deep Web. In this review, we probe recent studies in the field of analyzing Dark Web content for Cyber Threat Intelligence (CTI), introducing a comprehensive analysis of their techniques, methods, tools, approaches, and results, and discussing their possible limitations. In this review, we demonstrate the significance of studying the contents of different platforms on the Dark Web, leading new researchers through state-of-the-art methodologies. Furthermore, we discuss the technical challenges, ethical considerations, and future directions in the domain.

1. Introduction

In the age of technology, and with the rapid development of hacking techniques and tools, it has become an urgent need for various organizations to take appropriate countermeasures against cyberattacks and cybercriminals. Moreover, proactive detection of cybersecurity threats is one of the fundamental and challenging actions needed to anticipate and detect an attack before it occurs. Cybersecurity attacks can come in different patterns and levels, differing in their complexity, breadth, and objectives. This vast diversity necessitates institutions and countries in general to make cybersecurity one of their essential systems.

Because of the rapid technical development, a large part of hacking activities transformed from just individual acts of theft and vandalism to well-organized and financially supported actors aiming for profits on a large scale. The objectives of organized crime in this domain range from financial gain to achieving political goals [1].

This transformation urges organizations to consider contemporary and sophisticated techniques to keep pace with the development of cyberattacks. Therefore, a new in-demand generation of cybersecurity tools is arising and attracting increasing interest from researchers and security practitioners, which is Cyber Threat Intelligence (CTI). CTI is an information system that provides evidence-based knowledge about cyber threats. Considering the gained knowledge, organizations can make cybersecurity decisions, including detecting, preventing, and recovering from cyberattacks [1].

Recently, and in light of the COVID-19 pandemic outbreak, the number of data security attacks has increased dramatically, as the pandemic has forced a state of work-at-home applied by organizations worldwide without taking adequate and effective measures against these attacks [2]. In another COVID-19-related aspect, hacking communities witness an increase in posts discussing exploiting the pandemic as a new opportunity for attacks, most notably of which attacks targeting remote work tools and fraudulence that target people looking for jobs or information about COVID-19 [3].

Social networks on Deep and Dark Webs are considered essential places where hackers can gain technical information and develop their skills. On such networks, they share information, communicate with each other, and sell hacking-related materials such as breached data, stolen card numbers, and system vulnerabilities [4]. Criminal networks rely on social ties in their formation and growth. On the Internet, social networks, forums specifically, provide similar intellectual affinity of criminals, such as hackers, to exchange information and experiences or plan for crimes and attacks [5].

In these platforms, members can follow posts from other members that they consider trustworthy or experts. Moreover, forum members have different specializations and rankings within the same community according to their activities and services. Therefore, forums provide the proper environment for the growth of cybercriminals networks and increase opportunities for planning and conducting cybercrime worldwide [5]. Consequently, these places provide vital resources for researchers and cybersecurity experts to detect cyberattacks early and provide organizations with warnings of potential threats [4]. Moreover, studying these hackers’ communities on the Dark Web allows for the continuous development of new areas in security informatics technologies [6].

In this review, we highlight the importance of analyzing content on Dark Web platforms to detect and predict cybercrimes, leading new researchers through previous works. We introduce a comparative study of recent researches, their research goals, approaches, used methods and tools, applied case studies, and results, and discuss their possible limitations. Furthermore, this review highlights notable challenges in analyzing Dark Web content, emerging fields, and future directions. We excluded studies in crawling or collecting data from Dark Web or present only statistical analysis approaches. We focus on studies that analyze the content of Dark Web platforms (such as websites, forums, social networks, marketplaces) to gain valuable information for Cyber Threat Intelligence (CTI) (Abbreviations: social network analysis (SNA), content analysis (CA), artificial intelligence (AI), machine learning (ML), data mining (DM), natural language processing (NLP), topic modeling (TM)).

It is worth mentioning and directing the interested reader to other surveys in the domain of CTI in general. In these surveys, researchers can find the description of different threat intelligence types and information-sharing strategies with a comparative study of the most popular open-source threat intelligence tools and declaration of technical and nontechnical challenges [7, 8]. Moreover, an analysis of the current role of the Dark Web as an environment that facilitates cybercrime and illicit gain can help understand the particularity of the Dark Web [9], in addition to the importance of knowledge and employment of DM and ML methods to predict and discover patterns of cyberattacks [10]. An extensive description of possible cybersecurity data sources and applications can support launching cybersecurity studies [11]. Researchers can leverage a detailed explanation of CTI types and their phased cycles with the application of AI and ML methods for prompt, actionable operations [12]. An overview of existing CTI platforms and their approaches and information provision abilities can help researchers find current gaps to start with and provide improvement to the CTI industry [13].

3. The Particularity of the Dark Web

To begin with, we present a simple explanation of the Internet layers down to the Dark Web, while explaining these layers in detail is out of the scope of this review.

First, the Surface Web (Open Web or Clear Web in other synonyms) represents all websites that are publicly and easily accessible because search engines can index them. Alternatively, numerous websites are inaccessible because search engines cannot index them, forming the Deep Web (or the Invisible Web). In the latter, one needs to type the URL of the website directly in the address bar of the web browser, or the website itself is visible but its content needs a password to access it [14].

Researchers should differentiate Deep Web from Dark Web. The Deep Web is part of the web that search engines cannot access for different reasons related to the operational functions of the websites. Researchers estimate this part at more than 90% of the entire web, whereas the Dark Web is part of the Deep Web that uses special encryption software to hide users' identities and IP addresses [15].

Thus, the most difficult-to-access part of the Deep Web is the Dark Web or the Darknet in another synonym. This anonymization leads to the predominance of malicious and criminal activities in that hidden and encrypted environment [15]. Various crimes and heinous actions are prevalent in this part of the web, including novice and professional hackers either for fun deeds or for making gains through extortion, sabotaging networks, or stealing organizations' data, in addition to many crimes such as children pornography and pedophile networks, drugs and arms trade, human trafficking, terrorism and recruitment of extremists, planning terrorist attacks, murderers for hire, hacked digital media trade, counterfeit documents, fraud, and many others [15, 16].

The Dark Web provides the ability to hide the user’s identity, network traffic, and data exchanged through it. Users outside the Dark Web cannot access it using standard web browsers but through special software, such as The Onion Router (TOR), Invisible Internet Project (I2P), and Freenet [15, 17]. Researchers consider dark networks the primary host for various criminal activities. For example, marketplaces on the Dark Web are evidence of Crime-as-a-Service (CaaS), as they provide most of the items commonly found in conventional black markets [18]. Trades on Dark Web marketplaces are anonymized as well, where members complete their transactions using cryptocurrencies, such as Bitcoin and Monero [16]. In this regard, some cybercriminals act as cryptocurrency providers to make it easier for others to perform criminal activities [19].

In terms of cybersecurity threats, hacking communities are active on Dark Web platforms, where hackers exchange experiences and share information, in addition to circulating hacking tools, malware, ransomware, breached data, and planning large-scale cyberattacks resembling a pattern of an organized crime [16].

Alternatively, Dark Web marketplaces are fraught with hacking products and tools for organizing attacks. Additionally, vendors offer breached personal data, such as credit cards, bank accounts, PINs, credentials, and other Personal Identifiable Information (PII). These marketplaces also provide botnets for renting to perform Distributed Denial of Service (DDoS) and fraud and spam services such as e-mail lists for sending phishing e-mails [14].

Dark Web marketplaces include sellers and buyers with different levels of technical expertise. For example, a small class of highly experienced professional sellers creates and sells sophisticated hacking tools and malware, whereas other less-experienced members buy from or collaborate with them to organize massive attacks or breached data exploitation in a Crime-as-a-Service (CaaS) paradigm. This example of crime indicates that technical professionalism is no longer an essential component to conduct cybercrime [14]. In this context, some professional vendors offer security services to others to provide an extra level of protection and privacy against law enforcement agencies' operations. Thus, if a cyberattack is detected, the identity of the perpetrator remains unknown [14].

In this regard, studies have shown that many successful cyberattacks relied on the cohesion of the mutual relationships between the hackers, which they established in the long term of cooperation, especially with the different levels of skills they possess. These levels entail them cooperating to implement the attacks and achieve their pursued gains. Therefore, these networks and marketplaces form what look like peership or colleagueship networks [19].

Moreover, many cybercrime marketplaces operate alongside hacking forums. Sellers advertise their products on these forums along with a description of the product features, price details, payment methods, terms of services, and contact information of the seller. For the latter, sellers and buyers tend to use other encrypted communication media such as private messaging apps or direct messaging features included in the forum [14]. Dark Web marketplaces play a significant role in providing hacking-related items. From the existence of markets for hackers, one can infer that the focus of such business on the Dark Web is financial gains, which are sometimes monopolized by the professional minority that dominates the market [20].

Some forums maintain a level of professionalism by establishing a reputation system to prevent intruders or, in the case of researchers, from gathering information. The reputation system is based on giving professional and active users in the community more privileges as their professionalism and trustworthiness levels increase, such as getting more reputation points and permission to access other sections in the forum [3].

TOR also allows hosting websites, thus masking the location of the hosting servers, or TOR Hidden Services, and they can only be accessed through TOR [14]. Recently, Darknets have become more complex and difficult to penetrate. TOR has added a layer of privacy in 2017 that increases the complexity of identifying both website hosts and visitors. Thus, platforms on the Dark Web will be less discoverable. Moreover, website administrators become more inclined toward making sites and forums accessible by invitation only [16].

Conventional cybersecurity solutions have focused on protecting endpoint devices of all kinds; however, while they can be effective for some time, they are not a long-term remedy [16].

On the bright side, methods and techniques of artificial intelligence, machine learning, data mining, and analytics are vital tools in fighting cybercrime. Such tools assist law enforcement agencies to target and disrupt websites on Dark Web. Additionally, they provide them with the legal evidence they need to sanction perpetrators [16].

It is worth noting that not all activities on the Dark Web are illegal; many entities use encryption software for legitimate purposes, such as journalists, political activists, whistleblowers, and law enforcement agencies and researchers for research purposes [15, 16].

4. Dark Web and CTI

As discussed previously, Dark Web represents a critical source of information for CTI. In this section, we first start with a brief description of CTI with different aspects. Then, we introduce a review of recent researches, comparing their objectives, approaches, used methods and tools, results, and possible limitations.

4.1. Cyber Threat Intelligence (CTI)

Cyber Threat Intelligence (CTI) is an information system that supports public and private organizations to detect, identify, monitor, and respond to cyber threats. This acquired information helps to understand the tactic, techniques, and procedures (TTPs) of threats and threat actors. Moreover, it provides organizations timely security alerts, recommended settings, and other information according to the type and purpose of the CTI system [21].

CTI provides information related to customary five questions: Who, What, Where, How, and When. CTI can leverage data from multiple sources. Sources can be internal (such as network events log files, firewall logs, alerts, responses to previous incidents, the malware used for attacks, and network flows), or external (such as reports from other institutions or governments, and experts’ blogs) [21].

According to Sari, an effective and efficient CTI should have five major characteristics: timely, relevant, accurate, specific, and actionable [21].

CTI includes several subcategories according to its purpose and sources. For example, there are Open-Source Intelligence (OSINT), Social Media Intelligence (SOCMINT), Measurement and Signature Intelligence (MASINT), Human Intelligence (HUMINT), and Technical Intelligence (TECHINT). Sari [21] addressed in detail each type of CTI for further reading. Alsmadi [22] added other examples like Communication Intelligence (COMINT), Deep or Dark Web Intelligence, Signal Intelligence (SIGINT), and Geospatial Intelligence (GEOINT).

Alsmadi [22] defined three levels of Cyber Intelligence:(1)Strategic Cyber Intelligence, which is responsible for identifying threats in terms of sources, objectives, and possible consequences(2)Operational Cyber Intelligence, which provides information about attackers’ capabilities and resources, and predicts targets and methods employed by actors to achieve their goals(3)Tactical or Technical Cyber Intelligence, which provides information about real-time methods and tools used by attackers, and addresses the countermeasures and defending strategies to be followed by organizations

Several informational aspects from the Dark Web can help form a defense and remedy against cyberattacks. These aspects can include analyzing a recent attack on a specific organization, tracking changes on hacker marketplaces, monitoring hackers’ behavior on hacking communities, and evaluating the developments in hackers’ skills and capabilities [23]. Indeed, Shakarian [23] classified CTI into four tires (illustrated in Figure 1): (1) situational awareness, (2) imminent threats, (3) understand capabilities, and (4) understand communities, in which he gives the upper tiers (third and fourth) great importance for long-term considerations.

CTI is primarily a data-driven process; therefore, it needs several stages to collect, process, and analyze data according to the security needs of the organizations (public, private, or cybersecurity specialized). To understand the intelligence an organization requires, it should acquire several components, including inspecting the existing security domain, determining the current cyber threats, monitoring its cyber assets, and modeling potential directions of future threats [13].

Generally, CTI has four stages illustrated in Figure 2 [13]:(1)Intelligence planning/strategy(2)Data collection and aggregation(3)Threat analytics(4)Intelligence usage and dissemination

4.2. Dark Web Analysis in the Field of CTI

We organize reviewed researches listed below based on the leading issues they cover, although some studies fall under multiple subjects.

4.2.1. Detecting and Predicting Cyber Threats

Sapienza [24] presented an approach that integrates information from social networks on Surface, Deep, and Dark Webs. The system matches discovered terms from posts of cybersecurity experts on social media (Twitter) and hackers' discussions on Dark Web forums to generate warnings about anticipated or current cyber threats. The system identifies important terms using text mining techniques and computes their occurrences in Dark Web hacking forums. Additionally, the system connects the resulting terms with a group of words that have contextual semantic relationships with these terms for further interpretation and enriching the perception of the warning and eventually tracking and observing the evolution of activities related to the discovered terms on the Dark Web.

Working on already discovered vulnerabilities, Almukaynizi et al. [25] introduced an approach that maps multiple resources, such as white-hat community, vulnerability research community, and websites on the Dark Web and Deep Web, to predict if hackers will exploit a vulnerability. The approach highlights how the likelihood of exploitation increases when the associated vulnerability has frequent mentions in the resources at study.

In another work, Almukaynizi et al. [26] introduced the DARKMENTION system that employs association rules to find correlations between threats mentioned on Dark and Deep Webs and real-world cyber incidents. By using the discovered correlations, the system generates warnings to cybersecurity organizations promptly. The approach depends on Causal Reasoning concepts and Logic Programming (specifically Point Frequent Function (PFR)) to learn the rules.

Williams et al. [27] introduced an incremental crawling approach that crawls, classifies, and visualizes cyber threats. The classification phase categorizes up-to-date hacking exploits and attachments, detects trending and emerging threats, and analyzes hackers’ activities by year and exploit type.

Using ontologies, Narayanan et al. [28] proposed a framework that provides cybersecurity experts with semantically enriched knowledge representation and reasoning with the aim of early detection of cyberattacks. The approach relies on the fact of information incompleteness, considering that security blogs and discussions on the Dark Web are often written for a specific audience (such as alike minds or fellows of expertise). The system analyzes data about previous attacks patterns, tools used for these attacks, and other indicators as the reasoning part of the system to detect known and unknown attacks. Furthermore, the system employs association rules and clustering techniques to find complex patterns of events in the data stream.

Tavabi et al. [29] proposed DarkEmbed, a framework with a neural language modeling approach to predict the exploitation of vulnerabilities. The framework represents discussions from the Dark Web and Deep Web forums in low-dimensional vector space by employing language embeddings, which find the contextual, syntactic, and semantic relationships between words, and then using the distributed representations as classification features.

Arnold et al. [30] developed a CTI tool to identify cyber threats by analyzing social network data on the Dark Web to detect valuable information relevant to the diffusion of malicious tools and services, their key actors and participants, and breached data. Based on text feature analysis, the approach integrates text analytics from both forums and marketplaces on the Dark Web to gain information from actors’ discussions and interactions and information about the actual traded products (breached data). Key actors post diverse contents that contain contextual and temporal information related to their offers on the markets. Thus, the authors confirmed that considering forum data is substantial, as they are rich in text.

In classifying the source code, Ampel et al. [31] presented an approach to classify exploits source code based on deep transfer learning methodology. Their Deep Transfer Learning for Exploit Labeling (DTL-EL) tool takes the labels identified by professionals from public exploit repositories and performs a generalization on exploits extracted from hacker forums with enriching metadata. Furthermore, the system classifies the gained information in predefined categories to conduct proactive measures by cybersecurity organizations. The approach takes advantage of metadata and categories found in specific Dark Web marketplaces, forums, and public repositories.

The INTIME tool in Koloveas et al.'s work [32] provides a framework to identify and analyze cyber threats in specific cybersecurity topics (Internet of Things (IoT)) to share knowledge among cybersecurity organizations. Their approach integrates information extracted from Surface, Deep, and Dark Web resources, including websites, forums, marketplaces, social networks, and databases published by security agencies to enrich the gained data. The tool performs several tasks such as data collection, usefulness ranking, identifying cyber threats, linking different acquired information, and sharing the resulting cyber threat intelligence. These tasks utilize specific crawlers, social network monitors, NLP techniques, ML methods, named-entity recognition, and semantic correlations and similarity.

4.2.2. Analyzing Hacker Behavior and Detecting Key Actors

Samtani et al. [33] introduced a CTI framework that focuses on analyzing hackers' assets within forum discussions. The framework utilizes classification and TM on the implementation of disseminated hacking tools. Furthermore, it takes advantage of the available metadata and posts’ contents to explore trending tools. The framework applies bipartite SNA to detect key hackers in the communities. The constructed bipartite networks represent the relationship between hackers and threads of specific types of assets and eventually explore key hackers for each extracted topic.

Focusing on mobile malware and key hackers, Grisham et al. [34] introduced a proactive CTI tool to identify mobile malware attachments and their key authors from Dark Web hacker forums in different languages. They employed text classification methods to detect malware using neural network architecture and recurrent neural networks while applying SNA to identify key authors. The tool extracts the textual features from the subforum name, thread title, post content, and the attachment name to classify the malware. Furthermore, it identifies trending and common malware by distributing the discovered malware over posting date. On the actors’ side, they constructed a bipartite author-thread network, projecting it with the authors’ network to infer hackers’ co-occurrences in a unified network.

Pastrana et al. [35] aim for two objectives in their approach, preventing young people from drifting to become cybercriminals and helping security agencies to take suitable actions in response to cyberattacks. They employ several procedures to understand the behavior pathways of cybercriminals. The system identifies key actors by modeling their shared characteristics based on their forum activities, analyzes their temporal evolution in interests and knowledge, and predicts the probability of some users becoming future key actors. The proposed system utilizes ML, NLP, SNA, and TM. Furthermore, a directed network of replies and citations represented the relationships between actors.

Biswas et al. [36] studied hacker behavior in hacker forums to identify significant predictors that detect key hackers or leaders. They define a hacker community as a community of practice or knowledge community of practice where each member plays a role. Therefore, they employed text mining and sentiment analysis techniques to generate predictors and construct a hacker-role classification model. The study addressed eleven hypotheses and proved the high significance of four: (1) discussion threads to determine hacker expertise, (2) the number of messages posted to classify hackers based on their meritocracy, (3) the number of responses in each thread to classify hackers based on their expertise, and (4) average message size to classify hacker role. Four hypotheses have possible connections with hacker expertise, whereas the other three were insignificant.

Marin et al. [37] introduced an approach to identify key actors and discover unique features of cybercriminals. They employed CA over topics and their authors, SNA to construct interaction graphs and detect communities, and seniority analysis to identify hacker coverage and involvement over time. They utilized the three techniques separately and combined, and validated their results based on reputation systems within hacker forums as a ground-truth dataset. The approach aims at proving how a model learned from one forum can be generalized to identify other forums' key actors.

Marin et al. [38] proposed an approach to predict future posts of hackers by analyzing their adoption behavior. The adoption behavior means how hacking community members adopt the posting topics of influencing hackers and post in the same direction. They employ sequential rule mining to discover members' posting rules by their sequences of posts within defined time windows (day and hour), and then, they use these rules to predict near-future posts.

In a different aspect, Deb et al. [39] suggested using sentiment analysis to support time series modeling in predicting cyber events, tested on ground-truth events from two organizations. Their approach aims at generating predictive signals from hacker forums by analyzing the sentiments dominating forum posts for better understanding hackers' behavior over time.

Zenebe et al. [40] employed descriptive and predicative analysis tools and ML methods to detect cyber threats proactively and identify key actors in hacker forums on the Dark Web. Their approach focused on extracting trending topics among hackers as the most common threats and influential members as key actors, achieved using IBM Watson Analytics and WEKA ML tools.

To analyze hackers' behaviors and strategies and predict the near-future attacks afterward, Marin, Almukaynizi, and Marin et al. [41] proposed a temporal logical framework that learns the rules correlating hacking activities (preconditions) to real-world cyber incidents (postconditions). They used sociopersonal characteristics of the hackers who mention the vulnerabilities (such as hackers' activities, influences, and expertise) and technical attributes of the attacks from the mentioned vulnerabilities (such as recent patches or released exploitation scripts).

Sarkar et al. [42] used information from Dark Web forum posts represented in a graph of replies to predict real-world cyberattacks on specific organizations. They used classification methods over social network features. The suggested approach focused on the dynamics of discussions to extract valuable patterns about how a single piece of information gains popularity among members in a specific timestamp. These patterns can help identify specialized members or experts. Afterward, they apply time series methods to capture the dynamism of other members' interactions on those experts' discussions and link-extracted patterns with real-world security incidents to predict future attacks.

Huang et al. [43] proposed a hybrid method, HackerRank, which combines CA and SNA to detect key hackers. The approach uses CA to extract the topics preferences of forum members, and then, it uses SNA to construct a network representing relationships among members and identifying key hackers. The HackerRank evaluates members' ranking and extracts members with the highest ranks as key hackers. The method constructs the social network graph based on members' interactions (the replies) and evaluates the activeness of a user by the number of posts and replies.

4.2.3. Performance and Optimization

Deliu et al. [44] presented a comparative study of ML methods employed for detecting cyber threats from hacker forums. The study compares performances of convolutional neural network (CNN) and support vector machine (SVM). The comparisons included traditional classifiers with Bag-of-Words features and N-gram features while applying CNN with several feature vectors representing forum posts as sequences of words, using Word Embeddings to preserve the semantics of the words. The approach focuses on filtering out irrelevant posts using optimized classifiers with the highest accuracy for more accurate CTI results.

In another work, Deliu et al. [45] introduced a reduction approach to optimize the TM process of detecting cyber threats from hacker forums. They employed three reduction approaches: (1) classification to filter out irrelevant topics, (2) reducing the vocabulary size, and (3) reducing the number of topics.

Koloveas et al. [46] used classification and language modeling methods to support the crawling tasks by representing the collected information in a latent low-dimensional feature space, to analyze the content relevant to a specific hacking topic (IoT in the proposed study).

Queiroz et al. [47] proposed an approach to enhance classification models using language models for feature representation. They employed Word Embeddings (WEMB) and Sentence Embedding (SEMB) techniques to find semantic contextual properties of words and sentences, to detect cyber threats and posts related to vulnerabilities in forums and social networks in Surface, Deep, and Dark Webs.

Johnsen and Franke [48] presented two essential suggestions: using OSINT for proactive cyberattacks detection and applying extensive preprocessing on the document corpus iteratively before applying TM. These operations help to extract more coherent and focused topics.

A deep learning approach by Ebrahimi et al. [49] used a semisupervised labeling methodology to reduce manual labeling of training data. The system takes advantage of the lexical and structural characteristics of Dark Web marketplaces to increase classification performance while auto-detect cyber threats from hacking listings. The approach uses both Transductive Learning (TSVM) and Deep Bidirectional LSTM networks to identify threats.

4.2.4. Language Variations

The approach of Nunes et al. [50] integrated hackers' discussions on forums and their offerings on marketplaces on the Dark Web for proactive detection of cyberattacks targets, considering different languages. They categorized the targets into three main domains: platforms, vendors, and products. Their tool employs knowledge reasoning techniques that use a hybrid methodology of Defeasible Logic Programming (DeLP) and ML classifiers to reduce classification labels and to gain focused results.

By transferring knowledge from English Dark Web marketplaces to non-English ones, Ebrahimi et al. [51] proposed an approach to detect cyber threats from non-English marketplaces without the need for mono- or bilingual word embeddings or automatic translation. The system utilizes Deep Cross-Lingual Modeling that simultaneously learns common representations from two languages (English and Russian in the study). It generates a shared Bidirectional Long Short-Term Memory (BiLSTM) between English and Russian marketplaces by integrating labeled data from English markets with the limited labeled data from the Russian market. Their goal behind knowledge transfer is to reduce false positives and false negatives.

BlackWidow, a tool proposed by Schäfer et al. [52], discovers the features shared among platforms on both Dark Web and Deep Web to support future cybersecurity problems. The tool analyzes and compares forums in different languages to find cross-relationships, trending subjects, and key authors from multilingual content. Eventually, it constructs a knowledge graph of threads, actors, messages, and topics and the relationships among these four types of nodes. The generated network defines relationships among users by their replies to each other and trending topics by applying time series.

In another work, Ebrahimi et al. [53] suggested that translating non-English texts to English causes loss in the semantics of the language, therefore affecting the classification results. They presented an approach to detect cyber threats from untranslated non-English hacker forums to preserve the original semantics and produce an integrated cross-language knowledge representation about cyber threats from multiple sources in different languages. They proposed the Adversarial CLKT (A-CLKT) approach based on Long-Short-Term Memory (LSTM), Cross-Lingual Knowledge Transfer (CLKT), and Generative Adversarial Networks (GANs) principles.

4.2.5. The Role of Dark Web Marketplaces

Dong et al. [54] introduced a lightweight framework for detecting new cyber threats emerging from Dark Web marketplaces and new releases of already existing threats. The framework applies classification and text mining to the titles and descriptions of offered items to identify new terms that may represent new vulnerabilities or newly released malware. Furthermore, the framework generates warnings from the discovered terms with the associated properties such as vendor name, release date, and keywords.

Marin et al. [55] presented an approach to detect communities of vendors of malware and exploits from Dark Web marketplaces. They addressed the hypothesis that vendors with high similarities will form a community in the real world. The approach employed ML and SNA to prove how multiplexity social ties play an essential role in detecting and validating such communities. For cross-validation of detected communities, they divided a group of marketplaces into two sets and calculated the similarities among vendors according to the number of shared product categories and the corresponding number of products in each category they share. Consequently, they generated two bipartite networks, vendors-products network, and vendors-categories network. Eventually, they projected the generated networks to create a monopartite network of correlated vendors.

Table 1 summarizes the reviewed studies according to their goals, the proposed approaches, utilized methods and tools, case studies, results, and possible limitations (for visibility purposes, we place the limitations below each research row). Table 2 demonstrates topics covered by each research.

5. Challenges and Ethical Concerns

In the linguistics domain, Ferguson [56] addressed some significant challenges when studying the Dark Web content:(1)Inconsistency of the language used in communications between community members and forum discussions; this inconsistency is intended in Dark Web communities as a type of anonymity procedure.(2)Weak grammatical, spelling, and idiomatic context (also intended).(3)Individuals deliberately do not use particular terms or use them only in specific cases and ways.(4)The cultural dynamics of Dark Web communities: members come from worldwide; thus, they do not follow standard terminology or normative cultural context to contribute to the community.

Similarly, Queiroz and Keegan [4] indicated that hackers use constantly changing and evolving technical terms that contain semantic differences, in addition to abbreviations and misspellings, which require frequent development of the analysis model to keep pace with these changes. Moreover, it urges to adopt different modeling approaches for each social network; in other words, the model developed for a network may not perform similarly on another network due to terms changes. In another work, Queiroz et al. [57] justified the notion of “Concept Drift” caused by the mentioned changes in hackers’ terms. Furthermore, they introduced an approach to overcome this drift by updating and retraining the model with temporal features and weighting.

Queiroz and Keegan [4] added two more challenges in the CTI field. One is the lack of ground-truth datasets that researchers need to evaluate their modeling approaches and validate their results. The second are the ethical considerations when dealing with the data. Unlike common social media platforms (such as Facebook and Twitter), there is no explicit agreement in hacking forums and chat rooms explaining to the user that their data may be used by third parties (such as researchers). Additionally, the sheer volume of data makes it difficult to obtain explicit consent for the use of participants' data in research. These considerations call for researchers to make careful decisions about how to use the acquired data.

In the technical particularity of the Dark Web, Akhgar et al. [58] addressed the following challenges:(1)The nature of the web in general: the web consists of different types of media besides textual data, most commonly image, video, and audio.(2)The published multimedia is in different languages and colloquialisms or accents, using different terminologies.(3)The complexity of accessing criminals' social networks and closed groups: investigators often need to wait several weeks before obtaining approvals to join these networks. Moreover, they need to make their profiles look authentic, and their stories sound realistic and believable by administrators of the websites under study.

Due to the technical nature of the Dark Web, developing crawlers that collect and analyze the required data can be complicated. Furthermore, researchers must consider efficient precautionary measures since their employed techniques and tools themselves face the risk of being disclosed and vulnerable to cyberattacks [6].

In particular, Pastrana et al. [59] discussed the ethical issues when collecting and analyzing data from underground forums. Ethical considerations require research studies involving human participants to be reviewed by a Research Ethics Board (REB). The importance of such reviews is to consider the potential harm, how to reduce or avoid consequences, and protect the researchers from possible responsibilities.

Moreover, Pastrana et al. [59] differentiate ethical issues of collecting the data from analyzing the data. Their justification for this separation is due to the nature of each process. Collecting the data is to understand forum behavior as a computer system, whereas analyzing the data involves understanding human beings related to the collected data. In the former, researchers should consider some technical risks such as breaking terms of services of the platform or overcoming crawling prevention measures like CAPTCHAs. They suggest that if the benefits surpass the potential harms, it is ethically reasonable to break such measures. On the other hand, using TOR for research purposes cannot avoid making the researcher's device itself a relay on the network.

Researchers can consider several measures to mitigate potential harm [59]:(1)Avoiding identification of individuals (such as publishing their usernames)(2)Introducing the results objectively(3)Avoiding the disclosure of sensitive personal data (like credit card numbers of victims)(4)Protecting the researcher: for example, by avoiding making comments that offend the community and taking precautions not to download malicious content, which can cause security or legal issues, such as malware, child pornography, or terrorist materials(5)Hiding the name of the platform from which the researcher collected and analyzed the data(6)Taking cautions when analyzing leaked data, as it can include private messages, e-mail addresses, IP addresses, and exclusive posts

6. Discussion and Future Directions

Leukfeldt et al. [5] found that forums play a significant role in originating cybercriminal networks. Forums help individuals or small groups find colleagues for collaboration and encourage the growth of the networks. Therefore, it is advantageous to analyze Dark Web forums to discover how cybercriminals’ networks originate and grow, and understand the factors that attract individuals and groups to be active on these forums [5].

In addition to SNA, researchers need to study the types of criminals who communicate with each other on Dark Web platforms, their technical expertise levels, and the number of attacks generated in discussions. Furthermore, it is imperative to understand whether members’ participation in forums is only for the sake of curiosity, or they are petty thieves, or whether they form a professional network that carries out organized crime on various organizations systematically [5].

Vilić [60] indicated another type of crime that can be categorized under cyber threats but on an extensively wide scale, which is Cyber Terrorism. Cyber Terrorism can take many forms, including Logic Bombs, Trojan horses, Worms, Viruses, and other cyberattacks. Such attacks target large systems of critical institutions in countries (such as air forces, transportations, hospitals, and others), causing these systems to shut down, malfunction, or lose information. Vilić [60] mentioned the diverse definitions of Cyber Terrorism, its various goals, and techniques, all of which can lead to more future research in several different directions. Similar to CTI, these directions include disclosing criminals' identities and their supporting entities, and acquiring information related to the attacks, such as tools, techniques, methods, targets, motivations, and when these attacks will occur.

One challenging aspect of Dark Web analysis, which has limited research, is analyzing the encrypted messages exchanged among forum members. Future CTI tools can obtain considerable benefit from identifying members that use these encrypted means of communication and the content of the messages that may comprise extremely vital information about future cyberattacks [60].

Moreover, future directions will involve extensive employment of Social Network Analysis, Content Analysis, Link Analysis, and Sentiment Analysis conducted on various platforms on the Dark Web. These techniques help understand attacks and attackers by identifying criminals and detecting attacks patterns, leading organizations to take the proper proactive measures against them [60].

From the linguistics perspective, few research studies exist in analyzing the Dark Web content in different languages. It is noteworthy that not all content of the Dark Web is in English; conversely, many other languages are heavily used in Dark Web platforms, singularly or in a multilingual way. This aspect needs further research.

Cyber Threat Intelligence and Cyber Terrorism detection can leverage an integrated analysis of the virtual criminal environment and the physical or conventional crime world. Such studies can lead to identifying the geographical location of attackers, as researchers suggest that some criminal networks may originate in the physical world before transferring them to the cyber world [5].

An emerging area of research is how to exchange CTI among security organizations via secure channels to extend the level of protection and responses against cyberattacks. However, cyberattacks are rapidly becoming more complex, more extensive, and more effective due to the wide variety of methods, technologies, and platforms used by cyber threat actors. Therefore, the CTI domain needs constant developments, particularly implementing appropriate real-time procedures, to keep pace with the level of attacks and threats [21].

In their review, Samtani et al. [13] suggested four directions for future development in the CTI industry domain:(1)A genuine shift from developing reactive CTI tools to developing proactive OSINT-based CTI platforms(2)Sufficient adoption of AI and ML techniques, such as NLP, text mining techniques, TM, ontology development, named-entity recognition (NER), and diachronic linguistics.(3)The immense use of optimized DM methods(4)The integration of Big Data and Cloud Computing technologies: Big Data tools help to extract features, reduce feature space, and improve the performance of DM methods. Moreover, Cloud Computing enables organizations to increase their capabilities by extending their operating environment across the Cloud.

More specifically, ontology techniques have a promising lead in the future of CTI. A multilayer CTI ontology can integrate formal definitions and lexicons, representing the abstract layer of CTI in ontology with defined constraints for the proper utilization of Web Ontology Language (OWL) capabilities [2].

On a different aspect, Saalbach addressed the process of Attribution, defining it as “the identification of the origin of a cyberattack“ [61], which implies the identity of the attacker (individual or organization) as well, and it includes both the digital and physical worlds. In this context, Saalbach suggested integrating both cyber and conventional intelligence against cyberattack and their actors. Therefore, cooperation among organizations of different specialties is essential for successful attribution [61]. Moreover, matching cyberattack actors' identities among several platforms requires further research [6].

One crucial aspect, and in light of what we have discussed in this review, social relations among cybercriminals play a key role in executing large-scale cyberattacks and achieving their shared objectives (financial or nonfinancial). Such cooperation represents a type of organized crime or Crime-as-a-Service (CaaS). Therefore, it is advantageous to integrate computational modeling with social modeling to understand how these societies arise, develop, and grow, and eventually how they plan and perform organized attacks. Applying a Sociological Model of the Organizational Development of criminal networks helps to understand their structure, levels of professions or roles, their evolution over time, and their objectives [19].

7. Conclusion

With the rapid increase in quantity and complexity of cyber threats emerging from different parts of the Internet, organizations are increasingly considering Cyber Threat Intelligence (CTI) as one of the vital systems of their operational existence. CTI leverages multiple information sources and produces valuable insights, analytics, and knowledge for decision-makers to take proper actions against cyber threats. One of the most crucial sources is the Dark Web, which is growingly earning great interest from researchers due to its richness of information related to cyber threats presented by cybercriminals on different sorts of platforms such as forums (discussions, tutorials, and assets) and marketplaces (offered products and services).

In this review, we discussed the particularity of the Dark Web as an information source for an effective CTI through several state-of-the-art research, whether including the Dark Web solely or combining it with other sources such as the Surface Web and Deep Web or shared information from cybersecurity institutions. We compared their goals, approaches, used methods and tools, case studies, results, and possible limitations to assist future researchers in acquiring the necessary information about CTI and Dark Web in particular and finding gaps that need further research. Furthermore, we discuss the critical challenges, ethical considerations, and future directions in this specific domain.

CTI in the future may, or should, witness more engagement of artificial intelligence, machine learning, language processing, and ontology techniques to respond proactively and promptly to the relentlessly evolving cyber threats, achieving at the same time high standards of accuracy, effectiveness, and efficiency. Although these countermeasures cannot completely extirpate the malicious parts of the web, they can extensively alleviate the severe effects of the threats lurking in those parts.

Data Availability

No data were used to support this study.

Conflicts of Interest

The authors declare that they have no conflicts of interest.