Abstract

Are nearby places (e.g., cities) described by related words? In this article, we transfer this research question in the field of lexical encoding of geographic information onto the level of intertextuality. To this end, we explore Volunteered Geographic Information (VGI) to model texts addressing places at the level of cities or regions with the help of so-called topic networks. This is done to examine how language encodes and networks geographic information on the aboutness level of texts. Our hypothesis is that the networked thematizations of places are similar, regardless of their distances and the underlying communities of authors. To investigate this, we introduce Multiplex Topic Networks (MTN), which we automatically derive from Linguistic Multilayer Networks (LMN) as a novel model, especially of thematic networking in text corpora. Our study shows a Zipfian organization of the thematic universe in which geographical places (especially cities) are located in online communication. We interpret this finding in the context of cognitive maps, a notion which we extend by so-called thematic maps. According to our interpretation of this finding, the organization of thematic maps as part of cognitive maps results from a tendency of authors to generate shareable content that ensures the continued existence of the underlying media. We test our hypothesis by example of special wikis and extracts of Wikipedia. In this way, we come to the conclusion that geographical places, whether close to each other or not, are located in neighboring semantic places that span similar subnetworks in the topic universe.

1. Introduction

In this article, we explore crowd-sourced resources for automatically characterizing geographical places with the help of so-called topic networks. Our goal is to model the thematic structure of corpora of natural language texts that are about certain places seen as thematic frames. This is done in order to automatically compare the thematic structures of corpora of texts about these places, which will be represented as topic networks. In this way, we want to investigate the regularity or systematicity according to which geographical objects (i.e., cities and regions) are dealt with, especially in online communication.

Our work relates to what is described by Crooks et al. [1] as a novel paradigm of modeling “urban morphologies.” We not only add special wikis such as regional and city wikis as candidates to the resources listed in [1] but also introduce a novel method for modeling their content. This concerns local media of collaborative writing about places (cf. [2]), which contain everyday place descriptions [3] authored and networked according to the wiki principle. The corresponding wikis and the subgraphs of Wikipedia that we additionally analyze manifest Volunteered Geographic Information (VGI) [46] and thus relate to what is called the wikification of Geographical Information Systems (GIS) [7]. VGI is “completing traditional authoritative geographic information” [8], an information source which is still “underutilized” in geography [9] as a source of big textual data [8] making natural language processing an indispensable prerequisite for its analysis. According to Hardy et al. [6], authoring VGI has a spatial component in the sense that people likely write about local content though this also holds for Wikipedia for a minor degree [10]. This spatial component can be accompanied by a lack of quality assurance, which makes VGI susceptible to deficiencies and to a distorted resource of still unknown extent [5]. In any event, the biased coverage of VGI is a characteristic of resources like Wikipedia so that the same region can be displayed very differently in its various language editions [11], a sort of biasing which is typical for user-generated content. Nevertheless, Hahmann and Burghardt [12] show that more than 50% of the articles in the German Wikipedia contain georeferenced data (at least indirectly via links to other articles), so that such media can be regarded as rich resources of VGI. Moreover, Goodchild and Li [5] point to the fact that crowd-sourcing or, more precisely, crowd-curation [13], as enabled by wikis, is a means of quality assurance.

We follow this concept and assume that geographic data, as manifested linguistically in online media, are a valuable resource to investigate how communities form a common sense for addressing places of common interest. In line with Clare ([14], 41), we additionally assume that “[a]s people communicate more about a place, social consensus will create increased similarity between and within people’s judgments of it.” However, we also assume that the latter similarity can affect communications of different communities about different places. In this way, we assume a kind of horizontal self-similarity [15] of the thematic structure of online media, which is more or less independent of the underlying theme and the community. That is, our hypothesis on the theming of places is as follows.

Hypothesis 1. Thematizations of different places at a certain level of thematic abstraction tend to be similar among each other (rather than being dissimilar) (1) in the sense that they focus on similar topics and (2) the way these topics are networked and (3) with respect to the skewness of this focus, regardless of whether the underlying media are generated by different communities and whether these communities address related or unrelated places at near or distant spaces.
The intuition behind Hypothesis 1 is that thematizations of places in web-based communication are seemingly somehow thematically redundant: in reporting, for example, on the cities in which people live, they may aim to emphasize the special character of these places. It seems, however, as if a thematic trend is breaking ground that ultimately makes such reports appear thematically very similar. Whether or not this intuition is actually a trend that can be observed specifically in the field of wiki-based media is something this study is intended to clarify. From this point of view, it is obvious that Hypothesis 1 is only a starting point which in itself needs further clarification in order to be testable: similarity, for example, is a highly context-sensitive attribute [17] that needs further definitional specifications in order to be computable. Likewise, the concept of thematization (theme or topic)—a concept which according to Adamzik [18] has so far found comparatively less attention in linguistics—is not yet specified in Hypothesis 1. Thus, an appropriate elaboration and concretization of Hypothesis 1 is one of the main tasks of the present paper. To this end, it is developing a generic topic network model in conjunction with a measurement procedure which will specify both the notion of similarity (which will be defined in terms of the graph similarity of topic networks) and of the thematization of places (which will be defined in terms of topic labeling and topic networking). This topic network model will allow Hypothesis 1 to be reformulated and concretized in the form of variants (i.e., Hypotheses 2–4), which will be presented in Section 3.2.7 and whose formulations presuppose the topic network model that this paper develops in the preceding sections.
The skewness that is mentioned by Hypothesis 1 reminds one of a Zipfian process, according to which a few topics dominate, while the majority of candidate topics are underrepresented or disregarded. Therefore, we speak of Zipfian thematic universes, which are spanned by the thematization of the same places in online media such as special wikis of the sort studied here. By the term topic, we refer to the notion of aboutness of texts [18, 19]. From a linguistic point of view, the terminology of Hypothesis 1 seems to be confusing when referring to places as what is given and with topic to what is said about these places. The reason is that linguistics distinguishes between what is given (theme or topic) and what is said about it (rheme, comment, or focus) in a given piece of text [18, 2022]: a mention of a city like Vienna, for example, can be connected with certain subtopics (e.g., classical music), which characterize this place rhematically by providing new information about it. The latter distinction is meant when we relate subtopics in the role of rhemes to places in the role of topics in the linguistic sense. Thus, when talking about topics as part of a computational model, we will use the term topic (topic2), while when talking about places as topics in the linguistic sense (topic1), we will use the term theme and speak about its rhemes as its subtopics modeled by topics (topic2) as units of our model. This scenario and its relation to Hypothesis 1 are depicted in Figure 1. It shows a generalization of a hypothesis of Louwerse and Zwaan [16] according to which language encodes geographical information: the places p and q, which are understood as conceptual units (i.e., mental models), are described by or expressed in two discourse units (texts, dialogs, etc.) x and y. From the latter units, the topic representations α and β are derived by means of a computational model (e.g., Latent Dirichlet Allocation (LDA) [23] or the topic network model introduced in Section 3). While such derived topics are part of the computational model, the underlying discourses belong to the modeled system. We assume that the conceptual unit p (q) is structured into a system of networked rhemes or subtopics (). Ideally, the derived topic α in Figure 1 is a valid model of one of the rhemes of place p (e.g., ) and β of one of the rhemes of place q (e.g., ). If we assume now that p and q are conceptually related (e.g., similar) to each other, then the linguistic encoding hypothesis implies that this is possibly reflected by a relatedness (e.g., similarity) relation among some rhemes of these places (e.g., by the relatedness of and ). From the point of view of modeling, this relation is ideally mapped by the relatedness (e.g., similarity) of the derived topics α and β. We assume that conceptual relations between places can be parallelized by relations of physical proximity or distance between spaces that are mentally modeled by these places. If one additionally assumes that proximity in space correlates with relatedness in conceptual space (the less the distant, the more the similar, for example), one obtains a linguistic variant of Tobler’s so-called first law (see Section 2). If we look at the literature (see Section 2), we find that the approaches in this area differ in terms of the linguistic level at which they observe the linguistic encoding of platial [13] relations: for example, at the level of intertextually linked texts, at the level of the topics these texts are about, or at the level of lexical elements used by these and other texts to deal with the latter topics. In lexical variants of this approach, the places p and q, for which we assume that they are conceptually related, are preferably referred to or described by means of lexical items (see Figure 1) of the underlying lexis that are syntagmatically or paradigmatically associated. From the point of view of modeling, we have to then assume the two types (as models of the words ) for which we automatically detect, for example, their (paradigmatic) closeness in semantic space (cf. [24, 25]) or the similarity of their (syntagmatic) co-occurrence statistics (cf. [26]).
From this analysis, we obtain a series of reference points or means for encoding geographical information about conceptual relations (see [1] in Figure 1) of places. This concerns more precisely a series of possible parallelizations of such relations, which may ultimately be parallelized by relations between the spaces designated by these places (for the numbers in brackets, see Figure 1): at the level of the modeled system, this refers to thematically linked rhemes, intertextually linked discourse units (e.g., texts), and syntagmatically or paradigmatically linked words ([1]). From a modeling point of view, we distinguish the statistical relatedness of types or of topics as candidate parallelizations ([1]). Beyond that, we find the parallelization of the relatedness of rhemes and words on the one hand and of types and topics on the other ([2], [3]), as well as that of the relatedness of words on the one hand and of types on the other ([4]). The parallelization of the relatedness of rhemes of the same place ([0]) by the relatedness of the rhemes of another place concerns the core of our network approach. Such relations among rhemes constitute rhematic networks or networks of rhemes on both sides of the affected places. Our main assumption is now that any such rhematic network, which manifests the thematic structure of a place, can be related as a whole to that of another place. In doing so, it is, from a modeling point of view, ideally parallelized by the structural relatedness (e.g., similarity or complementarity) of topic networks, which are derived from corpora of texts, each of which describes one of these places ([5]). This type of parallelization affects entire networks of linguistic objects and yet offers a means of encoding the conceptual relationship of places ([1]) or the proximity of spaces, respectively. In the present paper, we explore relations of Type [5] in order to learn about the encoding of geographical information in natural language texts, that is, about relations of Type [1]. To this end, we develop, instantiate, and empirically test a formal model of multiplex topic networks derived from so-called linguistic multilayer networks as a model of relations of Type [5].
From this point of view, Hypothesis 1 means that certain rhemes of places and the structure they span resemble each other, regardless of how far the quantified distances of the spaces represented by these places are and regardless of the fact that the texts in which these rhemes are described are written by different communities. To test this hypothesis, we introduce topic networks to make the networking of topics a research object according to the scenario described in Figure 1, that is, in relation to the hypothesis of linguistic encoding of geographical information. The contributions of this article are of theoretical, methodical, and empirical nature.(1)Formal modeling: we develop a generic, extensible formalism for the representation of topic networks that cover a wide range of informational sources for spanning and weighting topic links. To this end, we introduce the notion of multiplex topic networks derived from so-called multilayer linguistic networks. In this way, we enable the same place to be represented by a family of thematic networks that offer different perspectives on the networking of its rhemes. We exemplify this model by means of two perspectives provided by so-called Text Topic Networks (TTN) and their corresponding Author Topic Networks (ATN).(2)Procedural modeling: we develop a measurement procedure for instantiating our formal model. To this end, we introduce novel measures of the similarity of labeled graphs that are sensitive to their links and to their nodes.(3)Experimentation: we further develop the range of baseline statistics in network theory in order to better assess the quality of our measurements. To this end, we test our model by means of a threefold classification experiment that compares a set of TTNs with each other, a set of corresponding ATNs with each other, and the former TTNs with the latter ATNs.(4)Theory formation: we interpret our findings in the context of cognitive maps, thus building a bridge between our network-theoretical approach and approaches to the cognitive representation of geographical information. We show how to integrate the analysis of entire networks into the research about the linguistic encoding of geographical information (see Figure 1).This paper is organized as follows: Section 2 discusses related work. Section 3 introduces our formal model of linguistic multilayer networks and the multiplex topic networks derived from them. Section 4 describes our experiments in detail, and Section 5 discusses our findings. Finally, Section 6 concludes and gives an outlook on future work.

Our work is related to linguistic research on Tobler’s [27] first law (TFL) which says that “[…] everything is related to everything else, but near things are more related than distant things” ([27], p. 236). Due to its underspecification, this so-called law raised many questions about what it means to be related or distant [28]. Accordingly, a range of approaches exist that make different proposals to interpret relatedness also in terms of semantic relatedness. In the context of information visualization, Montello et al. [29] test a variant of TFL called the first law of cognitive geography which says that “people believe closer things to be more similar than distant things” ([29], p. 317), where spatial distance is referred to for judging the similarity of information objects. This approach is contrasted with a study by Hecht and Moxley [30] who model relations of Wikipedia articles as a function of the probability of being linked in the web graph and find that this probability is related to the geographical distance of toponyms described in the articles. Hecht and Moxley relate their finding to the transitivity of networks by stating that the smaller the geographical distance of nodes, the higher their clustering coefficient ([30], 101). This work is extended by Li et al. [31], who calculate semantic relationships of articles instead of hyperlinks and show that TFL holds independently of the geographical domain up to a certain distance threshold. A lexical variant of TFL is mentioned by Yang et al. [32], according to which geographically close words tend to be clustered into the same geographical topics. This phenomenon has earlier been studied by Louwerse et al. (cf. the review in [26]) who reformulated Firth’s famous dictum by saying that “[…] you shall know the physical distance between locations by the lexical company they keep” ([26], p. 1557). This means that the distance of places correlates with syntagmatic associations between the lexical items used to describe them. That is, language encodes geographical information [16] at least regarding the distances of semantically related places. From this perspective, TFL appears to be reformulated as a candidate for a geolinguistic law that is compatible with the more general Symbol Interdependency Hypothesis (SIH) [33]. According to SIH, linguistic information encodes perceptual information so that the former serves as a shortcut to the latter [33]. Finally, a rather text-linguistic variant of TFL is proposed by Adams and McKenzie [34], which states that near places are each described by texts whose topics are more similar than in the case of texts about distant places.

In contrast to these approaches, we hypothesize that places, no matter how far apart, have similar topic distributions when their descriptions are transmitted by media such as city and region wikis. If we find evidence for this hypothesis, there are various candidates for explaining it: Firstly, such a finding could indicate a trivial meaning of TFL (cf. [28]) in relation to the topics modeled by us, implying that everything, distant or not, is highly related. Secondly, it could indicate the (in)effectiveness of distances and similarities at different scales: at the level of local, specific topics (within the scope of TFL) and at the level of global, more general topics (outside the scope of TFL). Thirdly, such a finding could indicate a hidden similarity of processes of collaboratively writing wikis about different places, even if the wikis are written by different communities (see Hypothesis 1). In order to decide between these alternatives, we need a new topic model that derives networks of thematic structures at different scales from texts in online media about the same places. This should at least include the networking of topics along relations of intertextuality and coauthorship in order to allow for revealing similarities of the underlying processes of collaborative writing. To this end, we will develop multiplex networks that integrate text- and author-driven topic networks.

So far, most approaches to thematic aspects of places use topic modeling based on Latent Dirichlet Allocation (LDA) to associate topics and texts about geographical units, where topics are represented as sets of thematically related words. An early approach in this regard is described by Mei et al. [35] who model spatiotemporal theme patterns to identify dominant topics in texts that are connected to places. A related approach is proposed by Qiang et al. [36], who aim to detect topics that are “localized” in places. This is done to ground their similarities in relations of their thematic representations—a scenario that is omnipresent in linguistically motivated work in the context of TFL (cf. Figure 1). Likewise, Adams and McKenzie [34] extract topic models from travel blogs to detect topics as groups of semantically related words associated to places, so that relations among places can be identified by shared topics. Another example is proposed by Bahrehdar and Purves [37]: instead of documents written by individual authors, they analyze tagging data extracted from image descriptions in Flickr. A hybrid model of topic modeling comes from Yin et al. [38], in which representations of regions are used instead of documents to link topics to places. A related region-topic model that uses regions as topics to map words, sentences, and texts to distributions of regions or to ground them semantically (cf. [39]) is proposed by Speriosu et al. [40]. A promising extension is developed by Gao et al. [41] who aim at detecting higher-level functional regions as semantically coherent areas of interest. To this end, they analyze co-occurrence relations between topics to describe many-to-many relations of locations and urban functions. Another direction is pursued by Lansley and Longley [42], who investigate the location- and time-based distribution of topics in Twitter, setting a number of twenty topics as a target for LDA. See also Jenkins et al. [13] who utilize a list of six high-level topic categories. One of the largest studies in this context is the one of Gao et al. [43] who present an integrative approach to modeling texts from a range of different media such as Wikipedia, Twitter, and Flickr to demarcate cognitive regions [44]. All these approaches start from topic modeling to map natural language texts onto distributions of topics in order to relate the places thematized by these texts (cf. Figure 1).

A prominent precursor of topic models [45] is given by Latent Semantic Analysis (LSA) [46]. Consequently, there are studies in the context of TFL based on this predecessor. Davies [24], for example, interprets the associations of place names computed by LSA from place descriptions as a model of the cognitive representation of the corresponding spaces (cf. [47]). This approach opens up a perspective for measuring biased cognitive representations of spatial systems: according to Davies, her approach provides representations of cognitive geographies that are explored by the associations of semantically close place names in accordance or not with the underlying geographical relations, that is, in accordance or not with TFL (cf. [39]). These and related studies produce interesting results about the localization of topics or vice versa about the thematization of places in texts. However, they mostly disregard topic networking, not to mention the networking of topics viewed from different angles. Although it is easy to derive a network approach from binary relations of topic similarity, relationships that cannot be traced back to sharing similar words are hardly mapped by topic models of the sort considered so far. By generating topic distributions per location, for example, we know nothing about the dynamics of the coauthorship of the underlying texts: in the extreme case, one observes (dis)similarities, which result from the activity of a small number of authors or even only one author—in contrast to the assumed collaboration density of online media such as Wikipedia. Therefore, it is our goal to develop a model of topic networks that simultaneously addresses the dynamics of the coauthorship of the underlying texts. A subtask will be to develop a formal model of thematic networking that is generic enough to integrate a wide range of sources of networking—at least theoretically.

While most of the approaches considered so far ignore aspects of networking, a second branch of research tends to follow the paradigm of network theory. Hu et al. [48], for example, measure the semantic relatedness of cities as nodes of a city network [9] depending on the co-occurrences of city names in news articles. This approach is related to Liu et al. [49], who explore co-occurrences of toponyms to induce city networks that can be used to test predictions associated with TFL. Hu et al. [48] further develop this approach to networking cities by reference to topics of articles in which the corresponding toponyms are observed. They use Labeled LDA [50] to learn to extract topics α from texts to finally determine the α-relative similarity of cities based on the co-occurrences of their names in texts about α. Another approach to city networks using Wikipedia as a data source is proposed by Salvini and Fabrikant [9]: they link cities as a function of the number of articles “co-siting” [51] their Wikipedia articles. A comprehensive perspective on modeling spatial information is developed by Luo et al. [52], who propose a three-part network model that integrates representations of spatial, social, and semantic networks. In this conceptual model, semantics plays the role of interpreting behavior in spatial and social space and thus of bridging them. Although we share this hybridization of the network perspective on spatial information, we strive for a more concrete model that can be empirically tested.

Any such study has to face various aspects of the vagueness [44, 53] or informational uncertainty [5] of concepts of regions [44] and places [13] and especially of the names of such entities [43]. According to Winter and Freksa [54], this includes semantic ambiguity, indeterminacy of spatial extent, or boundary vagueness [43], preference-oriented re-scaling of extent, and the dynamics of salience affected by various dimensions of contrast. Beyond boundary vagueness, Gao et al. [43] speak of the shape and location vagueness by example of cognitive regions. Furthermore, Jenkins et al. [13] refer to the temporal dynamics of places as evolving concepts as a source of uncertainty. From a methodological point of view, this multifaceted uncertainty has two implications: in relation to the model, which should be flexible enough to map these facets, and in relation to the object itself, which could complicate its modeling by unsystematically distorting it.

In accordance with Hu’s study [55], we assume that the thematic perspective complements the spatial and temporal perspective of the study of places. A rheme can be understood as the “content” of a geographical region that expands its dimensionality [44]. This content may be further specified in terms of affordances, functions, or shared conceptual representations associated by members of a community with the corresponding place so that different places can be related by being associated with similar content. This thematic perspective will be at the core of our article. To this end, we follow the approach of Jenkins et al. [13], according to which places are connected with meanings generated by collaborators of crowd-sourcing media such as Wikipedia: their collaboration creates what Jenkins et al. call platial themes, namely, themes that are characteristic for certain places. As shared meanings, these platial themes ultimately create a “collective sense of place,” as it is perceived by the corresponding community. In this context, Jenkins et al. [13] propose to study politics, business, education, recreation, sports, and entertainment as six high-level topics of places. However, by reference to the Dewey Decimal Classification (DDC), we will instead deal with more than six hundred hierarchically organized topics, each of which is manifested by a range of Wikipedia articles. In any event, we have to consider that thematic aspects may distort the conceptualization and perception of spatial objects [43]. A central question then concerns the regularity or systematicity of this distortion in the sense of asking to what extent thematic representations of different places show similar aspects of being biased. This question will be at the core of this article.

3. Multiplex Topic Networks: A Novel Approach to Topic Modeling

In order to study relations of thematic preference in VGI as a manifestation of distributed cognition, we introduce Topic Networks (TNs) as an alternative to Topic Models (TMs) [23, 58, 59]. TMs are based on the idea that texts manifest probabilistic distributions of topics which are represented as probability distributions over the lexical constituents of these texts, where these distributions may be affected by style, the underlying genre, or any other (syntactic, semantic, or pragmatic) criterion of text production [6062]. Regardless of its success, this model is unsuitable for modeling TNs as manifestations of distributed cognitive maps because of the following problems:(P1) Corpus specificity: the corpus specificity of TMs impairs comparability and transferability to ever new corpora, since the topic distributions are learned from the input corpora whose topics are to be modeled. This approach apparently cannot use a transferable topic model as a basis for representing the topics of a large number of different corpora.(P2) Topic labeling: the corpus-specific derivation of topic labels from the input corpora makes it difficult to compare their topic distributions. As reviewed by Herzog et al. [63], external resources can be used for this task. However, there are hardly any such resources for all possible topic combinations—unless one wants to explore an overarching system such as Wikidata making such a project considerably more difficult due to its size. The labeling problem can be addressed using, for example, Labeled LDA [50], an approach that leads us into the area of supervised classification, which is also followed here.(P3) Scalability: instead of dealing with corpora of equally large texts, online communication often leads to sparse, tiny texts that sometimes consist of a single sentence, a single phrase, or a single word. Regardless of the size of the text, we need a procedure that determines its topic distributions so that texts of different sizes can be compared using topic models of comparable size. Even if small texts are postprocessed (after topic modeling) in such a way that their topic distributions are derived from their lexical constituents, such an approach would nevertheless mean to exclude text snippets from the training process.(P4) Rare topics: one reason to prefer training by means of corpora as large as Wikipedia is to allow for detecting topics even if they form a kind of thematic hapax legomenon in the corpora to be analyzed. If we try to identify rare topics directly from these corpora, we will probably not detect them, since by definition these corpora do not provide enough information to identify such topics. In any event, the rarity of evidence about a topic should not be an impediment to identifying its occurrences even at the level of single sentences.(P5) Methodical closeness: instead of deriving all distributions of all dependent and independent variables as part of the same topic model, one possibly wants to include different information sources that are computed by different methods based on diverse computational paradigms (e.g., ontological approaches to measuring sentence similarities, approaches to word embeddings based on neural networks, and topic models). In order to enable this, we look for a methodologically open topic model that allows such different resources to be easily integrated.

In a nutshell, we are looking for an approach that (i) allows thematic comparisons of previously unforeseen text corpora using an underlying reference corpus, (ii) offers a generic solution to the problem of topic labeling, (iii) is highly scalable and can therefore map even the smallest text snippets to topic distributions, (iv) simultaneously takes rare topics into account, and (vii) is methodologically open and expandable. Such a topic network model is now developed in two steps: in Section 3.1, we introduce the underlying formal apparatus. This is done by deriving multiplex topic networks from linguistic multilayer networks. Section 3.2 describes a method by which this model is instantiated as a prerequisite for its empirical testing.

3.1. From Linguistic Multilayer Networks to Multiplex Topic Networks

In this section, we introduce multiplex topic networks. This is a type of network that is based on the idea of deriving the networking of topics of textual units by evaluating evidence from different sources of information such as text vocabulary, higher-level text components, distributed authorship or readership, genre, register, or medium. Since these sources of evidence can be explored in different compositions, this can lead to different perspectives on the salience and networking of the topics addressed by the same texts. Topic networks are multiplex precisely in this respect: the different evidence-providing perspectives may lead to different topic networks that allow comparisons to be made through which differences in the linguistic, social, or otherwise contextual embedding of thematizations become visible. This concept of a multiplex topic network is now being generically formalized.

To introduce multiplex topic networks, we start with defining linguistic multilayer networks (Definition 1) whose layeredness allows for distinguishing several (non)linguistic information sources of topic networking. We refer to supervised topic classifiers trained by means of large reference corpora to tackle the challenges P1, P2, P3, and P4. Based thereon, we introduce so-called text topic networks (Definition 3), which evaluate intra- and intertextual relations for the purpose of topic networking. Then, we introduce two-level topic networks (Definition 4) and exemplify them by author (Definition 5) and word topic networks (Definition 6), which explore relations of (co)authorship and lexical relatedness, respectively, as sources of topic networking. These notions are generalized to arrive at n-level topic networks (Definition 7) which are based on informational sources of topic networking (cf. challenge P5). Finally, multiplex topic networks are defined as families of n-level topic networks (Definition 8) representing the networking of the same set of topics from different informational perspectives and thus allowing for mapping the thematic dynamics, for example, of descriptions of the same place.

Definition 1. Let be a corpus of texts and . A Linguistic Multilayer Network (LMN) is a tuple (Mehler [57] speaks of multilevel graphs; see Boccaletti et al. [64] for a comprehensive overview of related notions whose formalism is used here; see Stella et al. [65] for an example of a multiplex network of lexical systems)of two sets of directed graphs such that the set of kernel layers consists of a pivotal text layer and several derivative layers, that is, a coauthoring layer, a language-systematic word layer, and possibly several layers modeling the networking of constituents of the pivotal texts:(1)The pivotal text layer , also called text network, is spanned by texts of the corpus such that is manifesting intratextual (as in the case of reflexive arcs) or intertextual relations(2)The author layer , also called agent network, is spanned by the network of agents (co)authoring the texts in and their social relations(3)The lexicon layer , also called word network, is spanned by the language-systematic lexical signs (i.e., lexemes and related units) used by agents of as part of their agent lexica to author the texts in (4)For , is called a constituent layer modeling the networking of (e.g., lexical, phrasal, and sentential) constituents of texts such that maps intratextual (e.g., anaphoric) or intertextual (e.g., sentence similarity) relations(5)For , is called a contextual layer modeling the networking of units (e.g., media, genres, and registers [66]) of the contextual embedding of texts such that maps, for example, relations of the switching, merging, or embedding [67, 68] of these contextual units(6)For each , , , , is called a margin layer where , , , and .For , and are vertex weighting functions, and are arc weighting functions, and are vertex labeling functions, and arc labeling functions. We say that the linguistic multilayer network is spanned over the text corpus X and layered into l layers.

Example 1. To illustrate our definitions, we construct a minimized example. Suppose a corpus of four texts , each containing three lexemes , , , and (for reasons of simplicity, we exemplify texts as bag-of-words), that is, , , and . Further, we assume four authors such that and coauthored and , while and coauthored and ; that is, and . Further, we assume that the texts are linked by some intertextual coherence relations (e.g., by a rhetorical relation, an argument relation, or some hyperlinks) as are the texts so that . Note that additional arcs of the layers will be generated according to the subsequent definitions. For simplicity reasons, we assume all weighting functions to be limited to the set of vertex/arc weights. Since we assume no additional constituent layer, we get . Thus, any linguistic multilayer network based on this setting is layered into three layers.
Throughout this paper, we use the following simplifying notation: for any graph of order , arc set of size and vertex labeling function λ, and any vertex , we write . Thus, for any two graphs with vertex labeling functions and , for which , , we can write . Further, for any function , for which , we use the following alternative notations:Finally, for any function , Z being any set, we introduce the following notation based on square brackets:To leave no room for ambiguity, we assume that expressions of the sort are replaced from left to right into expressions of the sort . Henceforth, a structure such as will be called information link. Based on Definition 1, we start now with introducing text topic networks using the following auxiliary notion.

Definition 2. Let be a directed Generalized Tree (GT) according to Mehler [69, 70] representing a hierarchical topic structure, henceforth called Reference Classification System (RCS), that is spanned by kernel arcs which are possibly superimposed by upward, downward, lateral, sequential, external, or reflexive arcs. (See Figure 2 for an example of a GT. This notion is required since we may decide for using, for example, the category system of Wikipedia as an RCS, which spans a GT [70]). That is, vertices represent topics, while kernel arcs represent subordination relations according to which u is a thematic specialization of t. Let further, θ denote a hierarchical text classifier [71] taking values in that has been trained, validated, and tested by means of a reference corpus . Let now be a LMN spanned over the text corpus X and layered into l layers. We call the structurea Definitional Setting for defining topic networks.

Example 2. Given the LMN of Example 1, the Dewey Decimal Classification (see Section 3.2), and the topic classifier θ of [72], which uses the DDC as its Reference Classification System , a definitional setting is exemplified by . More specifically, by we will denote three topic labels of the third level of the DDC so that . Note that by using the DDC as a reference classification, the generalized tree of Definition 2 is reduced to a tree (see Section 3.2 for more details).

Definition 3. Given a definitional setting according to Definition 2, a Text Topic Network (TTN) is a vertex- and arc-weighted simple directed graphwith vertex set V and arc set which is said to be derived from and inferred from by means of the optional classifier and the monotonically increasing functions if and only if and :where is a vertex weighting function, an arc weighting function, an injective vertex labeling function, , and κ an injective arc labeling function. is called a one-layer topic network that is generated by the generating layer .
Formulas (6) and (7) require that the weighting values for nodes and arcs are greater than 0: otherwise, the candidate vertices and arcs do not exist in the TTN. is a classifier mapping pairs of topics and texts x onto real numbers indicating the extent to which x is a “prototypical” instance of t (obviously, the textual arguments of the functions θ and θ are not restricted to elements of X.)

Example 3. Given Example 2, we assume that and , , so that . In our example, we disregard θ. Further, we assume that the functions are identity functions. Thus, and . Now, we can generate a topic link between and by exploring the intertextual relation : To this end, we assume thatso that . By analogy to this case, we link topic by means of a reflexive link so that . Note that these simplifications are made for simplicity’s sake only: Section 3.2 will elaborate a realistic weighting scenario. However, the function of the latter illustration is to show that by the intertextual linkage of both texts, we get evidence about the linkage of the topics instantiated by these texts. TTNs always operate according to this premise: they network topics as a function of the networking of an underlying set of texts. Figure 3 gives a schematic depiction of this scenario, which is varied subsequently to illustrate the other types of topic networks developed in this paper.
A concrete example of a TTN that is derived from the articles of the so-called Dresden wiki (see Section 4.1) is depicted in Figure 4. It shows the highest weighted topics addressed by these articles and their (undirected) links. The TTN has been computed by means of the procedural model of Section 3.2. Evidently, the topic Transportation; ground transportation is most prominent in this wiki followed by the topic Central Europe; Germany. Most topics belong to the areas transportation (red), geography and history (turquoise), and architecture (gray) (for the color code, see Appendix). More examples of TTNs can be found in Figures 57.
Arguments of the sort can be used to quantify evidence about text x as an instance of topic : the more the evidence of this sort, the higher possibly the impact of x in formula (6) and the higher possibly the final weight of . The adverb possibly refers to what is licensed by the parameters . Arguments of the sort , where , can be used to quantify evidence that text x is intertextually linked to text y: the more the evidence of this sort, the higher possibly the weight of the link from x to y and the higher possibly the influence of this link onto the weight of the link from topic to topic in formula (7) (in cases in which there is no explicit information about intertextual links, one can use functions of aggregated word embeddings of the lexical constituents of texts to calculate their intertextual similarity). In this and related definitions, we do not fully specify the functions to leave enough space for different instances of topic networks.

Definition 3 relies on the pivotal text layer for deriving topic networks. To integrate further layers into the process of inferring topic networks, we introduce the following generalized schema.

Definition 4. Given a definitional setting according to Definition 2, an -Topic Network, , is a vertex- and arc-weighted simple directed graphwhich is said to be derived from and inferred from and the elements of by means of the optional classifiers and monotonically increasing functions iff and :where . is a vertex weighting function, an arc weighting function, an injective vertex labeling function, , and κ an injective arc labeling function. For , we say that is a two-level topic network that is generated by the generating layers and . If , then formula (10) changes to formula (6) and formula (11) to formula (7). By omitting any optional classifier , expressions of the sort change to . ϑ is treated analogously.
To understand formula (10) look at Figure 8: among other things, formula (10) collects the triangle spanned by , x, and a supposed that the two-level topic network is based on text and authorship links. Obviously, Definition 4 generalizes Definition 3. Now, it should be clear why we speak of the text network of an LMN as its pivotal level: it is the reference layer of any additional layer that is integrated into a two-level topic network according to Definition 4. This role is maintained below when we generalize this definition to capture n layers, . With the help of Definition 4, we can immediately derive so-called author topic networks.

Definition 5. An Author Topic Network (ATN) is a directed graphaccording to Definition 4 such that .
The relational arguments of this definition can be motivated as follows—assuming that they are instantiated appropriately:(1) can be used to represent evidence that text x is about topic possibly in relation to other topics of .(2) can be used to represent evidence that text x is a prototypical instance of topic possibly in relation to other texts in .(3) can be used to represent the extent to which agent r tends to write about topic possibly in relation to other topics of .(4) represents evidence that agent r is a prototypical author writing about topic possibly in relation to other agents in .(5)For , can be calculated to represent evidence about text x to be intertextually linked to text y (e.g., in the sense of linking contributions of different authors). Otherwise, if , can be used to quantify evidence about x being intratextually structured.(6) can be used to quantify evidence about the role of agent r as an author of text x possibly in relation to other texts authored by r. Typically, is a function of the number of edit actions performed by r on x [74].(7) can be used to quantify evidence about the role of agent r as a prototypical author of text x possibly in relation to other authors of x. In the simplest case, is symmetric making obsolete.(8) represents evidence that agent r is a coauthor of or interacting with s. For instantiating , the literature knows a wide range of alternatives [74, 75] (which mostly concern symmetric measures of coauthorship). Note that we do not require that .

Example 4. Starting from Example 3 to exemplify arcs between topics in author topic networks, we can now additionally explore the evidence, that text and are both coauthored by the agents . That is, we can assume a coauthorship link ( is the arc set of the author layer in Definition 1) of weight . Let us now assume the following simplification of the function δ in Definition 4, for which we assume that it simply multiplies and adds up its argument values in the following way:In our example, we get , , , , , and . Since there is no other interlinked pair of texts (see Example 1), instantiating the topics , we get as the weight of this topic link in the corresponding ATN. By this simplified example of an ATN, we get the information that the link of topic to topic is additionally supported by the coauthorship of agents : this information extends the evidence about the topic link as provided by the underlying TTN of Example 3. Likewise, the reflexive link of topic is augmented by 1 compared to the underlying TTN, while there is no other topic link to be considered in this example of an ATN. By analogy to Figure 3, Figure 9 gives a schematic depiction of this scenario. Note that in our example, the weight of the link between authors (cf. ) is a function of their coauthorship: this is only one alternative to weight the social relatedness of both agents, actually one that can be measured by exploring (special) wikis. However, any other social relatedness might be explored to weight the interaction of agents.
By comparing a text topic network with an author topic network derived from the same LMN , we can learn how the topics of are manifested in the texts of corpus X in the form of a concomitance or a disparity of intertextual and coauthorship-based networking. Consider, for example, two vertices such that ; let further and denote the minimum and maximum that the vertex weighting functions of both graphs can assume. Then, we can distinguish four extremal cases:(1)Cases of the sortprovide information on prominent topics that tend to be addressed by many texts which are coauthored by many authors.(2)Situations likeprobably apply to the majority of the topics in , which are hardly or even not at all addressed by texts in due to the narrow thematic focus of these texts.(3)Cases likesuggests a Zipfian topic effect, according to which a prominent topic is addressed by a small group of agents or even by a single author.(4)Finally, situations of the sortrefer to rarely manifested topics addressed by a few but highly coauthored texts. In conjunction with many cases of the sort described by formula (16), situations of this kind indicate a Zipfian coauthoring effect, according to which many authors write only a few texts, while many texts are written by a few authors without encountering many (relevant) coauthors.
Formulas (14)–(17) compare the node weighting functions of a TTN with those of a related ATN. The same can be done regarding their arc weighting functions. That is, for two arcs and , for which , we distinguish again four cases ( and now denote the minimum and maximum the arc weighting functions of both graphs can assume):(1)In the case oftopic is intertextually linked more strongly to topic and authors of its text instances tend to cooperate with those of instances of topic likewise to a greater extent.(2)In the case oftopic is intertextually less strongly linked to topic and the few authors of its textual instances tend to cooperate with authors of instances of topic likewise to a lesser extent.(3)In the case oftopic is intertextually more strongly connected with topic , while authors of its text instances tend to cooperate with those of instances of topic to a lesser extent, if at all.(4)Finally, in the case oftopic is intertextually less strongly linked to topic , while the numerous authors of its text instances tend to cooperate with those of instances of topic to a much greater extent.Our central question regarding the relationship between TTNs and ATNs derived from the same LMN is whether these networks are similar or not. If they are similar, we expect that cases of the sort described by formulas (14), (15), (18), and (19) predominate so that cases matched by formula (14) are parallelized by those considered by formula (18) and where cases according to formula (15) are concurrent to those described by formula (19). An opposite situation would be that two topic nodes in the TTN are highly weighted but weakly linked, while they are weakly weighted but strongly linked in the corresponding ATN. In this case, a few or even only a single author is responsible for the thematic focus of the TTN. Note that this scenario reminds again of a Zipfian effect regarding the relation of TTNs and ATNs. By characterizing TTNs in relation to ATNs along these and related scenarios, we want to investigate laws of the interdependence of both types of networks, which may consist, for example, in the simultaneity of dense or sparse intertextuality-based networking on the one hand and dense or sparse coauthorship-based networking on the other. We may expect, for example, that the more related the two topics, the more likely the authors of their textual instances cooperate. However, not so much is known about such scenarios in the area of VGI especially with regard to Hypothesis 1. Thus, we address this gap at least by introducing a novel theoretical model which may help filling it.
Figure 5 exemplifies two ATNs in relation to a corresponding TTN (T1) which were computed using the apparatus of Section 3.2 to instantiate the formal model of this section. The upper right ATN (A1) is computed by globally weighting coauthorship activities based on Wikipedia (as explained in Section 3.2.3); the ATN (A2) below is calculated by weighting of these activities relative to the city wiki itself. Figure 5 shows that the topic with DDC number 720 (Architecture) is weighted higher in A1 than in T1. This is all the more pronounced in A2, where 720 becomes the most prominent topic and consequently displaces the top subject from T1, that is, topic 380 (Commerce, communications & transportation). That is, although topic 380 is most frequently addressed in this wiki’s texts, topic 720 not only is almost as salient but also attracts many more activities among its interacting coauthors. Similar observations concern the switch of the roles of the topics 910 (Geography & travel) and 940 (History of Europe) from T1 to A1 and A2.
Regardless of the answer to this and related questions, we will also ask whether the shape of an ATN can be predicted if one knows the shape of the corresponding TTN and vice versa. To answer this question, we will consider LMNs of different text genres: of city wikis and regional wikis on the one hand and extracts of encyclopedic wikis on the other. We expect that LMNs spanned over corpora of the same genre exhibit a pattern of collaboration- and intertextuality-based networking that makes TTNs and ATNs derived from them mutually recognizable or predictable, whereas for LMNs generated from corpora of different genres this does not apply.
For reasons of formal variety, we now consider an alternative to author topic networks, namely, so-called word topic networks, which in turn are derived from Definition 4.

Definition 6. A Word Topic Network (WTN) is a directed graphaccording to Definition 4 such that .
This definition departs by five new relational arguments from Definition 5, which—if being instantiated appropriately—can be motivated as follows:(1) quantifies evidence about the role of word a as a lexical constituent of text x possibly in relation to all other texts in which a occurs. Typically, is implemented by a global term weighting function [76] or by a neural network-based feature selection function.(2) quantifies evidence about the role of the word a as a lexical constituent of the text x possibly in relation to other lexical constituents of x. Typically, is a local term weighting function, such as normalized term frequency [76], or a topic model-based function.(3) represents evidence about the word a to be associated with the topic possibly in relation to all other topics of .(4) calculates evidence about the extent to which the topic is prototypically labeled by the word a, possibly in relation to all other words in .(5) quantifies evidence about the extent to which the word a associates the word b. Typically, is computed by means of word embeddings [77].Based on this list, we better understand what topic networks offer in contrast to TMs. This concerns the flexibility with which we can include informational resources computed by different methods (e.g., based on neural networks, topic models, and LSA) to generate topic networks (cf. challenge P5). Different relational arguments can be quantified using different methods, which in turn can belong to a wide range of computational paradigms. Table 2 gives an account of the generality of our approach by hinting at candidate procedures for computing the different relations of Figure 8.

Example 5. Starting from Example 3 to exemplify arcs between topics in word topic networks, we have to additionally explore evidence regarding the lexical relatedness of the vocabularies of the texts and . In Example 1, we assumed that the intersection of both texts (represented as bags-of-words) is given by the set . By analogy to Example 4, we assume now the following simplification of the function δ of Definition 4:In this scenario, we have to instantiate Definition 4 as follows: , , , , , and for one summand and—everything else being constant— and for a second summand (for (), we do not assume a lexical relatedness w.r.t. the words of text ()). Note that under this regime, we assume that relatedness of lexical constituents only concerns shared usages of identical words—of course, this is a simplifying example. By analogy to the setting of Example 4, we have thus to conclude that as the weight of the topic link from to in the corresponding WTN. For texts , we may alternatively assume that lexical relatedness does not only concern shared lexical items but also relatedness that is measured, for example, by means of a terminological ontology [83] or by means of word embeddings [77]. In this way, we may additionally arrive at a topic link between and . In order to allow for a comparison of a WTN with its corresponding TTN, a more realistic weighting scheme is needed that also reflects above and below average lexical relatednesses of the lexical constituents of interlinked texts—in Section 3.2, we elaborate such a model regarding ATNs in relation to TTNs. Figure 10 gives a schematic depiction of the scenario of WTNs as elaborated so far.
It is worth emphasizing that instead of the (language-systematic) lexicon layer , we may use a constituent layer , to infer a two-level topic network. For example, we can use the layer spanned by the sentences of the pivotal texts to obtain a sort of sentence topic network. In this case, may quantify evidence about the extent to which the sentence a entails the sentence b or the extent to which the sentence a is similar to the sentence b, etc., while may quantify evidence about the extent to which the sentence a is thematically central for the text x, etc. In sentence topic networks, topic linkage is a function of sentence linkage: prominent topics emerge from being addressed by many sentences, while prominent topic links arise from the relatedness of many underlying sentences. Another example of inferring two-level topic networks is to link topics as a function of places mentioned (by means of toponyms) within the texts of the underlying corpus X, where geospatial relations of these places can be explored to infer concurrent topic relations: if place p is mentioned in text x about topic and place q in text y about topic , where the platial relation relates p and q, this information can be used to link the topic nodes in the corresponding topic network. As a result, we obtain networks manifesting the networking of topics as a function of parallelized geographical relations.
Obviously, any other relationship (e.g., entailment among sentences, sentiment polarities shared by linked texts, and co-reference relations) can be investigated to induce such two-level networks. And even more, we can think of n-level networks in which several such relationships are explored at once to generate topic links. We can ask, for example, which locations are linked by which geospatial relations while being addressed in which sentences about which topics where these sentences are related by which sentiment relations. Another example is to ask which authors prefer to write about which topics while tending to use which vocabulary: the higher the number of authors who use the same words more often to write about the same topic, and the higher the number of such words, the higher the weight of that topic. In this case, topic weighting is a function of frequently observed pairs of linguistic (here: lexical) means and authors. On the other hand, the higher the degree of coauthorship of two authors contributing to different topics and the higher the degree of association of the words used by these authors to write about these topics, the higher the weight of the link between the topics. This concept of a topic network induced by the text, the coauthorship, and the lexicon layer of an LMN is addressed by the following generalization, which provides a generation scheme for topic networks:

Definition 7. Given a definitional setting according to Definition 2, an -Topic Network, for whichis a vertex- and arc-weighted simple directed graphwhich is said to be derived from and inferred from and the elements of by means of the optional classifiers and monotonically increasing functions iff and : is a vertex weighting function, an arc weighting function, an injective vertex labeling function, , and κ an injective arc labeling function. For , we say that is an m-level, , topic network generated by the generating layers and the elements of . If , formula (26) changes to formula (6) and formula (27) to formula (7). By omitting the optional classifier , expressions of the sort change to . θ and are treated analogously. In order to derive an undirected m-level topic network from , we define andand where are monotonically increasing functions.
Evidently, Definition 7 is a generalization of Definition 3 by considering higher numbers of generating layers. A schematic depiction of the scenario addressed by this definition is shown in Figure 11 by example of a 3-level topic network that explores evidence about topic linking starting from the text, the author, and the lexicon layer of Definition 1. Likewise, Figure 12 depicts an n-level topic network, , in which additional resources are explored beyond the word, author, and text level. Figure 8 illustrates more formally the inference process underlying Definition 7, and in particular of the arguments used. It illustrates the inference of an arc that connects two topics by exploring the links of the text, author, and lexicon layers of an underlying LMN. In this example, the blue and black arcs are evaluated to determine the weights of red arcs connecting the focal topic nodes. Blue arcs are used to orientate inferred arcs. We will not develop this apparatus further, nor will we empirically examine -layer topic networks for . Rather, the apparatus developed so far serves to demonstrate the generality, flexibility, and extensibility of our formal model.
In the above, we explained that one of the reasons for introducing a flexible and extensible formalism of topic networks is to compare topic networks derived from different layers (e.g., from the text layer on the one hand and the author layer on the other). In order to systematize this approach, we finally introduce the concept of a multiplex topic network, which is derived from the same or from different linguistic multilayer networks:

Definition 8. Given a definitional setting according to Definition 2, a Multiplex Topic Network (MTN) is a k-layer networksuch that each , , is an -Topic Network derived from according to Definition 7 and for each , , , , is called a margin layer fulfilling the following requirements: , , , and .
See Figure 13 for a schematic depiction of the comparison of two MTNs. Note that because of Definition 7, it does not necessarily hold that , but it always holds that . In this respect, we depart from [64], which instead require more strongly that . In the case of topic networks, this would be too restrictive, as different topic networks derived from the same definitional setting can focus on different subsets of topics, while ignoring the rest of the topics in the co-domain of θ. (A way to extend Definition 8 is to include the RCS of Definition 2 as an additional layer. This would allow for directly relating its constituent topic networks with the hierarchical classification system .)
In this paper, we quantify similarities of the different layers of MTNs to shed light on Hypothesis 1. More specifically, we generate an LMN for each corpus of a set of different text corpora in order to derive a separate two-layer MTN for each of these LMNs, each consisting of a TTN and an associated ATN. Then, among other things, we conduct a triadic classification experiment: firstly with respect to the subset of all TTNs derived from our corpus, secondly with respect to the subset of all corresponding ATNs, and thirdly with respect to the subset of all TTNs in relation to the subset of the corresponding ATNs. In the next section, we explain the measurement procedure for carrying out this triadic classification experiment.

3.2. A Procedural Model of Topic Network Analysis

In order to instantiate topic networks as manifestations of the rhematic networking of places, we employ the procedure depicted in Figure 14. It combines nine modules for the induction, comparison, and classification of topic networks.

3.2.1. Module 1: Natural Language Processing

Preparatory for all modules is the natural language processing of the input text corpora. To this end, we utilize the NLP tool chain of TextImager [84] to carry out tokenization, sentence splitting, part of speech tagging, lemmatization, morphological tagging, named entity recognition, dependency parsing [85], and automatic disambiguation—the latter by means of fastSense [86]. For more details on these submodules, see [86, 87]. As a result of Module 1, the topic classification can be fed with texts whose lexical components are disambiguated at the sense level. As a sense model, we use the disambiguation pages of Wikipedia, currently the largest available model of lexical ambiguity.

3.2.2. Module 2: Topic Classification

According to Definition 2, the derivation of TNs from LMNs requires the specification of a Reference Classification System (RCS) . For this purpose, we utilize the Dewey Decimal Classification (DDC), a system that is well established in the area of (digital) libraries. As a result, the generalized tree from Definition 2 degenerates into an ordinary tree since the DDC has no arcs superimposing its kernel hierarchy (see Figure 15 for a subtree of the DDC). As a classifier θ, which addresses the DDC, we use [72], a topic classifier based on neural networks, which has been trained for a variety of languages [88] (see https://textimager.hucompute.org/DDC/). Starting from the output of Module 1 (NLP), we use text2ddc to map each input text x to the distribution of the 5 top-ranked DDC classes that best match the content of x as predicted by text2ddc. Since text2ddc reflects the three-level topic hierarchy of the DDC, this classifier can output a subset of 98 classes of the (two classes of this level are unspecified) and a subset of 641 classes of the 3rd DDC level for each input text. (We did not have training for all 3rd level classes (which are partly unspecified). See [72] and the appendix for details.) Thus, each topic network of each input corpus is represented on two levels of increasing thematic resolution. Note that text2ddc classifies input texts of any size (from single words to entire texts in order to meet challenge P3) and works as a multilabel classifier for processing thematically ambiguous input texts. By using an RCS, text2ddc meets challenge P2 simply by referring to the labels of the topic classes of the DDC. Furthermore, since text2ddc is trained with the help of a reference corpus, it can detect topics, even if they occur only once in a text (this is needed to meet challenge P4) and guarantees comparability for different input corpora (challenge P1). text2ddc is based on fastText whose time complexity is , where “k is the number of classes and h the dimension of the text representation” (2, [89]) (making this classifier competitive compared to TMs).

Figures 47 show examples of TTNs and ATNs generated by means of text2ddc by addressing the second level of the DDC. Each of these topic networks was generated for a subset of articles of the German Wikipedia that are at most 2 clicks away from the respective start article x (for the statistics of the corpora underlying these topic networks, see Section 4.1). Formally speaking, let be a directed graph and ; the nth orbit induced by is the subgraph,that is induced by the subset of vertices whose geodetic distance from is at most n (cf. [90]). We compute the first orbit and the second orbit of a set of Wikipedia articles (so that G denotes Wikipedia’s web graph). This is done to obtain a basis for comparison for the evaluation of topic networks derived from special wikis. Since Wikipedia is probably more strongly regulated than these special wikis, we expect higher disparities between networks of different groups (Wikipedia vs. special wiki) and smaller differences for networks of the same group.

3.2.3. Module 3: Network Induction

Network induction is done according to the formal model of Section 3.1. It starts with inducing an LMN for each input corpus X. That is, for each corpus X, we generate a text network and an agent network according to Definition 1:(1)In this paper, X always denotes the set of texts (web documents) of a corresponding wiki W so that the text layer of the LMN , in which is an agent network defined below, can be used to represent the web graph [91] of this wiki. Thus, for any two texts that are linked in W, we generate an arc , where and . Further, for .(2)The author layer of the LMN corresponding to (see Definition 1) is generated as follows: is the set of all registered authors or TCP/IP addresses of anonymous users working on texts in X so that maps to this name or IP address, respectively. Let be the sum of all additions made by the author to any revision of the edit history of the text x; we use to approximate the more difficult to measure concept of authorship as introduced by Brandes et al. [74]. Then, we define: . Further, is the set of all arcs between users , for which there is at least one text x to which both contribute so that . Then, we define (cf. [92]):

Finally, . Obviously, is symmetric.

Now, given the definitional setting , where are instantiated in terms of Section 3.2.2, we induce a TTN according to Definition 3 by means of appropriately defined monotonically increasing functions . To this end, we utilize the setof the membership values of text to the topics in , where the parameter denotes a lower bound of an acceptable degree of aboutness. We set . Further, bywe denote the mean value of the set of selected topic membership values and by we denote the largest value of the arbitrary set . Finally, we select a number and define , thereby instantiating the parameters of formulas (6) and (7) of Definition 3:

According to formula (35), iff is one of the highest membership values of x to the topics in , supposed that . Otherwise, . In this paper, we experiment with . The higher the value of , the more sensitive the generation of to the thematic ambiguity of the underlying texts. However, since θ creates a membership value for each pair of texts and topics, we use as a lower bound of aboutness (in the sense of addressing a topic known by θ) so that irrelevant classifications do not affect .

Regarding the ATN corresponding to the TTN , we have to define monotonically increasing functions . To this end, we use several auxiliary functions:(i)By , we denote the mean activity per author per Wikipedia article.(ii)By , we denote the average number of active authors per Wikipedia article.

The corresponding estimators are found in Table 4. Now, consider the set of all active authors of the text x and the set of all texts that potentially contribute to and thus to the weight of the vertex :

Then, we define the following functions and ratios:where is a function which is used to rescale below or above average values (see formula (39)). Formula (40) defines the mean of the rescaled numbers of active users per article in . Based on these preliminaries and regarding the vertex weighting function , we define , thereby instantiating the functions α and β of formula (10) of Definition 4:

In the present paper, we experiment with . To understand this definition, we have to run through the cases of formula (42):(1)The case : suppose that, for each , the following condition holds: . In this case, we obtain for each , the following result:In other words, if all authors of all texts contributing to the weight of a topic contribute to these texts according to the average activity, the weight of this topic in the ATN corresponds to that of the corresponding TTN. In this case, the average activity does not bias the weight of a topic in the ATN compared to the same topic in the corresponding TTN. Obviously, this scenario gives us a neutral point or, more specifically, a calibration point for the comparison of ATNs and TTNs. Such a calibration point allows us to interpret any down- or upward deviation of the topic weights in both networks, since no deviation means average activity and average number of active users. However, this consideration presupposes that so that . If , then the number of authors of texts contributing to the weight of is on average higher than expected on the basis of Wikipedia, so that the weight of the topic in the ATN is “biased upwards” compared to the weight of the same topic in the corresponding TTN. Conversely, if , then the number of authors of texts contributing to the weight of is on average smaller than expected, so that ’s weight in the ATN is “biased downwards” compared to the weight of the same topic in the corresponding TTN. This scenario teaches us the different roles of and with respect to the weighting of the values: while operates as a function of the activities of authors, considers their number.(2)The case : suppose for each that while . Then, we conclude the following:Thus, for , we penalize the contribution of a below-average active author of a text to the weight of the topic to which this text contributes. The different effects of have already been discussed.(3)The case : if we suppose now that while , we conclude that for , we reward the contribution of an above-average active author of a text to the weight of the topic to which this text contributes.

In a nutshell, and implement the following proportionality assumptions:(i)By we penalize or reward under- or above-average coauthorships: the higher the above-average number of authors contributing to the texts of a topic, the higher the reward effect and the higher the weight of the topic. And vice versa, the lower the below-average number of authors contributing to the texts of a topic, the higher the penalty effect and the lower the weight of the topic.(ii)By we penalize or reward under- or above-average activities of single authors: the higher the above-average activity of a single author contributing to a text of a topic, the higher the reward effect and the higher the contribution of this author-text pair to the weight of the topic. And vice versa, the lower the below-average activity of a single author contributing to a text of a topic, the higher the penalty effect and the lower the contribution of this author-text pair to the weight of the topic.

Finally, we define the functions and to get instantiations of the functions γ and δ of formula (11) of Definition 4 (or, in the generalized case, of formula (27) of Definition 7). This is done by means of the following auxiliary function:where estimates the average degree of coauthorship in Wikipedia according to formula (31). (We estimate by means of 10,000 randomly selected Wikipedia articles so that .) is a readjustment of in relation to the mean value : the higher the above-average coauthorship, the higher the value of , and the lower the below-average coauthorship, the lower the value of . Then, we defineIn this definition, quantifies the link and the link (cf. formula (11)), the product quantifies the link , and quantifies the link . The calibration point of arc weighting is now reached under the conditions of the following scenario (for the first two conditions, see above):Under these conditions, the authors r and s contribute to the texts x and y at an average level while interacting at an average level of coauthorship. In this case, the (co)authorship of both authors does not influence the strength of the corresponding arc in the ATN: in terms of neither reducing nor increasing . Note that the size of an ATN (i.e., the number of its arcs) is always less than or equal to that of the corresponding TTN, since the arcs present in a TTN are merely re-weighted in the corresponding ATN: no new arcs are added. The same holds for the order of the ATN since there is no node in a TTN for which there is no author authoring it.

Our instantiation of multiplex text and author topic networks has shown two points: firstly, we demonstrated a single-parameter setting as an element of a huge parameter space spanned by parameters such as p, , , , θ, , , , , , , , and . (In the latter eight cases, various information links are included as candidate parameters. Formula (42) shows, for example, that out of the six possible information links, only two are evaluated to instantiate . Obviously, numerous alternatives exist to instantiate this function.) Secondly, anyone who complains about the apparently inherent parameter explosion in our approach should consider the hyperparameter spaces of neuronal networks as an object of parameter optimizations. Regardless of the heuristic character of our approach, compared to the black box character of neural networks, its settings are extensible on the basis of the schematic framework provided by Definition 8 of MTNs and the definitions it is based upon. At the same time, this approach guarantees interpretability as long as the different ingredients entering our model via formulas of the sort as formulas (26) and (26) fulfill this condition—in order to meet challenge P5.

3.2.4. Module 4: Network Randomization

Randomization is conducted to assess the significance of our findings. This is necessary because there is currently no related classification in the area examined here that can serve this role. To fill this gap, we compute the following randomizations:(1)Baseline B1: a lower bound of a baseline is obtained by randomly assigning the object networks onto the gold standard (target) classes. This can be done by informing the assignment about the true cardinality of these classes (B11) or not (B12). We opt for B11 since this variant yields a higher F-score, making it more difficult to surpass. Of course, any serious network representation and classification model should go beyond this baseline. B1 will be averaged over 100,000 iterations.(2)Baseline B2: an alternative is to randomize the input networks and to derive vector representations (according to Section 3.2.3), which ultimately undergo the same classification process as the original networks. That is, the input networks are randomly rewired to generate Erdős-Rényi (ER) graphs, for which we ask whether they are separable by the same classification model. (An alternative, not considered here, would be to randomize the topic classification of the underlying texts.) If this is successful in terms of high F-scores (the F-score is a measure of the accuracy of a classification, that is, the harmonic mean of its precision and recall), then we conclude that the network representation model or the operative classifier is not informative enough regarding the hypothetical class memberships of the input networks. Conversely, the lower the average F-scores obtained by classifying the randomized networks compared to the classification of the original ones, the more informative the representation model or the classification procedure regarding the underlying hypotheses. By keeping the model constant while varying the classifier, we can ultimately attribute this (non)informativity to the underlying representation model. Conversely, by keeping the classifier constant while varying the model, we can attribute this informativity to the classification model. B2 will be repeated 100 times.(3)Baseline B3: a third baseline results from randomizing the matrices that form the input of the target classifiers. This means that instead of calculating graph invariants or similarity values to feed the classifiers, we use matrices whose dimensions are chosen uniformly at random from the domain of the corresponding invariants or (dis)similarity measures. (We require that the main diagonal of the random matrix is 1 and that it is symmetric.) If the classification based on the original networks does not exceed this baseline, we are again informed about a deficit of our representation model. Evidently, we are looking for models that significantly exceed this baseline; otherwise, we would have to accept that the same classifiers perform better on random values than on our feature model. B3 will be repeated 100 times.(4)Baseline B4: finally, we start from randomly reorganizing the set of observations into random classes while using the same representation model to separate the resulting random gold standard. (Obviously, we have to prevent that the gold standard is ever part of the set of these randomizations.) We choose the variant of using randomized cardinalities of the random classes rather than keeping the sizes of the gold standard. Tests have shown that this approach tends to generate higher F-scores than the latter. If our network representation and classification model do not outperform this baseline, we learn that the underlying invariants used to characterize the networks are not specific enough; rather, they can be related to random classifications of the same objects using the same feature space. Obviously, we are looking for a model characterizing the gold standard (tendency to specificity) and not a random counterpart of it (tendency to non-specificity). B4 is averaged over 100 repetitions.

B1 is a lower bound: models that fall under this bound are obsolete. B2 concerns the evaluation of the network representation or classification model. B3 focuses on evaluating the classification model, and B4 aims at evaluating the specificity of the operative feature model.

3.2.5. Module 5: Network Quantification

Module 5 is a preparatory step for a subset of network similarity measures. This relates to so-called topology-based approaches to graph similarity [57, 9396]. The idea behind this approach is to map input networks onto vectors of graph indices or invariants to compare them with each other. That is, graph similarity is traced back to similarity in vector space: the higher the number of indices for which two graphs resemble each other, the more similar the graphs. The apparatus that we employ in this context is described next.

3.2.6. Module 6: Graph Similarity Analysis

Our hypothesis about thematic networks on geographical places says that these networks are similar in terms of the skewness of their thematic focus and their network structure, regardless of whether the underlying texts are written by different communities and regardless of the framing theme. To test this hypothesis, we apply the framework of graph similarity measurement which allows for mapping the second of these three reference points by exploring the structure of topic networks as well as features of their nodes. Since graph similarity measurement is generally known to be computational complex, we take profit from the fact of dealing with labeled graphs. By using alignments of the labels of the nodes of the graphs to be compared, we reduce the time complexity of these approaches enormously.

The literature knows a number of approaches for graph similarity measurement. Among other things, this includes the following approaches (see Emmert-Streib et al. [97] for an overview (cf. [98, 99]); the paper does not aim at a comprehensive study of them but focuses on a selected subset):(1)Graph Edit Distance- (GED-) based approaches [100102] and their relatives (e.g., the Vertex and Edge Overlap (VEO) [103])(2)Spherical [90] or neighborhood-related approaches (cf. [99])(3)Network topology-related approaches [57, 9396, 103]

We will develop and test candidates of each of these classes.

GED-based methods are well studied in the area of web mining [104]. Since we are dealing with labeled graphs, we can compute the GED directly from the vertex and edge sets of the input graphs [99, 100]. Let be two TNs, then their GED is computed as follows:where . Since we are targeting graph similarities, we consider instead of , where overlaps of vertex and arc sets are equally weighted:The same is done in the case of Wallis’ approach to graph distance [102], which is adapted as follows to get a similarity measure:

A relative of is the Vertex/Edge Overlap (VEO) graph similarity measure [103]:

Since node and arc weights are not taken into account by these measures, we compute the following variant of to close this gap:wges is sensitive to arc [99] and to vertex weights of TNs, the latter measuring the membership degree of the underlying texts to the topic represented by the corresponding vertex. We say that such measures are dual weight-dependent. These measures are of high interest since they cover more information on the underlying networks than single weight-dependent or even weight-independent measures (cf. the axiom of edge weight sensitivity of Koutra et al. [99]).

GED and its relatives share a view of similarity, according to which graphs are considered to be more similar the more (equally weighted) vertices and arcs they share. This notion of similarity is contrasted by spherical approaches (see above) as exemplified by DeltaCon [99]. Roughly speaking, according to DeltaCon, the more similar two graphs resemble each other from the perspective of their vertices, the more similar they are. Since DeltaCon is not dual weight-dependent, we consider a dual weight-dependent relative of it. To this end, we compute the cosine of the vectors of geodetic distances for each pair of equally labeled vertices. Since topic networks can differ in their order, we first have to align their node sets to make them comparable—this is also needed because we aim for a dual weight-dependent measurement. The required alignment is addressed by means of the following auxiliary graphs and :

and are needed to make and comparable whose symmetric difference can be nonempty while their vertex labeling functions share the same co-domain (since and belong to the same multiplex topic network according to Definition 8). Obviously, so that for each there is no path from to in . Cases in which no such path exists are denoted by ; otherwise, if such a path exists, we denote by the length of the shortest path, that is, the geodetic distance between and in . As we deal with graph similarities, we first transform the distance values into similarity values:gep is short for geodetic proximity. With the denominator , we penalize situations in which there is no path between and , that is, . The parameter specifies, whether the geodetic distance and the geodetic proximity are computed for the weighted () or unweighted () variant of . If , we assume that each arc weighting value is normalized by means of the nonzero maximum value assumed by the arc weighting function for this network (this means that a graph , which is obtained from a graph by multiplying the weights of all arcs of by a factor , will be equal to in terms of the graph similarity measure to be introduced now (insensitivity to certain scalings)). specifies the maximum geodetic distance to be considered: beyond this value, nodes are considered to be of maximum geodetic distance to —irrespective of their real distance. For , we have to compute all geodetic distances. For values of (e.g., ), we arrive at variants of that are less time complex. We consider the variant so that we take all path-related information into account. Now, we calculate the dual weight-dependent cosine of and as follows: is the weighted cosine of the vectors of geodetic proximities of the same-named vertices in and . In this article, we consider two instantiations of parameter :where implements an arithmetic mean. is a function of the degree centrality [105] of its arguments: the more linked a topic in a network, the higher its impact onto the similarity of the input networks. The similarity view behind this approach is that while , treats all—peripheral or central—nodes equally, gives central nodes more influence. Take the example of two city networks [106]: it is plausible to say that if city networks look similar from the point of view of their central places, this should have more impact on the general similarity assessment than similarities from the point of view of peripheral locations. An extension would be to use more informative node weighting measures (e.g., closeness centrality). Finally, parameter limits the number of vertices for which cosine values are computed. In the unlimited case, . It is easy to see that formulas (64)–(66) are similarity measures. For , this can be shown as follows:(1)Symmetry: since formulas (63)–(66) are all symmetric.(2)Positivity: since we are considering only positive arc weights, it always holds thatfor any and .(3)Upper bound: for any and and thus

It is worth noticing that the range of values of formulas (63) and (65) is limited to , since the values of gep are always positive and we only consider positive membership values of texts to topic nodes.

So far we looked at measures that mostly processed the arc set A of TNs. This is contrasted by measures operating on topological indices of graphs. An example is NetSimile [107], which is based on the idea of characterizing networks by vectors of graph indices, which mostly draw on theories of social networks or egonets. Starting from seven local, node-related structural features (e.g., node degree, node clustering, or size of a node’s egonet (see Berlingerio et al. [107] for the details of this approach)), it computes the mean and the first four moments of the corresponding distributions to generate 35-dimensional feature vectors per network where the Canberra Distance is used to compute their distances: let be two vectors, then their Canberra Distance is defined asSoundarajan et al. [108] show that NetSimile is consistently close to the consensus among all measures studied by them, showing that it approximates the results of more complex competitors. This finding makes NetSimile a first choice in any comparative study of graph similarities.

Following on from this success, we introduce a topology-related approach to graph similarity, which draws on the hierarchical classification of the texts underlying the topic networks by reference to the Dewey Decimal Classification (DDC) (see Section 3.2.2). Starting from a pretest which essentially showed that graph invariants of complex network theory [109] do not sufficiently distinguish networks from their random counterparts, we decided to calculate a series of graph indices that evaluate the assignment of topics to the second level of the DDC. More specifically, we compute three node type-sensitive variants of the four cluster coefficient [110], [111], [112], and [113] (cf. [114]). This variation can be exemplified by means of : to derive the desired variants from , we use the following scheme, where serves as a parameter to distinguish these alternatives ( is the degree of ):where is the number of adjacent neighbors of sharing their level topic classification with , is the number of adjacent neighbors of whose identical classification differs from that of , and is the number of adjacent neighbors of whose classification differs among each other and from that of (a 4th case is that shares with a single neighbor its level topic while differing from the topics of all other neighbors). In this way, we compute for each of the cluster values (unweighted), (unweighted), (weighted), and (weighted) three variants considering intra- and interrelational as well as heterogeneous type-sensitive clustering so that topic networks are finally represented by 12-dimensional feature vectors which are compared using the cosine measure. We call this approach ToSi (as short for topological similarity).

As a result of this candidate show of graph similarity measures, we consider the set of measures displayed in Table 5 for measuring the similarities of topic networks in order to shed light on Hypothesis 1, part (2).

3.2.7. Modules 7 and 8: Machine Learning and Classification Analysis

We conduct experiments in supervised learning with the aim of training classifiers to detect the layer (TTN or ATN) to which a topic network of a MTN belongs and the genre of the corpus from which the underlying LMN is derived. That is, our machine learning starts from a set of n genres , , each of which is represented by a set of text corpora (see Figure 16). The set defines a gold standard for which we assume that . Next, for each corpus of each genre , we span an LMN that in turn is used to derive a two-layer MTN such that consists of exactly two topic networks: a TTN and an ATN both derived from . In this way, we obtain the set and the set of all TTNs and ATNs, respectively, both derived from according to Section 3.2.3. Next, each of the sets and is randomized according to the procedure described in Section 3.2.4 (Baseline B2). In this way, we obtain the sets and as the randomized counterparts of and . As a result, we distinguish a range of classification experiments (1–14) only a subset of which will be conducted in Section 4 to tackle Hypothesis 1. We start with distinguishing TTNs from ATNs. The underlying classification hypothesis is as follows.

Hypothesis 2. Topic networks of the same layer (also called mode) (i.e., TTN or ATN) are more similar than networks of different modes (this concerns Scenario 1 (observed data) and Scenario 6 (randomized data) in Figure 16).
The similarity of TNs will be quantified by means of the apparatus of Section 3.2.6. Regardless of which genre (urban vs. regional vs. encyclopedic communication) the underlying corpus belongs to, Hypothesis 2 assumes that one can always distinguish TTNs from ATNs by their structure, while TTNs and ATNs are less distinguishable among themselves. This scenario is depicted in Figure 14 by Arrow 1. If we falsify the alternative to this hypothesis, we can assume that (poor, rich, or moderate) thematic intertextuality, as manifested by TTNs, is different from coauthorship-based networking of topics in ATNs. Collaboration- and intertextuality-based networking would then differ in a way that characterizes their layer. In order to test genre sensitivity as disregarded by Hypothesis 2, we carry out two experiments: one in which we classify TTNs (ATNs) by genre and one in which we combine both classifications by simultaneously classifying by genre and layer. When classifying by genre, we distinguish TNs derived from city wikis (urban communication), regional wikis (regional communication), and subnetworks of Wikipedia (knowledge communication) (see Section 3.2.2). Finally, we generate two control classes of wikis and Wikipedia-based networks outside of these three genres. The corresponding wikis are sampled in a way that their members are rather dissimilar. Our similarity measurement should therefore not work with them. In a nutshell, the underlying classification hypothesis is as follows.

Hypothesis 3. Topic networks of the same genre are more similar than those of different genres (this concerns Scenarios 2–4 (observed) and Scenarios 7–9 (random data) in Figure 16).
As we consider the genre-sensitive classification in the context of the layer-sensitive one, we get different classification scenarios:(1)Scenario 2 in Figure 16 denotes the task of training a classifier that detects TTNs of the same genre while distinguishing TTNs of different ones. If this is successful, we can assume that the TTNs analyzed here are genre-sensitive or that the communication functions that we hypothetically associate with these genres influence the structure of these TTNs.(2)Scenario 3 from Figure 16 regards the analog experiment for the genre-sensitive classification of ATNs.(3)Scenario 4 concerns the alternative in which the modal difference of TTNs and ATNs is ignored in order to classify topic networks independently of their modal difference according to their underlying genre.(4)This scenario is contrasted with Scenario 5, which considers classifiers for simultaneously detecting the genre and layer of TNs. The underlying classification hypothesis is as follows.

Hypothesis 4. Topic networks of the same layer and genre are more similar than networks of different layers or genres (this concerns Scenario 5 (observed data) and Scenario 10 (random data) in Figure 16).
Falsifying the alternative to part (2) of Hypothesis 1 implies that TNs derived from corpora written by different communities by addressing different thematic frames (e.g., cities) appear nevertheless similar in their gestalt. Such a finding is very unlikely in cases in which the underlying corpora serve very different communication functions: Hypothesis 1 is not saying that everything is similar irrespective of the heterogeneity of the underlying function or the thematic orientation. Thus, a genre-oriented classification that shows that TNs of the same genre (serving a certain communication function and having a certain thematic orientation) are more similar than those belonging to different genres would rather correspond to such a finding. From this point of view, Hypotheses 3 and 4 are of interest: to deal with them experimentally could pave the way for testing the second part (2) of Hypothesis 1.
As explained in Section 3.2.4, we randomize input networks so that we obtain five additional classification scenarios labeled 6–10 in Figure 16. The experiments corresponding to these scenarios will be conducted here, as far as they concern the baseline scenario B2 of Section 3.2.4. Furthermore, scenarios are to be enumerated which attempt to distinguish observed networks directly from their randomized counterparts. In this context, Scenario 11 aims at distinguishing TTNs from their randomized counterparts by means of the classifiers trained to detect TTNs. Analogously, Scenario 12 considers ATNs in relation to their randomized counterparts, while Scenario 13 aims to separate observed topic networks (whether ATNs or TTNs) from randomized ones. Finally, Scenario 14 extends the latter scenario by trying to additionally account for the modal difference of ATNs and TTNs. These scenarios are only listed for theoretical reasons.

4. Experimentation

To test Hypothesis 1 and its relatives (i.e., Hypotheses 2–4), we conduct several experiments using two resources: a corpus of special wikis, called the Frankfurt Regional Wiki Corpus, and a corpus of subnetworks of Wikipedia that mostly contain information about cities and regions.

4.1. Tools and Resources

The Frankfurt Regional Wiki Corpus (FRWC) contains 43 wikis collected from online wiki lists (e.g., https://de.wikipedia.org/wiki/Regiowiki). Table 1 shows the statistics of this corpus, which is divided into three genres: CITIES relates to wikis describing certain cities, REGIONS includes wikis focusing on a specific region, while the residual class OTHERS collects wikis that are not off-topic w.r.t. regional communication but are unusual in their structure or the described rhemes. We consider only articles that are not redirects. Wiki authors use redirect pages to lead readers of articles with outdated, incorrect, or alternative spelling titles to the desired target page. We remove all such redirects and rewire all affected links accordingly. As a result, the number of processed articles is smaller than their overall number (see Table 1). In addition to the FRWC, we extracted a corpus of Wikipedia subgraphs (see Section 3.2.2 for the formal definition of these graphs and Table 3 for the corpus statistics). Subsequently, we denote the two variants in this Wikipedia corpus WP-REGIO-1 and WP-REGIO-2. We choose 25 articles about cities or regions matching the titles of the wikis in the FRWC and additionally include the subgraphs of six off-topic articles to build two additional corpora, called WP-OTHERS-1 and WP-OTHERS-2, for purposes of comparison.

We process the content, link structure, and metadata (e.g., authorship-related information) of all articles in our corpora. This includes their history, that is, the chains of revisions which led to their current state. We do not consider past states of link structure and content itself but incorporate the authorship and the amount of content being added or removed per revision (see Section 3.2.3). The wikis considered here are based on MediaWiki. The structure of their articles varies from wiki to wiki so that HTML-based extractions are error-prone. To circumvent this problem, we use WikiDragon [115], a Java-based framework for importing and processing wikis offline.

For our experiments we used, adapted, and newly developed several tools including the so-called GeneticClassifierWorkbench (GCW), a Python library for performing feature selections and sensitivity analyses in classification experiments. Since our experiments are based on feature vectors with a size of sometimes more than 100 features, a complete sensitivity analysis of all feature combinations was not possible. Therefore, we conducted a genetic search for the best-performing subset of features due to maximizing the F-score. That is, a population of p features is evaluated and mutated over a number of t rounds. Instances which score best are saved unchanged for the next round and partly added in a slightly mutated form. The worst-performing instances are removed and replaced by random feature combinations. The Workbench is based on the Python library scikit-learn [116], allowing us to abstract from the underlying machine learning paradigm so that the same genetic search can be applied to optimize different classifiers. We experimented with neural networks which produced similar results on our test data but took too much time to be used for genetic searches and random baseline computations. Therefore, we decided for Support Vector Machines (SVM) as the embedded method of supervised learning using the Radial Basis Function (RBF) as a kernel. Our source code is open source on GitHub (https://github.com/texttechnologylab/GeneticClassifierWorkbench).

4.2. Classification Experiments

We investigate the similarities of our seven corpora of regional wikis (CITIES, REGIONS, and OTHERS) and of Wikipedia-based subgraphs (WP-REGIO-1, WP-REGIO-2, WP-OTHERS-1, and WP-OTHERS-2) (each defining a corpus of texts) in order to test Hypothesis 1 and its derivatives, that is, Hypotheses 2–4. Thus, we distinguish up to seven target classes in our experiments. For reasons of simplicity, we call each element of these corpora wiki and each of the seven classes genre. Unless otherwise stated, the experiments are performed on all of them. In the case of WP-REGIO-2 and WP-OTHERS-2, we did not induce the corresponding ATNs, as some of these would have included several million edit events. Thus, in this case, we have at most five target classes. Each experiment includes three consecutive steps:(1)The all variant: the first step, denoted by all, is a hyperplane parameter optimization and evaluation using the entire feature set. The optimized parameters of the respective classifier are then used in subsequent steps. Ideally, the parameters are optimized independently for each step, but this would have slowed down the genetic search.(2)The opt variant: in the step, denoted by opt, genetic searches for optimal feature subsets are performed using a population of 20 feature vector instances and 50 rounds, trying to maximize the F-score of the classification. Note that these searches may only reach a local maximum.(3)The ext variant: for experiments which are not conducted on random baseline data, we perform an extended genetic search for optimal feature subsets based on 20 instances and 500 rounds. In an additional step, a bit-wise genetic optimization attempts to further minimize the number of used features while keeping or even improving the F-score, using 20 instances and 500 rounds.

4.2.1. Graph-Similarity-Based Classification

Using the apparatus of Section 3.2.6, each TN (ATN or TTN) of each MTN is represented by a vector of values indicating its similarities to the wikis of the underlying experiment. Any such vector is separately computed for each of the 11 similarity measures of Table 5. Thus, if is the set of all TNs of whatever mode (ATN or TTN) and genre (CITIES, REGIONS, etc.) and if is a subset of these TNs used in a classification experiment concerning the genres (target classes) (cf. Figure 16), then each topic network is represented for each similarity measure by a -dimensional feature vector which is processed by the three-step algorithm described above. If for a given similarity measure the topic networks derived from wikis of the same genre are mapped to neighboring similarity vectors, then they belong to overlapping neighborhoods in vector space: related networks are similar in their similarity and dissimilarity relations. In this way, TNs of the same genre should become as recognizable as TNs of different genres. Now we see why a genetic search for optimal subsets of features is necessary: the reason is that otherwise we would assume that all dimensions of our feature vectors are equally informative—an assumption that is probably wrong.

Relating to Hypothesis 3, Tables 6 and 7 summarize our findings regarding the genre-sensitive classification of TTNs and ATNs, respectively. Cosine-based measures always perform best. Especially in the case of ATNs we see that accounting for arcs and for nodes secures better performance: dual weight-dependent measures (see Section 3.2.6) outperform single weight-dependent or weight-insensitive measures. However, in the case of TTNs, we also see that as long as we do not perform an extended optimization (ext), the measure , which disregards arc weights, is a best performer. Of special interest is , the best performer regarding the classification of ATNs (Table 7), which is not only arc and node sensitive but also weights nodes as a function of their degree centrality and therefore covers the highest amount of structural information among all candidates considered here. This measure is also a robust candidate working at a high level in both experiments (it is the best performer in the case of TTNs if being optimized by an extended genetic search). Thus, we conclude that spherical measures clearly outperform GED-related approaches and especially network-topology-based approaches (ToSi and NetSimile) which perform worst: the kind of information we seek is apparently ignored or “abstracted away” by the latter measures. However, NetSimile has at least a high optimization potential (see the column ext in Table 6)—a potential which is missing in the case of ToSi. In any event, none of the measures considered here is outperformed by our baselines. But in Table 6, we also see that B3 (opt) approaches ToSi (all); in Table 7, we make analog observations also by example of other measures. A serious problem concerns NetSimile in relation to Baseline B2 regarding the classification of ATNs (Table 7): the baseline surpasses the topology-related measure whether being optimized (opt) or not (all). The graph indices collected by NetSimile have obviously difficulties in making observed networks distinguishable from their random counterparts—at least in some of the cases considered here. B3 is also of interest with regard to the classification of ATNs, which achieves F-scores of up to 40% and thus makes representation models based on measures such as NetSimile, ToSi, and wges problematic candidates. The values of B4 opt are also remarkably high and can therefore be regarded as a challenge for the measures.

Figure 17 shows that the baselines B1, B3, and B4 are outperformed by the results obtained for TTNs. However, it also shows that feature optimization affects the random baselines. This is particularly evident in the case of B3, which is based on random matrices. This gain in F-score can be explained by random numbers that allow the target classes to be separated—at least to some extent. These features are then selected by the genetic feature selection. The baseline results for ATNs show a similar picture (see Figure 17(b)). Regarding B2, we make the following observations in Figure 17(b) (for reasons of complexity, we did not consider all measures to compute B2): although the best B2 candidates are better than the average F-scores calculated on the basis of real data, B2 is clearly surpassed on average. Thus, we come to the conclusion that we found effective measures for comparing networks—this concerns in particular the spherical approach based on the cosine measure. From these experiments, we conclude the following:(1)Hypothesis 3 is not falsified: we know the genre of a topic network by its structure. Note that this only concerns Scenarios 2 and 3 of Figure 16—Scenario 4 is not computed here. Similarly, by calculating our baselines, this also involves Scenarios 7 and 8 while ignoring Scenario 9. The classification benefits especially from information that is explored by dual weight-dependent measures. This holds regardless of the mode (ATN or TTN).(2)Spherical measures should be preferred to GED-based measures and these in turn to topology-based measures:

The boxplots in Figure 18 give another perspective on the classification results by summarizing the distributions of precision and recall values generated by the graph similarity measures. Except for the results on ATN using all features, the average precision is higher than the average recall. The figure also demonstrates the strong effect of feature selection.

So far, we considered classifications as a whole and thus abstracted from the scores obtained for individual genres. The boxplots in Figure 19 give insights into these genre-related scores regarding the classification of TTNs by means of the extended feature optimization (ext). The members of the genre CITIES are well identified: in terms of recall and precision. The genre REGIONS is far less separable and causes many classification errors (low recall). Apparently, this class contains more heterogeneous TTNs. In any event, the Wikipedia-based genres WP-REGIO-1 and WP-REGIO-2 are very well separated. By contrast, instances of the category OTHERS are extremely difficult to detect (as predicted in Section 3.2.7). Similarly, elements of the classes WP-OTHERS-1 and WP-OTHERS-2 are difficult to identify—albeit to a minor degree. Thus, we conclude that the upper bound of separability concerns Wikipedia-based regional wikis. The corresponding subgraphs are very similar. This upper bound is approached by city wikis. Region wikis are less homogeneous, making the corresponding class REGIONS rather blurred and therefore question its status as a genre. Figure 20 shows the corresponding results of classifying ATNs. The general picture is quite similar to that of the TTNs.

We take another perspective on the results to examine classification errors. The best results on TTNs using all features is achieved by . Figure 21 shows to what degree wikis of a target class are wrongly classified using this measure. The labels show the proportion of the categories according to the gold standard (top) and the classification result (bottom). The picture is diverse, but some details become clear: wikis of the classes REGIONS and OTHERS are often falsely categorized as CITIES. City wikis on the other hand are wrongly classified as WP-OTHERS-1 or WP-REGIO-1.

Genetic feature selection has proven to increase F-score significantly. In the extended optimization (ext), the last step is to minimize the number of features used. Since our features stand for similarities to networks, we have to ask whether some of the wikis underlying these networks are more relevant for the differentiation of the target classes than others—possibly because of their prototypical status. If all wikis were equally important, an equal distribution of the frequencies with which these features are selected by the genetic optimization would be expected. Figure 22 shows the corresponding rank frequency distribution: it shows that we are far from evenly distributed features. From this, we conclude that the selection of features is indispensable and that the underlying wikis are very different in their roles in our classification experiments.

Next, we try to distinguish TTNs from ATNs thereby addressing Hypothesis 2 (or more specifically, Scenario 1 of Figure 16). The error analysis in Figure 23 shows that networks of these two modes are not separable using our approach. Table 8 differentiates this outcome by reporting the results obtained for different measures. It shows that this classification scenario is far exceeded by Baseline B1 and is therefore irrelevant. From this result, we conclude that ATNs are so similar to their corresponding TTNs that they cannot be distinguished by our measures, or alternatively, our similarity measures are not suitable to distinguish them. This is not surprising, as the order and the size of an ATN always correspond to the order and the size of the TTN from which it was derived, so that they can only differ by the weighting of their nodes and arcs. By concerning Hypothesis 4 and thus by distinguishing twelve target classes (in the case of WP-OTHERS-2 and WP-REGIO-2, we do not induce ATNs), Table 8 shows a somehow different scenario: though the F-scores are still rather low, Baseline B1 is clearly outperformed when using a cosine measure for graph similarity measurement. From this observation, we conclude that while Hypothesis 2 is falsified, there is at least a potential regarding the simultaneous distinction of genre and mode: ATNs do not uniformly resemble their corresponding TTNs.

So far we considered part (2) of Hypothesis 1 by showing that TTNs (and also ATNs) with similar functions resemble each other, while differing from networks of other genres. It remains to be shown that these networks are also thematically focused—in a highly skewed manner. To test this, we fit power laws to the distributions of node weights in TTNs. Remember that these weights result from detecting textual instances of the topic represented by the respective node so that the more such instances are detected, the more salient the topic in the network. Fitting a power law to such a distribution means that there is a minority of topics or just one topic that surpasses all other topics in its importance, while the majority of topics are of little or no importance. The boxplots in Figure 24(a) show the distribution of the exponents of the power laws fitted to these distributions, differentiated by the genres considered here. To assess the goodness of the fittings, we compute the adjusted R-squares and display the value distributions in Figure 24(b). Obviously, the fits are very good (the adjusted R-squares are on average above 95%) while the averages of the exponents range between 0.5 and 1.5: from this analysis, we conclude that the underlying wikis are all thematically focused and skewed by dealing with a minority of topics in depth. The five most detected DDC labels per genre are shown in Table 9. It shows that Transportation; ground transportation is by far the most dominant topic in city wikis and in region wikis. Obviously, these wikis are thematically focused in a highly skewed manner.

It remains to be shown that our findings about urban wikis neither depend on the distances of the corresponding places nor on the communities writing these wikis. Figure 25 shows that the similarities detected by us do hardly correlate with the underlying distances of the places. In the heatmap in Figure 25(a), a connection between two city wikis is the greener, the closer, and the more similar they are to each other, while a pair of wikis is the more red, the less similar, and the more distant they are. Similarity is measured by while distance is converted into closeness and normalized to the unit interval (the values of the heatmap scale to by calculating ). Figure 25(b) shows that there is hardly a tendency to being more similar when being more close to each other. The lower similarity values are mostly induced by the rather unusually small wikis such as Boppard (see Table 1). Figure 26 shows the Fuzzy Jaccard of the communities underlying the wikis, that is, the overlap of these communities weighted by the activities of their authors: the lower the number of shared authors of two wikis and the less active these authors, the lower the fuzzy overlap of these wikis. The Fuzzy Jaccard is computed as follows (cf. [117]): let be the set of all registered users contributing to any of the wikis in  = CITIES REGIONS, OTHERS WP-REGIO-1 WP-OTHERS-1 and let be the set of all (nonredirect) articles of wiki , then we computewhere

Figure 25 shows that while among the Wikipedia-based extractions the overlap is remarkably high, it does nearly not exist between any of the city or region wikis: these wikis are written by mostly completely different communities. The picture is not different if one considers all authors—registered and unregistered.

5. Discussion

Section 4 has shown that topic networks, whether TTNs or ATNs, are similar if they belong to the same genre, while they are characterized by a high degree of thematic focusing. In order to operationalize this notion of network similarity, we tested further or newly developed 11 different measures of network similarity by relying on four different paradigms of measuring the similarity of graphs (see Table 5 and the discussion of graph/network similarity measures in Section 3.2.6) as instantiated by the complex networks studied here. All these measures and paradigms come along with a different notion of network similarity. We have shown that a subclass of them, especially cosine-based measures of network similarity, allow for detecting similarities of topic networks in line with Hypotheses 3 and 4. At the same time, the concept of network similarity underlying this class of dual weight-dependent measures seems to be the most promising from a research point of view, as it is based on node and arc weights and instantiates a very intuitive concept of network similarity: The more similar the two networks are from the perspective of the more of their nodes, the more similar they are. Thus, at the level of thematic abstraction examined here, there seems to be a hidden tendency to write about very prominent topics when it comes to thematizing places and linking the underlying texts in such a way that the resulting networks become almost indistinguishable.

Starting from this kind of thematic distortion of VGI as conveyed by online media, we now ask for a more general explanation of our findings. The candidate we are considering for this purpose is given by Cognitive Maps (CM), which were introduced as models of the cognitive representation and processing of spatial information to explain a number of different cognitive biases. Because of bridging the gap between geographical information and its biased representation, CMs promise to be a candidate for our task. At the same time, this notion allows for the connection of cognitive geography on the one hand and our generalized model of linguistic encoding of geographical information on the other (see Figure 1). The reason is that as mental representations, CMs are seen to integrate a wide range of representations of spatial objects, their relations, and thematic units (see below). We may argue now that we developed a method to represent and analyze a particular type of thematic information which can be subsumed under the latter list. If this is true, then the thematic distortion observed by us could be seen as a result of the biased processing of geographic information by a community of agents dealing with the same place to generate a common cognitive map, thereby manifesting a particular type of distributed cognition. When creating such a common CM of the same place, agents tend to focus on a highly selected set of rhemes (see Figure 1), even if there is no explicit agreement among these agents about this selection and even if there is little or no direct communication between them and also irrespective of the focal place. It seems that the agents participate in processes of distributed cognition in such a way that their own thematically distorted maps flow into the formation of a shared, stable but likewise distorted “thematic map.” These maps then appear as the result of a sort of swarm behavior regarding the formation of a particular distribution of the preference and salience of certain place-related rhemes. From this perspective, topic networks serve as models of these thematic maps which in turn are parts of CMs. To underpin this interpretation, we briefly summarize the research on CMs and, above all, ask about distortions that are distinguished by the research in this area.

Understood as mental representations of spatial knowledge, CMs have been subject of scientific work for decades. Starting from different disciplinary perspectives, this research provides insights into how people perceive their environment, think about it, and how this influences their spatial behavior. The interdisciplinary research on CMs has led to a multitude of notions, research designs, and outcomes, the integration of which is still pending. Over the years, researchers worked, for example, with different terms for the mental representations in question such as cognitive maps [118], environmental images [119], mental maps [120], mental sketch maps [121], narrative space maps [122], or internal representations [123], where the constituent map is most common. However, there has been a discussion as to whether the term map is generally misleading. In this context, Kitchin ([124], 3 pp.) distinguishes approaches that understand CMs as(1)Three-dimensional maps(2)An analogy to maps (because of their map-like characteristics)(3)A metaphor for maps (because they function as if they were maps), or(4)A hypothetical construct used to explain spatial behavior

While we refer to cognitive maps as an auxiliary notion, we adhere to the fourth of these variants. Regardless of this discussion, there is a greater consensus on some characteristics of CMs as mental representations: CMs are understood as complexes of mental images and concepts that humans have in mind when thinking about places, their location (in terms of distance and direction), accessibility (regarding questions like how to get there), and the meanings associated with them. They serve as a means of understanding spatial circumstances and as a frame of reference for the interpretation, preference, and prediction of spatial structures, their relations, and events in which they participate (see [125], 100 pp, 313), ([120], 3), and ([119], 5p.). Beyond that, they also serve as a basis for decision-making regarding spatial behavior (e.g., in route planning). In a nutshell, humans activate, generate, and utilize CMs in spatial thinking and spatial behavior (cf. [126], 233). CMs are distinguished according to the entities they model. Kitchin and Blades ([127], 5p) distinguish CMs of object spaces (e.g., rooms and cars), environmental spaces (e.g., buildings, streets, neighborhoods, and cities), geographical spaces (e.g., regions and countries), panoramic spaces, and map spaces (including models) (cf. [128]). In this way, they cover existing as well as imagined places, where facts about the former can be mixed with imaginations of the latter [129]. This list includes the kind of places that are central to our study, especially cities.

To build a bridge between the notion of CMs and our analysis, we need to look more closely at their content and the principles by which they are created. Generally speaking, CMs are seen to cover at least two types of information (see [124] 1p. and [129] 314p.):(1)Regarding spatial cognition, this concerns information about where entities are located in the environment of a person (location, distance, and direction in relation to her location or to reference points like landmarks)(2)Regarding environmental cognition, this concerns information about the kind of these entities, their attributes, meanings, valuations, and attitudes that the person associates with them—individually, socially, or culturally mediated ([126], 224, 235)

Our study focuses on the second part of this distinction: it is related to the rhemes that are associated with places as framing themes (see Section 1). In any event, CMs are systematically characterized by distortions ([129], 315) concerning judgments about locations, distances, and directions as well as the formation of preferences which affect spatial or environmental cognition. One example is the localization effect [120] according to which people can discriminate nearby places better and have stronger preferences for them, see also [126]. This relates to errors in distance judgments depending on the perspective from which they are made: more differences are seen between closer areas than between more distant ones, so that shorter distances are exaggerated, while longer distances are underestimated [130], 133). Furthermore, spatial knowledge can be organized by reference to landmarks which “distort” places in their “neighborhood” so that buildings, for example, are judged to be closer to them than vice versa [130], 134). Tversky ([130], 135pp.] describes additional modes of distortion: to remember the position and orientation of objects, humans isolate them from their background and organize them by referring to a general frame of reference (rotation) or to other figures (alignment). While these examples primarily concern spatial cognition, the following bias focuses more on environmental cognition. This concerns the hierarchical organization of conceptual systems according to which places of the same category are supposed to be closer in distance than places of different categories, while the direction of a category (with a direction slot) determines the one of its members ([130], 132p). Last but not least, Golledge and Stimson [126] describe distortions of the representation of urban spaces. They observe that interactions influence the perception of a city in the sense that spatial information accumulates along the representations of the paths used to carry out these interactions. Likewise, structural properties of cities which are more salient than others are likely to become anchor points in CMs. In such maps, areas between used paths and anchor points may appear to be “folded” or “wrapped” so that preferred visited places are represented closer to each other. As a result, positional and relational errors can occur in perception (see ([126], 254) and ( [131], 7).

To interpret our findings in the light of this research, we need to link the formation of CMs with linguistic processes. The idea that this formation is substantially influenced by human language processing, so that geographical information is nontrivially encoded in linguistic structure, goes back to the work of Louwerse and Benesh (cf. [26]) (see Section 1; see also Montello and Freundschuh ([132], 171) for an earlier hint on “obtain[ing] spatial knowledge through language”). In this context, Golledge and Stimson ([126], 235) distinguish shared components of CMs from personalized ones by stating that “The common elements facilitate communication with others about the characteristics of an environment; the idiosyncratic elements provide the basis of the personalized responses to such situations.” Our hypothesis is now that at the level of thematic abstraction as modeled here, the organization of platial rhemes shared by the members of a community is influenced by the general law of preferential order, which is most prominently instantiated by Zipf’s first law [133]. Such an organization makes the anticipation of a place rather expectable among the members of a community so that communication about this place is facilitated as predicted by Golledge and Stimson [126].

This Zipfian organization allows for relating our findings to the well-known power-law-like degree distributions found in many natural, social, semiotic, or technical networks (see [109, 134] and especially [135] for overviews of this and related research) and also by example of many linguistic systems—especially on the text level [136138]. Because of this commonality, one might assume that we just detected a well-known text or network characteristic. Characteristic for our findings, however, is that we developed a measurement procedure that detects a text (corpus)-related semantic, thematic trend—with the help of network theory: instead of counting directly observable arcs, for example, in ontological networks or co-occurrences in texts and instead of relying on monoplex networks [70, 93, 139143], we generated and analyzed a range of different networks in relation to each other in order to determine the corresponding thematic trend by means of multiplex networks. This is not to say that we first discovered a Zipfian process in the organization of linguistic networks, but rather that we observe such a process in a very specific area, in which it has not been observed before and which requires an appropriate explanation as elaborated so far. Indeed, if thematic salience is skewed, and if skewed topic distributions derived from different corpora are similar not only topologically but also regarding the ranking of the majority of salient topics, such an observation requires explanation subject to the fact that the underlying text networks are constituted by different, distributed communities of authors. It is the answer to this question that the paper was about.

At this point, one might further object that we made a rather expectable observation in the sense that descriptions of cities, for example, are very likely related to rhemes like traffic, trade, culture, and history. However, this would mean underestimating our results: (i) the thematic distortions observed by us are extremely skewed, (ii) they seem to emerge rather earlier in the development of a wiki (this is not shown here but is the result of a pretest in which we looked at the life cycles of three different wikis; in future work, we will analyze the underlying time series of multiplex topic networks in detail), and (iii) they make both members of the same genre similar while allowing for distinguishing members of different genres. To phrase it as a question: If the number of rhemes under which places are thematized is limited, why then should always a tiny subset of them dominate the discourse about a place and why then should the networking of these rhemes make discourses of the same genre identifiable? From this point of view, we argue that we discovered an additional form of the distortion of CMs, which means that the underlying place is always conceptualized from the point of view of a few but extremely preferred rhemes. When organizing their distributed processes of coauthorship, communities of authors seem to strive to a kind of thematic unification that makes different wikis serving alike functions looking structurally similar—with respect to the preference order of themes and their networking. It seems that people participate in processes of collaborative writing with a tendency to organize their thematic contributions and references in such a way that they remain shareable [144] and communicable among members of the same community. Ensuring shareability means securing the continued existence of the underlying wiki, which could otherwise collapse because of too many personalized or individualized fragmentations. At this point, we can speculate that people unconsciously prefer such thematic contributions that make their social roles and participations expectable and acceptable, whereby this selection behavior produces the described similarity of thematic maps as components of CMs. In other words, the participants anticipate social roles and neglect their personal view of cities and regions, whose documentation would fragment the corresponding media thematically. Instead, they ignore the reproduction of their idiosyncratic, personalized views of places. To say it in terms of the distinction made by Golledge and Stimson [126] between shared and personalized components of CMs: participants overweight the former to the disadvantage of the latter to guarantee the shareability [144, 145] of CMs as a result of distributed cognition.

Note that in our study we did not simply map a frequency effect by our measurements: although we counted frequencies of topic assignments, they were determined by means of an inference process that went through a process of (machine) learning. To support such an interpretation, however, a deeper analysis with a larger corpus of wikis and related media providing different functions is required. This also requires experiments with other and above all much finer classification systems than the DDC to find out how much the use of the DDC has influenced our measurements. And it requires a deeper analysis of the social roles of authors in online media, their interactions, and the regulatory systems under which they interact. But this already concerns future work.

6. Conclusion

We developed a novel model of topic networks in order to investigate the networking of rhemes addressing the same places in underlying corpora of natural language texts. We developed our network model in a way that it enables thematic comparisons of previously unforeseen text corpora using an underlying reference corpus, offers a generic solution to the problem of topic labeling, is highly scalable and can therefore map even the smallest text snippets to topic distributions, simultaneously takes rare topics into account, and is methodologically open and expandable. Moreover, our model allows for comparatively investigating the networking of thematic units from different angles. In this way, it is open and expandable as it allows for integrating different analytical perspectives into the study of the same semantic networks. We exemplified our model by means of corpora of special wikis and extracts from Wikipedia in order to investigate how textual information encodes geographical information on the aboutness level of texts. Our experiments show that the thematizations of different places on a certain level of abstraction are similar to each other in that they focus on a few themes in a highly distorted manner while networking them in similar ways. This happens regardless of whether the underlying media are generated by different communities and whether these communities address related or unrelated places in nearby or distant places. We interpreted our findings in the context of the notion of cognitive maps. To this end, we proposed to extend this notion in terms of thematic maps and argued that participants or interlocutors of online communication tend to organize their contributions in a way that makes them sharable. This means that the contributions are abstracted and depersonalized at the aboutness level in such a way that the social roles of these participants become expectable and acceptable, while their personal views of places are reduced whose documentation would fragment the corresponding media thematically. Ensuring shareability means securing the continued existence of the wiki, which could otherwise collapse in the face of too many personalized or individualized fragmentations. Future work concerns several tasks: we want to conduct deeper analyses based on larger corpora that manifest a greater variety of communication functions in order to shed more light on the genre sensitivity discovered in our study. Beyond the DDC, we strive for the use of finer structured, higher resolution classification systems in order to model the contents of texts much more precisely. Ideally, this should be carried out with the help of systems like the category system of Wikipedia or even Wikidata, both of which develop as open topic universes [146]. Last but not least, a deeper analysis of the social roles of authors in online media and their coauthorship is required to gain a deeper understanding of the processes of linguistic encoding of geographical information. This will be the task of future work.

Appendix

A. text2ddc

text2ddc is trained by means of corpora that are derived by integrating information from Wikidata, Wikipedia, and the Integrated Authority File (Gemeinsame Normdatei—GND) of the German National Library: we explore the links of Wikipedia articles to entries in Wikidata containing the property attribute https://www.wikidata.org/wiki/Property:P1036 that directly links to the DDC or to a GND page containing a DDC tag. An example is the article about the Pythagorean theorem (https://en.wikipedia.org/wiki/Pythagorean_theorem), which is linked to the GND page 4176546-1 (https://d-nb.info/gnd/4176546-1) referring to the DDC tag 516 (geometry). Using such information, we obtain a corpus for a subset of 98 classes of the and for a subset of 641 classes of the 3rd DDC level. Since Wikipedia exists for many languages, such corpora can be created for each of them. For preprocessing the input data of text2ddc, we use TextImager [86] and fastSense [88] for disambiguating these data on the sense level. The resulting information is used to train a neural network for classifying any piece of text (down to the word level) into DDC classes (see https://textimager.hucompute.org/DDC/). To this end, text2ddc uses a very efficient classifier, that is, fastText [91], a bag-of-words model to train a neural network with a single hidden layer. To optimize fastText, we optimize the following hyperparameters: learning rate: 0; update rate: 150; minimal number of word occurrences: 5; number of epochs: 10,000. In this way, we increase the F-score to 87% for the level and to 78% for the 3rd level of the DDC.

B. Color Codes and Class Members of the DDC

Figure 27 shows the colors and labels of the classes of the level of the DDC.

Data Availability

Parts of the programs that underlie our work are available via GitHub (https://github.com/texttechnologylab/GeneticClassifierWorkbench).

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

Financial support by the Federal Ministry of Education and Research (BMBF) via the Centre for the Digital Foundation of Research in the Humanities, Social, and Educational Sciences CEDIFOR is gratefully acknowledged.