Abstract

Social event detection in large photo collections is very challenging and multimodal clustering is an effective methodology to deal with the problem. Geographic information is important in event detection. This paper proposed a topic model based approach to estimate the missing geographic information for photos. The approach utilizes a supervised multimodal topic model to estimate the joint distribution of time, geographic, content, and attached textual information. Then we annotate the missing geographic photos with a predicted geographic coordinate. Experimental results indicate that the clustering performance improved by annotated geographic information.

1. Introduction

Social events are events that are planned by people, attended by people, and for which the social multimedia are also captured by people [1]. Massive social multimedia documents such as photos or videos were uploaded to social media. Social event detection is to categorize or index these documents in relation to events with the aim of all collected documents being categorized to relative event accordingly automatically [2]. Accurate and fast detection is of great significance in social media. Taking Boston Marathon bombings as an example, a large number of photos taken by the crowd and shared to others during Boston Marathon Event, these photos may be a valuable clue to search for suspect. Organizing these photos automatically by event is helpful to identify the suspects. Nevertheless, annotating process is tedious and lengthy. Temporal and spatial information are two major considerations to identify an event. Temporal information can be collected from photo taken time easily. Meanwhile, as a typical spatial information, GPS coordinate is wide supported by smart phone nowadays. There are a lot of services provided based on location information [3]. These location-based services (LBS) may be employed in a number of applications, including recommending social events in a city, locating people on a map displayed on the mobile phone, or receiving alerts. Benefiting from the convenience of LBS, more and more people prefer to enable GPS option while taking a picture. However, there are still many users who tend to disable GPS for the sake of their privacy, resulting in a huge amount of geomissing photos. Furthermore, event detection for these geomissing images is a big challenge [4]. Moreover, estimate geographical coordinates for those photos missing geographic information will bring some other benefits: to improve the event clustering; to place photo on the map by event; and to help different users attend the same event (e.g., concert) for sharing photos. The main task of this paper is to assign missing geostamps to some photos within the collection automatically based on the existing spatiotemporal information.(1)Employ Supervised Document Neural Autoregressive Distribution Estimator (SupDocNADE) to model a joint distribution from textual annotation words (including title, tag, and description), image visual words, and geographic information.(2)Estimate missing geographic information.(3)Improve event-based large scale photo clustering by the estimated geographic information. The structure of the remainder of this paper is as follows. Section 2 presents some related work. The proposed approach is described in subsequent Section 3 in detail. Section 4 presents social event detection application based on geoannotation. Section 5 presents some empirical results and finally Section 6 concludes the work of this paper and discusses some works to be done in future.

In recent years, the research on the organization of social media documents has attracted considerable attention due to information overload problem. Many researchers dedicated their effort to the social event detection problem. Hintsa et al. [5] and Brenner and Izquierdo [6] tried to find additional information about social event based on public information. This information is used as a clue to find more relative images by Papadopoulos et al. [7]. Multimodal clustering algorithm [8] is a wide-accepted approach for social event detection. Reuter and Cimiano [2] apply support vector machines (SVM) technique to classify photos with respect to events. Becker et al. [9] presented an incremental clustering approach for event identification. Papadopoulos et al. [10] utilized textual information, temporal information, and spatial information together to increase the clustering performance. Visual features are also involved in excluding noisy pictures. The Latent Dirichlet Allocation (LDA) based methodology [11] initially classifies media data according to the location, and then textual metadata are to be analyzed based on topics which are extracted from description of images. The paper also applies the “same class” model for mapping images to a graph to ensure that a community detection algorithm can be employed. A spatiotemporal clustering method presented by Zeppelzauer et al. [12] generates an event list as candidates which can be filtered by taking into account additional information in further step. The essence of these methodologies for social events detection lies in creating distance matrices based on multimodalities: temporal, spatial, visual, and textual. It is easy to identify an event if both temporal information and spatial information are known. However, a large amount of photos are not well geotagged in large photo collections. This hinders us from identifying events correctly. The basic idea of this paper is to predict the spatial information of a photo without geographic coordinate based on the joint distribution of temporal, spatial, visual, and textual information. We assume that two photos with similar visual and textual information in semantics may be taken in the same location.

Topic model is a tool for measuring semantic similarity. Text document modeling is a primary application for topic modeling such as Document Neural Autoregressive Distribution Estimator (DocNADE) [13]. The joint distribution of the words in a document was modeled directly. And the distribution can be decomposed as product of conditional distributions, each of which can be modeled as a neural network. Then, Zheng et al. [14] extend the model to Supervised DocNADE to do image classification and annotation task by taking label information as a supervised information to train the model. This Supervised DocNADE was utilized to estimate the joint distribution of time, geographic, visual content, and textual information which can be used to predict the missing geographic information [15].

3. Event-Based Geoannotation

The basic idea of this paper is inspired by Supervised DocNADE. We first depict the original model of Supervised DocNADE and how it learn the joint distribution from multimodal data in this section. Then we depict how to represent temporal information and spatial information. Finally, a model for joint distribution of these multimodal information is described.

3.1. Supervised DocNADE

Supervised DocNADE models the joint distribution of the image words and its class label as The observation where value 0 indicates that a word does not exist in a text while 1 does. Following DocNADE, through the probability chain rule, can be decomposed as For , all autoregressive conditional can be modeled and learned by the following feed-forward architecture neural network: are neural network parameters where and are bias parameter vectors, and are the connection parameter matrices between hidden units (topics) and vocabulary whose size is and is the number of hidden units. is the subvector and is a matrix made of first columns of , and is a nonlinear activation function. A balanced binary tree whose different leaves represent different image word is proposed to decompose the computation of the conditionals. In particular, on the path from the root node to any leaf of image word , let be the sequence of tree nodes and be the corresponding sequence of binary left/right choices. If the word is in its left subtree, for example, will be 0 and 1 for otherwise. The probability can be computed as follows: where Supervised DocNADE take class label as a supervised layer. To obtain joint distribution , it models conditional probability as a regular multiclass neural network: where and is bias parameter vector in the supervised layer and is the connection matrix between hidden layer and the class label . In Supervised DocNADE model, the image word includes image visual words (coming from SIFT feature and region pair) and textual annotation; that is, .

3.2. Adding Time Word and Geo Word

So far, Supervised DocNADE provides an approach to model the joint distribution of image class label and image word. To model the joint distribution of photos and social event, we take time and geographic information into account. Specifically, we take social events type as class label and extend image word by embedding into time word and geographic word.

3.2.1. Adding Time Word

We adopt the image taken time as the source of time word. In our experimental dataset, all images were taken between 2006 and 2012. We collect the date of photo taken time and treat every different date as a distinct time word. To simplify the computation, we sort all dates by ascending order and set every distinct time word by an integer id. So every image only has one active time word. The image word vector can now be represented as

3.2.2. Adding Geographic Word

Firstly, we construct a geographic id list and select a distance threshold . Then, we calculate the distance every two images; if the distance is less than , we set these two images with the same geographic id; otherwise, we add a new geographic id to geographic id list. We save id and corresponding GPS coordination in the list. Finally, the image word vector can now be represented as Please note that the length of geographic id list may not be equal to the number of events in the dataset. Some event has two or more geographic id probably and some distinct events may share the same geographic id.

3.3. Model Training

So far, the joint distribution can be modeled as , where ; the task of model training is undertaken by minimizing the negative log-likelihoodfollowing Supervised DocNADE, a regularization hyperparameter is introduced to weight the importance of the generative term

Algorithm 1 gives pseudocode for computing the joint distribution .

input: bag of words representation and event type
output:
begin
  
  
  for do
     
     
     for do
        
     
     
  
  
  
3.4. Geoannotation

After the joint distribution is obtained, we began to predict the geographic information for those photos without geotag. Based on the visual words, annotation words, and time word, the document representation and the probability of a observed geo word for each possible geographic word are computed through tree decomposition according to (4). We then rank the geographic id in the list and select the geographic id with the highest probability as predicted geographic.

4. Event Detection

After geographic information annotating, we then commit event detection task in large photo collections. We adopt multimodal clustering for the event detection task. We represented every photo by a set of features and compute the vector that contains the distance between any two photos. Formally, given two photos, and , which expressed by features and , respectively, we compute the distance between two photos: And then we predict if the photos belong to the same event by a function of this distance. In our scenario, we use the following set of features and similarity measures. (1)Capture Time. We rely on timestamp of a photo to define a time-based distance as follows: where and are timestamps of two photos, and is the logarithm of the number of minutes in a year.(2)Geographical Information. We use the latitude and longitude (actual coordinate is selected if it exists; otherwise predicted coordinate is selected) donating the location of a photo and compute the great-circle distance by Haversine-formula between two locations:where is latitude, is longitude, and is the Earth radius.(3)Textual Information. Photos uploaded by users are typically accompanied by a title, a set of tags, and a description. We extract features from dataset using a sparse vectorizer based on Term Frequency C Inverse Document Frequency (TF-IDF) and map the most frequent words to features indices to obtain a word occurrence frequency matrix. We also utilize Latent Semantic Analysis (LSA) to perform dimensionality reduction.

5. Experiments

To test the proposed approach, we measured the performance under social event clustering task on real world dataset: MediaEval Benchmark for Multimedia Evaluation dataset for Social Event Detection (SED) task 2013 [1]. We first measure the accuracy of geoannotation, and then we provide some comparisons on event clustering performance with/without geoannotation.

5.1. Datasets and Experimental Setup

The dataset consists of 437,370 photos uploaded by 4,923 different users. The events in the dataset are heterogeneous, including sport events, protest marches, BBQs, debates, expositions, festivals, or concerts. There are 200917 photos with Geotag and the average number of geotagged ratios per event in the whole collection is 45.94%. The dataset includes metadata of photos uploaded in the years from 2006 to 2012. For this paper we require the timestamps, geotagging if available, title, text tag, the description of each photo, and the venue geo coordinate of each event. We also require the event type label as supervised layer. We first do some preprocessing work for the dataset. We delete photos with wrong geographic information whose latitude is not between −90 and 90 and longitude not between −180 and 180. We also delete photos whose timestamp is not between January 1, 2006, and December 31, 2012. Then we do word segmentation for textual information (including title, tag, and description), remove the stop-words and some other specific words such as “http://,” “www,” “href,” “com,” and “org,” as well as some non-ASCII characters. We also remove those words only occurring 1 time in the whole dataset. After that, we delete those events with less than 10 photos left. Finally, we get 5676 events with 247227 photos. The average number of geotagged ratios per event in the final collection is 47.22%. The ratio matches the whole dataset. We select data from preprocessed dataset in two levels. The small level is all the events which are fully geotagged. This level is used for evaluation of the average distance within the same event. And the large level is those events with more than 3 photos geotagged, a condition necessary if we are to train joint distribution. We use all geotagged photos as train data and estimate the missing geographic information for photos. We depicted the detailed information of two levels in Table 1.

5.2. Event Distance Evaluation

We first evaluate the average distance in small level dataset. We take the venue location as the central of this event and then calculate the distance between the photo coordinate and venue coordinate. We select 10 events from those events with more than 100 photos and illustrate the average distance and standard deviation of selected events in Figure 1. There is no relevancy between the average distance and standard deviation. Among all events in small level dataset, the average distance is 3.20 km.

5.3. Geographic Information Estimate

Then we train joint distribution in large level dataset and estimate missing geographic information. Because the original geotagged ratio is high, more than 70%, we decrease the geotagged ratio by removing some geographic information of photos randomly. For comparison, we get three separated datasets by removing different amount of geographical information. And we ignore the events with less than 3 geotagged photos. The three datasets are described as in Table 2.

As small level dataset, we first calculate the average distance between existing geographic information and venue location of every event. To evaluate the accuracy of estimated geographic information, we define a measurement like root mean square error (RMSE) to evaluate the accuracy of estimated geographic information:where is the number of estimated photos, is the distance between estimated geo and venue geo, is the mean of distance between existing geo and venue geo. Apparently, the accuracy is higher where value is smaller. We also select 10 events from those events with more than 100 photos and illustrate three values of the same event in three large level datasets, respectively, as in Figure 2. For each event, the geotagged ratio is different in three large level datasets. The ratio of Large 1 dataset is the smallest and the ratio of Large 3 is the biggest among three datasets. Correspondingly, the accuracy of Large 1 is lower than Large 2 and Large 2 is lower than Large 3. The reason is that the joint distribution of trained model is more accurate with higher geotagged ratio. This leads to the accuracy of estimation becoming higher.

5.4. Event Detection Comparison

We extract features for every photo which contains time feature, geographic feature, and textual features. We take one or more features to perform event clustering task. We utilized measure of clustering quality which is Adjusted Mutual Information (AMI). To demonstrate the importance of geographic information for clustering task, we do clustering with and without geographic information, respectively. The results are shown in Figure 3. We set a threshold for the number of photos in each event to reduce the number of clusterings. For example, if we set the threshold as 50, only taking those events with more than 50 photos into account when doing clustering task, there are more events with lower threshold.

The result with geographic information is dramatically outperformed without geographic information. Furthermore, we utilize LSA to perform dimensionality reduction on textual information. Higher text dimension leads to better clustering performance without geographic information experiment; however, the difference of different textual dimension is slight. The importance of textual information is reduced when taking geographic information into account. We can reduce textual information to a lower dimension to speed up the clustering task. We also do comparative clustering experiment for before-estimated and after-estimated geographic information datasets. The results are shown as in Figure 4.

As a baseline, we do clustering on three large Level datasets. For those photos missing geographic information, we set geographic coordinate to to proceed with the clustering task. We do clustering on the same photos collection after geographic information predicted. Same as previous experiment, we also set textual dimension to 10 and 20. The AMI measurement is improved from 0.67 to 0.97 after estimated geographic information. Comparing Figure 3 with Figure 4, we also found that the difference between different textual dimension is smaller taking geographic information into account. The AMI difference is nearly 0.05 between textual dimensions 10 and 20 without geographic information; however, the difference is 0.03 between textual dimensions 10 and 20 with before-estimated geographic information. But the average AMI is decreased because we set the missing geographic information to . The main reason is that the noise is imported when we set missing geographic to a fixed value.

5.5. Discussion

Experimental results show that the multimodal topic modeling combined with geographic information and image tags is helpful for the task of social media image clustering by events. Figure 4 indicates that the accuracy of estimated geographical information will affect the performance of clustering. However, as we can see from Figure 2, the accuracy of estimated geographic information varies widely for different event. It may be as small as a few meters or larger than tens of kilometers. By comparing Figures 1 and 2, we find that the geographical coverage for different event may vary widely also. To this end, we analyzed the photo geographical dispersion patterns of different events and found that there are two different patterns. The first is random pattern and the second is clumped pattern as shown in Figures 5(a) and 5(b). The random pattern refers to all photos in an event located in a certain area randomly. However, the clumped pattern is quite different. All photos in the event clustered into several different groups located in different areas separately. For example, for a tour concert event, photos uploaded by audiences may be located far away about thousands kilometers from each other. For different dispersion pattern, the accuracy of estimated geographical information is based on geotagged photos and the correlation between their visual words and textual tags. As shown in Figure 5(a) random pattern event, geotagged photos (shown as cross) are located within a circle having a diameter of . The missing geographic information for an untagged photo (shown as dot) is taken from the geographical coordinates with the highest correlation, and the estimated result is also within the circle. Therefore, the maximum error is not greater than diameter . However, for a clumped event as shown in Figure 5(b), all geotagged photos are distributed in four different groups within circle diameters which are , , , and . We also take the geographical coordinates from the geotagged photo with the highest correlation as estimated geographical information. The maximum error will not be greater than the maximum of , , , and . At last, the estimated error is less than even if .

6. Conclusions and Future Work

We presented a system to cluster photos in large collections by events. Evaluations show that the performance of events clustering in photo collections highly depends on the geographic information. We utilized the Supervised DocNADE model to estimate the joint distribution of time information, geographic information, visual content information, and textual information. The missing geographic information is predicted based on the shortest path on trained balanced binary tree. Events clustering experiment shows that the estimated geographic information improved the performance of events clustering dramatically. There are two important issues that we intend to address in future work. First, we intend to tackle event evolution problem. The proposed method needs an event database including event id and event type for supervised layer model training. However, some events will evolve from one type to another. If the type changed, we should set the event as a new event. We should determine dividing line to set an event to a new event. Second, we intend to investigate the problem of dynamic retrain of the model. Along with more and more photos with geotagged incoming, the model should be retrained to obtain a more reasonable joint distribution.

Competing Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

Acknowledgments

This work was supported by the Fundamental Research Funds for the Central Universities of China under Grants no. N150404003 and no. N150308001, the Liaoning Province Science and Technique Foundation under Grant no. 2013217004-1, and the National Natural Science Foundation of China under Grant no. 51607029.