Abstract

In this paper, we develop a framework to estimate network flow length distributions in terms of the number of packets. We model the network flow length data as a three-way array with day-of-week, hour-of-day, and flow length as entities where we observe a count. In a high-speed network, only a sampled version of such an array can be observed and reconstructing the true flow statistics from fewer observations becomes a computational problem. We formulate the sampling process as matrix multiplication so that any sampling method can be used in our framework as long as its sampling probabilities are written in matrix form. We demonstrate our framework on a high-volume real-world data set collected from a mobile network provider with a random packet sampling and a flow-based packet sampling methods. We show that modeling the network data as a tensor improves estimations of the true flow length histogram in both sampling methods.

1. Introduction

Monitoring network statistics is crucial for the maintenance and infrastructure planning for network service providers. Statistical information about traffic patterns helps a service provider to characterize its network resource usage and user behavior, to infer future traffic demands, to detect traffic and usage anomalies, and to provide insights to improve the performance of the network [1]. However, network monitoring has become a difficult task due to increasingly high-volume and high-speed data over modern networks, and in most cases, it requires special hardware. For this reason, sampling [2] becomes a viable approach for extracting statistics from such high-speed networks. In this work, we are interested in one of the most important network statistics, the flow length distribution (FLD).

A network flow is defined as a set of Internet protocol (IP) packets with the same signature observed within a limited time period. The flow signature is composed of the IP and port pairs of both the source and destination nodes together with level-3 protocol types such as transport control protocol (TCP) or user datagram protocol (UDP). A flow starts with the arrival of the first packet and terminated when the interpacket timeout is exceeded. The total number of packets in a flow is referred to as the flow length and the length distribution of a set of flows that are terminated in a time window is called flow length distribution.

In this work, we are using one of the most popular methods for collecting per-flow information, i.e., passive measurement. In this method, network packets are processed as they pass through a passive measurement beacon connected to the network, e.g, router. The beacon keeps a look-up table for flow identification. The beacon processes a packet by searching its corresponding flow inside the look-up table using its signature. If such a flow is found, its statistics are updated. Otherwise, the packet is treated as the first packet of a new flow, and the new flow is inserted into the table. Once a flow is terminated, its statistics are transferred to a storage.

The flow length histogram can be calculated exactly by processing every packet that passes through the measurement beacon. In order to implement such a direct method, the monitoring beacon needs to maintain a table to hold information for all active flows on the network. However, substantial amount of concurrent flows with very short packet interarrival times of current high-speed networks (on the order of 10 Gbps to 100 Gbps inside carrier’s network today) make this brute-force counting method very costly to implement. First of all, this method would require a large amount of memory to record the flow table. Secondly, in a high-speed link, the interarrival times between packets, which may be as small as 8 nanoseconds in an OC-768 link, may be smaller than the time required to process flow hash operations such as identifying the packet and updating the flow statistics.

The characteristics of the network traffic data inevitably lead to the development of alternative methods for measurement such as random sampling, where a fraction of the network traffic is randomly selected and processed. The simplest sampling method is the uniform packet sampling [36], used in commercial systems [7, 8]. In uniform sampling, each packet is selected with a predefined constant probability. This approach is easy to implement since it does not require the flow identification of each packet. However, recovering the true flow length distribution from the random packet sampled traffic is a challenging problem. The unbiased estimator of the original flow length n for sampling probability π is , where m is observed flow length. The relative error of this estimator, calculated as [3], grows unboundedly for short flows as the sampling rate gets smaller. The high error on the small flow lengths comes from the fact that most of the samples are collected from longer flows.

Flow-based adaptive sampling methods [914] were proposed for more accurate flow length estimation. In these methods, each incoming packet is processed and then sampled with a probability that is a function of the current sampled length of the flow that the packet belongs to. Here, the main idea is to use a smaller memory by compressing the flow statistics counters in the router. However, these methods need to be implemented on specialized and expensive hardware due to the mandatory packet identification and look-up step.

Both packet-based and flow-based adaptive sampling methods rely on numerical methods to recover the true FLD. In this work, we propose a framework that can be used to recover the true FLD from the sampled observation obtained by any sampling method. This framework uses a variant of the nonnegative tensor factorization NTF model, which we call the thin nonnegative tensor factorization (ThinNTF), where the “thin” prefix emphasizes that the factorization is applied directly to the samples, or namely “thinned” data.

In our framework, the network traffic data is modeled as a 3-way array, containing the number of flow length observations, with dimensions interpreted as (1) flow length, (2) hour-of-day, and (3) day-of-week to capture the hourly and daily periodicity in the data. The nonnegative factorization of this tensor basically gives us estimates in the form of a nonparametric mixture model. Therefore, our model is an improvement of the nonparametric flow length models in [3, 6] by having the capability of modeling data with an arbitrary amount of mixture components and use the periodicity.

While the ordinary NTF model [15] factorizes an observation tensor, the ThinNTF directly factorizes its sampled version and recovers the original tensor from the estimated factors. We take a fully Bayesian approach here and provide a generative model for the TNTF and a variational Bayes algorithm for inference. The contributions in this paper can be listed as follows:(i)We model one week of flow length observations as a 3-dimensional tensor and observe the periodic behavior.(ii)We propose a novel tensor factorization scheme, ThinNTF, which is able to find the factors of a latent tensor from its sampled counterpart. By doing so, we also solve the reconstruction problem.(iii)We apply ThinNTF to real-world data sampled with two different sampling methods: uniform random packet sampling and flow-based adaptive sampling.

The structure of the paper is as follows. In Section 2, we provide the related works on network sampling and tensor factorization. In Section 3, we describe our real-world data and how we visualize it as a tensor. Additionally, we describe the sampling methods that we used to sample the data. In Section 4, we describe our ThinNTF model and the variational Bayes algorithm for estimating the factors. In Section 5, we describe our real-world data collection architecture. In Section 6, we present our synthetic and real-world experiments and results. Finally, in Section 7, we draw our conclusions.

Sampling methods have long been applied to network traffic monitoring. A survey on fundamental network sampling strategies is given in [2]. Uniform packet sampling is extensively studied by various authors. Duffield et al. [3] propose the first nonparametric model for flow length distribution and provide a maximum likelihood estimation to recover the flow lengths. Riberio et al. [4] show that using protocol specific information gives better flow length distribution estimates in TCP flows. Yang and Michailidis [6] adopt the maximum likelihood approach to estimate both flow length and flow volume (number of bytes in a flow) distributions. Additionally, they model the data with a nonparametric mixture model of two components, where the first component models small flows and the second models large ones.

Flow-based sampling methods are proposed as alternatives to the uniform packet sampling since packet sampling has theoretical limitations when recovering true flow statistics [5]. These methods process every incoming packet and apply sampling conditionally. Kumar et al. [13, 16] propose two different algorithms where the flow size counters are compressed statistically. They also propose a nonuniform packet sampling algorithm based on sketch counting [12]. Hu et al. [10, 14] propose another nonuniform packet sampling algorithm, called adaptive nonlinear sampling (ANLS), for estimating flow lengths per each flow and then adopt this method to flow volume [11]. In our experiments, we are going to use ANLS as an example of flow-based sampling methods since it is the current state-of-the-art nonuniform sampling method.

Nonnegative tensor factorization is the generalization of the nonnegative matrix factorization (NMF) [17] to multiway arrays. In NMF, a nonnegative matrix is approximated with a multiplication of two nonnegative matrices. Minimizing the Kullback–Leibler divergence between the initial matrix and multiplied factors is a popular formulation of this method and can be solved with fixed-point iterations [18] or full Bayesian methods [19]. NMF has been used in many applications such as spectral data analysis [20], face recognition [17], and document clustering [21].

Modeling the flow length distribution as a mixture of distributions is first proposed by [6]. However, according to our best knowledge, there is no previous work that models a large volume of flow length data as a tensor. This work fills a gap in the literature by introducing tensor factorization methodology to network monitoring.

3. Problem Description

We describe our problem as a tensor thinning problem, where the count entities of the original flow lengths are stored in a tensor. We formulate the sampling process as a matrix multiplication operated on this data tensor. In order to do that, each sampling model should be represented as a matrix that transforms the original data tensor to a sampled one. We provide matrices for two sampling models: uniform packet sampling and ANLS flow-based packet sampling.

3.1. Notation and Indexes

For a clear notation, the scalar values are denoted by lightface letters, such as the index variable j and its maximum value J. The vectors are represented by boldface lower case letters, such as vector x. Boldface upper case letters represent matrices, such as and D, and the tensors are represented with calligraphic upper case letters, i.e, . The individual entries in matrices and tensors are written like scalars, i.e., and . The index denotes all the entries in the given dimension. For example is the row of the S matrix and is the slice of the tensor in the first dimension.

The index parameters are also fixed for clarity. The list of indexes and their ranges and semantic descriptions are given in Table 1. For example, the i index always represents an original flow length, while ν presents a sampled flow length. The range of ν starts from 0, since all of the packets of a flow may be discarded during the sampling process yielding a zero-length sampled flow, which is never observed.

3.2. Data Tensor

The original flow length data is represented in an tensor , with individual elements , regarded as the number of flows that has length i measured at the hour j of the day k. In this setup, I is the maximum flow length value, J is the hours of the day, and K is the days of the week. For our real-world data, collected continuously for 1 week, these values are and .

Working with large maximum flow size is not feasible for two reasons. First, learning a mixture model where each flow component has 2 million parameters is not a good formulation for this problem. Secondly, 99.9% of flows in our data have less than 100 packets, which means the tensor will be very sparse for . The clamping process can be defined aswhere is the clamped tensor. The clamping does not require any change in the model and inference equations that are given in Section 4. Therefore, for notational clarity, we only use as the generic original data tensor.

Figure 1 shows the vertical slices of our unsampled real-world data tensor , collected at the backbone of a mobile operator during a one week period, from Monday to Sunday, and clamped at . The intensity images are generated from the logarithm of the flow length counts. The daily and hourly patterns are easily recognizable in the original FLD data.

3.3. Sampling Methods

Independent of the sampling method, we can define an size S matrix, where I is the maximum flow length with entries interpreted as the probability of sampling ν packets from an original flow of length i. Naturally, S is a lower diagonal matrix of the form:where its row defines a probability distribution for the sampled flow length of a flow of size l. Given a flow size distribution , where is the set of nonnegative integers, the expected sampled flow length distribution would be given by . It immediately follows that the sampled flow size distribution y has length , with being the proportion of sampled flows with none of their packets sampled. During the sampling process, this value will never be observed naturally since the flow identification is performed only on selected packets. In all experiments throughout this paper, the y vector (or tensor, which will be described later on) will be element-wise multiplied with a binary mask vector m (or binary mask tensor ), whose entries are set to 1 except the ones corresponding to zero-sampled flow lengths, in order to simulate the real-life scenario.

For any given sampling method, we can calculate the S matrix directly if a closed-form expression is available. Otherwise, it can be approximated by simulating the sampling process and counting the sampling statistics. In this paper, both uniform random sampling and the ANLS provide closed-form expressions for the calculation of S matrix.

An important practical issue is that, if the original tensor is clamped at , the S matrix must also be clamped. In that case, a last row entry must present the probability of selecting ν packets from a flow of length greater or equal to . This clamping operation can be done by calculating a full-size S matrix first and setting with a naive assumption that after , the original flow sizes are uniformly distributed.

3.3.1. Uniform Sampling Method

In uniform sampling, each packet is processed with a fixed probability of π, irrespective of the flow it belongs to. In this method, the sampling matrix entries are calculated directly through Binomial distribution with i trials and π success probability, i.e.,

Algorithm 1 describes how the flow table is updated with uniform sampling upon the arrival of a new packet. The algorithm uniformly draws a random number in interval , and if it is less than π, the packet is processed; otherwise, the packet is discarded. For processing the packet, a look-up operation is performed on the flow table to find and update the flow that the packet belongs to. If no such flow is found, a new flow is created using the packet’s signature.

(1)function Sample Uniformly Random π, flow_table, packet
(2)if  > rand_double (0, 1) then
(3)  flow = flow_table·look-up (packet)
(4)  if flow is null then
(5)   flow  new Flow (packet)
(6)  else
(7)   flow·length+ = 1
(8)  flow_table·insert_or_update (flow)

Figure 2(a) shows the top entries of the lower diagonal S matrices with different sampling probabilities. As the sampling probability π gets smaller, fewer packets from a flow gets observed, and the flow may even be missed when none of its packets are observed. The rightmost sampling matrix shows the case when , where the matrix has a very high concentration of zero-length sampled flows.

Figure 2(b) shows the original Monday data (the leftmost matrix) and its sampled versions under uniform sampling with the probabilities shown on the top sampling matrices. Here, we see that for , the observed flow lengths are mostly less than 10, while the majority are not observed at all.

3.3.2. ANLS Sampling Method

The ANLS [10] will be used as the representative of the flow-based adaptive sampling methods. In ANLS, each packet is sampled according to the number of packets previously sampled from its corresponding flow. If a sampled flow has length x, the probability of its next packet to be sampled () is calculated as

Here, is a monotonically increasing function of flow length x, parametrized with u, which makes monotonically decreasing. u determines the tendency of sampling packets. As u gets smaller, more packets are sampled and estimating original flow lengths gets easier.

The ANLS method is described in details in Algorithm 2. Prior to the sampling, f and p are calculated in the SetupANLS function, according to equations (4) and (5). During sampling, for each incoming packet, a look-up operation is performed unconditionally. If the corresponding flow is found, it is updated with probability relative to its current observed size. Otherwise, a new flow is created, ensuring that every flow is observed with at least one packet.

(1)function Setup ANLSu
(2)f[0] = 0
(3)for do
(4)  
(5)  
(6)return f, p
(7)function SampleWithANLS (p, flow_table, packet)
(8) flow = flow_table·look-up (packet)
(9)if flow is null then
(10)  flow  new Flow (packet)
(11)else if p[flow·length]  rand_double (0, 1) then
(12)  flow·length+ = 1
(13) flow_table·insert_or_update (flow)

We calculate the sampling matrix S recursively for ANLS. In this method, the first packet is always sampled since independent of u. We start by assigning all zero sampling probabilities as and . The recursive equation for calculating the sampling matrix can be deduced as

Figure 3 shows the ANLS sampling matrices for first 10 flow lengths and the Monday data sampled with them, respectively, similarly to Figure 2. First, we can see that when u is small, the S matrix looks like identity and as u gets larger, the sampling probability of large flows decreases. Secondly, compared to uniform sampling, the ANLS method has much higher sampling ratios than uniform sampling. However, operating with such high sampling ratios would require specialized hardware in real time.

4. Methodology

Our methodology is based on the nonnegative factorization of the data tensor. Our model, which we call ThinNTF, introduces the sampling matrix as a constant factor to the original NTF with the Poisson–Gamma observation model. The rationale for using factorization for recovering true flow sizes is that the flow size distributions have daily periodic behavior, as we show in Section 3. Inferring the factors, instead of the original matrix, requires less parameter estimation and results in smoother estimates compared to the standard maximum likelihood estimation described in [3]. In this section, we first describe the original tensor factorization model and we provide our novel version: the ThinNTF. At the end of the section, we present the full-Bayesian variational Bayes algorithm for the inference.

4.1. Nonnegative Tensor Factorization

NTF is the generalization of the 2-dimensional NMF model to multiple dimensions. In NTF, an N-dimensional tensor is approximated by the multiplication of lower dimensional factors. Unlike NMF, tensor factorization can be done in multiple ways. In this work, we are going to use the PARAFAC [2224] factorization scheme. In PARAFAC, an tensor is approximated by matrices for . Here, R is the number of components, i.e., the number of clusters in the data. Figure 4 shows the PARAFAC factorization of our FLD tensor , into 3 factors: an factor F for representing the flow length clusters, a factor H for representing hourly behavior, and a factor D for representing the daily behavior of the data. Every single entry of the tensor is approximated by

Bro [25] explains that the PARAFAC factorization is unique under certain circumstances, where uniqueness is defined as begin unable to rotate the factorization without loss of fit. NMF and NTF are statistical models that impose nonnegativity constraint without uniqueness property. The uniqueness may be important if individual factors are of special interest. In our case, we are concerned with the estimation of the original data tensor from sampled tensor , but not the individual factors for any interpretation. Our problem is more close to a missing value imputation problem; hence, uniqueness is not a requirement.

4.2. Thin Nonnegative Tensor Factorization

ThinNTF is basically an NTF with an additional constant factor, which in our case is the sampling matrix S. Figure 5 shows the graphical models of the NTF and the ThinNTF models for factorizing original and sampled flow length observations. In the graphical models, the shaded nodes are the observed entities and the unshaded ones are the latent entities.

In Section 3.3, we have described the sampling process as a matrix multiplication operation with a sampling matrix S. In ThinNTF, this sampling matrix operates on the original tensor and creates a thinned version of it, which we call , by down-sampling its entries according to a sampling scheme, as shown in Figure 6. The entries of tensor presents the number of flows of sampled-length ν, at hour j at day k. The operation denotes the 1-mode product of matrix and tensor , which corresponds to the set of matrix multiplications for .

In this scheme, one can immediately suspect that can be estimated by . However, this solution is not feasible for several reasons. First, the S matrix is not square and hence not invertible. Instead, its pseudoinverse can be calculated, but this does not impose nonnegativity. Moreover, the top slice of the tensor, which stores the number of flows with zero-sampled size, is never observed and hence must be estimated. Therefore, we need a solid statistical model and an inference method to estimate under this model.

In ThinNTF, we observe the tensor, but try to factorize the tensor, which is latent (Figure 5(b)). In the end, the factors of are going to provide us an approximation which solves the original flow length distribution reconstruction problem. We mathematically express this approximation aswhere and D are described in exactly the same way in the original NTF case. In Subsections 3.3.1 and 3.3.2, we described two different S matrices for two different schemes. ThinNTF model can be employed with any sampling method as long as it is described with a sampling matrix.

4.3. Generative Model

Taking the Bayesian approach, we first provide a generative model for the ThinNTF and then describe how we can estimate the posterior probabilities of model parameters (in this case, the factor matrices) conditioned on the sampled flow length observations and the sampling matrix S using the well-known Bayes rule. Table 2 contains all tensors and matrices used in the model together with their index sets.

The original and latent data tensor and the sampled and observed data tensor Y have nonnegative integer entries. The natural probability distribution for this type of count data is the Poisson distribution. We assume that each entry of a latent 5-dimensional tensor is drawn from a Poisson distribution whose parameters are functions of sampling matrix S and factors and D, such aswhere the Poisson distribution is defined as

We choose the prior distributions for the factor entries as the Gamma distribution since it is the conjugate prior of Poisson distribution [26]. For each entry of factor F, we writewhere the Gamma distribution is described aswith shape parameter κ and scale parameter θ. In our generative model, the parameters for Gamma distributions are and respectively. This means that the mean of is , which is independent of . The variance of becomes , which means that as gets smaller, the factors gets sparser.

In order to avoid repetition, we are going to omit the equations regarding the factors H and D throughout the paper. These factors behave exactly like factor F, and it is easy to derive equations related to these factors once their corresponding equation for F is given.

Finally, we generate and tensor from . Each entry of can be interpreted as the number of original flows of length i, generated on hour j, day k, by cluster r, and observed as length ν. By summing over dimensions cluster and original lengths , we get the sampled observations tensor . Similarly, by summing over dimensions cluster and sampled lengths , we get the original flow length tensor . The whole generative process is summarized in Algorithm 3. The set of all indexes and tensors in the model are summarized in Tables 1 and 2, respectively.

(1)function RandInit ()
  //Sample factor F from Gamma ()
(2)  for all do
(3)   
  //Sample factor H from Gamma ()
(4)  for all do
(5)   
  //Sample factor D from Gamma ()
(6)  for all do
(7)   
  //Sample latent tensor from Poisson distributions
(8)  for all do
(9)   
(10)  return {F, H, D, }
(11)function GenerateData ()
  //Randomly initialize factors and latent tensor
(12) ⟵ RandInit ()
  //Generate original tensor
(13)for all do
(14)  
  //Generate sampled tensor
(15)for all do
(16)  
(17)return {F, H, D, , , }
4.4. Variational Bayes

After defining the generative model, we can inter the factors , and D of a sampled flow length observation tensor . In the original NMF paper, Lee, and Seung [17] provide fixed-point update equations for inferring the factors. Bro [25] gives similar fixed-point equations for updating the factors in PARAFAC factorization. Cemgil [19] shows that these updates correspond to the Kullback–Leibler minimization between the original matrix (or tensor ) and the approximated one () and also provides a full Bayesian variational algorithm for the matrix factorization. Ermis et al. [15] provide a similar variational algorithm for the Gamma–Poisson tensor factorization.

We start our Bayesian inference by calculating the posterior distributions over the factors and D conditioned on observed tensor . For notational clarity, we introduce as the list of model hyperparameters. The log-likelihood observing under the model parameters θ is written as

This log-likelihood is intractable due to the integration over the latent factors, but it is lower bounded aswhere q is an auxiliary joint distribution of latent factors. This bound is tight when . However, this is also intractable to calculate. Instead, we use a variational approximation [27] for q such thatwhere we iteratively update the posterior distribution of each factor by calculating the expectation of the logarithm of the full joint likelihood under the posteriors of all other latent factors.

4.5. Update Equations

Here, we provide the update equations for and . The updates of and can be easily deduced from the update equations of . The full joint likelihood whose expectation will be calculated at each step iswhere is a degenerate distribution () to make sure the summation of equals . By inserting the necessary Poisson and Gamma distributions given in the generative model in the equation (16), we get the following expression:

4.5.1. Update Rule for

Considering the terms in the log-likelihood expression in equation (17), that only includes , we find thatwhere, becomes multinomial distributed. The expectation of is calculated as

4.5.2. Update Rule for

Similarly, considering the terms in log-likelihood equation (17) that only includes , we find thatwhere becomes Gamma distributed with shape and scale parameters

We calculate the expectation of and the logarithm of as

The variational Bayes algorithm that uses the above equations is given in Algorithm 4. The calculation of the lower bound is given in Appendix. The exact derivations of all equations can be found in [28].

(1)function ThinNTF_VB (, S, )
  //Randomly initialize factors and latent tensor
(2) ⟵ RandInit ()
(3)repeat
(4)  Calculate and as in equations (21)–(23).
(5)  Calculate and similarly.
(6)  Calculate and similarly.
(7)  Calculate as in equation (24).
(8)  Calculate similarly.
(9)  Calculate similarly.
(10)  Calculate as in equation (19).
(11)  Calculate lower bound
(12)until Max iterations are reached or lower bound converged
(13)return F, H, D,
4.6. Computational Complexity

The nonnegative tensor factorization is an NP-hard problem [29]. The variational Bayes algorithm we introduced in Algorithm 4 is an iterative solution that converges to a local maximum solution. The complexity of each iteration is determined by the leading term, which is the equation (19). In general, calculating a ThinNTF model with R components for a κ-dimensional tensor with all dimensions of length N has complexity for a single iteration.

5. Data Collection

Applying a tensor model to the flow length estimation problem requires high-volume data collected over a long period, to capture the timely behavior of the network. The already available online data sets do not fulfill this requirement. Therefore, we collected our own real-world data from a mobile network service provider in Turkey [30]. The architecture of our system and the description of the data we collected are presented as follows.

5.1. System Architecture

The system architecture of a mobile operator’s general packet radio service (GPRS) network infrastructure, including radio access and core network elements, is illustrated in Figure 7. IP traffic generated or received by mobile devices between mobile station (MS) and packet data network PDN, e.g., IP Multimedia Subsystem (IMS), is tunneled in the core network of mobile operators through serving GPRS support node (SGSN) and gateway GPRS support node (GGSN) via the user data part of the GPRS tunneling protocol (GTP) [31]. The GTP message exchanges include information such as the size of the traffic, IP session start and end time, and device and user identifiers.

The Gn interface (Gn is an interface between SGSN and GGSN where GTP is the main protocol for network packets flowing through) carries user packets to be transferred between the mobile users and the Internet together with control packets necessary for the universal mobile telecommunications service (UMTS) core network [32]. All packets in this channel are carried by the GTP, which is an IP-based protocol for carrying GPRS data within UMTS networks, used for data encapsulation in order to keep the core network independent of the protocols that are being used between MS and the packet-switched network.

The Gn interface carries mainly two types of GTP message structures or packets: GTP-C and GTP-U. GTP-C is used for signaling between SGSN and GGSN in core network which carries packet data protocol (PDP) context messages such as activating and deactivating user session, configuring service parameters or updating the session. GTP-U is used for transmitting user data between the radio access network and core network. In our experiments, we filtered out GTP-C packets (since they are not considered to be part of a flow due to flow definition), which makes 10% of the total Gn traffic. Therefore, the sampling is applied to GTP-U packets only. GTP is carried mainly over UDP.

5.2. Data Extraction Process

The mobile operator network consists of several districts with more than 10 regional core areas throughout Turkey. The average total traffic in all regional areas consists of approximately over 15 billion packets in the uplink direction and over 20 billion packets in the downlink direction daily. This corresponds to approximately 80 terabytes of total data flowing in uplink and downlink daily inside the mobile operator’s core network as a whole. In this work, the Gn interface which connects the SGSN and GGSN nodes are mirrored, and the network traffic is transferred into a FLD server located at mobile operator’s technology center in the core network. A speed of 200 Mbit/sec at peak hours can be observed through one of the mirrored interfaces in the core network.

We monitored the network traffic in one of the servers of a mobile operator continuously for 10 days. We developed a packet extraction tool inside the monitoring server shown in Figure 7, which parses each GTP-U packet and stores the packet signature together with packet length and arrival time, discarding their payload. The stored network data is processed offline for extracting the true flow lengths.

After the data collection and flow extraction, the total number of packets collected is found to be , which makes up around flows. Figure 8 shows the cumulative flow length distribution of the data. We see that most of the flows have less than 5 packets, and 99.9% of them have less than 100 packets.

6. Experiments and Results

We designed two sets of experiments in order to verify our model: synthetic and real-world experiments. In each set, we sampled the original data with both uniform and ANLS models with different sampling parameters. Then, we tried to recover the original tensor with ThinNTF models. The ThinNTF model takes a single parameter R which is the number of components in each factor. Additionally, we also represented data as matrix by unfolding the tensor in the first dimension as described in [28] and applied the 2-dimensional version of ThinNTF, which we simply call thin nonnegative matrix factorization (ThinNMF). For Uniform sampling, we used the maximum likelihood estimation (MLE) defined in [3] as the baseline. For ANLS, we used both MLE and its own unbiased estimator of the model as baselines.

Both ThinNMF and ThinNTF models explain the data as a linear combination of R flow length distributions, stored in the columns of F matrix. In the ThinNMF model, we have coefficient sets for this combination. On the other hand, in ThinNTF, we have J coefficients for hour-of-day and K coefficients for day-of-week. The Cartesian products of these coefficient sets make a total of coefficients and create a dependency between the hour and day attributes. Therefore, we expect that ThinNTF captures the periodicity and give better estimates.

During the experiments, we always run the stochastic algorithms, i.e., ThinNMF, ThinNTF, and MLE, for 10 times and keep the parameters of the model with the highest lower bound value. Then, we reported the success of our algorithm with the weighted mean relative distance (WMRD) metric as this was used in all previous flow size estimation works. The WMRD is a metric which gives more weights to the relative differences that occur with larger frequency. It is formulated aswhere x is an original flow size distribution measured at the end of the hour and is its estimated version. For the whole tensor , we calculate the average WMRD value.

Additionally, we report the Kullback–Leibler divergence between the original and the estimated tensors, since this is the metric minimized during the variational Bayes algorithm. The KL divergence between two distributions x and is calculated as

6.1. Experiments on Synthetic Data

We prepared our synthetic experiments to test the validity of our models. In this experiment set, we used the generative model of the ThinNTF model as described in Algorithm 3 to generate a small network with maximum flow size and . The original synthetic flow length distribution is generated by a generative model with 3 components, where each component is a column in the F factor. We selected these 3 components as exponential, inverted exponential, and uniform random distributions. Therefore, in experiments, we used ThinNMF-R3 and ThinNTF-R3 models, where the suffix R3 shows that the model has 3 components.

We sampled the synthetic data with uniform and ANLS sampling methods with different sampling parameters. The sampling was done simply by randomly drawing a sampled size for each flow according to the sampling probabilities in the S matrix. By this way, we ignored the flow splitting problem, and this gave us an ideal data for the ThinNTF model. We report and compare the mean standard deviation of these WMRD values for all experiments.

The ThinNTF model always performed best with the uniform sampling model, as shown in Table 3 as expected. On ANLS sampling, the MLE and the ANLS estimators performed better with high sampling probabilities, when , as shown in Table 4. On the other hand, when the sampling probability of the ANLS model decreases, the ThinNMF helped with better estimations. From the initial results, we conclude that the factorization is definitely helpful for more difficult uniform sampling method and helps lower the sampling probabilities in flow-based packet sampling. The results are also visible in Figure 9.

6.2. Experiments on Real-World Data

The original data collected from a mobile network provider as we described in Section 5 are sampled with both sampling methods. However, this time, we simulated the real network offline by feeding the packets one by one to the monitoring server, as described in Section 5, This way, we were able to create the actual conditions on a sampler installed at a network provider’s backbone. This also created the flow splitting problem, since we applied a 30 seconds timeout in our flow hash. We set the maximum flow length as , meaning that entries show the count of flows that have more than 99 packets. This clamping decision was made according to the cumulative distribution of flow lengths as shown in Figure 8. We also clamped the sampling matrices S so that they exactly match the model.

Since the number of components in the original flow distribution is unknown, we run our experiments with components for ThinNMF and ThinNTF. The rest of the experiment is similar to the synthetic one. The sampled Y matrix with shape is factorized and is reconstructed with the estimated factors. We reported and compared the mean and standard deviation of WMRD and KL values.

The factorization models, both ThinNMF and ThinNTF helped lower the WMRD score in both uniform and ANLS sampling methods. ThinNTF-R4 model consistently gave lower error than the MLE baseline for uniform model as shown in Table 5 and Figure 10. Indeed, our factorization framework improved results overall for uniform sampling. However, since recovering true estimates in uniform sampling is quite difficult, we see less impact of the factorization as the sampling ratio increases.

Figure 10 also gives the KL values between the true and estimated flow length distributions. While the scale of this metric is different, it gives consistent results with the WMRD. This shows that our model, which minimizes the KL metric, also minimizes the commonly used WMRD metric; hence, the model is suitable for this problem.

Another important issue is that for uniform sampling, 3-way factorization is more successful than the 2-way factorization. The periodicity information which is captured by the ThinNTF model helps improve the estimates and makes it a more successful model for this sampling method.

In ANLS, all our factorization models gave lower error values than the MLE and unbiased estimator of ANLS as shown in Table 6 and Figure 11. Since ANLS is a more powerful sampling method than uniform sampling, the effect framework is slightly less for small sampling parameter u. However, both ThinNMF and ThinNTF gave better result while sampling smaller number of packets (when u is large). Furthermore, since we are trying to recover the same original data in both experiments, we can compare our ThinNMF and ThinNTF models under two sampling methods. We see that, in both methods, as the number of components increases, the models gave lower error rates. However, with the uniform sampling method, 3-dimensional methods give better results, while with ANLS, 2-dimensional models perform slightly better.

6.3. Effect of Clamping

The choice of where to clamp the data can be given by multiple factors. First of all, one can set the clamping value according to a value of special interest. Otherwise, we would like to choose a small so that we deal with a dense tensor and less parameters. On the other hand, we would like to set as high as possible so that the clamped portion of the data is as small as possible.

We run the best algorithms found in previous section for uniform and ANLS sampling methods with . The WMRD values are given in Figure 12. In both methods, gave relatively poor performance and was generally the best choice. Also the results with are closer to each other. This is consistent with the graph in Figure 8, where the cumulative flow lengths do not change much after . A final remark from this experiment is that as the clamping value increases, estimation becomes harder with small sampling rates. This explains the results in uniform sampling with sampling rate 1/64.

7. Conclusions

In this work, we introduced a novel nonnegative tensor factorization model called ThinNTF, which extends the classic nonnegative tensor factorization with an additional constant factor that can represent a network packet sampling method. We showed that this model can be employed to improve the current reconstruction algorithms in recovering the original flow length distributions.

We tested our model with two different types of sampling methods: the uniform packet sampling method and a flow-based packet sampling method, called ANLS. We described how to use these methods by showing how to build their sampling matrices.

In order to test our model, we collected high-volume data from a mobile network provider for a long period in order to observe the periodical behavior of the flow length distribution. In experiments on synthetic and real-world data, our models gave promising results by lowering the estimation errors compared to the baselines of each sampling method. We conclude that our model can be used to decrease estimation errors or to decrease the sampling probabilities without increasing the estimation error.

An important issue left as future work is the online execution of the ThinNTF model. Theoretically, the ThinNTF model can be used online once sufficient data from the target network is collected, and the flow length distribution components, i.e., the F factor, are inferred. The power of our model is that this inference can be done directly from the sampled observations. Once the F factor is estimated, for each incoming observation, the corresponding entries in other factors can be inferred by keeping F constant during the inference. Moreover, F can be updated periodically, say weekly, in a sliding window fashion and kept up to date with the networks flow length behavior.

Appendix

Variational Lower Bound Calculation

The calculation of the lower bound includes a few arithmetic tricks. We provide a Bayesian nonnegative matrix factorization [28] tutorial for the detailed derivation and coding tricks. The final form of the lower bound equation is

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.