Abstract

In this paper, we consider a multivariate statistical model of accident frequencies having a variable number of parameters and whose parameters are dependent and subject to box constraints and linear equality constraints. We design a minorization-maximization (MM) algorithm and an accelerated MM algorithm to compute the maximum likelihood estimates of the parameters. We illustrate, through simulations, the performance of our proposed MM algorithm and its accelerated version by comparing them to Newton-Raphson (NR) and quasi-Newton algorithms. The results suggest that the MM algorithm and its accelerated version are better in terms of convergence proportion and, as the number of parameters increases, they are also better in terms of computation time.

1. Introduction

A large majority of the problems encountered in applied statistics (maximum likelihood estimation, least squares, data fitting, machine learning, data analysis, experimental design, clustering, and classification) involve the numerical optimization of a real-valued function depending on a parameter vector , where is a positive integer. Of all the numerical optimization algorithms developed over the years, the Newton-Raphson (NR) algorithm is the most famous and the most widely used. This algorithm starts from an initial guess and iterates according to the scheme where and are, respectively, the gradient vector and the Hessian matrix evaluated at . The popularity of this algorithm is explained by its fast convergence when the starting guess is chosen close enough to the maximum that is unknown in practice [1]. Unfortunately, the source of NR algorithm’s popularity is also that of his first flaw (its success strongly depends on the appropriate choice of in a neighbourhood of the unknown solution). The second major drawback of the NR algorithm inherited from its mathematical formulation is that it requires the numerical inversion of the Hessian matrix (a square matrix of order ) at each iteration, which can be very tricky if at the iteration , the matrix is singular or ill-conditioned. When the NR algorithm is unsuccessful, alternatives may be considered. Among them, we can mention quasi-Newton algorithms [2] (which compute approximations of at each iteration), block-relaxation algorithms (which divide parameters into disjoint blocks and proceed to optimization by cycling through these blocks) [3], and derivative-free optimization (DFO) algorithms [4, 5].

In statistics, the last decades have seen the development and the fast breakthrough of the minorization-maximization (MM) principle for constructing maximization algorithms [68], and the expectation-maximization (EM) algorithm [9] which is considered as a special case of MM algorithm for statistical estimation with incomplete data [6]. The design of MM algorithms for maximizing consists in constructing a special surrogate function whose maximization is simpler but equivalent to that of and then maximizing that surrogate function. Since they do not need any matrix inversion, MM algorithms generally outperform the NR algorithm and are considered as relevant for high-dimensional problems and for some discrete multivariate distributions [10].

In this paper, we are interested in one of the very important issues in statistics applied to road safety which is the statistical evaluation of road safety measures. We consider a discrete multivariate statistical model coupling the distributions of accidents classified by level of severity on the sites where the measure has been implemented (called treated sites) and on the sites where the measure has not been implemented (called control sites). This model has a constrained parameter vector of components where is the number of treated sites and is the number of accident severity levels. Our main purpose is to design an MM algorithm for the maximum likelihood estimation of the parameter vector . Since MM algorithms may be slow to converge, we also consider the acceleration of our proposed MM algorithm.

The rest of this paper is structured as follows. In Section 2, we describe the model and we present the maximum likelihood estimation problem. In Section 3, we devise our new MM algorithm for estimating the parameter vector ; afterwards, we apply an acceleration scheme in order to get the accelerated version. In Section 4, we present the results of the comparison of our proposed MM algorithm and its accelerated version to NR and quasi-Newton algorithms.

2. Statistical Model and Parameter Estimation

Let us consider geographical sites (hereafter referred to as treated sites) where a road safety measure (maximum speed reduction, installation of roundabouts, and so on) has been implemented for a certain period of time. The accidents occurring on these sites are assumed to be classified by level of severity in levels (). Suppose that, in order to avoid confusing the effect of the measure with those of other factors likely to influence the number of crashes, each treated site is paired with another geographical site (hereafter called control site) with the same characteristics (traffic flow, weather conditions, and so on) as the treated site but where the measure has not been implemented. For , let be a vector composed of random variables, where for all , (respectively, ) is a random variable representing the number of crashes of type occurred on treated site in the period before (respectively, after) the implementation of the measure. Also consider the non-random vector where for all , is a non-random variable representing the ratio of the number of accidents of severity level in the “after” period to the number of accidents of the same severity level in the “before” period on the control site.

Different models combining accident frequencies from the treated sites and the control sites have been proposed in order to estimate mainly the average effect of the measure and, secondarily, the accident risks [1113]. Let , be the classical inner product on , and () be the total number of accidents observed at treated site . In this paper, we consider the statistical model proposed in [13] under the following assumptions:

(A1) For all , follows the multinomial distribution , where

(A2) , , , and for all

(A3) ,

Model (4) has a parameter vector where is the mean effect of the measure and for all , , is the probability that an accident occurring on the treated site is of severity level . In this paper, we are interested in estimating the unknown parameter vector .

The likelihood of observed data , where for all , , is given to one additive constant by where for all , and . The maximum likelihood estimate (MLE) of is the solution to the following constrained maximization problem:

In the case , there exists a closed-form expression of [14]. However, in the general case , the optimization problem (6a), (6b), and (6c) looks impossible to solve in closed-form and therefore calls for numerical optimization. As stated earlier, different iterative algorithms may be used to solve (6a), (6b), and (6c), each with strengths and weaknesses. The minorization-maximization (MM) strategy for constructing optimization algorithms has been shown to yield algorithms (then called MM algorithms) that outperform the classical NR and quasi-Newton algorithms [6, 8, 10]. Moreover, it can handle constraints easily. In problems such as (6a), (6b), and (6c), it consists in defining a special surrogate function and afterwards maximizes this latter rather than the log-likelihood. MM algorithms are considered as relevant for high-dimensional problems and for discrete multivariate distributions [10]. In the next section, we explain the MM principle; afterwards, we devise an MM algorithm to solve the problem (6a), (6b), and (6c).

3. An MM Algorithm for Computing the MLE

3.1. Brief Reminder of the MM Principle for Constructing Maximization Algorithms

The MM principle for constructing an iterative maximization algorithm consists of two steps. Let be the iterate after iterations. For the maximization problem (6a), (6b), and (6c), the first step consists in defining a minorizing function and the second step consists in maximizing the surrogate function with respect to rather than the log-likelihood , and the next iterate is obtained as the value in which attains its maximum. Before going further, let us remind the definition of a minorizing function [6].

For a given , a function is said to minorize the function at provided

We have successively by Formula (7), by Equation (21), and by Formula (8).

Thus, and, therefore, a maximization MM algorithm is an ascent algorithm. This ascent property ensures its numerical stability [6].

3.2. Design and Maximization of the Minorizing Function

To develop our MM algorithm, we must define and maximize a minorizing function for . To this purpose, we use some mathematical inequalities related to convex functions presented in [6]. The following lemma gives the expression of the minorizing function.

Lemma 1. Let be a value of the vector parameter and the real-valued function of defined by where is an additive constant independent of . Then, minorizes at .

Proof. By convexity of , we know that, for all , , . So, for all , Hence, We can then write and, after summing on the indexes, we get By convexity of and by Equation (10) of [6], we also have Hence, which is equivalent to From (14), (17), and the definition of (see (5)), we can write which means that where is defined by (9). Moreover, we have . We can then conclude that minorizes at .

Figure 1 gives an example of representation of and where , , and . Since , and are considered as functions of two variables and such that .

To finalize the design of our MM algorithm, we need to deal with the constraints (6b) and (6c). In [6], the authors recommend to extend function under a new form that takes into account the inequality constraints; afterwards, equality constraints will be enforced during the optimization of the extended form of . Their method is based on the logarithmic barrier method for handling inequality constraints. For inequality constraints of the form , , the extension of presented in Equation (23) of [6] is an additive value composed of linear combinations of . In this paper, the form of inequality constraint (6b) implies that the extension of will be composed of linear combinations of and . So the log-likelihood defined in Equation (5) already contains the logarithmic barrier (it diverges to negative infinity if any of the parameters or tends to zero). Thus, as the authors of [6] themselves recognize, if the initial point satisfies inequality constraints (6b), then the presence of the terms and in the expression of prevents and from occurring.

Now, knowing , we just have to maximize under the equality constraints (6c) to obtain the next iterate .

Theorem 2. Let be the estimate of the parameter vector after steps of our proposed MM algorithm. Then, the components of the next iterate are given by and for all

Proof. If is the estimate of after steps, then the next iterate denoted by is obtained as under equality constraints (6c). Solving problem (21) is equivalent to looking for the stationary point of the Lagrangian where is a vector of Lagrange’s multipliers and is defined by Equation (9). Setting and to zero, we get Multiplying (23a) by and (23b) by , one gets For all , summing on the index in Equation (24b) and noting that and lead to Hence, Combination of (24b) and (26) leads to Thus, the solution to (21) satisfies and for all The updates and are obtained by replacing by , by , and by in Equations (28) and (29).

Remark 3. The updates in Formulas (19) and (20) are ideal but impossible to apply in practice since (a) the computation of depends on the unknown and vice versa and (b) in Equation (20), the computation of each depends on and vice versa. To circumvent these difficulties, we can replace by on the right side of Equations (19) and (20).

3.3. The Proposed MM Algorithm

Our proposed MM algorithm is Algorithm 1. It starts from , where and are randomly set such that for all , . At the -iteration, the update is computed from and using Formula (28) and Remark 3; afterwards, is updated from , , and using Formula (29) and Remark 3. This process is repeated until a convergence criterion is satisfied. Since the are computed from , our MM algorithm (Algorithm 1) is thus a cyclic MM algorithm [6].

Input: , , ... and , ...,
Output: MLE
1 Initialize and where and for all , ;
2 repeat
3 ;
4 For all , update , where for all ,
  
5 Set and
6 until
7 Set .
3.4. Acceleration of the MM Algorithm

In many statistical estimation problems, MM algorithms may need a great number of iterations and converge very slowly. In consequence, different acceleration schemes have been developed [1517].

To the best of our knowledge, the acceleration schemes developed in [16] are among the most popular, and they will be also considered in this paper. The authors of [16] presented a new class of iterative methods, called SQUAREM (squared iterative methods for EM acceleration), for accelerating the expectation-minimization (EM) algorithm and state that SQUAREM may be also used to accelerate MM algorithms.

Let be the MM map of Algorithm 1, i.e., the function from to itself defined by where and for all , , where for all ,

Knowing iterate , the SQUAREM consists in computing the next iterate as where , , and is a scalar steplength. Varadhan and Roland [16] described three choices for the steplength, but, in many numerical experiments (see, for example, [16, 17]), the steplength gives a faster convergence. As in [16, 17], the accelerated MM algorithm with steplength (34) (the third choice of steplength proposed by [16]) will be denoted SqS3. The SqS3 algorithm is given hereafter (see Algorithm 2).

Input:, , , ... and , ...,
Output: MLE
1 Initialize and where and for all , ;
2 repeat
3 ;
4 ;
5 ;
6 ;
7 ifthen
8  ;
9 end
10 ;
11 until
12 Set .

Note that lines 7 to 9 of Algorithm 2 allow to correct the new iterate by performing a simple step of the non-accelerated MM if does not belong to the constrained parameter space .

4. Simulation Study

We compare, in R software [18], our MM algorithm and its accelerated version SqS3 to the NR algorithm (package nleqslv [19]) and quasi-Newton BFGS algorithm (package alabama [20]). The design of the simulation study is inspired from [21, 22].

4.1. Data Generation

We have generated the data under assumptions (A1), (A2), and (A3), where the true parameter vector denoted by has taken five values defined as follows:

Scenario 1. , ,

Scenario 2. , ,

Scenario 3. , ,

Scenario 4. , ,

Scenario 5. , ,

For all , we have given two values: and . The starting guess is randomly generated as follows: is a random observation of the uniform distribution and for all , is randomly generated using the formula where for all , is a random observation from the uniform distribution on .

4.2. Results

For the different scenarios and values of , the average values obtained over 1000 replications are given by Tables 15. In these tables, convergence proportion refers to the percentage of convergence over 1000 replications, CPU (central processing unit) times are given in seconds, and time ratios are obtained by dividing the CPU time of each algorithm by that of the MM algorithm (which explains why the time ratio of MM is always equal to 1). The mean square error (MSE) linked to an estimate is

To avoid overloading the tables, the estimate has been included for Scenario 1 only (see Table 1).

In Tables 15, we can notice that the estimated values, the standard deviation, the log-likelihoods, and the MSE are globally the same for all algorithms except BFGS when . Regarding the convergence proportions, our proposed MM algorithm and its accelerated version SqS3 always have a 100% convergence proportion while the convergence proportion of NR is close but strictly lower than 100%. The convergence proportion of BFGS varies between 4.1% and 77.3% and decreases when the number of parameters increases.

For all the algorithms, we notice a decrease of the MSE when increases from 50 to 5000 which suggests a good fitting between the true and estimated values when increases. When , the MSE of BFGS is greater than those of the other algorithms which suggests a convergence of BFGS to bad values.

As far as the CPU times are concerned, we notice that, except for Scenario 1, the computation time of NR algorithm is greater than that of the MM algorithm (more than 2 times for and more than 1.8 times for in Table 2, more than 3 times in Table 3, more than 6 times in Table 4, and more than 19 times in Table 5). The proposed MM algorithm is 23 to 67 times faster than BFGS.

The performance of the MM is even better with acceleration. Indeed, the CPU time ratio of SqS3 is between 0.35 and 0.51. The percentage of computation time reduction yielded by the acceleration varies between 49% and 65% (since and ).

5. Conclusion

In this paper, we built a minorization-maximization (MM) algorithm to compute, under box constraints and linear equality constraints, the maximum likelihood estimates of the parameters of a multivariate statistical model used in the analysis of accident frequencies. This statistical model is composed of several multinomial distributions whose parameters are dependent. Since MM algorithms are generally considered as slow to converge, we have also proposed an accelerated version of our MM algorithm using a square iterative acceleration scheme developed in [16]. Using simulated data, we have proven that our proposed MM algorithm and its accelerated version are better than Newton-Raphson (NR) and quasi-newton BFGS algorithms in terms of convergence proportion and computation time.

Data Availability

This research uses simulated data and the data generation process is described in the paper.

Conflicts of Interest

The authors declare that they have no conflicts of interest.