Abstract

Numerical computation of maximum likelihood estimates (MLE) is one of the most common problems encountered in applied statistics. Even if there exist many algorithms considered as performing, they can suffer in some cases for one or many of the following criteria: global convergence (capacity of an algorithm to converge to the true unknown solution from all starting guesses), numerical stability (ascent property), implementation feasibility (for example, algorithms requiring matrix inversion cannot be implemented when the involved matrices are not invertible), low computation time, low computational complexity, and capacity to handle high dimensional problems. The reality is that, in practice, no algorithm is perfect, and for each problem, it is necessary to find the most performing of all existing algorithms or even develop new ones. In this paper, we consider the computing of the maximum likelihood estimate of the vector parameter of a statistical model of crash frequencies. We split the parameter vector, and we develop a new estimation algorithm using the profile likelihood principle. We provide an automatic starting guess for which convergence and numerical stability are guaranteed. We study the performance of our new algorithm on simulated data by comparing it to some of the most famous and modern optimization algorithms. The results suggest that our proposed algorithm outperforms these algorithms.

1. Introduction

Let be a log-likelihood function where is a parameter vector whose structure will be specified later. In computing the maximum likelihood estimate (MLE) which is the point at which attains its maximum, the very first algorithm one can try is the Newton-Raphson (NR) algorithm which is no longer to be presented. The success of this algorithm comes from an enviable and unequalled property: that of quadratic convergence when the starting guess is well chosen (i.e., close to the unknown solution). However, it also has drawbacks: it is not globally convergent (i.e., its success depends on the starting guess); it can be costly or impossible to implement in high-dimensional problems because of its need of inverting the Hessian matrix at each iteration. Many other algorithms can be used whenever NR cannot. We refer the reader to [14] for a comprehensive review of these algorithms. Of all these algorithms, we can mention quasi-Newton (which use approximations of the inverse of the Hessian matrix rather than inverting it), Fisher scoring (a purely statistical method which consists in replacing the Hessian matrix with its mathematical expectation in order to ensure the ascent property and therefore numerical stability), block optimization, and derivative-free optimization algorithms. We can also mention minorization-maximization (MM) algorithms [5, 6] which have made an extraordinary breakthrough in computational statistics and are increasingly used. The MM philosophy for maximizing a function is to define, in the first M step, a minorizing function for the objective function, and to maximize, in the second M step, the minorizing function with respect to the parameter vector .

Even if all these algorithms are considered as performing, they can suffer in some cases for one or many of the following criteria: global convergence (capacity of an algorithm to converge to the true unknown solution from all starting guesses), numerical stability (ascent property), implementation feasibility (for example, algorithms requiring matrix inversion cannot be implemented when the involved matrices are not invertible), low computation time, low computational complexity, and capacity to handle high dimensional problems. The reality is that, in practice, no algorithm is perfect, and for each problem, it is necessary to find the most performing of all existing algorithms or even develop new ones.

In this paper, we consider the computing of the maximum likelihood estimate (MLE) of the vector parameter of a statistical model of crash frequencies proposed by [7]. The vector parameter has the form where is the parameter of interest and is a vector of secondary parameters. Although secondary, the subvector can contain an important number of components (up to several hundreds) depending on the structure of the data. Thus, the classical algorithms mentioned above may need a great number of iterations (and thus have a slow convergence) or simply fail to converge. For the model considered in this paper, Newton-Raphson and Minorization-Maximization (MM) algorithms have been implemented by [7, 8], but the numerical study of these algorithms has been limited to simple cases. The NR algorithm is known to be fast in simple cases and inefficient (or even impossible to implement) for high-dimensional problems. Moreover, its convergence depends on the starting guess (initial solution from which the algorithm starts). The MM algorithm, on the other hand, often requires a large number of iterations before converging to the solution.

The main contribution of this paper is to build an estimation algorithm which converges faster than the other algorithms and whose convergence is guaranteed and is not affected by the dimension (the number of components of ). For this purpose, we exploit the splitting of the parameter vector into two blocks, and we apply the profile likelihood principle (see for example [9], p. 231) to reduce the search of to only the search of the parameter of interest . Then, we develop a new estimation algorithm for which we provide an automatic starting guess from which convergence and numerical stability (ascent property) are guaranteed. This automatization of the starting guess reveals to be a true advantage for our algorithm over the others because it allows to circumvent the difficulty of finding an adequate starting guess. We show using simulated data that our algorithm outperforms Minorization-Maximization (MM) and Newton-Raphson algorithms which are two of the most famous and modern optimization algorithms.

The remainder of this paper is organized as follows. In Section 2, we describe the data and the estimation problem. The statistical model and the constrained maximum likelihood estimation of the parameters are also presented. In Section 3, we present the profile likelihood principle and then use it to design our new profile-likelihood-based algorithm (PLBA) for computing the MLE of . We also provide an automatic starting guess, and we prove that it guarantees convergence of the proposed algorithm. In Section 4, we prove that the proposed algorithm satisfies the ascent property. In Section 5, we report the results of the comparison of the PLBA with MM and NR algorithms. The paper finishes with some discussions and concluding remarks in Section 6.

2. Data, Statistical Model, and Problem Setup

The framework of this paper is the statistical analysis of crash data before and after the implementation of a road safety measure at () experimental sites (called treated sites) where crashes are classified by severity in () levels. This analysis is aimed at estimating the mean effect of the safety measure simultaneously on all the sites. This mean effect must be understood in the multiplicative sense and therefore must be compared to 1. A value indicates a positive effect of the measure (an average reduction of in the number of crashes) while a value indicates a negative effect of the measure (an average increase of in the number of crashes) and indicates that the measure had no significant influence on the number of crashes.

Let

be a matrix of order where (respectively ) is the number of accidents of severity level () which occurred at site in the period before (respectively, after) the implementation of the measure. Also let be a matrix of order where (, ) is the ratio of the number of accidents of severity in the “after” period to the number of accidents of the same severity in the “before” period on a control site (a site where the measure has not been implemented) paired with treated site .

Let and be the total number of accidents observed on the treated site . N’Guessan et al. [7] proposed the following multinomial-based probability distribution of parameter vector for : where is the parameter vector, is the mean effect, is a vector of secondary parameters such that for all , and where .

The parameter of the model is the vector , and the log-likelihood () of observed data is given to one additive constant by

where (see [7] for more details).

In this paper, we are interested in computing the maximum likelihood estimate (MLE) of the unknown vector parameter defined by

In the case , [10] built an algorithm to solve Equation (6). In the next section, we present a profile-likelihood-based algorithm (PLBA) for computing the MLE in the general case . Our proposed PLBA is automated since we provide an automatic starting guess which guarantees convergence.

3. The Automated Profile-Likelihood-Based Algorithm (PLBA) for Computing the MLE

3.1. A Brief Reminder on Profile Likelihood

Let us rewrite the log-likelihood as . If, for a given , the MLE of may be written as a function of , that is, then the profile likelihood function (see for example [9], p. 231) is

expressed as a function of only. The maximization of is equivalent to that of [11].

3.2. Computation of in Closed Form

Lemma 1. Given , let be the solution to Equation (7), where for all , . The components of are given by

Proof. Given , [7] proved that for all , the components of satisfy the following system of equations: Thus, for all , we apply Theorem 3.5 of [12] to Equation (10) and get where is a block-defined square matrix of order , is a diagonal matrix of order , , and is the Schur complement of in (see for example [13] (p. 34) for a reminder on the use of Schur complement for inverting a block matrix). The proof is thus completed.

3.3. Profile Likelihood

Theorem 2. The profile likelihood function is defined, up to an additive constant independent of, by where .

Proof. Expression (5) is equivalent to For all , Equation (9) yields and the relationships (9) and (16) enable us to write After some manipulations on the first, second and fourth terms, we get: Removing the third and sixth terms and the constants (first and fifth terms), we get (14).

3.4. Design of the PLBA

Lemma 3. Let be the mapping defined on by where . The MLE of is the unique root of .

Proof. (i)On the one hand, function is a one-to-one decreasing function (it is continuous and its derivative is strictly negative for all ) and , whereSince , Equation has a unique solution. (ii)On the other hand, the MLE , if it exists, is solution to the optimization problemThe profile log-likelihood being differentiable for every , the MLE is then solution to Equation , where and . Thus, Equation is equivalent to , and is the unique root of .

Equation seems fairly complicated to solve in closed form for any and must therefore be solved numerically. Obviously, there are many root-finding algorithms (see for example [14] (Chapter 3) or [15] (Chapter 3)). Here, we propose a numerical approximation of using the following one-dimensional Newton-Raphson (NR) root-finding algorithm:

where the starting guess should be chosen in by the user. Our choice of NR algorithm is motivated by the fact that it converges quadratically to the solution if is chosen near the unknown solution. To overcome the difficulty of the choice of , we prove that, if we set as an automatic starting guess, then, the convergence of NR iterations (23) is always guaranteed.

Theorem 4. If , then, the sequence defined by NR iterations (23) converges to the MLE .

To prove Theorem 4, we need the following Lemma 5.

Lemma 5. Let be the function linked to NR iterations (23) such that , i.e., the function defined for all by (i)Function is increasing on and decreasing on .(ii)The MLE is the unique fixed point of .(iii)For all ,(iv)For all , .

Proof of Lemma 5. (i)For all , we haveTherefore, is differentiable and where for all , So the sign of is the same as the one of . As is a decreasing function and , we have It means that is increasing on and decreasing on . (ii)Equation is equivalent to , where is defined by Formula (19). By Lemma 3, this equation has as unique solution; hence, is the unique fixed point of .(iii)For all , . The relation (25) is a simple consequence of (26) and (29).(iv)Let . If , then because is increasing on (item (i)). If , then because is decreasing on . Thus, for all , .

We may now prove Theorem 4.

Proof of Theorem 4. Assume that . Then, and by item (iii) and (iv) of Lemma 5, . It can be easily proved by induction that Thus, the sequence is increasing and bounded; hence, it converges to the unique fixed point of which is .

Remark 6. Actually, from the proof of Theorem 4, it is clear that any starting value will guarantee convergence of the sequence generated by NR iterations (23). However, since is unknown, it is difficult to find a value other than .

We therefore propose an algorithm (see Algorithm 1) starting from . The MLE is computed using NR iterations (23); afterwards, is computed from .

Input:, and
Output: MLE
1 Initialize and ;
2 repeat
3 ;
4 Set ;
5 until;
6 Set ;
7 For every , compute as
          
8 Set .

4. Ascent Property

The ascent property of Algorithm 1 (the fact that the profile log-likelihood is increased monotonically by the algorithm) is given by Theorem 7.

Theorem 7. The sequence generated by Algorithm 1 increases monotonically the profile log-likelihood , that is

Proof. From (22) and (29), we deduce that, for all , if and if . Hence, the function is increasing on the interval and decreasing on . From (30), we know that the sequence is increasing and still belongs to the interval . Then, for all iteration , we have and because is increasing on . The proof of Theorem 7 is then completed.

5. Simulation Study

We compare the performance of our PLBA with that of Newton-Raphson (NR) and MM algorithms in R software [16]. We implemented the NR algorithm using the R package pracma [17] and the MM algorithm using Theorem 3.3 of [8]. The choice of these two comparison algorithms is motivated by the following factors: (a) MM and NR algorithms are two of the most used algorithms in statistics for parameter estimation; (b) other algorithms such as quasi-Newton and derivative-free algorithms showed very low convergence proportions and their results are not reported here; (c) the results obtained for the particular case in [10] suggest that MM and NR algorithms are much more efficient than quasi-Newton and derivative-free algorithms.

5.1. Data Generation Principle

For given values of , , , ..., , the components of matrix are randomly generated from a uniform distribution on . Given the true parameter vector denoted , the true class probabilities (see Equation (4)) are computed afterwards data matrix (Equation (1)) is generated using the random generation function linked to the probability distribution function (3).

The true parameter vector is presented under the following six scenarios:

Scenario 1: , ,

Scenario 2: , ,

Scenario 3: , ,

Scenario 4: , ,

Scenario 5: , ,

Scenario 6: , ,

For these different scenarios, the number of parameters () is given by Table 1.

For the ’s (), we have chosen two common values: a low value () and a great value (). Except for the proposed algorithm whose starting guess is automated, the other algorithms were given randomly generated starting guesses.

5.2. Results

Tables 27 present the average results obtained for the different scenarios over 1000 replications (i.e., 1000 repetitions of the data generation and computation of ). In these tables, CPU times are given in seconds and CPU time ratios are calculated as the ratio of the mean CPU time of a given algorithm to the CPU time of the PLBA. Thus, the CPU time ratio of the PLBA is always equal to 1. The Mean Square Error (MSE) is defined as

Due to the increasing number of parameters, the MLE has only been included for Scenario 1 (Table 2).

From these tables, it can be seen that PLBA and MM algorithms have always converged while the convergence proportion of NR algorithm decreases (from 55.9% to 19.6%) with the number of parameters (see Figure 1). As far as the MSE is concerned, the trend for all the algorithms is that the MSE decreases when the sample size increases.

By taking a look at the CPU times ratios, we notice that the CPU time ratios of the MM and NR algorithms are all well above 1. This means that the PLBA is significantly faster than these two algorithms. As shown on Figures 2 and 3, the CPU time ratios of MM and NR algorithms increase with the number of parameters. The PLBA is on average 8 to 50 times faster than MM and 6 to 74 times quicker than NR.

5.3. Analysis of the Results

It appears that our proposed PLBA outperforms the Minorization-Maximization (MM) and the full Newton-Raphson (NR) algorithms. It always converges; it requires a low number of iterations and little computation time. The number of iterations varies slightly and seems to stabilize around six iterations whatever the starting guess and the number of parameters to be estimated. The MM algorithm also has a convergence rate of 100%, but its number of iterations is quite high, and this may indicate some sensitivity to the starting guess. The NR algorithm has a convergence proportion at most slightly higher than 50% and this proportion decreases when the number of parameters increases. This is not surprising because at each iteration of the NR algorithm, a square matrix of order is numerically inverted. This numerical inversion can be complicated or impossible when the matrix to be inverted is ill-conditioned or singular. The fact that NR does not converge for all the replications is also not surprising since it is well known that NR may converge to bad values or may not converge at all if the starting guess is far from the true parameter vector.

The good results obtained by our proposed algorithm PLBA can be explained by several factors. First, the principle of profile likelihood enables to reduce the search for solutions to that of the single solution from which the remaining estimates (, ) can be obtained by a simple formula. This also enables to reduce the computation time required by our proposed PLBA to converge. Secondly, the fact that the estimation of relies on a one-dimensional NR enables our algorithm to enjoy the quadratic convergence of the NR algorithm. Thirdly, the definition of the automated starting guess ensures the convergence of our proposed algorithm PLBA, and the ascension property ensures its numerical stability.

6. Conclusion

In this article, we have built a profile-likelihood-based algorithm (PLBA) to compute, under constraints, the maximum likelihood estimate (MLE) of the parameter vector of a statistical model used in the analysis of a road safety measure applied to sites presenting in total accident severity levels. The parameter vector of the model is of the form , where is the parameter of interest and is a vector of secondary parameters. Using the likelihood equations, we obtained the closed-form expression of the components of as a function of the main parameter and then used the principle of profile likelihood to express the log-likelihood only as a function of . We then built an algorithm mixing a one-dimensional Newton method to compute the estimate of the parameter of interest and the computation of the secondary parameters from their closed-form expressions. The starting guess of our proposed algorithm is automated in such a way as to guarantee its convergence towards the MLE . The numerical studies suggest that our PLBA outperforms the Minorization-Maximization (MM) and the full Newton-Raphson (NR) algorithms in terms of computation time. Our PLBA converges to estimates close to the true values of the parameter vector even for small sample sizes. They also suggest that the problem addressed in this paper is difficult to tackle for the full NR algorithm (the latter having a convergence proportion at most slightly higher than 50%).

Data Availability

This study uses simulated data, and the data generation process is described in the paper.

Conflicts of Interest

The author declares that he/she has no conflicts of interest.