Abstract

I present a parametric, bijective transformation to generate heavy tail versions of arbitrary random variables. The tail behavior of this heavy tail Lambert   random variable depends on a tail parameter : for , , for has heavier tails than . For being Gaussian it reduces to Tukey’s distribution. The Lambert W function provides an explicit inverse transformation, which can thus remove heavy tails from observed data. It also provides closed-form expressions for the cumulative distribution (cdf) and probability density function (pdf). As a special case, these yield analytic expression for Tukey’s pdf and cdf. Parameters can be estimated by maximum likelihood and applications to S&P 500 log-returns demonstrate the usefulness of the presented methodology. The R package LambertW implements most of the introduced methodology and is publicly available on CRAN.

1. Introduction

Statistical theory and practice are both tightly linked to Normality. In theory, many methods require Gaussian data or noise: (i) regression often assumes Gaussian errors; (ii) many time series models are based on Gaussian white noise [13]. In such cases, a model , parameter estimates and their standard errors, and other properties are then studied, all based on the ideal(istic) assumption of Normality.

In practice, however, data/noise often exhibits asymmetry and heavy tails, for example, wind speed data [4], human dynamics [5], or Internet traffic data [6]. Particularly notable examples are financial data [7, 8] and speech signals [9], which almost exclusively exhibit heavy tails. Thus a model developed for the Gaussian case does not necessarily provide accurate inference anymore.

One way to overcome this shortcoming is to replace with a new model , where is a heavy tail distribution: (i) regression with Cauchy errors [10]; (ii) forecasting long memory processes with heavy tail innovations [11, 12], or ARMA modeling of electricity loads with hyperbolic noise [13]. See also Adler et al. [14] for a wide range of statistical applications and methodology for heavy-tailed data.

While such fundamental approaches are attractive from a theoretical perspective, they can become unsatisfactory from a practical point of view. Many successful statistical models and techniques assume Normality, their theory is very well understood, and many algorithms are implemented for the simple and often much faster Gaussian case. Thus developing models based on an entirely unrelated distribution is like throwing out the (Gaussian) baby with the bathwater.

It would be very useful to transform a Gaussian random variable to a heavy-tailed random variable and vice versa and thus rely on knowledge and algorithms for the well-understood Gaussian case, while still capturing heavy tails in the data. Optimally such a transformation should (a) be bijective, (b) include Normality as a special case for hypothesis testing, and (c) be parametric so the optimal transformation can be estimated efficiently.

Figure 1 illustrates this pragmatic approach: researchers can make their observations as Gaussian as possible () before making inference based on their favorite Gaussian model . This avoids the development of, or the data analysts waiting for, a whole new theory of and new implementations based on a particular heavy-tailed distribution , while still improving statistical inference from heavy-tailed data . For example, consider from a standard Cauchy distribution in Figure 2(a): modeling heavy tails by a transformation makes it even possible to Gaussianize this Cauchy sample (Figure 2(c)). This “nice” data can then be subsequently analyzed with common techniques. For example, the location can now be estimated using the sample average (Figure 2(d)). For details see Section 6.1.

Liu et al. [15] use a semiparametric approach, where has a nonparanormal distribution if and is an increasing smooth function; they estimate using nonparametric methods. This leads to a greater flexibility in the distribution of , but it suffers from two drawbacks: (i) nonparametric methods have slower convergence rates and thus need large samples, and (ii) for identifiability of , and must hold. While (i) is inherent to nonparametric methods, point (ii) requires to have finite mean and variance, which is often not met for heavy-tailed data. Thus here we use parametric transformations which do not rely on restrictive identifiability conditions and also work well for small sample sizes.

The main contributions of this work are threefold: (a) a metafamily of heavy tail Lambert W × distributions (see also [16]) with Tukey’s distribution [17] as a special case, (b) a bijective transformation to “Gaussianize” heavy-tailed data (Section 2), and (c) simple expressions for the cumulative distribution function (cdf) and probability density function (pdf) (Section 2.4). In particular, analytic expressions for the pdf and cdf for Tukey’s (Section 3) are presented here, to the best of the author’s knowledge, for the first time in the literature.

Section 4 introduces a method of moments estimator and studies the maximum likelihood estimator (MLE). Section 5 shows their finite sample properties. As has been shown in many case studies, Tukey’s distribution (heavy tail Lambert W × Gaussian) is useful to model data with unimodal, heavy-tailed densities. Applications to S&P 500 log-returns confirm the usefulness of the Lambert W framework (Section 6). Finally, we discuss the new methodology and future work in Section 7. Detailed derivations and proofs are given in the Supplementary Material available online at http://dx.doi.org/10.1155/2015/909231.

Computations, figures, and simulations were done in R [18]. The R package  LambertW implements most of the presented methodology and is publicly available on CRAN.

1.1. Multivariate Extensions

While this work focuses on the univariate case, multivariate extensions of the presented methods can be defined component-wise, analogously to the multivariate version of Tukey’s distribution [19]. While this may not make the transformed random variables jointly Gaussian, it still provides a good starting point for more well-behaved multivariate estimation.

1.2. Box-Cox Transformation

A popular method to deal with skewed, high variance data is the Box-Cox transformation A major limitation of (1) is the nonnegativity constraint on , which prohibits its use in many applications. To avoid this limitation it is common to shift the data, , which restricts to a half-open interval. If, however, the underlying process can occur on the entire real line, such a shift undermines statistical inference for yet unobserved data (see [20]). Even if out-of-sample prediction is not important for the practitioner, Figure 2(b) shows that the Box-Cox transformation in fact fails to remove heavy tails from the Cauchy sample. (We use and use  boxcox from the  MASS R package; .)

Moreover, the main purpose of the Box-Cox transformation is to stabilize variance [2123] and remove right tail skewness [24]; a lower empirical kurtosis is merely a by-result of the variance stabilization. In contrast, the Lambert W framework is designed primarily to model heavy-tailed random variables and remove heavy tails from data and has no difficulties with negative values.

2. Generating Heavy Tails Using Transformations

Random variables exhibit heavy tails if more mass than for a Gaussian random variable lies at the outer end of the density support. A random variable has a tail index if its cdf satisfies , where is a slowly varying function at infinity, that is, for all [25]. (There are various similar definitions of heavy, fat, or long tails; for this work these differences are not essential.) The heavy tail index is an important characteristic of ; for example, only moments up to order can exist.

2.1. Tukey’s Distribution

A parametric transformation is the basis of Tukey’s random variables [17]where is standard Normal random variable and is the heavy tail parameter. The random variable has tail parameter [17] and reduces to the Gaussian for . Morgenthaler and Tukey [26] extend the distribution to the skewed, heavy-tailed family of random variables where again . Here and shape the left and right tail of , respectively; thus transformation (3) can model skewed and heavy-tailed data; see Figure 3(a). For the sake of brevity let .

However, despite their great flexibility, Tukey’s and distributions are not very popular in statistical practice, because expressions for the cdf or pdf have not been available in closed form. Although Morgenthaler and Tukey [26] express the pdf of (2) as they fall short of making explicit. So far the inverse of (2) or (3) has been considered analytically intractable [4, 27]. Thus parameter inference relied on matching empirical and theoretical quantiles [4, 17, 26], or by the method of moments [28]. Only recently Headrick et al. [28] provided numerical approximations to the inverse. However, numerical approximations can be slow and prohibit analytical derivations. Thus a closed form, analytically tractable pdf, which can be computed efficiently, is essential for a widespread use of Tukey’s (and variants).

In this work I present this long sought closed-form inverse, which has a large body of literature in mathematics and is readily available in standard statistics software. For ease of notation and concision main results are shown for ; analogous results for will be stated without details.

2.2. Heavy Tail Lambert W Random Variables

Tukey’s transformation (2) is strongly related to the approach taken by Goerg [16] to introduce skewness in continuous random variables . In particular, if Tukey’s , then skewed Lambert W ×   with skew parameter .

Adapting the skew Lambert W ×   input/output idea (see Figure 1), Tukey’s random variables can be generalized to heavy-tailed Lambert W ×   random variables. (Most concepts and methods from the skew Lambert W ×   case transfer one-to-one to the heavy tail Lambert W random variables presented here. Thus for the sake of concision I refer to Goerg [16] for details of the Lambert W framework.)

Definition 1. Let be a continuous random variable with cdf , pdf , and parameter vector . Then, is a noncentral, nonscaled heavy tail Lambert W ×   random variable with parameter vector , where is the tail parameter.

Tukey’s distribution results for being a standard Gaussian .

Definition 2. For a continuous location-scale family random variable define a location-scale heavy-tailed Lambert W ×   random variable with parameter vector , where and and are mean and standard deviation of , respectively.

The input is not necessarily Gaussian (Tukey’s ) but can be any other location-scale continuous random variable, for example, from a uniform distribution, (see Figure 4).

Definition 3. Let be a continuous scale-family random variable, with scale parameter and standard deviation ; let . Then, is a scaled heavy-tailed Lambert W ×   random variable with parameter .

Let define transformation (6). (For noncentral, nonscale input set ; for scale-family input .) The shape parameter governs the tail behavior of : for values further away from are increasingly emphasized, leading to a heavy-tailed version of ; for , , and for values far away from the mean are mapped back again closer to . For and , also . For and , also .

Remark 4 (only nonnegative ). Although gives interesting properties for , it defines a nonbijective transformation and leads to parameter-dependent support and nonunique input. Thus for the remainder of this work assume , unless stated otherwise.

2.3. Inverse Transformation: “Gaussianize” Heavy-Tailed Data

Transformation (6) is bijective and its inverse can be obtained via the Lambert W function, which is the inverse of , that is, that function which satisfies . It has been studied extensively in mathematics, physics, and other areas of science [2931] and is implemented in the GNU Scientific Library (GSL) [32]. Only recently the Lambert W function received attention in the statistics literature [16, 3335]. It has many useful properties (see Appendix  A in the supplementary material and Corless et al. [29]), in particular, is bijective for .

Lemma 5. The inverse transformation of (6) is where and is the sign of . is bijective for all and all .

Lemma 5 gives for the first time an analytic, bijective inverse of Tukey’s transformation: of Morgenthaler and Tukey [26] is now analytically available as (8). Bijectivity implies that for any data and parameter , the exact input can be obtained.

In view of the importance and popularity of Normality, we clearly want to back-transform heavy-tailed data to data from a Gaussian rather than yet another heavy-tailed distribution. Tail behavior of random variables is typically compared by their kurtosis , which equals if is Gaussian. Hence for the future when we “normalize ” we cannot only center and scale but also transform it to with (see Figure 2(c)). While does not guarantee that is Gaussian, it is a good baseline for a Gaussian sample. Furthermore, it puts different data not only on the same scale, but also on the same tail.

This data-driven view of the Lambert W framework can also be useful for kernel density estimation (KDE), where multivariate data is often prescaled to unit variance, so the same bandwidth can be used in each dimension [36, 37]. Thus “normalizing” the Lambert Way can also improve KDE for heavy-tailed data (see also [38, 39]).

Remark 6 (generalized transformation). Transformation (2) can be generalized to The inner term guarantees bijectivity for all . The inverse is For comparison with Tukey’s I consider only. For transformation (10) is closely related to skewed Lambert W ×   distributions.

2.4. Distribution and Density

For ease of notation let

Theorem 7. The cdf and pdf of a location-scale heavy tail Lambert W ×   random variable equal Clearly, and , since and for all .
For scale family or noncentral, nonscale input set or , .

The explicit formula (14) allows a fast computation and theoretical analysis of the likelihood, which is essential for statistical inference. Detailed properties of (14) are given in Section 4.1.

Figure 4 shows (13) and (14) for various with four different input distributions: for the input equals the output (solid black); for larger the tails of and get heavier (dashed colored).

2.5. Quantile Function

Quantile fitting has been the standard technique to estimate , , and of Tukey’s . In particular, the medians of and are equal. Thus for symmetric, location-scale family input the sample median of is a robust estimate of for any (see also Section 5). General quantiles can be computed via [17] where are the -quantiles of .

3. Tukey’s Distribution: Gaussian Input

For Gaussian input Lambert W ×   equals Tukey’s , which has been studied extensively. Dutta and Babbel [40] show that which, in particular, implies that [28] Thus the kurtosis of equals (see Figure 5) For , (18) and (19) reduce to the familiar Gaussian values.

Expanding (19) around yields Dropping and solving (20) gives a rule of thumb estimate where is the sample kurtosis and ; that is, if ; otherwise, set .

Corollary 8. The cdf of Tukey’s equals where is the cdf of a standard Normal. The pdf equals (for )

Proof. Take in Theorem 7.

Section 4.1 studies functional properties of (23) in more detail.

3.1. Tukey’s versus Student’s

Student’s -distribution with degrees of freedom is often used to model heavy-tailed data [41, 42], as its tail index equals . Thus the th moment of a Student’s random variable exists if . In particular, and kurtosis

Comparing (24) and (25) with (18) and (19) shows a natural association between and and a close similarity between the first four moments of Student’s and Tukey’s (Figure 5). By continuity and monotonicity, the first four moments of a location-scale -distribution can always be exactly matched by a corresponding location-scale Lambert W × Gaussian. Thus if Student’s is used to model heavy tails and not as the true distribution of a test statistic it might be worthwhile to also fit heavy tail Lambert W × Gaussian distributions for an equally valuable “second opinion.” For example, a parallel analysis on S&P 500 -returns in Section 6.2 leads to divergent inference regarding the existence of fourth moments.

4. Parameter Estimation

Due to the lack of a closed form pdf of , has typically been estimated by matching quantiles or a method of moments estimator [4, 26, 28]. These methods can now be replaced by the, fast and usually efficient, maximum likelihood estimator (MLE). Rayner and MacGillivray [43] introduce a numerical MLE procedure based on quantile functions, but they conclude that “sample sizes significantly larger than should be used to obtain reliable estimates.” Simulations in Section 5 show that the MLE using the closed form Lambert W ×   distribution converges quickly and is accurate even for sample sizes as small as .

4.1. Maximum Likelihood Estimation (MLE)

For an i.i.d. sample the log-likelihood function equals The MLE is that which maximizes (26); that is, Since is a function of , the MLE depends on the specification of the input density. Equation (26) can be decomposed as where is the log-likelihood of the back-transformed data (via (8)) and where Note that only depends on and (and ), but not necessarily on every coordinate of .

Decomposition (28) shows the difference between the exact MLE based on and the approximate MLE based on alone: if we knew beforehand, then we could back-transform to and estimate from (maximize (29) with respect to ). In practice, however, must also be estimated and this enters the likelihood via the additive term . A little calculation shows that for any , if , with equality if and only if . Thus can be interpreted as a penalty for transforming the data. Maximizing (28) faces a trade-off between transforming the data to follow (and increasing ) and the penalty of a more extreme transformation (and decreasing ); see Figure 6(b).

Figure 6(a) shows a contour plot of as a function of and : it increases (in absolute value) either if gets larger (for fixed ) or for larger (for fixed ). In both cases, increasing makes the transformed data get closer to , which in turn increases its input likelihood. For , the penalty disappears since input equals output; for there is no penalty since for all .

Figure 6(b) shows an i.i.d. sample () Lambert W × Gaussian with and the decomposition of the log-likelihood as in (28). Since is known, the likelihood and penalty are only functions of . Theorem 9 shows that the convexity of the penalty (decreasing, red) and concavity of the input likelihood (increasing, green) as a function of holds true in general for any data , and their sum (solid, black) has a unique maximum; here (blue, dashed vertical line). See Theorem 9 below for details.

The maximization of (28) can be carried out numerically. Here I show existence and uniqueness of assuming that and are known. Further theoretical results for remain for future work. Given the “nice” form of , which is continuous, twice differentiable (under the assumption that is twice differentiable), the MLE for should have the usual optimality properties, such as being consistent and efficient [44].

4.1.1. Properties of the MLE for the Heavy Tail Parameter

Without loss of generality let and . In this case

Theorem 9 (see [14]). Let have a Lambert W × Gaussian distribution, where and are assumed to be known and fixed. Also consider only . (While for some samples the MLE also exists for , it cannot be guaranteed for all . If (and ), then is either not unique in (principal and nonprincipal branch) or does not have a real-valued solution in if .)(a)If then .If (34) does not hold, then(b) exists and is a positive solution to (c)There is only one such satisfying (35); that is, is unique.

Proof. See Appendix  B in the supplementary material.

Condition (34) says that only if the data is heavy-tailed enough. Points (b) and (c) guarantee that there is no ambiguity in the heavy tail estimate. This is an advantage over Student’s -distribution, for example, which has numerical problems and local maxima for unknown (and small) [45, 46]. On the contrary, is always a global maximum.

Given the heavy tails in one might expect convergence issues for larger . However, for the true and close to a standard Gaussian if . Since the log-likelihood and its gradient depend on and only via the performance of the MLE should thus not get worse for large as long as the initial estimate is close enough to the truth. Simulations in Section 5 support this conjecture, even for .

4.2. Iterative Generalized Method of Moments (IGMM)

A disadvantage of the MLE is the mandatory a priori specification of the input distribution. Especially for heavy-tailed data the eye is a bad judge to choose a particular parametric . It would be useful to directly estimate without the intermediate step of estimating first.

Goerg [16] presented an estimator for based on iterative generalized methods of moments (IGMM). The idea of IGMM is to find a such that the back-transformed data has desired properties, for example, being symmetric or having kurtosis . An estimator for , , and can be constructed entirely analogous to the IGMM estimator for the skewed Lambert W × case. See the Supplementary Material, Appendix  C for details.

IGMM requires less specific knowledge about the input distribution and is usually also faster than the MLE. Once has been obtained, the back-transformed can be used to check if has characteristics of a known parametric distribution . It must be noted though that testing for a particular distribution is too optimistic as has “nicer” properties regarding than the true would have. However, estimating the transformation requires only three parameters and for a large enough sample, losing three degrees of freedom should not matter in practice.

5. Simulations

This section explores finite sample properties of estimators for and under Gaussian input . In particular, it compares Gaussian MLE (estimation of and only), IGMM and Lambert W × Gaussian MLE, and, for a heavy tail competitor, the median. (For IGMM, optimization was restricted to .) All results below are based on replications.

5.1. Estimating Only

Here I show finite sample properties of for , where and are known and fixed. Theorem 9 shows that is unique: either at the boundary or at the globally optimal solution to (35). Results in Table 1 were obtained by numerical optimization restricted to () using the  nlm function in .

Table 1 suggests that the MLE is asymptotically unbiased for every and converges quickly (at about ) to its asymptotic variance, which is increasing with . Assuming and to be known is unrealistic and thus these finite sample properties are only an indication of the behavior of the joint MLE, . Nevertheless they are very remarkable for extremely heavy-tailed data (), where classic average-based methods typically break down. One reason lies in the particular form of the likelihood (32) and its gradient (35) (Theorem 9): although both depend on , they only do so through . Hence as long as is sufficiently close to the true , (32) and (35) are functions of almost Gaussian random variables and standard asymptotic results should still apply.

5.2. Estimating All Parameters Jointly

Here we consider the realistic scenario where and are also unknown. We consider various sample sizes (, , and ) and different degrees of heavy tails, , each one representing a particularly interesting situation: (i) Gaussian data (does additional, superfluous, estimation of affect other estimates?), (ii) fourth moments do not exist, (iii) nonexisting mean, and (iv) extremely heavy-tailed data (can we get useful estimates at all?)

The convergence tolerance for IGMM was set to . Table 2 summarizes the simulation.

The Gaussian MLE estimates directly, while IGMM and the Lambert W × Gaussian MLE estimate and , which implicitly give through if (see (18)). For a fair comparison each subtable also includes a column for . Some of these entries contain “,” even for ; this occurs if at least one . For any , , thus and can be compared directly. For , the mean does not exist; each subtable for these interprets as the median.

Gaussian Data . This setting checks if imposing the Lambert W framework, even though it is superfluous, causes a quality loss in the estimation of or . Furthermore, critical values for (Gaussian) can be obtained. As in the -only case above, Table 2(a) suggests that estimators are asymptotically unbiased and quickly tend to a large-sample variance. Additional estimation of does not affect the efficiency of compared to estimating solely . Estimating directly by Gaussian MLE does not give better results than the Lambert W × Gaussian MLE.

No Fourth Moment . Here , but fourth moments do not exist. This results in an increasing empirical standard deviation of as grows. In contrast, estimates for are not drifting off. In the presence of these large heavy tails the median is much less variable than Gaussian MLE and IGMM. Yet, Lambert W × Gaussian MLE for even outperforms the median.

Nonexisting Mean . Both sample moments diverge, and their standard errors are also growing quickly. The median still provides a very good estimate for the location but is again inferior to both Lambert W estimators, which are closer to the true values and appear to converge to an asymptotic variance at rate .

Extreme Heavy Tails . As in Section 5.1, IGMM and Lambert W MLE continue to provide excellent estimates even though the data is extremely heavy tailed. Moreover, Lambert W MLE also has the smallest empirical standard deviation overall. In particular, the Lambert W MLE for has an approximately lower standard deviation than the median.

The last column shows that for some about of the simulations generated invalid likelihood values (NA and ). Here the search for the optimal led into regions with a numerical overflow in the evaluation of . For a comparable summary, these few cases were omitted and new simulations added until full finite estimates were found. Since this only happened in of the cases and also such heavy-tailed data is rarely encountered in practice, this numerical issue is not a limitation in statistical practice.

5.3. Discussion of the Simulations

IGMM performs well independent of the magnitude of . As expected the Lambert W MLE for has the best properties: it can recover the truth for all , and for it performs as well as the classic sample mean and standard deviation. For small it has the same empirical standard deviation as the Gaussian MLE, but a lower one than the median for large .

Hence the only advantage of estimating and by sample moments of is speed; otherwise the Lambert W × Gaussian MLE is at least as good as the Gaussian MLE and clearly outperforms it in presence of heavy tails.

6. Applications

Tukey’s distribution has already proven useful to model heavy-tailed data, but parametric inference was limited to quantile fitting or methods of moments estimation [4, 27, 28]. Theorem 7 allows us to estimate by ML.

This section applies the presented methodology on simulated as well as real world data: (i) Section 6.1 demonstrates Gaussianizing on the Cauchy sample from the Introduction, and (ii) Section 6.2 shows that heavy tail Lambert W × Gaussian distributions provide an excellent fit to daily S&P 500 log-return series.

6.1. Estimating Location of a Cauchy with the Sample Mean

It is well known that the sample mean is a poor estimate of the location parameter of a Cauchy distribution, since the sampling distribution of is again a Cauchy (see [47] for a recent overview); in particular, its variance does not go to for .

Heavy-tailed Lambert W × Gaussian distributions have similar properties to a Cauchy for . The mean of () equals the location of () due to symmetry around and , respectively. Thus we can estimate from the Cauchy sample , transform to , get from , and thus obtain an estimate for by .

The random sample , with pdf , in Figure 2(a) has heavy tails with two extreme (positive) observations. A Cauchy ML fit gives and (standard errors in parenthesis). A Lambert W × Gaussian MLE gives , , and . Thus both fits correctly fail to reject . Table 3(a) shows summary statistics on both samples. Since the Cauchy distribution does not have a well-defined mean, is not meaningful. However, is approximately Gaussian and we can use the sample average for inference: correctly fails to reject a zero location for . The transformed features additional Gaussian characteristics (symmetric, no excess kurtosis), and even the null hypothesis of Normality cannot be rejected (-value ). Note, however, that Normality for the transformed data is only an empirical approximation; the random variable , where is Cauchy, does not have a Normal distribution.

Figure 2(d) shows the cumulative sample average for the original sample and its Gaussianized version. For a fair comparison was reestimated cumulatively for each and then used to compute . The transformation works extremely well: data point is highly influential for but has no relevant effect on . Even for small it is already clear that the location of the underlying Cauchy distribution is approximately zero.

Although it is a simulated example, it demonstrates that removing (strong) heavy tails from data works well and provides “nice” data that can then be analyzed with more refined Gaussian methods.

6.2. Heavy Tails in Finance: S&P 500 Case Study

A lot of financial data displays negative skewness and excess kurtosis. Since financial data in general is not i.i.d., it is often modeled with a (skew) Student’s -distribution underlying a (generalized) autoregressive conditional heteroskedastic (GARCH) [2, 48] or a stochastic volatility (SV) [49, 50] model. Using the Lambert W approach we can build upon the knowledge and implications of Normality (and avoid deriving properties of a GARCH or SV model with heavy-tailed innovations) and simply “Gaussianize” the returns before fitting more complex, GARCH or SV, models.

Remark 10. Time series models with Lambert W × Gaussian white noise are far beyond the scope of this work but can be a direction of future research. Here I only consider the unconditional distribution.

Figure 7(a) shows the S&P 500 log-returns with a total of daily observations (R package  MASS, dataset  SP500). Table 3(b) confirms the heavy tails (sample kurtosis ) but also indicates negative skewness (). As the sample skewness is very sensitive to outliers, we fit a distribution which allows skewness and test for symmetry. In case of the double-tail Lambert W × Gaussian this means testing versus . Using the likelihood expression in (28), we can use a likelihood ratio test with one degree of freedom ( versus parameters). The log-likelihood of the double-tail fit (Table 4(a)) equals (input log-likelihood + penalty), while the symmetric fit gives . Here the symmetric fit gives a transformed sample that is more Gaussian, but it pays a greater penalty for transforming the data. Comparing twice their difference to a distribution gives a -value of . For comparison, a skew- fit [51], with location , scale , shape , and degrees of freedom, also yields (Function  st.mle in the R package  sn.) a nonsignificant (Table 4(b)). Thus both fits cannot reject symmetry.

Assume we have to make a decision if we should trade a certificate replicating the S&P 500. Since we can either buy or sell, it is not important if the average return is positive or negative, as long as it is significantly different from zero.

6.2.1. Gaussian Fit to Returns

Estimating by Gaussian MLE and thus ignoring the heavy tails, cannot be rejected on a level (Table 4(e)). Ignoring heavy tails we would thus decide to not trade a replicating certificate at all.

6.2.2. Heavy Tail Fit to Returns

Both a heavy tail Lambert W × Gaussian (Table 4(c)) and Student -fit (Table 4(d)) reject the zero mean null (-values: and , resp.).

Location and scale estimates are almost identical, but tail estimates lead to different conclusions: while for only moments up to order exist, in the Lambert W × Gaussian case moments up to order exist (). This is especially noteworthy as many theoretical results in the (financial) time series literature rely on finite fourth moments [52, 53]; consequently many empirical studies test this assumption [7, 54]. Here Student’s and a Lambert W × Gaussian fit give different conclusions. Since previous empirical studies often use Student’s as a baseline [41], it might be worthwhile to reexamine their findings in light of heavy tail Lambert W × Gaussian distributions.

6.2.3. “Gaussianizing” Financial Returns

The back-transformed is indistinguishable from a Gaussian sample (Figure 7(b)) and thus demonstrates that a Lambert W × Gaussian distribution is indeed appropriate for . Not even one test can reject Normality: -values are , , , and , respectively (Anderson-Darling, Cramer-von-Mises, Shapiro-Francia, Shapiro-Wilk; see Thode [55]). Table 3(b) confirms that Lambert W “Gaussianiziation” was successful: and are within the typical variation for a Gaussian sample of size . Thus is an adequate (unconditional) Lambert W × Gaussian model for the S&P 500 log-returns . For trading, this means that the expected return is significantly larger than zero () and thus replicating certificates should be bought.

6.2.4. Gaussian MLE for Gaussianized Data

For , also . We can therefore replace testing versus for a non-Gaussian , with the very well-understood hypothesis test versus for the Gaussian . In particular, standard errors based on and thus and -values should be closer to the “truth” (Tables 4(c) and 4(d)) than a Gaussian MLE on the non-Gaussian (Table 4(e)). Table 4(f) shows that standard errors for are even a bit too small compared to the heavy-tailed versions. Since the “Gaussianizing” transformation was estimated, treating as if it was original data is too optimistic regarding its Normality (recall the penalty (30) in the total likelihood (28)).

This example confirms that if a model and its theoretical properties are based on Normality, but the observed data is heavy-tailed, then Gaussianizing the data first gives more reliable inference than applying Gaussian methods to the original, heavy-tailed data (Figure 1). Clearly, a joint estimation of the model parameters based on Lambert W × Gaussian random variables (or any other heavy-tailed distribution) would be optimal. However, theoretical properties and estimation techniques may not be available or well understood. The Lambert Way to Gaussianize data is thus a pragmatic method to improve statistical inference on heavy-tailed data, while preserving the applicability and interpretation of Gaussian models.

7. Discussion and Outlook

In this work I use the Lambert W function to model and remove heavy tails from continuous random variables using a data-transformation approach. For Gaussian random variables this not only contributes to existing work on Tukey’s distribution but also gives convincing empirical results: unimodal data with heavy tails can be transformed to Gaussian data. Properties of a Gaussian model on the back-transformed data mimic the features of the “true” heavy-tailed model very closely.

Since Normality is the single most typical and often required assumption in many areas of statistics, machine learning, and signal processing, future research can take many directions. From a theoretical perspective properties of Lambert W ×   distributions viewed as a generalization of already well-known distributions can be studied. This area will profit from existing literature on the Lambert W function, which has been discovered only recently by the statistics community. Empirical work can focus on transforming the data and comparing approximate Gaussian with joint heavy tail analyses. The comparisons in this work showed that approximate inference for Gaussianized data is comparable with the direct heavy tail modeling and so provides a simple tool to improve inference for heavy-tailed data in statistical practice.

I also provide the R package  Lambert W, publicly available at  CRAN, to facilitate the use of heavy tail Lambert W ×   distributions in practice.

Disclosure

The first version of this article entitled “Closed-form cdf and pdf of Tukey’s -distribution, the heavy-tail Lambert W approach, and how to bijectively ‘Gaussianize’ heavy-tailed data” has appeared online [56]. This research was performed while the author was at Carnegie Mellon University. Currently, Georg M. Goerg works at Google Inc., New York, NY 10011, USA.

Conflict of Interests

The author declares that there is no conflict of interests regarding the publication of this paper.

Supplementary Materials

The supplementary material contains proofs of the main results, details on the simulation setup and algorithms, and also lists some results for the two tails case.

  1. Supplementary Material