Abstract

We consider a problem in parametric estimation: given samples from an unknown distribution, we want to estimate which distribution, from a given one-parameter family, produced the data. Following Schulman and Vazirani (2005), we evaluate an estimator in terms of the chance of being within a specified tolerance of the correct answer, in the worst case. We provide optimal estimators for several families of distributions on . We prove that for distributions on a compact space, there is always an optimal estimator that is translation invariant, and we conjecture that this conclusion also holds for any distribution on . By contrast, we give an example showing that, it does not hold for a certain distribution on an infinite tree.

1. Introduction

Estimating probability distribution functions is a central problem in statistics. Specifically, beginning with an unknown probability distribution on an underlying space , one wants to be able to do two things: first, given some empirical data sampled from the unknown probability distribution, estimate which one of a presumed set of possible distributions produced the data; and second, obtain bounds on how good this estimate is. For example, the maximum likelihood estimator selects the distribution that maximizes the probability (among those under consideration) of producing the observed data. Depending on what properties of the estimator one is trying to evaluate, this may or may not be optimal. An extensive literature, dating back to the early 20th century, addresses problems of this sort; see for example [16].

In this paper, we consider one such problem. We presume that samples are coming from an unknown “translate” of a fixed known distribution. The challenge is to guess the translation parameter. More precisely, we are given a distribution on a space , along with an action of a group on , which defines a set of translated distributions as follows: for .

Thus, in this context an estimator is a (measurable) function ; the input is the list of samples, and the output is the estimate of , the translation parameter. For the majority of the paper, we will study the case of acting by translations (changes in location) on , and the group action will be written additively, as seen beginning from Section 2.

We are interested in finding good estimators; thus we need a way of measuring an estimator's quality. A common way to do this is to measure the mean squared error, in which case an optimal estimator minimizes this error. Various results are known in this case; for instance, the maximum likelihood estimator (which agrees with the sample mean estimator), minimizes the mean squared error if is a Gaussian distribution on .

In this paper, we investigate a different and natural measure of quality whereby we consider an estimator to succeed or fail according to whether or not its estimate is within a certain threshold of the correct answer. We then define the quality of the estimator to be the chance of success in the worst case. This notion was introduced in [7] to analyze certain approximation algorithms in computer science. Precisely, the -quality of is defined as where is a metric on and is the product measure on . (In the case of perverse measures, , we must consider the probability as the sup of the intersection of the set with all measurable sets. We will ignore this caveat throughout. Indeed, we primarily focus on absolutely continuous measures (as [8, 9] have done, e.g.) and purely atomic measures.) Note that, depending on context, it is sometimes advantageous to define the quality using a closed interval rather than an open one; for example, in the discrete case we could then interpret as the probability that is exactly equal to . We write when the value of is unambiguous. For fixed , an (-sample) estimator is optimal if for all (-sample) estimators . Many authors use the term minimax to describe optimal estimators. Note that much of the literature on this subject uses the notion of loss functions and the associated risk ; our point of view is equivalent but more convenient for our purposes.

Motivated initially by analyzing an approximate algorithm for determining the average matching size in a graph, Schulman and Vazirani [7] introduce the stronger notion of a majorizing estimator, which is optimal (by the above definition) simultaneously for all . This was previously studied by Pitman [5], who considered several different optimality criteria and, for each one, constructed optimal “shift-invariant” estimators (defined below). Schulman and Vazirani focus on the Gaussian distribution and prove that the mean estimator is the unique majorizing estimator in this case.

In the first part of this paper, we investigate the optimal estimators for several different classes of distributions on . We conjecture that there is always an optimal estimator that is shift invariant, that is, satisfies for all . These estimators are typically easier to analyze than general estimators because the quality is the same everywhere, that is, for every . Conditions under which invariant minimax estimators can be obtained have been studied, for example, in [1012]. Indeed, some of our existence results follow from the quite general Hunt-Stein theorem [12, Theorem 9.5.5], but we give constructions that are very natural and explicit. We obtain general bounds on the quality of shift invariant estimators (Section 2) and general estimators (Section 3), and then we apply these bounds to several families of distributions (Section 4). In each case, we are able to construct an optimal estimator that is shift-invariant. These examples include the Gaussian and exponential distributions, among others.

These results motivate our study of shift-invariant estimators on other spaces; these are estimators that are equivariant with respect to the induced diagonal action of on either the left or the right on . That is, a left-invariant estimator satisfies where Right invariance is defined similarly.

In Section 5, we show that on a compact space , if is an estimator for , then there is always a shift-invariant estimator with quality at least as high as that of . The idea is to construct a shift-invariant estimator as an average of the translates of ; this is essentially a simple proof of a special case of the Hunt-Stein theorem. As there is no invariant probability measure on , the proof does not extend to the real case.

Finally, in the last section, we give an example due to Schulman which shows that (on noncompact spaces) there may be no shift-invariant estimator that is optimal. It continues to be an interesting problem to determine conditions under which one can guarantee the existence of a shift-invariant estimator that is optimal.

2. The Real Case: Shift-Invariant Estimators

Let , and consider the action of on by translations. Because much of this paper is concerned with this context, we spell out once more the parameters of the problem. We assume that is fixed throughout. We are given a probability distribution on , and we are to guess which distribution produced a given collection of data, where . An estimator is a function , and we want to maximize its quality, which is given by

First we present some notation. We will write the group action additively and likewise the induced diagonal action of on ; in other words, if and , then denotes the point . Similarly, if and , then . We also use the “interval notation” for the set ; this is a segment of length in if and are finite. If is any function, and , define . If , then define by .

We now establish our upper bound on the quality of shift-invariant estimators. Note that a shift invariant estimator has the property that . Also note that a shift-invariant estimator is determined uniquely by its values on the coordinate hyperplane and that a shift-invariant estimator exists for any choice of such values on . In addition, for shift-invariant, so the quality can be ascertained by setting .

Definition 2.1. For fixed , let denote the collection of all Borel subsets of the form where is a Borel function. For fixed and , define

Theorem 2.2. Let and be given, then any shift-invariant -sample estimator satisfies .

Proof. Due to the observation above, it suffices to bound the quality of at . But this quality is just , where . Note that and in particular . Thus, the quality of is at most .

Theorem 2.3. Let and be given. If the in Definition 2.1 is achieved, then there is a shift-invariant -sample estimator with quality . For any , there is a shift-invariant -sample estimator with quality greater than .

Proof. For a given , let be the corresponding Borel function (see Definition 2.1). Define the estimator to be on and then extend to all of to make it shift invariant. Note that , so . The theorem now follows from the definition of .

3. The Real Case: General Estimators

In this section, we obtain a general upper bound on the quality of randomized estimators, still in the case . The arguments are similar to those of the previous section.

Again is fixed throughout. A randomized estimator is a function , where is a probability space of estimators; thus for fixed , is an estimator. The -quality of a randomized estimator is where

Definition 3.1. For fixed , let For fixed and , define

Comparing with Definition 2.1, we observe that and hence .

Theorem 3.2. Let and be given. Any -sample randomized estimator satisfies .

Proof. We will give a complete proof in the case that is defined by a density function and then indicate the modifications required for the general case. The difference is purely technical; the ideas are the same.
Consider first a nonrandomized estimator . The performance of at is . To simplify notation, we will let denote the set , and we will suppress the subscript when no ambiguity exists. Since is an infimum, the average performance of at the points () is at least as follow:
Now, we use the density function . Recall that . Define on by
Since the are disjoint, we now have
We will bound the middle term by and show that the first and last terms go to zero (independently of ) as gets large. The bound on the middle term is a consequence of the following claim.
Claim 3.3. For any , To prove the claim, set , and set . Thus, the are disjoint and cover . Now, The last equality follows from the fact that the are disjoint (recall that ), and the final step follows because the set is in . This proves the claim.
Next, we show that approaches zero as grows. Recall that , and set . The function is a probability density function, so is nonnegative and has total integral 1. The Dominated Convergence Theorem then implies that the sequence is decreasing to 0. Bounding by , we have
A similar argument shows that the term goes to 0 as grows. Since (3.7) holds for all , we have for any estimator .
We have shown that for any , we can find depending on and , such that the average performance of an arbitrary estimator on the points is bounded above by . Now, for a randomized estimator , the quality is bounded above by its average performance on the same points, and that performance can be no better than the best estimator's performance. We conclude that , and the theorem follows.
The proof is now complete in the case that has a density . In general, the argument requires minor technical adjustments. The first step that requires modification is the definition of the function . Let Then , and we work with rather than . From here, one defines the accordingly, and the remainder of the argument goes through with corresponding changes.

4. The Real Case: Examples

We have obtained bounds on quality for general estimators and for shift-invariant ones. In this section, we give several situations where the bounds coincide, and therefore the optimal shift-invariant estimators constructed in Section 2 are in fact optimal estimators, as promised by the Hunt-Stein theorem. These examples include many familiar distributions, and they provide evidence for the following conjecture.

Conjecture 4.1. Let be a distribution on . Then there is an optimal estimator for that is shift invariant.

4.1. Warmup: Unimodal Densities—One Sample

Our first class of examples generalizes Gaussian distributions and many others. The argument works only with one sample, but we will refine it in Section 4.2. Note that the optimal estimator in this case is the maximum likelihood estimator.

We say that a density function is unimodal if for all , is convex.

Example 4.2. Let be defined by a unimodal density function . Then there is a shift-invariant one-sample estimator that is optimal.

Proof. We first show that . It follows from the definition of that any set must have Lebesgue measure less than or equal to . Since is unimodal, is maximized by concentrating around the peak of ; thus the best will be an interval that includes the peak of . But any interval in is contained in and thus . Since always, we have .
Now, recalling that and are defined as suprema, we observe that the above argument shows that if one is achieved, then so is the other. Therefore, the result follows from Theorems 2.3 and 3.2.

4.2. A Sufficient Condition

The next class is more restrictive than the preceding, but with the stronger hypothesis we get a result for arbitrary . Any Gaussian distribution continues to satisfy the hypothesis.

Example 4.3. Let be a distribution defined by a density function of the form with continuous and decreasing. Then for any , there is a shift-invariant -sample estimator that is optimal.

Proof. For any fixed , we define a function by Since and is decreasing, it is clear that for each , for at most one value of . Since as , it follows that for any , is a unimodal function of .
Now the argument is similar to Example 4.2. We will show that . Since restricted to each orbit is unimodal as we have just shown, a set on which the integral of is maximized is obtained by choosing an interval from each orbit. To make this more precise, for each , let be the center of the length interval that maximizes . Then let Now , and moreover, for any because for each .
Thus, is achieved by , and it follows that and that the best shift-invariant estimator is optimal.

4.3. Monotonic Distributions on

The third class of examples generalizes the exponential distribution, defined by the density for and for . The optimal estimator in this case is not the maximum likelihood estimator. (Note that in a typical estimation problem involving a family of exponential distributions, one is trying to estimate the "scale" parameter rather than the "location" ).

Example 4.4. Let be defined by a density function that is decreasing for and identically zero for . Then for any , there is a shift-invariant -sample estimator that is optimal.

Proof. We construct the estimator as follows: for , define . Note that this is shift invariant; therefore can be computed at . That is, it suffices to show that .
Let . Note that , and so is the quality of . Note also that (in fact ), so certainly . We will show that any can be modified to a set such that and . It then follows that , and this will complete the proof.
So, let , and define . Note that is determined uniquely by . Now is in , and by our hypotheses on , if , then for every integer . Therefore, .

4.4. Discrete Distributions

Here, we discuss purely atomic distributions on finite sets of points. Because we are only trying to guess within of the correct value of , there are many possible choices of estimators with the same quality. Among the optimal ones is the maximum likelihood estimator.

Example 4.5. Let be a distribution on a finite set of points . There is a shift-invariant one-sample estimator that is optimal. Furthermore, if all of the pairwise distances between points of are distinct, then for every there is a shift-invariant -sample estimator that is optimal.

Proof. We first treat the case . Since is discrete, the supremum defining is attained; therefore, by Theorems 2.3 and 3.2, it suffices to show that every estimator has quality at most .
Let , and for any , let denote the mass at . For a finite set, we use to denote the cardinality. Suppose that is any estimator.
Lemma 4.6. Let be any finite subset of , then where denotes the set .Proof of Lemma 4.6. To prove the lemma, we estimate the average quality over . We have with the inner part of the last sum taken over those that lie in . Using to denote , this condition becomes , and the right hand side above may be rewritten as with the inner sum now taken over all with . This latter condition implies that is within of . But by definition, is the maximum measure of any interval of length . Hence, for any fixed , the inner sum is at most , and the entire sum is thus bounded above by . Dividing by gives a bound for the average quality over , and since is defined as an infimum, the lemma follows.
We now apply the lemma to complete the example. Let be a natural number, and let Note that and . It follows that for any , there exists such that , for otherwise, would grow at least exponentially in . Using the fact that , the lemma applied to implies that . Therefore,, and this finishes the case .
Lastly, we consider an arbitrary . If we are given samples and if any for some and , then by our hypothesis the shift is uniquely determined. Thus, we may assume that any optimal estimator picks the right in these cases, and the only question is what value the estimator returns if all the samples are identical. The above analysis of the one sample case can be used to show that any optimal shift-invariant estimator is optimal.

5. The Compact Case

So far, we have dealt only with distributions on , where the shift parameter is a translation. In every specific case that we have analyzed, we have found a shift-invariant estimator among the optimal estimators. In this section, we prove that if is a compact group acting on itself by (left) multiplication, then at least for measures defined by density functions, there is always a shift-invariant estimator as good as any given estimator. In Section 6, we show that the compactness hypothesis cannot be eliminated entirely; we do not know how much it can be weakened, if at all.

We will continue to use both and as notations, in order to emphasize the distinction between the two roles played by this object. Eaton [2] discusses estimators in a context in which the group acts on both the parameter space and the sample space . In his work, the sample space is an arbitrary homogeneous space (i.e., a space with a transitive -action). In this generality, shift-invariant estimators may not exist, since there may not even exist a function from to that preserves the action. For this reason, we choose to identify the sample space with the group .

As usual acts diagonally on ; we denote the orbit space by . An element of is an equivalence class , which we identify with via . For , we denote by the point ; thus, is in the orbit of and has first coordinate 1. The set is naturally identified with .

Equip with a left- and right-invariant metric , meaning that for all . Let denote the ball of radius around . If is a subset of a measure space , then we denote the measure of variously by , , or . (The notation refers to the characteristic function of the set .)

Fix and , and let be an arbitrary measure on . The following technical lemma says that to evaluate an integral over , we can integrate over each -orbit and then integrate the result over the orbit space.

Lemma 5.1. There exist measures on and on each orbit such that for any function on ,

Proof. Let be defined by and let be the image of with respect to , that is, for all Borel , then for nonnegative Borel functions on . Taking yields for some measures and . The right-hand side can then be written as where is an image of with respect to the function and for , completing the proof.

Lemma 5.2. If is a shift-invariant (-sample) estimator, then

Proof. Since is shift invariant, its quality can be computed at the identity. Thus, . By Lemma 5.1, this integral can be decomposed as Now, note that if and only if if and only if . Thus, the integral above is the same as the one in the statement of the lemma, and we are done.

We are now ready to prove the result. Note that we do not prove that optimal estimators exist—only that if they exist, then one of them is shift invariant.

Theorem 5.3. Let be a compact group, let and be given, and let be defined by a density function. If is any estimator, then there exists a shift-invariant estimator with .

Proof. Let be any estimator. For each group element , we define a shift-invariant estimator that agrees with on the coset as follows: We will show that there exists such that .
Denote by the invariant (Haar) measure on . Since is defined as an infimum, we have where the last equality comes from Lemma 5.1. The condition that is equivalent to . Now, we make the substitution . Thus, , and the condition becomes , or, by invariance of the metric, . This says that , or equivalently, .
This allows us to rewrite the triple integral (5.8), using the measure-preserving transformation , as Now, comparing with Lemma 5.2, we see that the inner integral above is exactly the quality of the shift-invariant estimator .
We therefore have or in other words, the average quality of the shift-invariant estimators is at least . Therefore, at least one of the satisfies .

6. A Non-Shift-Invariant Example

In the following example, suggested by Schulman, the optimal shift-invariant estimator is not optimal. This provides an interesting complement to Conjecture 4.1. Lehmann and Casella [11, Section 5.3] also give examples of this phenomenon.

Let be the infinite trivalent tree, which we view as the Cayley graph of the group . In words, is a discrete group generated by three elements , and , each of order two and with no other relations. Each nonidentity element of can be written uniquely as a finite word in the letters , , and with no letter appearing twice in a row; we refer to such a word as the reduced form of the group element. (We write 1 for the identity element of .) Multiplication in the group is performed by concatenating words and then canceling any repeated letters in pairs. Evidently, is infinite. The Cayley graph is a graph with one vertex labeled by each group element of , and with an edge joining vertices and if and only if , , or . Note that this relation is symmetric, since , , and each has order 2. Each vertex of has valence 3, and is connected and contains no circuits, that is, it is a tree. Finally, becomes a metric space by declaring each edge to have length one.

Because of how we defined the edges of , acts naturally on the left of : given , the map defined by is an isometry of . So if is given, is a probability distribution, , and is an estimator, then the shift and the quality are defined as usual by (1.1) and (1.3).

We are ready to present the example. Suppose that is fixed, and let be the probability distribution with atoms of weight at the three vertices . Thus for , the distribution has atoms of weight at the three neighbors of the vertex in .

Example 6.1. There is an optimal one-sample estimator with quality 2/3, but the quality of any shift-invariant one-sample estimator is at most 1/3.

Proof. Consider the one-sample estimator that truncates the last letter of the sample (unless the sample is the identity, in which case we arbitrarily assign the value ). That is, for a vertex of , Geometrically, this estimator takes a sample and, unless , guesses that the shift is the (unique) neighbor of that is closer to 1.
We compute the quality of . Note that , because if , then the sample will be , or , and the estimator is guaranteed to guess correctly. In fact, also, as is easily verified. For any other shift , the sample is for , or , and the estimator guesses correctly exactly when differs from the last letter of . So , and .
It is easy to see that this estimator is optimal. Suppose that is another estimator and . Since each local quality is either 0, 1/3, 2/3, or 1, we must have for all . This means that always guesses correctly. But since there are different values of that can produce the same sample, this is impossible.
Observe that the estimator above is neither left nor right invariant. For instance right invariance fails, as , and the same example shows the failure of left invariance: .
Indeed, we conclude by showing that the quality of any shift invariant one-sample estimator is at most 1/3. Suppose . If is left invariant, it follows that for all ; if is right invariant, it follows that for all .
Since , the quality of at is equal to the probability that , given that was sampled from . With equal probability is , , or ; since at most one of , , and and one of , , and can equal 1, we conclude that .

We remark that this example readily generalizes to other non-amenable groups; the key is that is a two-to-one map but with only one sample, a shift-invariant estimator is necessarily one-to-one.

Acknowledgments

The authors thank V. Vazirani and L. Schulman for suggesting the problem that motivated this paper along with subsequent helpful discussions. They are grateful to the reviewers for many helpful suggestions, especially for correcting our proof of Lemma 5.1. As always, they thank Julie Landau, without whose support and down-the-line backhand this work would not have been possible.