Abstract

As a diffusion distance, we propose to use a metric (closely related to cosine similarity) which is defined as the distance between two -normalized vectors. We provide a mathematical explanation as to why the normalization makes diffusion distances more meaningful. Our proposal is in contrast to that made some years ago by R. Coifman which finds the distance between certain unit vectors. In the second part of the paper, we give two proofs that an extension of mean first passage time to mean first passage cost satisfies the triangle inequality; we do not assume that the underlying Markov matrix is diagonalizable. We conclude by exhibiting an interesting connection between the (normalized) mean first passage time and the discretized solution of a certain Dirichlet-Poisson problem and verify our result numerically for the simple case of the unit circle.

1. Introduction

Several years ago, motivated by considering heat flow on a manifold, R. Coifman proposed a diffusion distance—both for the case of a manifold and a discrete analog for a set of data points in . In the continuous case, his distance can be written as the norm of the difference of two specified vectors, each of which has unit norm. (An analogous situation holds in the discrete case.) Coifman's distance can be successfully used in various applications, including data organization, approximately isometric embedding of data in low-dimensional Euclidean space, and so forth. See, for example, [13]. For a unified discussion of diffusion maps and their usefulness in spectral clustering and dimensionality reduction, see [4].

We see a drawback in Coifman's diffusion distance in that it finds the norm of the distance between two unit vectors, rather than unit vectors. As shown by a simple example later in this paper, two vectors (representing two diffusions), which we may want to consider to be far apart, are actually close to each other in , even though the angle between them is large, because they have small norm, while still having unit norm. Additionally, applying Coifman's distance to heat flow in , a factor of a power of time remains, with the exponent depending on the dimension . It would be desirable not to have such a factor.

Our main motivation for this paper is to propose an alternate diffusion metric, which finds the distance between two unit vectors (with analogous statements for the discrete case). Our distance is thus the length of the chord joining the tips, on the unit hypersphere, of two normalized diffusion vectors, and is therefore based on cosine similarity (see (4.4) below). Cosine similarity (affinity) is popular in kernel methods in machine learning; see for example, [5, 6] (in particular, Section —Document Clustering Basics) and for a review of kernel methods in machine learning, [7].

In the case of heat flow on , our proposed distance has the property that no dimensionally dependent factor is left. Furthermore, for a general manifold, our diffusion distance gives, approximately, a scaled geodesic distance between two points and , when and are closer than , and maximum separation when the geodesic distance between and , scaled by , goes to infinity.

We next give two proofs that the mean first passage cost—defined later in this paper as the cost to visit a particular point for the first time after leaving a specified point—satisfies the triangle inequality. (See Theorem in [8] in which the author states that the triangle inequality holds for the mean first passage time.) We give two proofs that do not assume that the underlying Markov matrix is diagonalizable; our proofs do not rely on spectral theory.

We calculate explicitly the normalized limit of the mean first passage time for the unit circle by identifying the limit as the solution of a specific Dirichlet-Poisson problem on . We also provide numerical verification of our calculation.

The paper is organized as follows. After a section on notation, we discuss R. Coifman's diffusion distance for both the continuous and discrete cases in Section 3. In Section 4, we define and discuss our alternate diffusion distance. In Section 5, we give two proofs of the triangle inequality for mean first passage cost. We conclude the section by exhibiting an interesting connection between the (normalized) mean first passage time and the discretized solution of a certain Dirichlet-Poisson problem and verify our result numerically for the simple case of .

2. Notation and Setup

In this paper, we will present derivations for both the continuous and discrete cases.

In the continuous situation, we assume there is an underlying Riemannian manifold with measure will denote points in . For , will denote a kernel on , with for all , and satisfying the following semigroup property: for all , and . In addition, we assume the following property: for all and all . The latter convention gives the mass preservation property where

We will often specialize to the case when for all and , as in the case of heat flow. Note that when is the fundamental solution for heat flow, we have , where denotes the Dirac delta function centered at . We will sometimes assume (as in the case of heat flow on a compact manifold) that there exist with each corresponding to a finite dimensional eigenspace, and a complete orthonormal family of functions , such that for . We will also frequently use the following fact: if is symmetric in the space variables, then for any , where we have used the symmetry of and its semigroup property.

For the discrete situation, the analog of is an matrix , with , every . In keeping with the usual convention that is Markov if each row sum equals 1, that is, for all , the analog of is , where is the transpose of , and is an column vector. So the index corresponds to the second space variable in , the index corresponds to the first space variable in , and , , corresponds to the power of . The obvious analog of symmetric in its space variables is a symmetric Markov matrix , that is, .

For as above, not necessarily symmetric, we think of as the probability of transitioning from state to state in tick of the clock; is the underlying set of states. For a subset of the set of states , the matrix will denote the following projection: all entries of are except for diagonal entries , when ; the latter entries are equal to .

Finally, will denote the column vector where each entry is 1; will denote the column vector with the component 1, and all others 0, and, for a set of states , will denote the complement of with respect to .

3. A Diffusion Distance Proposed by R. Coifman

Several years ago, R. Coifman proposed a novel diffusion distance based on the ideas of heat flow on a manifold or a discrete analog of heat flow on a set of data points (see, e.g, [1, 2] for a thorough discussion). In this section, we will describe Coifman's distance using our notation, and consider some of its good points, and what we see as some of its drawbacks.

Referring to Section 2, for the continuous case, the unweighted version of Coifman's distance between , which we will denote by , can be defined as follows: Here, for . The is the usual inner product on . (In [1], the authors consider a weighted version of (3.1) which naturally arises when the underlying kernel does not integrate to (in each variable). In terms of data analysis, this corresponds to cases where the data are sampled nonuniformly over the region of interest. For simplicity, we are just using Coifman's unweighted distance.)

Note that we thus have

Although Coifman's original definition used a kernel symmetric with respect to the space variable, as given above need not be based on a symmetric . Note that, by the defining (3.1), is symmetric in and (even if is not), and satisfies the triangle inequality. If is symmetric in the space variables, from (2.6) we see that: a form matching one of Coifman's formulations for the continuous case.

If, in addition to being symmetric in the space variables, we have that (2.5) holds, as in the case of heat flow, we easily see that: the original form proposed by Coifman. Note that the latter expression again explicitly shows that is symmetric in and and satisfies the triangle inequality (by considering, for example, the right-hand side as the square of a weighted distance in ).

Referring again to Section 2, for the discrete situation, where we start with a set of data points , and is a Markov matrix specifying the transition probabilities between the “states" of , the distance between two data points and is given by where is the usual inner product in , and for a matrix , denotes the entry of . Again, symmetry and the triangle inequality are easily verified. If is symmetric, The “1" appearing in the subscript of refers to the fact that is used, corresponding to in the continuous case. As the diffusion along data points flows, after ticks of the clock, we can successively consider which, for a symmetric , equals

An important benefit of introducing a diffusion distance as above can be illustrated by considering (3.5). If is such that (3.5) holds for a complete orthonormal family , we see that as increases, we are achieving an (approximate) isometric embedding of into successively lower-dimensional vector spaces (with a weighted norm). More specifically, for , if is large, the terms are nearly . So, as increases, we see that the “heat smeared" manifold is parametrized by only a few leading 's. Thus, “stepping" through higher and higher times, we are obtaining a natural near-parametrization of more and more smeared versions of , giving rise to a natural ladder of approximations to .

Analogous considerations hold in the discrete situation for symmetric, when we easily see that the eigenvalues of are between and and decrease exponentially for , as increases (the “heat smeared" data points are now parametrized by a few leading eigenvectors of , associated to the largest eigenvalues).

See [13] for more discussion and examples of the natural embedding discussed above, along with illustrations of its power to organize unordered data, as well as its insensitivity to noise.

We would now like to point out what we see some drawbacks of Coifman's distance, which led us to propose an alternative distance in Section 4.

Let us consider (3.4) for the case where the fundamental solution to the heat equation in . Then, If is small, then to the leading order in , Thus, if , we do recover the geodesic distance between and but, due to the term in front, normalized by a power of which depends on the dimension . As pointed out by the reviewer, for itself, the normalization does depend on , but is simply a global change of scale, for each , and thus basically immaterial. Suppose, however, that the data we are considering come in two “clumps", one of dimension and the other of dimension , with . Let us also suppose these clumps are somehow joined together and, far away from the joining region, each clump is basically a flat Euclidean space of the corresponding dimension. Then, far away from the joint, heat diffusion in a particular clump would behave as if it were in , respectively (until the time that the flowing heat “hits" the joint region). Thus, in the part of each clump that is far from the joint, the diffusion distance would be normalized differently, one normalization depending on and the other on . An overall change of scale would not remove this difference, thus we would not recover the usual Euclidean distance in the two clumps simultaneously, as we would like.

The second point of concern is more general in nature. In the continuous case, Coifman's distance involves the distance between , when , and when ; see (3.1). The norm of is , since using the mass preservation assumption of (2.2). For the discrete case, , where is the vector of 's.

So the diffusion distance proposed by Coifman finds the (resp., ) distance between (resp., ) normalized vectors. Let us illustrate by an example for the discrete situation, with , in which this may lead to undesired results. Without specifying the matrix , suppose that after some time has passed, we have the following two vectors giving two different results of diffusion: where the first one hundred entries are each , and the rest entries are , and where each entry is .

Note that and both have norm 1. Now, considering two canonical basis vectors and , , each of which has norm , we see that . So, a distance of gives the (in fact, maximum) separation between two completely different ( unit) diffusion vectors. Return to and , note that corresponds to total diffusion, while has only diffused over 1% of the entries. We would thus hope that and would be nearly as much separated as and , that is, have diffusion distance not much smaller than . But a trivial calculation shows that which seems much smaller than what we would like. The problem is that is small since the norm of each of and is small, even though the norm of each is .

In the next section, we propose a variant of the diffusion distance discussed in this section. Our version will find the (resp., ) distance between vectors which are normalized to have (resp., ) norm to be , rather than (or ) norm .

4. An Alternate Diffusion Distance

In this section, we propose a new diffusion distance. Let us first define our alternate diffusion distance for the continuous case. Refer to Section 2 for the definitions of functions and operators used below.

For any , let Then, Note that has norm :

For , we define our diffusion distance, as follows: where we have used (4.3). Here again, is the usual inner product on . Note the analogy to (3.3).

As is clear from the defining equality in (4.4), is symmetric in and and satisfies the triangle inequality: for all . Geometrically, is the length of the chord joining the tips of the unit vectors and . We have that for all and .

If is symmetric in the space variables, by (2.6), we have that

As an example, again consider the case where , the fundamental solution to the heat equation in . Then, Note that if , then so gives (approximately) the geodesic distance in , in the “near regime" where , and with scale . Note that unlike (3.12), no term appears. (Also see the discussion following (3.12).) Also note that if is large, (the greatest possible distance, see (4.6)), so for such the points and are (nearly) maximally separated. Hence, , for the case of heat flow in , gives a scaled geodesic distance when is close to , with as the unit of length and near maximum separation when is far from at the scale .

For any, say, compact Riemannian manifold , if is the fundamental solution to the heat equation on , we have that where is the geodesic distance on (see [9]). Hence, repeating the expansion in (4.9) for a compact manifold , with small, and , we have that , again recovering (scaled) geodesic distance. (The discussion following (3.12) gives an example for which it would be preferable not to have presented a normalization factor which depends on the dimension.) Exponentially decaying bounds on the fundamental solution of the heat equation for a manifold (see [9, Chapter , Section ]), suggest that and become nearly maximally separated, as given by , when (scaled by ) is large, just as in the Euclidean case.

In the discrete situation, where we start with a set of data points , and is a Markov matrix specifying the transition probabilities between the “states" of , for we let where is the canonical basis vector (see Section 2), and is the vector norm. For and , we define by where and are, respectively, the usual inner product and norm in , and, for a matrix , denotes the entry of .

If is symmetric, As before, represents the tick of the clock.

5. The Mean First Passage Cost Satisfies the Triangle Inequality: An Example of Its Normalized Limit

In this section, we consider a slightly different topic: the mean first passage cost (defined below) between two states as a measure of separation in the discrete situation. We give two explicit proofs showing that the mean first passage cost satisfies the triangle inequality (in [8], the author states this result when all costs are equal to as Theorem , but the proof is not very explicit in our opinion).

In [1012], as well as some of the references listed therein, it is shown that the symmetrized mean first passage time and cost are metrics (for mean first passage cost see, in particular, [10]; also, in the above sources the symmetrized mean first passage time is called the commute time). “Symmetrized" refers to the sum of the first cost (time) to reach a specified state from a starting state and to return back to the starting state. This symmetrization is necessary to ensure a quantity symmetric in the starting and destination states. In the sources cited above, the fundamental underlying operator is the graph Laplacian , which, using the notation of [12], is defined as . Here, is the adjacency matrix of a graph, and is the diagonal degree matrix, with the entry on the diagonal equaling . In addition to assuming the nonnegativity of the 's, the authors in the above works assume that is symmetric. The resulting symmetry (and positive semi-definiteness of ) implies the existence of a full set of nonnegative eigenvalues of , and the diagonalizability of is used heavily in the proofs that the commute time/cost is a distance. In the random walk interpretation, see, for example, [12], the following normalized Laplacian is relevant: . To make a connection with the notation in the present paper, , a Markov matrix giving the transition probabilities of the random walk. Although is not necessarily symmetric, it is easy to see that (see the discussion in [12]). Hence , while not itself symmetric in general, is conjugate to the symmetric matrix , and thus too has a full complement of eigenvalues.

In this section, as in the rest of the paper unless stated otherwise, we are not assuming that the Markov matrix is symmetric or conjugate to a symmetric matrix; hence may not be diagonalizable (i.e., may have Jordan blocks of dimension greater than ). We thus do not have spectral theory available to us. Furthermore, we do not wish to necessarily symmetrize the mean first passage time/cost to obtain a symmetric quantity; we are not actually going to get a distance, but will try to obtain the “most important" property of being a distance, namely, the triangle inequality.

A model example we are thinking about is the following. Suppose we have a map grid and are tracking some localized storm which is currently at some particular location on the grid. We suppose that the storm behaves like a random walk and has a certain (constant in time) probability to move from one grid location to another at each “tick of the clock" (time step). We can thus model the movements of the storm by a Markov matrix , with the power of giving the transition probabilities after ticks of the clock. If there is no overall wind, the matrix could reasonably be assumed to be symmetric, and we could use spectral theory. But suppose there is an overall wind in some fixed direction, which is making it more probable for the storm to move north, say, rather than south. Then the matrix is not symmetric; there is a preferred direction of the storm to move in, from one tick of the clock to the next; spectral theory cannot, in general, be used. Furthermore, it may not be reasonable in this situation to consider the commute time—the symmetrized mean first passage time—since we may rather want to know the expected time to reach a certain population center from the current location of the storm, and may not care about the storm's return to the original location. Thus the mean first passage time would be the quantity of interest.

In the first part of this section, we give two proofs that the mean first passage cost/time, associated with a not-necessarily-symmetric Markov matrix , does indeed satisfy the triangle inequality; our proofs do not rely on spectral theory. We think that satisfying the triangle inequality, while in general failing to be symmetric, is still a very useful property for a bilinear form to have.

We conclude the section by exhibiting a connection between the (normalized) mean first passage time and the discretized solution of a certain Dirichlet-Poisson problem and verify our result numerically for the simple case of the unit circle.

In this section, is a finite set of states and is a Markov matrix giving the transition probabilities between states in one tick of the clock (see Section 2). will denote an matrix with non-negative entries, . We will think of each as the “cost" associated with the transition from state to state . By a slight abuse of notation, for , will be the matrix in which all entries are , except the entry which is (this corresponds to in Section 2). Also, will be the matrix in which all entries are , except the and entries each of which is (this corresponds to in Section 2).

Let be the random variable which gives the cost accumulated by a particle starting at state until its first visit to state after leaving . In other words, if a particular path of the particle is given by the states , the value of is . We suppose has the property that for every there exists an such that , that is, every state is eventually reachable from every state . Then, as is shown in [13] (using slightly different notation), we have the following formula for , which is the expected cost of going from state to state : where is the matrix with entry equal to . (In particular, it is shown in [13] that is invertible and , as .) See [14, 15] for discussion of related expected values, and [8, 1012, 1618] for discussion of mean first passage times and related concepts.

We will give two proofs that the expected cost of going from one state to another satisfies the triangle inequality.

Proposition 5.1.

We again note that this proposition, for the case all costs are , is stated as Theorem in [8], but we feel the proof is not very explicit. (In our proofs below, we assume ; if , the inequality in Proposition 5.1 is immediate.)

Proof. Our first proof is probabilistic. Let a random walker start at state and accumulate costs given by the matrix as he moves from state to state. As soon as the walker reaches state , we obtain a sample value of . Now, at this point of the walk, there are two possibilities. Either the walker has passed through state before his first visit to after leaving , or he has not. In the first instance, we have obtained sample values of and along the way, and for this simulation. In the second case, we let the walker continue until he first reaches , to obtain a sample value of , and walk still more until he reaches for the first time since leaving , thus giving a sample value of (note that by the memoryless property, this sample value of is independent of the walker's prior history). In the second case, we thus clearly have that . Combining the two cases, we have . Repeating the simulation, averaging, and taking the limit as the number of simulations goes to infinity, we obtain that
(2) Our second proof is via explicit matrix computations. Let us define the following two quantities: (See Section 2 and the paragraphs before the statement of Proposition 5.1.) Now, we have see (5.1). Also, But where we have used a series expansion to show the first inequality (all entries are non-negative), and the fact that , since . Thus, .
We will finish our second proof by showing that . First note that using that . Thus, Here we have used the fact that (as mentioned earlier, we are assuming ; the triangle inequality we are proving holds trivially for the case ).

We would like to point out that the decomposition of in the second proof above is not a “miraculous" guess. We arrived at this decomposition by writing as the derivative (evaluated at ) of the characteristic function (Fourier transform) of (see [13]), and breaking up the expression to be differentiated into a sum of terms: one term corresponding to the random walk going from to without visiting first, and one term corresponding to visiting before reaching . After differentiation, the resulting six pieces, when suitably combined into two terms, yielded and .

We conclude this section by considering certain (suitably normalized) limiting values of the expected cost of going from state to state , for the first time after leaving , given by (5.1). For this discussion, we will take all the costs to be identically , that is, for all . Then, we see from (5.1) that where we have used , since .

Now, let us digress a little to describe a stochastic approach to solving certain boundary value problems. The description below follows very closely parts of Chapter in [19]. Some statements are excerpted verbatim from that work, with minor changes in some labels. The background results below are well known and are often referred to as Dynkin's formula (see, e.g, [20]). We are presenting them for the reader's convenience and will use them to exhibit an interesting connection between the mean first passage time and the discretized solution of a certain Dirichlet-Poisson problem; see (5.16).

Let be a domain in , and let denote a partial differential operator on of the form: where . We assume each is bounded and has bounded first and second partial derivatives; also, each is Lipschitz. Suppose is uniformly elliptic in (i.e., all the eigenvalues of the symmetric matrix are positive and stay uniformly away from in ). Then, for , some , and bounded, and for , the function defined below solves the following Dirichlet-Poisson problem: (Regular points in this context are defined in [19] and turn out to be the same as the regular points in the classical sense, i.e., the points on where the limit of the generalized Perron-Wiener-Brelot solution coincides with , for all .)

Now we define . We choose a square root of the matrix , that is, Next, for , let be an Itô diffusion solving where is n-dimensional Brownian motion. Then, is a solution of (5.10). Here, the expected values are over paths starting from , and is the first exit time from .

Let us transfer the above discussion to, say, a compact manifold , rather than . We sample and let the “states" be the sample points. We construct a transition matrix to give a discretized version of (5.12). Let be the approximate separation between the sample points. Fix a sample point , and let be the domain in consisting of the complement of the closure of the ball of radius in , center . Let be a sample point in . For this situation, in (5.10), let be the function, and be the constant function. Then (5.13) becomes: is the first exit time from , that is, first visit time to the neighborhood of . (Compare with Proposition in [21] which discusses the case of the Dirichlet-Poisson problem (5.10) with and for a manifold.) As shown in [13] (with slightly different notation), a discrete version of (5.14) is where , all , and , for all not equal to , and all . Thus, . Combining (5.15) and (5.8), we see that: for small.

We thus see a connection between the (normalized) mean first passage time and the solution to the Dirichlet-Poisson problem discussed above.

Let us illustrate the preceding discussion by a simple example: , the unit circle. We will consider , the Laplacian on , and sample uniformly. We will let the transition matrix take the walker from the current state to each of the two immediate neighbor states, with probability for each. The variance is then . Since , see (5.12), we must have , and we should use as our value of in (5.16). Using symmetry, we can take , the angle on , without loss of generality. Let Note that So is the unique solution satisfying (5.10) for our example, on the domain .

To numerically confirm (5.16), we ran numerical experiments in which we discretized into equispaced points, with the transition matrix taking a state to each of its 2 immediate neighbors with probability , and used as the value of in (5.16) to calculate . We took to be the angle , and to be the closest sample point to the angle with radian measure , for example. Letting in (5.17), we compared the value of with . For instance, with , , and the nearest sample point to the angle with radian measure , the relative error is less than . Note that is, for close to , essentially a scaled geodesic distance on (from our base angle ).

6. Conclusions

The authors have presented a diffusion distance which uses unit vectors, and which is based on the well-known cosine similarity. They have discussed why the normalization may make diffusion distances more meaningful. We also gave two explicit proofs of the triangle inequality for mean first passage cost, and exhibited a connection between the (normalized) mean first passage time and the discretized solution of a certain Dirichlet-Poisson problem.

Acknowledgments

We thank Raphy Coifman for his continuous generosity in sharing his enthusiasm for mathematics and his ideas about diffusion and other topics. We would also like to thank the anonymous reviewer for his/her thorough critique of this paper and many helpful suggestions.