ebook img

On H\"older projective divergences PDF

1.3 MB·
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview On H\"older projective divergences

On Ho¨lder projective divergences∗ Frank Nielsen† ´ Ecole Polytechnique, France Sony Computer Science Laboratories, Japan Ke Sun Computer, Electrical and Mathematical Sciences and Engineering Division King Abdullah University of Science and Technology (KAUST), Saudi Arabia St´ephane Marchand-Maillet Viper Group, Computer Vision and Multimedia Laboratory 7 University of Geneva, Switzerland 1 0 2 n Abstract a J Wedescribeaframeworktobuilddistancesbymeasuringthetightnessofinequalities,andintro- 4 ducethenotionofproperstatisticaldivergencesandimproperpseudo-divergences. Wethenconsider 1 the Ho¨lder ordinary and reverse inequalities, and present two novel classes of Ho¨lder divergences and pseudo-divergences that both encapsulate the special case of the Cauchy-Schwarz divergence. ] We report closed-form formulas for those statistical dissimilarities when considering distributions G belonging to the same exponential family provided that the natural parameter space is a cone (e.g., L multivariate Gaussians), or affine (e.g., categorical distributions). Those new classes of Ho¨lder dis- . s tancesareinvarianttorescaling,andthusdonotrequiredistributionstobenormalized. Finally,we c show how to compute statistical Ho¨lder centroids with respect to those divergences, and carry out [ center-basedclusteringtoyexperimentsonasetofGaussiandistributionsthatdemonstrateempiri- 1 cally that symmetrized Ho¨lder divergences outperform the symmetric Cauchy-Schwarz divergence. v Keywords: H¨older inequalities; H¨older divergences; projective divergences; Cauchy-Schwarz divergence; 6 H¨older escort divergences; skew Bhattacharyya divergences; exponential families; conic exponential fam- 1 9 ilies; escort distribution; clustering. 3 0 . 1 Introduction: Inequality, proper divergence and improper 1 0 pseudo-divergence 7 1 1.1 Statistical divergences from inequality gaps : v i An inequality is denoted mathematically by lhs ≤ rhs, where lhs and rhs denote respectively the left- X hand-side and right-hand-side of the inequality. One can build dissimilarity measures from inequalities r a lhs ≤ rhs by measuring the inequality tightness: For example, we may quantify the tightness of an inequality by its difference gap: ∆=rhs−lhs≥0. (1) When lhs>0, the inequality tightness can also be gauged by the log-ratio gap: rhs lhs D =log =−log ≥0. (2) lhs rhs We may further compose this inequality tightness value measuring non-negative gaps with a strictly monotonically increasing function f (with f(0)=0). ∗Reproduciblesourcecodeavailableathttps://www.lix.polytechnique.fr/~nielsen/HPD/ †Contactauthor: [email protected] 1 A bi-parametric inequality lhs(p,q)≤rhs(p,q) is called proper if it is strict for p(cid:54)=q (i.e., lhs(p,q)< rhs(p,q),∀p (cid:54)= q) and tight if and only if (iff) p = q (i.e., lhs(p,q) = rhs(p,q),∀p = q). Thus a proper bi-parametric inequality allows one to define dissimilarities such that D(p,q) = 0 iff p = q. Such a dissimilarity is called proper. Otherwise, an inequality or dissimilarity is said improper. Note that there are many equivalent words used in the literature instead of (dis-)similarity: distance (although often assumed to have metric properties), pseudo-distance, discrimination, proximity, information deviation, etc. A statistical dissimilarity between two discrete or continuous distributions p(x) and q(x) on a sup- port X can thus be defined from inequalities by summing up or taking the integral for the inequalities instantiated on the observation space X: ∀x∈X, D (p,q)=rhs(p(x),q(x))−lhs(p(x),q(x))⇒ (3) x (cid:26) (cid:80) (cid:2) (cid:3) rhs(p(x),q(x))−lhs(p(x),q(x)) discrete case, D(p,q)= (cid:82) x(cid:2)∈X (cid:3) (4) rhs(p(x),q(x))−lhs(p(x),q(x)) dx continuous case. X In such a case, we get a separable divergence. Some non-separable inequalities induce a non-separable divergence. For example, the renown Cauchy-Schwarz divergence [8] is not separable because in the inequality: (cid:115) (cid:90) (cid:18)(cid:90) (cid:19)(cid:18)(cid:90) (cid:19) p(x)q(x)dx≤ p(x)2dx q(x)2dx , (5) X X X the rhs is not separable. Furthermore, a proper dissimilarity is called a divergence in information geometry [2] when it is C3 (i.e., three times differentiable thus allowing to define a metric tensor [34] and a cubic tensor [2]). Many familiar distances can be reinterpreted as inequality gaps in disguise. For example, Bregman divergences [4] and Jensen divergences [28] (also called Burbea-Rao divergences [9, 22]) can be rein- terpreted as inequality difference gaps, and the Cauchy-Schwarz distance [8] as an inequality log-ratio gap: Example 1 (Bregman divergence as a Bregman score-induced gap divergence). A proper score func- tion [14] S(p:q) induces a gap divergence D(p:q)=S(p:q)−S(p:p)≥0. A Bregman divergence [4] B (p : q) for a strictly convex and differentiable real-valued generator F(x) is induced by the Bregman F score S (p:q). Let S (p:q)=−F(q)−(cid:104)p−q,∇F(q)(cid:105) denote the Bregman proper score minimized for F F p=q. Then the Bregman divergence is a gap divergence: B (p:q)=S (p:q)−S (p:p)≥0. When F F F F is strictly convex, the Bregman score is proper, and the Bregman divergence is proper. Example 2 (Cauchy-Schwarz distance as a log-ratio gap divergence). Consider the Cauchy-Schwarz (cid:113) (cid:82) (cid:0)(cid:82) (cid:1)(cid:0)(cid:82) (cid:1) inequality p(x)q(x)dx≤ p(x)2dx q(x)2dx . ThentheCauchy-Schwarzdistance[8]between X X X two continuous distributions is defined by CS(p(x):q(x))=−log√ (cid:82)Xp(x)q(x)dx ≥0. ((cid:82) p(x)2dx)((cid:82) q(x)2dx) X X Note that we use the modern notation D(p(x):q(x)) to emphasize that the divergence is potentially asymmetric: D(p(x):q(x))(cid:54)=D(q(x):p(x)), see [2]. In information theory [11], the older notation “||” is often used instead of “:” that is used in information geometry [2]. To conclude this introduction, let us finally introduce the notion of projective statistical distances. A statistical distance D(p(x):q(x)) is said projective when: D(λp(x):λ(cid:48)q(x))=D(p(x):q(x)), ∀λ,λ(cid:48) >0. (6) The Cauchy-Schwarz distance is a projective divergence. Another example of such a projective divergence is the parametric γ-divergence [13]. Example 3 (γ-divergence as a projective score-induced gap divergence). The γ-divergence [13, 29] D (p(x):q(x)) for γ >0 is projective: γ D (p(x),q(x))=S (p(x),q(x))−S (p(x),p(x)),with γ γ γ 1 (cid:82) p(x)q(x)γdx S (p(x),q(x))=− . γ γ(1+γ)(cid:0)(cid:82) q(x)1+γdx(cid:1)1+γγ The γ-divergence is related to the proper pseudo-spherical score [13]. 2 Theγ-divergenceshavebeenprovenusefulforrobuststatisticalinference[13]inthepresenceofheavy outlier contamination. 1.2 Pseudo-divergences and the axiom of indiscernability Consider a broader class of statistical pseudo-divergences based on improper inequalities, where the tightness of lhs(p,q) ≤ rhs(p,q) does not imply that p = q. This family of dissimilarity measures have interesting properties which have not been studied before. Formally, statistical pseudo-divergences are defined with respect to density measures p(x) and q(x) with x∈X, where X denotes the support. By definition, pseudo-divergences satisfy the following three fundamental properties: 1. Non-negativeness: D(p(x):q(x))≥0 for any p(x),q(x); 2. Reachable indiscernability: • ∀p(x), there exists q(x) such that D(p(x):q(x))=0, • ∀q(x), there exists p(x) such that D(p(x):q(x))=0. 3. Positive correlation: If D(p:q)=0, then (p(x )−p(x ))(q(x )−q(x ))≥0 for any x , x ∈X. 1 2 1 2 1 2 As compared to statistical divergence measures such as the Kullback-Leibler (KL) divergence: (cid:90) p(x) KL(p(x):q(x))= p(x)log dx, (7) q(x) X pseudo-divergences do not require D(p(x) : p(x)) = 0. Instead, any pair of distributions p(x) and q(x) with D(p(x) : q(x)) = 0 only has to be “positively correlated” such that p(x ) ≤ p(x ) implies q(x ) ≤ 1 2 1 q(x ), and vice versa. Any divergence with D(p(x) : q(x)) = 0 ⇒ p(x) = q(x) (law of indiscernibles) 2 automatically satisfies this weaker condition, and therefore any divergence belongs to the broader class ofpseudo-divergences. Indeed, ifp(x)=q(x)then(p(x )−p(x ))(q(x )−q(x ))=(p(x )−p(x ))2 ≥0. 1 2 1 2 1 2 Howevertheconverseisnottrue. Asweshalldescribeintheremainder,thefamilyofpseudo-divergences is not limited to proper divergence measures. In the remainder, the term “pseudo-divergence” refers to such divergences that are not proper divergence measures. We study two novel statistical dissimilarity families: One family of statistical improper pseudo- divergences and one family of proper statistical divergences. Within the class of pseudo-divergences, this work concentrates on defining a one-parameter family of dissimilarities called H¨older log-ratio gap divergence that we concisely abbreviate as HPD for “H¨older pseudo divergence” in the remainder. We also study its proper divergence counterpart termed HD for “H¨older divergence.” 1.3 Prior work and contributions Theterm“H¨olderdivergence”hasfirstbeencoinedin2014basedonthedefinitionoftheH¨olderscore[20, 19]: The score-induced H¨older divergence D(p(x) : q(x)) is a proper gap divergence that yields a scale- invariant divergence [20, 19]. Let p (x) = aσp(σx) for a,σ > 0 be a transformation. Then a scale- a,σ invariantdivergence[20,19]satisfiesD(p (x):q (x))=κ(a,σ)D(p(x):q(x))forafunctionκ(a,σ)> a,σ a,σ 0. Thisgapdivergenceispropersinceitisbasedontheso-calledH¨olderscore[20,19]butisnotprojective and does not include the Cauchy-Schwarz divergence. Due to these differences the H¨older log-ratio gap divergence introduced here shall not be confused with the H¨older gap divergence induced by the H¨older score [19, 20] that relies both on a scalar γ and a function φ(·). Weshallintroducetwonovelfamiliesoflog-ratioprojectivegapdivergencesbasedonH¨olderordinary (orforward)andreverseinequalitiesthatextendtheCauchy-Schwarzdivergence,studytheirproperties, andconsiderasanapplicationclusteringGaussiandistributions: Weexperimentallyshowbettercluster- ing results when using symmetrized H¨older divergences than using the Cauchy-Schwarz divergence. To contrast with the “H¨older composite score-induced divergences” of [19], our H¨older divergences admit closed-form expressions between distributions belonging to the same exponential families [23] provided that the natural parameter space is a cone or affine. Our main contributions are summarized as follows: 3 • Define the uni-parametric family of H¨older improper pseudo-divergences (HPDs) in §2 and the bi- parametric family of H¨older proper divergences in §3 (HDs) for positive and probability measures, and study their properties (including their relationships with skewed Bhattacharrya distances [22] via escort distributions); • Report closed-form expressions of those divergences for exponential families when the natural parameterspaceisaconeoraffine(includebutnotlimitedtothecasesofcategoricaldistributions and multivariate Gaussian distributions) in §4; • Provide approximation techniques to compute those divergences between mixtures based on log- sum-exp inequalities in §4.6; • Describeavariationalcenter-basedclusteringtechniquebasedontheconvex-concaveprocedurefor computing H¨older centroids, and report our experimental results in §5. 1.4 Organization This paper is organized as follows: §2 introduces the definition and properties of H¨older pseudo- divergences (HPDs). It is followed by §3 that describes H¨older proper divergences (HDs). In §4, closed-form expressions for those novel families of divergences are reported for the categorical, mul- tivariate Gaussian, Bernoulli, Laplace and Wishart distributions. §5 defines H¨older statistical centroids and presents a variational k-means clustering technique: We show experimentally that using H¨older divergences improve over the Cauchy-Schwarz divergence. Finally, 6 concludes this work and hints at further perspectives from the viewpoint of statistical estimation and manifold learning. In Appendix A, we recall the proof of the ordinary and reverse H¨older’s inequalities. 2 H¨older pseudo-divergence: Definition and properties H¨older’s inequality (see [18] and Appendix A for a proof) states for positive real-valued functions1 p(x) and q(x) defined on the support X that: (cid:90) (cid:18)(cid:90) (cid:19)1 (cid:18)(cid:90) (cid:19)1 α β p(x)q(x)dx≤ p(x)αdx q(x)βdx , (8) X X X where exponents α and β satisfy αβ > 0 as well as the exponent conjugacy condition: 1 + 1 = 1. We α β also write β = α¯ = α meaning that α and β are conjugate H¨older exponents. We check that α > 1 α−1 and β > 1. H¨older inequality holds even if the lhs is infinite (meaning that the integral diverges) since the rhs is also infinite in that case. The reverse H¨older inequality holds for conjugate exponents 1 + 1 =1 with αβ <0 (then 0<α<1 α β and β <0, or α<0 and 0<β <1): (cid:90) (cid:18)(cid:90) (cid:19)1 (cid:18)(cid:90) (cid:19)1 α β p(x)q(x)dx≥ p(x)αdx q(x)βdx . (9) X X X Both H¨older’s inequality and the reverse H¨older inequality turn tight when p(x)α ∝q(x)β (see proof in Appendix A). 2.1 Definition Let (X,F,µ) be a measurable space where µ is the Lebesgue measure, and let Lγ(X,µ) denote the Lebesgue space of functions that have their γ-th power of absolute value Lebesgue integrable, for any γ >0 (when γ ≥1, Lγ(X,µ) is a Banach space). We define the following pseudo-divergence: 1In a more general form, Ho¨lder’s inequality holds for any real and complex valued functions. In this work, we only focusonrealpositivefunctionsthataredensitiesofpositivemeasures. 4 Definition 1 (H¨olderstatisticalpseudo-divergence, HPD). For conjugate exponents α and β with αβ > 0, the H¨older Pseudo-Divergence (HPD) between two densities p(x)∈Lα(X,µ) and q(x)∈Lβ(X,µ) of positive measures absolutely continuous with respect to (wrt.) µ is defined by the following log-ratio gap:   (cid:82) p(x)q(x)dx DαH(p(x):q(x))=−log(cid:0)(cid:82) X (cid:1)1 (cid:0)(cid:82) (cid:1)1 . (10) p(x)αdx α q(x)βdx β X X When 0<α<1 and β =α¯ = α <0, or α<0 and 0<β <1, the reverse HPD is defined by: α−1   (cid:82) p(x)q(x)dx DαH(p(x):q(x))=log(cid:0)(cid:82) X (cid:1)1 (cid:0)(cid:82) (cid:1)1 . (11) p(x)αdx α q(x)βdx β X X ByH¨older’sinequalityandthereverseH¨olderinequality, DH(p(x):q(x))≥0withDH(p(x):q(x))= α α 0 iff p(x)α ∝ q(x)β or equivalently q(x) ∝ p(x)α/β = p(x)α−1. When α > 1, xα−1 is monotonically increasing, andDH isindeedapseudo-divergence. However, thereverseHPDisnot apseudo-divergence α because xα−1 will be monotonically decreasing if α<0 or 0<α<1. Therefore we only consider HPD with α>1 in the remainder, and leave here the notion of reverse H¨older divergence. When α=β =2, the HPD becomes the Cauchy-Schwarz divergence CS [16]:   (cid:82) p(x)q(x)dx D2H(p(x):q(x))=CS(p(x):q(x))=−log(cid:0)(cid:82) X (cid:1)1 (cid:0)(cid:82) (cid:1)1, (12) p(x)2dx 2 q(x)2dx 2 X X which has been proved useful to get closed-form divergence formulas between mixtures of exponential families with conic or affine natural parameter spaces [21]. TheCauchy-SchwarzdivergenceisproperforprobabilitydensitiessincetheCauchy-Schwarzinequal- ity becomes an equality iff q(x) = λp(x)α−1 = λp(x) implies that λ = (cid:82) λp(x)dx = (cid:82) q(x)dx = 1. It X X is however not proper for positive densities. Fact 1 (CS is only proper for probability densities). The Cauchy-Schwarz divergence CS(p(x):q(x)) is proper for square-integrable probability densities p(x),q(x)∈L2(X,µ) but not proper for positive square- integrable densities. 2.2 Properness and improperness In the general case, when α (cid:54)= 2, the divergence DH is not even proper for normalized (probability) α densities, not to mention general unnormalized (positive) densities. Indeed, when p(x)=q(x), we have:   (cid:82) p(x)2dx DαH(p(x):p(x))=−log(cid:0)(cid:82) p(x)αdx(cid:1)α1 (cid:0)(cid:82) p(x)αα−1dx(cid:1)αα−1(cid:54)=0 when α(cid:54)=2. (13) Letusconsiderthegeneralcase. Forunnormalizedpositivedistributionsp˜(x)andq˜(x)(thetildenotation stems from the notation of homogeneous coordinates in projective geometry), the inequality becomes an equality when: p˜(x)α ∝ q˜(x)β, i.e., p(x)α ∝ q(x)β, or q(x) ∝ p(x)α/α¯ = p(x)α−1. We can check that DH(p(x):λp(x)α−1)=0 for any λ>0: α     (cid:82) p(x)λp(x)α−1dx (cid:82) p(x)αdx −log =−log =0, (14) (cid:0)(cid:82) (cid:1)1 (cid:0)(cid:82) (cid:1)1 (cid:0)(cid:82) (cid:1)1 (cid:0)(cid:82) (cid:1)1 p(x)αdx α λβp(x)(α−1)βdx β p(x)αdx α p(x)αdx β since (α−1)β =(α−1)α¯ =(α−1) α =α. α−1 For α=2, we find indeed that DH(p(x):λp(x))=CS(p(x):p(x))=0 for any λ(cid:54)=0. 2 Fact 2 (HPD is improper). The H¨older pseudo-divergences are improper statistical distances. 5 2.3 Reference duality In general, H¨older divergences are asymmetric when α (cid:54)= β ((cid:54)= 2) but enjoy the following reference duality [39]: DH(p(x):q(x))=DH(q(x):p(x))=DH (q(x):p(x)). (15) α β α α−1 Fact 3 (Reference duality HPD). The H¨older pseudo-divergences satisfy the reference duality α↔β = α : DH(p(x):q(x))=DH(q(x):p(x))=DH (q(x):p(x)). α−1 α β α α−1 An arithmetic symmetrization of the HPD yields a symmetric HPD SH, given by: α DH(p(x):q(x))+DH(q(x):p(x)) SH(p(x):q(x)) = SH(q(x):p(x))= α α¯ , α α 2   (cid:82) p(x)q(x)dx = −log(cid:113) . (16) (cid:0)(cid:82) (cid:1)1 (cid:0)(cid:82) (cid:1)1 (cid:0)(cid:82) (cid:1)1 (cid:0)(cid:82) (cid:1)1 p(x)αdx α p(x)α¯dx α¯ q(x)αdx α q(x)α¯dx α¯ 2.4 HPD is a projective divergence Intheabovedefinition, densitiesp(x)andq(x)caneitherbepositiveornormalizedprobabilitydistribu- (cid:82) tions. Let p˜(x) and q˜(x) denote positive (not necessarily normalized) measures, and w(p˜) = p˜(x)dx X the overall mass so that p(x) = p˜(x) is the corresponding normalized probability measure. Then we w(p˜) check that HPD is a projective divergence [13] since: DH(p˜(x):q˜(x))=DH(p(x):q(x)), (17) α α or in general: DH(λp(x):λ(cid:48)q(x))=DH(p(x):q(x)) (18) α α for all prescribed constants λ,λ(cid:48) > 0. Projective divergences may also be called “angular divergences” or “cosine divergences” since they do not depend on the total mass of the measure densities. Fact 4 (HPD is projective). The H¨older pseudo-divergences are projective distances. 2.5 Escort distributions and skew Bhattacharyya divergences Letusdefinewithrespecttotheprobabilitymeasuresp(x)∈Lα1(X,µ)andq(x)∈Lβ1(X,µ)thefollowing escort probability distributions [2]: pE(x)= p(x)α1 , (19) α (cid:82) p(x)α1dx and 1 q(x)β qE(x)= . (20) β (cid:82) 1 q(x)βdx Since HPD is a projective divergence, we compute with respect to the conjugate exponents α and β the H¨older Escort Divergence (HED): (cid:90) DHE(p(x):q(x))=DH(pE(x):qE(x))=−log p(x)1/αq(x)1−1/αdx=B (p(x):q(x)), (21) α α α β 1/α X which turns out to be the familiar skew Bhattacharyya divergence B (p(x):q(x)), see [22]. 1/α Fact 5 (HED as a skew Bhattacharyya divergence). The H¨older escort divergence amounts to a skew Bhattacharyya divergence: DHE(p(x):q(x))=B (p(x):q(x)) for any α>0. α 1/α Inparticular,theCauchy-SchwarzescortdivergenceCSHE(p(x):q(x))amountstotheBhattacharyya (cid:113) (cid:82) distance [6] B(p(x):q(x))=−log p(x)q(x)dx: X CSHE(p(x):q(x))=DHE(p(x):q(x))=DH(pE(x):qE(x))=B (p(x):q(x))=B(p(x):q(x)). (22) 2 2 2 2 1/2 ObservethattheCauchy-Schwarzescortdistributionsarethesquarerootdensityrepresentations[36] of distributions. 6 3 Proper H¨older divergence 3.1 Definition Let p(x) and q(x) be positive measures in Lγ(X,µ) for a prescribed scalar value γ > 0. Plugging the positive measures p(x)γ/α and q(x)γ/β into the definition of HPD DH, we get the following definition: α Definition 2 (Proper H¨older Divergence, HD). For conjugate exponents α,β >0 and γ >0, the proper H¨older divergence between two densities p(x) and q(x) is defined by: (cid:32) (cid:82) p(x)γ/αq(x)γ/βdx (cid:33) DH (p(x):q(x))=DH(p(x)γ/α :q(x)γ/β)=−log X . (23) α,γ α ((cid:82) p(x)γdx)1/α((cid:82) q(x)γdx)1/β X X By definition, DH (p : q) is a two-parameter family of dissimilarity statistical measures. Following α,γ H¨older’s inequality, we can check that DH (p(x):q(x))≥0 and DH (p(x):q(x))=0 iff p(x)γ ∝q(x)γ, α,γ α,γ i.e. p(x) ∝ q(x) (see Appendix A). If p(x) and q(x) belong to the statistical probability manifold, then DH (p(x) : q(x)) = 0 iff p(x) = q(x) almost everywhere. This says that HD is a proper divergence α,γ for probability measures, and it becomes a pseudo-divergence for positive measures. Note that we have abusedthenotationDH todenoteboththeH¨olderpseudo-divergence(withonesubscript)andtheH¨older divergence (with two subscripts). Similar to HPD, HD is asymmetric when α(cid:54)=β with the following reference duality: DH (p(x):q(x))=DH (q(x),p(x)). (24) α,γ α¯,γ HD can be symmetrized as: DH (p:q)+DH (q :p) (cid:115)(cid:82) p(x)γ/αq(x)γ/βdx(cid:82) p(x)γ/βq(x)γ/αdx SαH,γ(p:q)= α,γ 2 α,γ =−log X (cid:82) p(x)γdx(cid:82)X q(x)γdx . (25) X X Furthermore, one can easily check that HD is a projective divergence. For conjugate exponents α,β >0 and γ >0, we rewrite the definition of HD as: (cid:90) (cid:18) p(x)γ (cid:19)1/α(cid:18) q(x)γ (cid:19)1/β DαH,γ(p(x):q(x))=−log (cid:82) p(x)γdx (cid:82) q(x)γdx dx, X X X (cid:16) (cid:17)1/α(cid:16) (cid:17)1/β =−log pE (x) qE (x) dx=B (pE (x):qE (x)). 1/γ 1/γ 1 1/γ 1/γ α Therefore HD can be reinterpreted as the skew Bhattacharyya divergence [22] between the escort distri- butions. In particular, when γ =1, we get: (cid:18)(cid:90) (cid:19) DH (p(x):q(x))=−log p(x)1/αq(x)1/βdx =B (p(x):q(x)). (26) α,1 1 X α Fact 6. The two-parametric family of statistical Ho¨lder divergence DH passes through the one- α,γ parametric family of skew Bhattacharyya divergences when γ =1. 3.2 Special case: The Cauchy-Schwarz divergence We consider the intersection of the uni-parametric class of H¨older pseudo-divergences (HPD) with the bi-parametric class of proper H¨older divergences (HD): That is, the class of divergences which belong to both HPD and HD. Then we must have γ/α = γ/β = 1. Since 1/α+1/β = 1, we get α = β = γ = 2. ThereforetheCauchy-Schwarz(CS)divergenceistheuniquedivergencebelongingtobothHPDandHD classes: DH (p(x):q(x))=DH(p(x):q(x))=CS(p(x):q(x)). (27) 2,2 2 Infact,theCSdivergenceistheintersectionofthefourclassesHPD,symmetricHPD,HD,andsymmetric HD. Figure 1 displays a diagram of those divergence classes with their inclusion relationships. 7 ProjectiveDivergence skewBhattacharyya Divergence(proper) B H¨olderPseudo-Divergence 1/α DH Cauchy-Schwarz α H¨olderDivergence(proper) CSDivergence DH α,γ Figure 1: H¨older proper divergence (bi-parametric) and H¨older improper pseudo-divergence (uni-parametric) intersect at the unique non-parametric Cauchy-Schwarz divergence. By using escort distributions, H¨older divergences encapsulates the skew Bhattacharyya distances. As stated earlier, notice that the Cauchy-Schwarz inequality: (cid:115) (cid:90) (cid:18)(cid:90) (cid:19)(cid:18)(cid:90) (cid:19) p(x)q(x)dx≤ p(x)2dx p(x)2dx , (28) isnotproperasitisanequalitywhenp(x)andq(x)arelinearly dependent(i.e.,p(x)=λq(x)forλ>0). The arguments of the CS divergence are square-integrable real-valued density functions p(x) and q(x). Thus the Cauchy-Schwarz divergence is not proper for positive measures but is proper for normalized (cid:82) (cid:82) probability distributions since p(x)dx= λq(x)dx=1 implies that λ=1. 3.3 Limit cases of H¨older divergences and statistical estimation Let us define the inner product of unnormalized densities as: (cid:90) (cid:104)p˜(x),q˜(x)(cid:105)= p˜(x)q˜(x)dx (29) X (for L2(X,µ) integrable functions), and define the L norm of densities as (cid:107)p˜(x)(cid:107) = ((cid:82) p˜(x)αdx)1/α α α X for α≥1. Then the CS divergence can be concisely written as: (cid:104)p˜(x),q˜(x)(cid:105) CS(p˜(x),q˜(x))=−log , (30) (cid:107)p˜(x)(cid:107) (cid:107)q˜(x)(cid:107) 2 2 and the H¨older pseudo-divergence writes as: (cid:104)p˜(x),q˜(x)(cid:105) DH(p˜(x),q˜(x))=−log . (31) α (cid:107)p˜(x)(cid:107) (cid:107)q˜(x)(cid:107) α α¯ When α→1+, we have α¯ =α/(α−1)→+∞. Then it comes that: (cid:104)p˜(x),q˜(x)(cid:105) (cid:90) lim DH(p˜(x),q˜(x))=−log =−log(cid:104)p˜(x),q˜(x)(cid:105)+log p˜(x)dx+logmaxq˜(x). (32) α→1+ α (cid:107)p˜(x)(cid:107)1(cid:107)q˜(x)(cid:107)∞ X x∈X When α→+∞ and α¯ →1+, we have: (cid:104)p˜(x),q˜(x)(cid:105) (cid:90) lim DH(p˜(x),q˜(x))=−log =−log(cid:104)p˜(x),q˜(x)(cid:105)+logmaxp˜(x)+log q˜(x)dx. (33) α→+∞ α (cid:107)p˜(x)(cid:107)∞(cid:107)q˜(x)(cid:107)1 x∈X X 8 Now consider a pair of probability densities p(x) and q(x). We have: lim DH(p(x),q(x))=−log(cid:104)p(x),q(x)(cid:105)+maxlogq(x), α α→1+ x∈X lim DH(p,q)=−log(cid:104)p(x),q(x)(cid:105)+maxlogp(x), α α→+∞ x∈X DH(p,q)=−log(cid:104)p(x),q(x)(cid:105)+log(cid:107)p(x)(cid:107) +log(cid:107)q(x)(cid:107) . (34) 2 2 2 In an estimation scenario, p(x) is fixed and q(x|θ)=q (x) is free along a parametric manifold M, then θ minimizing H¨older divergence reduces to: (cid:18) (cid:19) arg min lim DH(p(x),q (x))=arg min −log(cid:104)p(x),q (x)(cid:105)+maxlogq (x) , α θ θ θ θ∈M α→1+ θ∈M x∈X (cid:18) (cid:19) arg min lim DH(p(x),q(x))=arg min −log(cid:104)p(x),q (x)(cid:105) , α θ θ∈M α→+∞ θ∈M (cid:18) (cid:19) arg minDH(p(x),q(x))=arg min −log(cid:104)p(x),q (x)(cid:105)+log(cid:107)q (x)(cid:107) . (35) 2 θ θ 2 θ∈M θ∈M Therefore when θ varies from 1 to +∞, only the regularizer in the minimization problem changes. In any case, H¨older divergence always has the term −log(cid:104)p(x),q(x)(cid:105), which shares a similar form with the Bhattacharyya distance [6]: (cid:90) (cid:112) (cid:112) (cid:112) B(p(x):q(x))=−log p(x)q(x)dx=−log(cid:104) p(x), q(x)(cid:105). (36) X HPD between p˜(x) and q˜(x) is also closely related to their cosine similarity (cid:104)p˜(x),q˜(x)(cid:105) . When α=2, (cid:107)p˜(x)(cid:107)2(cid:107)q˜(x)(cid:107)2 HD is exactly the cosine similarity after a non-linear transformation. 4 Closed-form expressions of HPD and HD for conic and affine exponential families We report closed-form formulas for the HPD and HD between two distributions belonging to the same exponential family provided that the natural parameter space is a cone or affine. A cone Ω is a convex domain such that for P,Q ∈ Ω and any λ > 0, we have P +λQ ∈ Ω. For example, the set of positive measures absolutely continuous with a base measure µ is a cone. Recall that an exponential family [23] has a density function p(x;θ) that we be written canonically as: p(x;θ)=exp((cid:104)t(x),θ(cid:105)−F(θ)+k(x)). (37) In this work, we consider the auxiliary carrier measure term k(x) = 0. The base measure is either the Lebesgue measure µ or the counting measure µ . A Conic or Affine Exponential Family (CAEF) is C an exponential family with the natural parameter space Θ a cone or affine. The log-normalizer F(θ) is a strictly convex function also called cumulant generating function [2]. Lemma 1 (HPD and HD for CAEFs). For distributions p(x;θ ) and p(x;θ ) belonging to the same p q exponential family with conic or affine natural parameter space [21], both the HPD and HD are available in closed-form: 1 1 DH(p:q)= F(αθ )+ F(βθ )−F(θ +θ ), (38) α α p β q p q (cid:18) (cid:19) 1 1 γ γ DH (p:q)= F(γθ )+ F(γθ )−F θ + θ . (39) α,γ α p β q α p β q Proof. Consider k(x)=0 and a conic or affine natural space Θ (see [21]), then for all a,b>0, we have: (cid:18)(cid:90) (cid:19)1b (cid:18)1 a (cid:19) p(x)adx =exp F(aθ )− F(θ ) , (40) b p b p 9 since aθ ∈Θ. Indeed, we have: p (cid:18)(cid:90) (cid:19)1/b (cid:18)(cid:90) (cid:19)1/b p(x)adx = exp((cid:104)aθ,t(x)(cid:105)−aF(θ))dx (cid:18)(cid:90) (cid:19)1/b = exp((cid:104)aθ,t(x)(cid:105)−F(aθ)+F(aθ)−aF(θ))dx  1/b (cid:18)1 a (cid:19)(cid:90)  =exp F(aθ)− F(θ)  exp((cid:104)aθ,t(x)(cid:105)−F(aθ))dx . b b   (cid:124) (cid:123)(cid:122) (cid:125) =1 Similarly, we have for all a,b>0 (details omitted), (cid:90) p(x)aq(x)bdx=exp(F(aθ +bθ )−aF(θ )−bF(θ )), (41) p q p q since aθ +bθ ∈Θ. Therefore, we get: p q (cid:82) p(x)q(x)dx DH(p(x):q(x))=−log (42) α (cid:0)(cid:82) (cid:1)1 (cid:0)(cid:82) (cid:1)1 p(x)αdx α q(x)βdx β 1 1 =−F(θ +θ )+F(θ )+F(θ )+ F(αθ )−F(θ )+ F(βθ )−F(θ ) (43) p q p q α p p β q q 1 1 = F(αθ )+ F(βθ )−F(θ +θ )≥0, (44) α p β q p q (cid:82) p(x)γ/αq(x)γ/βdx DH (p(x):q(x))=−log (45) α,γ (cid:0)(cid:82) (cid:1)1 (cid:0)(cid:82) (cid:1)1 p(x)γdx α q(x)γdx β (cid:18) (cid:19) γ γ γ γ 1 γ 1 γ =−F θ + θ + F(θ )+ F(θ )+ F(γθ )− F(θ )+ F(γθ )− F(θ ) α p β q α p β q α p α p β q β q (46) (cid:18) (cid:19) 1 1 γ γ = F(γθ )+ F(γθ )−F θ + θ ≥0. (47) α p β q α p β q When 1>α>0, we have β = α <0. To get similar results for the reverse H¨older divergence, we α−1 need the natural parameter space to be affine (eg., isotropic Gaussians or multinomials, see [27]). In particular, if p(x) and q(x) belong to the same exponential family so that p(x)=exp((cid:104)θ ,t(x)(cid:105)− p F(θ )) and q(x) = exp((cid:104)θ ,t(x)(cid:105) − F(θ )), one can easily check that DH(p(x;θ ) : p(x;θ )) = 0 iff p q q α p q θ =(α−1)θ . For HD, we can check DH (p(x):p(x))=0 is proper since 1 + 1 =1. q p α,γ α β The following result is straightforward from Lemma 1. Lemma 2 (Symmetric HPD and HD for CAEFs). For distributions p(x;θ ) and p(x;θ ) belonging to p q the same exponential family with conic or affine natural parameter space [21], the symmetric HPD and HD are available in closed-form: (cid:20) (cid:21) 1 1 1 1 1 SH(p(x):q(x))= F(αθ )+ F(βθ )+ F(αθ )+ F(βθ ) −F(θ +θ ); (48) α 2 α p β p α q β q p q (cid:20) (cid:18) (cid:19) (cid:18) (cid:19)(cid:21) 1 γ γ γ γ SH (p(x):q(x))= F(γθ )+F(γθ )−F θ + θ −F θ + θ . (49) α,γ 2 p q α p β q β p α q Remark 1. By reference duality, SH(p(x):q(x))=SH(p(x):q(x)); α α¯ SH (p(x):q(x))=SH (p(x):q(x)). α,γ α¯,γ 10

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.