On Ho¨lder projective divergences∗ Frank Nielsen† ´ Ecole Polytechnique, France Sony Computer Science Laboratories, Japan Ke Sun Computer, Electrical and Mathematical Sciences and Engineering Division King Abdullah University of Science and Technology (KAUST), Saudi Arabia St´ephane Marchand-Maillet Viper Group, Computer Vision and Multimedia Laboratory 7 University of Geneva, Switzerland 1 0 2 n Abstract a J Wedescribeaframeworktobuilddistancesbymeasuringthetightnessofinequalities,andintro- 4 ducethenotionofproperstatisticaldivergencesandimproperpseudo-divergences. Wethenconsider 1 the Ho¨lder ordinary and reverse inequalities, and present two novel classes of Ho¨lder divergences and pseudo-divergences that both encapsulate the special case of the Cauchy-Schwarz divergence. ] We report closed-form formulas for those statistical dissimilarities when considering distributions G belonging to the same exponential family provided that the natural parameter space is a cone (e.g., L multivariate Gaussians), or affine (e.g., categorical distributions). Those new classes of Ho¨lder dis- . s tancesareinvarianttorescaling,andthusdonotrequiredistributionstobenormalized. Finally,we c show how to compute statistical Ho¨lder centroids with respect to those divergences, and carry out [ center-basedclusteringtoyexperimentsonasetofGaussiandistributionsthatdemonstrateempiri- 1 cally that symmetrized Ho¨lder divergences outperform the symmetric Cauchy-Schwarz divergence. v Keywords: H¨older inequalities; H¨older divergences; projective divergences; Cauchy-Schwarz divergence; 6 H¨older escort divergences; skew Bhattacharyya divergences; exponential families; conic exponential fam- 1 9 ilies; escort distribution; clustering. 3 0 . 1 Introduction: Inequality, proper divergence and improper 1 0 pseudo-divergence 7 1 1.1 Statistical divergences from inequality gaps : v i An inequality is denoted mathematically by lhs ≤ rhs, where lhs and rhs denote respectively the left- X hand-side and right-hand-side of the inequality. One can build dissimilarity measures from inequalities r a lhs ≤ rhs by measuring the inequality tightness: For example, we may quantify the tightness of an inequality by its difference gap: ∆=rhs−lhs≥0. (1) When lhs>0, the inequality tightness can also be gauged by the log-ratio gap: rhs lhs D =log =−log ≥0. (2) lhs rhs We may further compose this inequality tightness value measuring non-negative gaps with a strictly monotonically increasing function f (with f(0)=0). ∗Reproduciblesourcecodeavailableathttps://www.lix.polytechnique.fr/~nielsen/HPD/ †Contactauthor: [email protected] 1 A bi-parametric inequality lhs(p,q)≤rhs(p,q) is called proper if it is strict for p(cid:54)=q (i.e., lhs(p,q)< rhs(p,q),∀p (cid:54)= q) and tight if and only if (iff) p = q (i.e., lhs(p,q) = rhs(p,q),∀p = q). Thus a proper bi-parametric inequality allows one to define dissimilarities such that D(p,q) = 0 iff p = q. Such a dissimilarity is called proper. Otherwise, an inequality or dissimilarity is said improper. Note that there are many equivalent words used in the literature instead of (dis-)similarity: distance (although often assumed to have metric properties), pseudo-distance, discrimination, proximity, information deviation, etc. A statistical dissimilarity between two discrete or continuous distributions p(x) and q(x) on a sup- port X can thus be defined from inequalities by summing up or taking the integral for the inequalities instantiated on the observation space X: ∀x∈X, D (p,q)=rhs(p(x),q(x))−lhs(p(x),q(x))⇒ (3) x (cid:26) (cid:80) (cid:2) (cid:3) rhs(p(x),q(x))−lhs(p(x),q(x)) discrete case, D(p,q)= (cid:82) x(cid:2)∈X (cid:3) (4) rhs(p(x),q(x))−lhs(p(x),q(x)) dx continuous case. X In such a case, we get a separable divergence. Some non-separable inequalities induce a non-separable divergence. For example, the renown Cauchy-Schwarz divergence [8] is not separable because in the inequality: (cid:115) (cid:90) (cid:18)(cid:90) (cid:19)(cid:18)(cid:90) (cid:19) p(x)q(x)dx≤ p(x)2dx q(x)2dx , (5) X X X the rhs is not separable. Furthermore, a proper dissimilarity is called a divergence in information geometry [2] when it is C3 (i.e., three times differentiable thus allowing to define a metric tensor [34] and a cubic tensor [2]). Many familiar distances can be reinterpreted as inequality gaps in disguise. For example, Bregman divergences [4] and Jensen divergences [28] (also called Burbea-Rao divergences [9, 22]) can be rein- terpreted as inequality difference gaps, and the Cauchy-Schwarz distance [8] as an inequality log-ratio gap: Example 1 (Bregman divergence as a Bregman score-induced gap divergence). A proper score func- tion [14] S(p:q) induces a gap divergence D(p:q)=S(p:q)−S(p:p)≥0. A Bregman divergence [4] B (p : q) for a strictly convex and differentiable real-valued generator F(x) is induced by the Bregman F score S (p:q). Let S (p:q)=−F(q)−(cid:104)p−q,∇F(q)(cid:105) denote the Bregman proper score minimized for F F p=q. Then the Bregman divergence is a gap divergence: B (p:q)=S (p:q)−S (p:p)≥0. When F F F F is strictly convex, the Bregman score is proper, and the Bregman divergence is proper. Example 2 (Cauchy-Schwarz distance as a log-ratio gap divergence). Consider the Cauchy-Schwarz (cid:113) (cid:82) (cid:0)(cid:82) (cid:1)(cid:0)(cid:82) (cid:1) inequality p(x)q(x)dx≤ p(x)2dx q(x)2dx . ThentheCauchy-Schwarzdistance[8]between X X X two continuous distributions is defined by CS(p(x):q(x))=−log√ (cid:82)Xp(x)q(x)dx ≥0. ((cid:82) p(x)2dx)((cid:82) q(x)2dx) X X Note that we use the modern notation D(p(x):q(x)) to emphasize that the divergence is potentially asymmetric: D(p(x):q(x))(cid:54)=D(q(x):p(x)), see [2]. In information theory [11], the older notation “||” is often used instead of “:” that is used in information geometry [2]. To conclude this introduction, let us finally introduce the notion of projective statistical distances. A statistical distance D(p(x):q(x)) is said projective when: D(λp(x):λ(cid:48)q(x))=D(p(x):q(x)), ∀λ,λ(cid:48) >0. (6) The Cauchy-Schwarz distance is a projective divergence. Another example of such a projective divergence is the parametric γ-divergence [13]. Example 3 (γ-divergence as a projective score-induced gap divergence). The γ-divergence [13, 29] D (p(x):q(x)) for γ >0 is projective: γ D (p(x),q(x))=S (p(x),q(x))−S (p(x),p(x)),with γ γ γ 1 (cid:82) p(x)q(x)γdx S (p(x),q(x))=− . γ γ(1+γ)(cid:0)(cid:82) q(x)1+γdx(cid:1)1+γγ The γ-divergence is related to the proper pseudo-spherical score [13]. 2 Theγ-divergenceshavebeenprovenusefulforrobuststatisticalinference[13]inthepresenceofheavy outlier contamination. 1.2 Pseudo-divergences and the axiom of indiscernability Consider a broader class of statistical pseudo-divergences based on improper inequalities, where the tightness of lhs(p,q) ≤ rhs(p,q) does not imply that p = q. This family of dissimilarity measures have interesting properties which have not been studied before. Formally, statistical pseudo-divergences are defined with respect to density measures p(x) and q(x) with x∈X, where X denotes the support. By definition, pseudo-divergences satisfy the following three fundamental properties: 1. Non-negativeness: D(p(x):q(x))≥0 for any p(x),q(x); 2. Reachable indiscernability: • ∀p(x), there exists q(x) such that D(p(x):q(x))=0, • ∀q(x), there exists p(x) such that D(p(x):q(x))=0. 3. Positive correlation: If D(p:q)=0, then (p(x )−p(x ))(q(x )−q(x ))≥0 for any x , x ∈X. 1 2 1 2 1 2 As compared to statistical divergence measures such as the Kullback-Leibler (KL) divergence: (cid:90) p(x) KL(p(x):q(x))= p(x)log dx, (7) q(x) X pseudo-divergences do not require D(p(x) : p(x)) = 0. Instead, any pair of distributions p(x) and q(x) with D(p(x) : q(x)) = 0 only has to be “positively correlated” such that p(x ) ≤ p(x ) implies q(x ) ≤ 1 2 1 q(x ), and vice versa. Any divergence with D(p(x) : q(x)) = 0 ⇒ p(x) = q(x) (law of indiscernibles) 2 automatically satisfies this weaker condition, and therefore any divergence belongs to the broader class ofpseudo-divergences. Indeed, ifp(x)=q(x)then(p(x )−p(x ))(q(x )−q(x ))=(p(x )−p(x ))2 ≥0. 1 2 1 2 1 2 Howevertheconverseisnottrue. Asweshalldescribeintheremainder,thefamilyofpseudo-divergences is not limited to proper divergence measures. In the remainder, the term “pseudo-divergence” refers to such divergences that are not proper divergence measures. We study two novel statistical dissimilarity families: One family of statistical improper pseudo- divergences and one family of proper statistical divergences. Within the class of pseudo-divergences, this work concentrates on defining a one-parameter family of dissimilarities called H¨older log-ratio gap divergence that we concisely abbreviate as HPD for “H¨older pseudo divergence” in the remainder. We also study its proper divergence counterpart termed HD for “H¨older divergence.” 1.3 Prior work and contributions Theterm“H¨olderdivergence”hasfirstbeencoinedin2014basedonthedefinitionoftheH¨olderscore[20, 19]: The score-induced H¨older divergence D(p(x) : q(x)) is a proper gap divergence that yields a scale- invariant divergence [20, 19]. Let p (x) = aσp(σx) for a,σ > 0 be a transformation. Then a scale- a,σ invariantdivergence[20,19]satisfiesD(p (x):q (x))=κ(a,σ)D(p(x):q(x))forafunctionκ(a,σ)> a,σ a,σ 0. Thisgapdivergenceispropersinceitisbasedontheso-calledH¨olderscore[20,19]butisnotprojective and does not include the Cauchy-Schwarz divergence. Due to these differences the H¨older log-ratio gap divergence introduced here shall not be confused with the H¨older gap divergence induced by the H¨older score [19, 20] that relies both on a scalar γ and a function φ(·). Weshallintroducetwonovelfamiliesoflog-ratioprojectivegapdivergencesbasedonH¨olderordinary (orforward)andreverseinequalitiesthatextendtheCauchy-Schwarzdivergence,studytheirproperties, andconsiderasanapplicationclusteringGaussiandistributions: Weexperimentallyshowbettercluster- ing results when using symmetrized H¨older divergences than using the Cauchy-Schwarz divergence. To contrast with the “H¨older composite score-induced divergences” of [19], our H¨older divergences admit closed-form expressions between distributions belonging to the same exponential families [23] provided that the natural parameter space is a cone or affine. Our main contributions are summarized as follows: 3 • Define the uni-parametric family of H¨older improper pseudo-divergences (HPDs) in §2 and the bi- parametric family of H¨older proper divergences in §3 (HDs) for positive and probability measures, and study their properties (including their relationships with skewed Bhattacharrya distances [22] via escort distributions); • Report closed-form expressions of those divergences for exponential families when the natural parameterspaceisaconeoraffine(includebutnotlimitedtothecasesofcategoricaldistributions and multivariate Gaussian distributions) in §4; • Provide approximation techniques to compute those divergences between mixtures based on log- sum-exp inequalities in §4.6; • Describeavariationalcenter-basedclusteringtechniquebasedontheconvex-concaveprocedurefor computing H¨older centroids, and report our experimental results in §5. 1.4 Organization This paper is organized as follows: §2 introduces the definition and properties of H¨older pseudo- divergences (HPDs). It is followed by §3 that describes H¨older proper divergences (HDs). In §4, closed-form expressions for those novel families of divergences are reported for the categorical, mul- tivariate Gaussian, Bernoulli, Laplace and Wishart distributions. §5 defines H¨older statistical centroids and presents a variational k-means clustering technique: We show experimentally that using H¨older divergences improve over the Cauchy-Schwarz divergence. Finally, 6 concludes this work and hints at further perspectives from the viewpoint of statistical estimation and manifold learning. In Appendix A, we recall the proof of the ordinary and reverse H¨older’s inequalities. 2 H¨older pseudo-divergence: Definition and properties H¨older’s inequality (see [18] and Appendix A for a proof) states for positive real-valued functions1 p(x) and q(x) defined on the support X that: (cid:90) (cid:18)(cid:90) (cid:19)1 (cid:18)(cid:90) (cid:19)1 α β p(x)q(x)dx≤ p(x)αdx q(x)βdx , (8) X X X where exponents α and β satisfy αβ > 0 as well as the exponent conjugacy condition: 1 + 1 = 1. We α β also write β = α¯ = α meaning that α and β are conjugate H¨older exponents. We check that α > 1 α−1 and β > 1. H¨older inequality holds even if the lhs is infinite (meaning that the integral diverges) since the rhs is also infinite in that case. The reverse H¨older inequality holds for conjugate exponents 1 + 1 =1 with αβ <0 (then 0<α<1 α β and β <0, or α<0 and 0<β <1): (cid:90) (cid:18)(cid:90) (cid:19)1 (cid:18)(cid:90) (cid:19)1 α β p(x)q(x)dx≥ p(x)αdx q(x)βdx . (9) X X X Both H¨older’s inequality and the reverse H¨older inequality turn tight when p(x)α ∝q(x)β (see proof in Appendix A). 2.1 Definition Let (X,F,µ) be a measurable space where µ is the Lebesgue measure, and let Lγ(X,µ) denote the Lebesgue space of functions that have their γ-th power of absolute value Lebesgue integrable, for any γ >0 (when γ ≥1, Lγ(X,µ) is a Banach space). We define the following pseudo-divergence: 1In a more general form, Ho¨lder’s inequality holds for any real and complex valued functions. In this work, we only focusonrealpositivefunctionsthataredensitiesofpositivemeasures. 4 Definition 1 (H¨olderstatisticalpseudo-divergence, HPD). For conjugate exponents α and β with αβ > 0, the H¨older Pseudo-Divergence (HPD) between two densities p(x)∈Lα(X,µ) and q(x)∈Lβ(X,µ) of positive measures absolutely continuous with respect to (wrt.) µ is defined by the following log-ratio gap: (cid:82) p(x)q(x)dx DαH(p(x):q(x))=−log(cid:0)(cid:82) X (cid:1)1 (cid:0)(cid:82) (cid:1)1 . (10) p(x)αdx α q(x)βdx β X X When 0<α<1 and β =α¯ = α <0, or α<0 and 0<β <1, the reverse HPD is defined by: α−1 (cid:82) p(x)q(x)dx DαH(p(x):q(x))=log(cid:0)(cid:82) X (cid:1)1 (cid:0)(cid:82) (cid:1)1 . (11) p(x)αdx α q(x)βdx β X X ByH¨older’sinequalityandthereverseH¨olderinequality, DH(p(x):q(x))≥0withDH(p(x):q(x))= α α 0 iff p(x)α ∝ q(x)β or equivalently q(x) ∝ p(x)α/β = p(x)α−1. When α > 1, xα−1 is monotonically increasing, andDH isindeedapseudo-divergence. However, thereverseHPDisnot apseudo-divergence α because xα−1 will be monotonically decreasing if α<0 or 0<α<1. Therefore we only consider HPD with α>1 in the remainder, and leave here the notion of reverse H¨older divergence. When α=β =2, the HPD becomes the Cauchy-Schwarz divergence CS [16]: (cid:82) p(x)q(x)dx D2H(p(x):q(x))=CS(p(x):q(x))=−log(cid:0)(cid:82) X (cid:1)1 (cid:0)(cid:82) (cid:1)1, (12) p(x)2dx 2 q(x)2dx 2 X X which has been proved useful to get closed-form divergence formulas between mixtures of exponential families with conic or affine natural parameter spaces [21]. TheCauchy-SchwarzdivergenceisproperforprobabilitydensitiessincetheCauchy-Schwarzinequal- ity becomes an equality iff q(x) = λp(x)α−1 = λp(x) implies that λ = (cid:82) λp(x)dx = (cid:82) q(x)dx = 1. It X X is however not proper for positive densities. Fact 1 (CS is only proper for probability densities). The Cauchy-Schwarz divergence CS(p(x):q(x)) is proper for square-integrable probability densities p(x),q(x)∈L2(X,µ) but not proper for positive square- integrable densities. 2.2 Properness and improperness In the general case, when α (cid:54)= 2, the divergence DH is not even proper for normalized (probability) α densities, not to mention general unnormalized (positive) densities. Indeed, when p(x)=q(x), we have: (cid:82) p(x)2dx DαH(p(x):p(x))=−log(cid:0)(cid:82) p(x)αdx(cid:1)α1 (cid:0)(cid:82) p(x)αα−1dx(cid:1)αα−1(cid:54)=0 when α(cid:54)=2. (13) Letusconsiderthegeneralcase. Forunnormalizedpositivedistributionsp˜(x)andq˜(x)(thetildenotation stems from the notation of homogeneous coordinates in projective geometry), the inequality becomes an equality when: p˜(x)α ∝ q˜(x)β, i.e., p(x)α ∝ q(x)β, or q(x) ∝ p(x)α/α¯ = p(x)α−1. We can check that DH(p(x):λp(x)α−1)=0 for any λ>0: α (cid:82) p(x)λp(x)α−1dx (cid:82) p(x)αdx −log =−log =0, (14) (cid:0)(cid:82) (cid:1)1 (cid:0)(cid:82) (cid:1)1 (cid:0)(cid:82) (cid:1)1 (cid:0)(cid:82) (cid:1)1 p(x)αdx α λβp(x)(α−1)βdx β p(x)αdx α p(x)αdx β since (α−1)β =(α−1)α¯ =(α−1) α =α. α−1 For α=2, we find indeed that DH(p(x):λp(x))=CS(p(x):p(x))=0 for any λ(cid:54)=0. 2 Fact 2 (HPD is improper). The H¨older pseudo-divergences are improper statistical distances. 5 2.3 Reference duality In general, H¨older divergences are asymmetric when α (cid:54)= β ((cid:54)= 2) but enjoy the following reference duality [39]: DH(p(x):q(x))=DH(q(x):p(x))=DH (q(x):p(x)). (15) α β α α−1 Fact 3 (Reference duality HPD). The H¨older pseudo-divergences satisfy the reference duality α↔β = α : DH(p(x):q(x))=DH(q(x):p(x))=DH (q(x):p(x)). α−1 α β α α−1 An arithmetic symmetrization of the HPD yields a symmetric HPD SH, given by: α DH(p(x):q(x))+DH(q(x):p(x)) SH(p(x):q(x)) = SH(q(x):p(x))= α α¯ , α α 2 (cid:82) p(x)q(x)dx = −log(cid:113) . (16) (cid:0)(cid:82) (cid:1)1 (cid:0)(cid:82) (cid:1)1 (cid:0)(cid:82) (cid:1)1 (cid:0)(cid:82) (cid:1)1 p(x)αdx α p(x)α¯dx α¯ q(x)αdx α q(x)α¯dx α¯ 2.4 HPD is a projective divergence Intheabovedefinition, densitiesp(x)andq(x)caneitherbepositiveornormalizedprobabilitydistribu- (cid:82) tions. Let p˜(x) and q˜(x) denote positive (not necessarily normalized) measures, and w(p˜) = p˜(x)dx X the overall mass so that p(x) = p˜(x) is the corresponding normalized probability measure. Then we w(p˜) check that HPD is a projective divergence [13] since: DH(p˜(x):q˜(x))=DH(p(x):q(x)), (17) α α or in general: DH(λp(x):λ(cid:48)q(x))=DH(p(x):q(x)) (18) α α for all prescribed constants λ,λ(cid:48) > 0. Projective divergences may also be called “angular divergences” or “cosine divergences” since they do not depend on the total mass of the measure densities. Fact 4 (HPD is projective). The H¨older pseudo-divergences are projective distances. 2.5 Escort distributions and skew Bhattacharyya divergences Letusdefinewithrespecttotheprobabilitymeasuresp(x)∈Lα1(X,µ)andq(x)∈Lβ1(X,µ)thefollowing escort probability distributions [2]: pE(x)= p(x)α1 , (19) α (cid:82) p(x)α1dx and 1 q(x)β qE(x)= . (20) β (cid:82) 1 q(x)βdx Since HPD is a projective divergence, we compute with respect to the conjugate exponents α and β the H¨older Escort Divergence (HED): (cid:90) DHE(p(x):q(x))=DH(pE(x):qE(x))=−log p(x)1/αq(x)1−1/αdx=B (p(x):q(x)), (21) α α α β 1/α X which turns out to be the familiar skew Bhattacharyya divergence B (p(x):q(x)), see [22]. 1/α Fact 5 (HED as a skew Bhattacharyya divergence). The H¨older escort divergence amounts to a skew Bhattacharyya divergence: DHE(p(x):q(x))=B (p(x):q(x)) for any α>0. α 1/α Inparticular,theCauchy-SchwarzescortdivergenceCSHE(p(x):q(x))amountstotheBhattacharyya (cid:113) (cid:82) distance [6] B(p(x):q(x))=−log p(x)q(x)dx: X CSHE(p(x):q(x))=DHE(p(x):q(x))=DH(pE(x):qE(x))=B (p(x):q(x))=B(p(x):q(x)). (22) 2 2 2 2 1/2 ObservethattheCauchy-Schwarzescortdistributionsarethesquarerootdensityrepresentations[36] of distributions. 6 3 Proper H¨older divergence 3.1 Definition Let p(x) and q(x) be positive measures in Lγ(X,µ) for a prescribed scalar value γ > 0. Plugging the positive measures p(x)γ/α and q(x)γ/β into the definition of HPD DH, we get the following definition: α Definition 2 (Proper H¨older Divergence, HD). For conjugate exponents α,β >0 and γ >0, the proper H¨older divergence between two densities p(x) and q(x) is defined by: (cid:32) (cid:82) p(x)γ/αq(x)γ/βdx (cid:33) DH (p(x):q(x))=DH(p(x)γ/α :q(x)γ/β)=−log X . (23) α,γ α ((cid:82) p(x)γdx)1/α((cid:82) q(x)γdx)1/β X X By definition, DH (p : q) is a two-parameter family of dissimilarity statistical measures. Following α,γ H¨older’s inequality, we can check that DH (p(x):q(x))≥0 and DH (p(x):q(x))=0 iff p(x)γ ∝q(x)γ, α,γ α,γ i.e. p(x) ∝ q(x) (see Appendix A). If p(x) and q(x) belong to the statistical probability manifold, then DH (p(x) : q(x)) = 0 iff p(x) = q(x) almost everywhere. This says that HD is a proper divergence α,γ for probability measures, and it becomes a pseudo-divergence for positive measures. Note that we have abusedthenotationDH todenoteboththeH¨olderpseudo-divergence(withonesubscript)andtheH¨older divergence (with two subscripts). Similar to HPD, HD is asymmetric when α(cid:54)=β with the following reference duality: DH (p(x):q(x))=DH (q(x),p(x)). (24) α,γ α¯,γ HD can be symmetrized as: DH (p:q)+DH (q :p) (cid:115)(cid:82) p(x)γ/αq(x)γ/βdx(cid:82) p(x)γ/βq(x)γ/αdx SαH,γ(p:q)= α,γ 2 α,γ =−log X (cid:82) p(x)γdx(cid:82)X q(x)γdx . (25) X X Furthermore, one can easily check that HD is a projective divergence. For conjugate exponents α,β >0 and γ >0, we rewrite the definition of HD as: (cid:90) (cid:18) p(x)γ (cid:19)1/α(cid:18) q(x)γ (cid:19)1/β DαH,γ(p(x):q(x))=−log (cid:82) p(x)γdx (cid:82) q(x)γdx dx, X X X (cid:16) (cid:17)1/α(cid:16) (cid:17)1/β =−log pE (x) qE (x) dx=B (pE (x):qE (x)). 1/γ 1/γ 1 1/γ 1/γ α Therefore HD can be reinterpreted as the skew Bhattacharyya divergence [22] between the escort distri- butions. In particular, when γ =1, we get: (cid:18)(cid:90) (cid:19) DH (p(x):q(x))=−log p(x)1/αq(x)1/βdx =B (p(x):q(x)). (26) α,1 1 X α Fact 6. The two-parametric family of statistical Ho¨lder divergence DH passes through the one- α,γ parametric family of skew Bhattacharyya divergences when γ =1. 3.2 Special case: The Cauchy-Schwarz divergence We consider the intersection of the uni-parametric class of H¨older pseudo-divergences (HPD) with the bi-parametric class of proper H¨older divergences (HD): That is, the class of divergences which belong to both HPD and HD. Then we must have γ/α = γ/β = 1. Since 1/α+1/β = 1, we get α = β = γ = 2. ThereforetheCauchy-Schwarz(CS)divergenceistheuniquedivergencebelongingtobothHPDandHD classes: DH (p(x):q(x))=DH(p(x):q(x))=CS(p(x):q(x)). (27) 2,2 2 Infact,theCSdivergenceistheintersectionofthefourclassesHPD,symmetricHPD,HD,andsymmetric HD. Figure 1 displays a diagram of those divergence classes with their inclusion relationships. 7 ProjectiveDivergence skewBhattacharyya Divergence(proper) B H¨olderPseudo-Divergence 1/α DH Cauchy-Schwarz α H¨olderDivergence(proper) CSDivergence DH α,γ Figure 1: H¨older proper divergence (bi-parametric) and H¨older improper pseudo-divergence (uni-parametric) intersect at the unique non-parametric Cauchy-Schwarz divergence. By using escort distributions, H¨older divergences encapsulates the skew Bhattacharyya distances. As stated earlier, notice that the Cauchy-Schwarz inequality: (cid:115) (cid:90) (cid:18)(cid:90) (cid:19)(cid:18)(cid:90) (cid:19) p(x)q(x)dx≤ p(x)2dx p(x)2dx , (28) isnotproperasitisanequalitywhenp(x)andq(x)arelinearly dependent(i.e.,p(x)=λq(x)forλ>0). The arguments of the CS divergence are square-integrable real-valued density functions p(x) and q(x). Thus the Cauchy-Schwarz divergence is not proper for positive measures but is proper for normalized (cid:82) (cid:82) probability distributions since p(x)dx= λq(x)dx=1 implies that λ=1. 3.3 Limit cases of H¨older divergences and statistical estimation Let us define the inner product of unnormalized densities as: (cid:90) (cid:104)p˜(x),q˜(x)(cid:105)= p˜(x)q˜(x)dx (29) X (for L2(X,µ) integrable functions), and define the L norm of densities as (cid:107)p˜(x)(cid:107) = ((cid:82) p˜(x)αdx)1/α α α X for α≥1. Then the CS divergence can be concisely written as: (cid:104)p˜(x),q˜(x)(cid:105) CS(p˜(x),q˜(x))=−log , (30) (cid:107)p˜(x)(cid:107) (cid:107)q˜(x)(cid:107) 2 2 and the H¨older pseudo-divergence writes as: (cid:104)p˜(x),q˜(x)(cid:105) DH(p˜(x),q˜(x))=−log . (31) α (cid:107)p˜(x)(cid:107) (cid:107)q˜(x)(cid:107) α α¯ When α→1+, we have α¯ =α/(α−1)→+∞. Then it comes that: (cid:104)p˜(x),q˜(x)(cid:105) (cid:90) lim DH(p˜(x),q˜(x))=−log =−log(cid:104)p˜(x),q˜(x)(cid:105)+log p˜(x)dx+logmaxq˜(x). (32) α→1+ α (cid:107)p˜(x)(cid:107)1(cid:107)q˜(x)(cid:107)∞ X x∈X When α→+∞ and α¯ →1+, we have: (cid:104)p˜(x),q˜(x)(cid:105) (cid:90) lim DH(p˜(x),q˜(x))=−log =−log(cid:104)p˜(x),q˜(x)(cid:105)+logmaxp˜(x)+log q˜(x)dx. (33) α→+∞ α (cid:107)p˜(x)(cid:107)∞(cid:107)q˜(x)(cid:107)1 x∈X X 8 Now consider a pair of probability densities p(x) and q(x). We have: lim DH(p(x),q(x))=−log(cid:104)p(x),q(x)(cid:105)+maxlogq(x), α α→1+ x∈X lim DH(p,q)=−log(cid:104)p(x),q(x)(cid:105)+maxlogp(x), α α→+∞ x∈X DH(p,q)=−log(cid:104)p(x),q(x)(cid:105)+log(cid:107)p(x)(cid:107) +log(cid:107)q(x)(cid:107) . (34) 2 2 2 In an estimation scenario, p(x) is fixed and q(x|θ)=q (x) is free along a parametric manifold M, then θ minimizing H¨older divergence reduces to: (cid:18) (cid:19) arg min lim DH(p(x),q (x))=arg min −log(cid:104)p(x),q (x)(cid:105)+maxlogq (x) , α θ θ θ θ∈M α→1+ θ∈M x∈X (cid:18) (cid:19) arg min lim DH(p(x),q(x))=arg min −log(cid:104)p(x),q (x)(cid:105) , α θ θ∈M α→+∞ θ∈M (cid:18) (cid:19) arg minDH(p(x),q(x))=arg min −log(cid:104)p(x),q (x)(cid:105)+log(cid:107)q (x)(cid:107) . (35) 2 θ θ 2 θ∈M θ∈M Therefore when θ varies from 1 to +∞, only the regularizer in the minimization problem changes. In any case, H¨older divergence always has the term −log(cid:104)p(x),q(x)(cid:105), which shares a similar form with the Bhattacharyya distance [6]: (cid:90) (cid:112) (cid:112) (cid:112) B(p(x):q(x))=−log p(x)q(x)dx=−log(cid:104) p(x), q(x)(cid:105). (36) X HPD between p˜(x) and q˜(x) is also closely related to their cosine similarity (cid:104)p˜(x),q˜(x)(cid:105) . When α=2, (cid:107)p˜(x)(cid:107)2(cid:107)q˜(x)(cid:107)2 HD is exactly the cosine similarity after a non-linear transformation. 4 Closed-form expressions of HPD and HD for conic and affine exponential families We report closed-form formulas for the HPD and HD between two distributions belonging to the same exponential family provided that the natural parameter space is a cone or affine. A cone Ω is a convex domain such that for P,Q ∈ Ω and any λ > 0, we have P +λQ ∈ Ω. For example, the set of positive measures absolutely continuous with a base measure µ is a cone. Recall that an exponential family [23] has a density function p(x;θ) that we be written canonically as: p(x;θ)=exp((cid:104)t(x),θ(cid:105)−F(θ)+k(x)). (37) In this work, we consider the auxiliary carrier measure term k(x) = 0. The base measure is either the Lebesgue measure µ or the counting measure µ . A Conic or Affine Exponential Family (CAEF) is C an exponential family with the natural parameter space Θ a cone or affine. The log-normalizer F(θ) is a strictly convex function also called cumulant generating function [2]. Lemma 1 (HPD and HD for CAEFs). For distributions p(x;θ ) and p(x;θ ) belonging to the same p q exponential family with conic or affine natural parameter space [21], both the HPD and HD are available in closed-form: 1 1 DH(p:q)= F(αθ )+ F(βθ )−F(θ +θ ), (38) α α p β q p q (cid:18) (cid:19) 1 1 γ γ DH (p:q)= F(γθ )+ F(γθ )−F θ + θ . (39) α,γ α p β q α p β q Proof. Consider k(x)=0 and a conic or affine natural space Θ (see [21]), then for all a,b>0, we have: (cid:18)(cid:90) (cid:19)1b (cid:18)1 a (cid:19) p(x)adx =exp F(aθ )− F(θ ) , (40) b p b p 9 since aθ ∈Θ. Indeed, we have: p (cid:18)(cid:90) (cid:19)1/b (cid:18)(cid:90) (cid:19)1/b p(x)adx = exp((cid:104)aθ,t(x)(cid:105)−aF(θ))dx (cid:18)(cid:90) (cid:19)1/b = exp((cid:104)aθ,t(x)(cid:105)−F(aθ)+F(aθ)−aF(θ))dx 1/b (cid:18)1 a (cid:19)(cid:90) =exp F(aθ)− F(θ) exp((cid:104)aθ,t(x)(cid:105)−F(aθ))dx . b b (cid:124) (cid:123)(cid:122) (cid:125) =1 Similarly, we have for all a,b>0 (details omitted), (cid:90) p(x)aq(x)bdx=exp(F(aθ +bθ )−aF(θ )−bF(θ )), (41) p q p q since aθ +bθ ∈Θ. Therefore, we get: p q (cid:82) p(x)q(x)dx DH(p(x):q(x))=−log (42) α (cid:0)(cid:82) (cid:1)1 (cid:0)(cid:82) (cid:1)1 p(x)αdx α q(x)βdx β 1 1 =−F(θ +θ )+F(θ )+F(θ )+ F(αθ )−F(θ )+ F(βθ )−F(θ ) (43) p q p q α p p β q q 1 1 = F(αθ )+ F(βθ )−F(θ +θ )≥0, (44) α p β q p q (cid:82) p(x)γ/αq(x)γ/βdx DH (p(x):q(x))=−log (45) α,γ (cid:0)(cid:82) (cid:1)1 (cid:0)(cid:82) (cid:1)1 p(x)γdx α q(x)γdx β (cid:18) (cid:19) γ γ γ γ 1 γ 1 γ =−F θ + θ + F(θ )+ F(θ )+ F(γθ )− F(θ )+ F(γθ )− F(θ ) α p β q α p β q α p α p β q β q (46) (cid:18) (cid:19) 1 1 γ γ = F(γθ )+ F(γθ )−F θ + θ ≥0. (47) α p β q α p β q When 1>α>0, we have β = α <0. To get similar results for the reverse H¨older divergence, we α−1 need the natural parameter space to be affine (eg., isotropic Gaussians or multinomials, see [27]). In particular, if p(x) and q(x) belong to the same exponential family so that p(x)=exp((cid:104)θ ,t(x)(cid:105)− p F(θ )) and q(x) = exp((cid:104)θ ,t(x)(cid:105) − F(θ )), one can easily check that DH(p(x;θ ) : p(x;θ )) = 0 iff p q q α p q θ =(α−1)θ . For HD, we can check DH (p(x):p(x))=0 is proper since 1 + 1 =1. q p α,γ α β The following result is straightforward from Lemma 1. Lemma 2 (Symmetric HPD and HD for CAEFs). For distributions p(x;θ ) and p(x;θ ) belonging to p q the same exponential family with conic or affine natural parameter space [21], the symmetric HPD and HD are available in closed-form: (cid:20) (cid:21) 1 1 1 1 1 SH(p(x):q(x))= F(αθ )+ F(βθ )+ F(αθ )+ F(βθ ) −F(θ +θ ); (48) α 2 α p β p α q β q p q (cid:20) (cid:18) (cid:19) (cid:18) (cid:19)(cid:21) 1 γ γ γ γ SH (p(x):q(x))= F(γθ )+F(γθ )−F θ + θ −F θ + θ . (49) α,γ 2 p q α p β q β p α q Remark 1. By reference duality, SH(p(x):q(x))=SH(p(x):q(x)); α α¯ SH (p(x):q(x))=SH (p(x):q(x)). α,γ α¯,γ 10