Self annealing and self annihilation: Unifying deterministic annealing and relaxation labeling Anand Rangarajan Image Processingand AnalysisGroup Departmentof DiagnosticRadiology YaleUniversity School of Medicine New Haven, CT, USA Abstract Deterministicannealingandrelaxationlabeling algorithmsforclassi(cid:2)cationandmatching arepresented anddiscussed. A newapproach(cid:151)selfannealing(cid:151)is introduced to bring deter- ministic annealing and relaxation labeling into accord. Self annealing results in an emergent linearscheduleforwinner-take-allandlinearassignmentproblems. Selfannihilation,agener- alizationofselfannealingiscapableofperformingtheusefulfunctionofsymmetrybreaking. The original relaxation labeling algorithm is then shown to arise from an approximation to either the self annealing energy function or the corresponding dynamical system. With this relationship in place, self annihilation can be introduced into the relaxation labeling frame- work. Experimentalresultsonsyntheticmatchingandlabelingproblemsclearlydemonstrate the three-way relationship between deterministic annealing, relaxation labeling and self an- nealing. Keywords: Deterministic annealing, relaxation labeling, self annealing, self ampli(cid:2)cation, self annihilation,softmax,softassign. 1 Introduction Labelingandmatchingproblemsaboundincomputervisionandpatternrecognition(CVPR).Itis notanexaggerationtostatethatsomeformortheotherofthebasicproblemsoftemplatematching anddataclusteringhasremainedcentraltotheCVPRandneuralnetworks(NN)communitiesfor about three decades [1]. Due to the somewhat disparate natures of these communities, different frameworks forformulating andsolvingthese twoproblemshaveemerged anditisnotimmedi- ately obvious how to go about reconciling some of the differences between these frameworks so thattheycanbene(cid:2)t fromeachother. Inthis paper, we pick two such frameworks, deterministic annealing [2] andrelaxationlabel- ing [3] which arose mainly in the neural networks and pattern recognition communities respec- tively. Deterministic annealinghasits originsinstatisticalphysics andmore recently inHop(cid:2)eld 1 networks [4]. Ithas been applied with varyingdegrees of success to avariety ofimage matching and labeling problems. In the (cid:2)eld of neural networks, deterministic annealing developed from itssomewhatcrudeoriginsintheHop(cid:2)eld-Tanknetworks[4]toincludefairlysophisticatedtreat- mentofconstraintsatisfactionandenergyminimizationbydrawingonwellestablishedprinciples in statistical physics [5]. Recently, for both matching [6] and classi(cid:2)cation [7] problems, a fairly coherent framework andsetofalgorithmshaveemerged. These algorithmsrange from usingthe softmax[8]orsoftassign[9]forconstraintsatisfactionanddynamicsthataredirectlyderivedfrom ormerely mimictheExpectation(cid:150)Maximization(EM)approach[10]. The term relaxation labeling (RL) originally referred to a heuristic dynamical system devel- opedin[11]. RLspeci(cid:2)ed adiscrete time dynamicalsystem inwhichclass labels(typically inim- age segmentation problems) were re(cid:2)ned while taking relationships in the pixel and label array intoaccount. Asinterest inthetechnique grew, manybifurcations, offshootsandgeneralizations of the basic idea developed; examples are the product combination rule [12], the optimization approach[13], projected gradientdescent [3], discrete relaxation[14], andprobabilisticrelaxation [15]. RLinits basic form is adiscrete time update equationthat issuitably (andfairly obviously) modi(cid:2)eddependingontheproblemofinterest(cid:151)imagematching,segmentation,orclassi(cid:2)cation. The more principled deviationsfrom the basic form of RL replaced the discrete time update rule bygradientdescentandprojectedgradientdescent[3,13]onenergyfunctions. However, recently it has been shown [16] that the original heuristic RL dynamical system minimizes the labeling energy function. It is now fairly clear that both continuous time projected gradient descent and discretetime RLdynamicalsystemscanbeusedtominimizethesamelabelingenergyfunction. Much ofthis development pre(cid:2)gured orran parallelto the evolutionofdeterministic anneal- ing(DA)dynamicalsystems withatleastonemajordifference. Whiletheconcernsofcontinuous time versus discrete time dynamics were common to both RL and DA approaches, within the DA approaches a fundamental distinction was drawn between matching and labeling problems [17]. This distinction was almost never emphasized in RL. In labeling problems, a set of labels have to be assigned to a set of nodes with the constraint that a node should be assigned only one label. A variety of problems not necessarily restricted to CVPR require labeling constraints; some examples are central and pairwise clustering [7, 18], consistent labeling [3], and graph col- oring. In matching problems on the other hand, a set of model nodes have to be assigned to a set of data nodes with the constraint that each model node should be assigned to one and only onedatanodeandviceversa. Avarietyofproblems require matchingconstraints;someexamples are quadratic assignment [2, 19], TSP [20, 9], graph matching [21, 22], graph partitioning (with minor differences) [20, 23] and point matching [24, 25]. The original neural network approaches used a penalty function approach at (cid:2)xed temperature [4]. Withthe importance of deterministic annealingandexactconstraintsatisfactionbecoming clear, theseapproachesquickly gavewayto the softmax [26, 20, 23, 27, 28], softassign [29, 9, 22], Lagrangianrelaxation [29, 30] and projected gradientdescent [31,32,33,34]approachesusuallyperformed withindeterministicannealing. Here,wereturntotheoriginalrelaxationlabelingdynamicalsystemsinceironically,itisinthe RLdiscretetimedynamicalsystemthatwe(cid:2)ndaclosestparalleltorecentdiscrete-time determin- 2 istic annealing algorithms. Even after restricting our focus to discrete time dynamical systems, important differences like the manner in which constraint satisfaction is performed, relaxation at a (cid:2)xed temperature and the nature of the update mechanism remain. A new approach(cid:151)self annealing(cid:151)is presented to unify relaxation labeling and deterministic annealing. We show that the self annealingdynamicalsystem whichis derivedfrom acorrespondingenergy functioncor- respondstodeterministicannealingwithalinearschedule. Also,theoriginalRLupdateequation canbe derived from the self annealingdynamicalsystem viaa Taylorseries approximation. This suggests that a close three-way relationship exists between DA, RL and self annealing with self annealingactingasabridgebetween DAandRL. 2 Deterministic Annealing Deterministicannealingaroseasacomputationalshortcuttosimulatedannealing. Closelyrelated tomean(cid:2)eldtheory, themethodconsistsofminimizingthefreeenergyateachtemperature setting. The free energy is separately constructed for each problem. The temperature is reduced accord- ing to a pre-speci(cid:2)ed annealing schedule. Deterministic annealing has been applied to a variety of combinatorial optimization problems(cid:151)winner-take-all (WTA), linear assignment, quadratic assignment including the traveling salesman problem, graph matching and graph partitioning, clustering (central and pairwise), the Ising model etc.(cid:151)and to nonlinear optimization problems aswellwithvariedsuccess. Inthispaper, we focusontherelationshipbetween deterministic an- nealingandrelaxationlabelingwithemphasisonmatchingandlabelingproblems. Thearchetypal problem atthe heart of labeling problems isthe winner-take-all andsimilarly for matching prob- lems, it is linear assignment that is central. Consequently, our development dwells considerably onthesetwoproblems. 2.1 Thewinner take all (cid:0)(cid:2)(cid:1)(cid:4)(cid:3)(cid:6)(cid:5)(cid:8)(cid:7)(cid:10)(cid:9)(cid:12)(cid:11)(cid:13)(cid:3)(cid:8)(cid:14)(cid:15)(cid:14)(cid:16)(cid:14)(cid:17)(cid:3)(cid:19)(cid:18)(cid:21)(cid:20) Thewinner-take-all problemisstatedasfollows: Givenaset ofnumbers ,(cid:2)nd (cid:5)(cid:4)(cid:22)(cid:24)(cid:23)(cid:26)(cid:25)(cid:28)(cid:27)(cid:30)(cid:29) (cid:31)!(cid:25)(cid:28)"(cid:17)(cid:1)$#%(cid:0)&(cid:1)’(cid:3)((cid:5))(cid:7)*(cid:9)(cid:12)(cid:11)(cid:13)(cid:3)+(cid:14),(cid:14)-(cid:14)(cid:2)(cid:3)(cid:19)(cid:18)(cid:10)(cid:20)/. or in other words, (cid:2)nd the index of the maximum number. (cid:18) (cid:1)(cid:4)(cid:3)(cid:6)(cid:5)1(cid:7)2(cid:9)(cid:12)(cid:11)(cid:13)(cid:3)3(cid:14)(cid:17)(cid:14)(cid:16)(cid:14)(cid:17)(cid:3)(cid:19)(cid:18)(cid:21)(cid:20) Using binaryvariables 0 ,theproblemisrestatedas: (cid:31)(cid:24)(cid:25)(cid:28)" (cid:0)&(cid:1) (cid:1) 465 (cid:1) 0 (1) (cid:1)7(cid:23)*(cid:11)(cid:13)(cid:3) (cid:1)8(cid:7)2(cid:9)(cid:28)9:(cid:3);(cid:11)<(cid:20)(cid:12)(cid:3)(cid:2)=>(cid:5) s.to 5 (cid:1) 0 and0 (2) Thedeterministic annealingfree energyiswrittenasfollows: (cid:11) ?(cid:6)@:ACB/#EDF.G(cid:23)*H (cid:0)&(cid:1)%D/(cid:1)FIKJ(cid:6)# D/(cid:1)-HL(cid:11)M.-I D/(cid:1)<OQP(cid:13)(cid:29) D/(cid:1)(cid:4)(cid:14) 5 (cid:1) 5 (cid:1) N 5 (cid:1) (3) D Inequation(3), isanewset ofanalog mean(cid:2)eldvariables summingtoone. Thetransitionfrom D N binary variables 0 to analog variables is deliberately highlighted here. Also, is the inverse 3 J temperaturetobevariedaccordingtoanannealingschedule. isaLagrangeparameter satisfying OQP(cid:13)(cid:29) D the WTA constraint. The (cid:0) (cid:0) form of the barrier function keeps the variables positive andis alsoreferred toasanentropyterm. D J We now proceed to solve for the variables and the Lagrange parameter . We get (after J eliminating ) "(cid:6)(cid:8) # N (cid:0)&(cid:1) . D(cid:2)(cid:1)(cid:1)(cid:4)(cid:3)(cid:6)(cid:5) (cid:23) (cid:7) (cid:3)(cid:2)=F(cid:5) (cid:3)8(cid:5)3(cid:7)2(cid:9)(cid:12)(cid:11)(cid:13)(cid:3)(cid:8)(cid:14)(cid:15)(cid:14)(cid:12)(cid:14)(cid:17)(cid:3)(cid:19)(cid:18)(cid:21)(cid:20) (cid:9)(cid:11)(cid:10) "(cid:6)(cid:8) # N (cid:0) (cid:10) . (4) (cid:7) This is referred to as the softmax nonlinearity [8]. Deterministic annealing WTA uses the nonlin- earity within an annealing schedule. (Here, we gloss over the technical issue of propagating the D (cid:3)(cid:13)(cid:12) N(cid:15)(cid:14)(cid:17)(cid:16)(cid:19)(cid:18) solutionatagiventemperature tobetheinitialconditionatthenexttemperature .) When there are no ties, this algorithm (cid:2)nds the single winner for any reasonable annealing schedule(cid:151) N quenchingathigh beingoneexampleofan(cid:147)unreasonable(cid:148)schedule. 2.2 Thelinear assignment problem (cid:1)$(cid:3)(cid:24)(cid:23) (cid:3)’(cid:5) (cid:7) The linear assignment problem is written as follows: Given a matrix of numbers (cid:20)(cid:22)(cid:21) (cid:9)(cid:12)(cid:11)(cid:13)(cid:3))(cid:14)G(cid:14)1(cid:14) (cid:3)(cid:19)(cid:18)(cid:21)(cid:20) (cid:18)(cid:26)(cid:25) , (cid:2)nd the permutation that maximizes the assignment. Using binary variables (cid:1)’(cid:3)(cid:28)(cid:23)F(cid:3)(cid:4)(cid:5)1(cid:7) (cid:9)(cid:12)(cid:11)(cid:13)(cid:3)1(cid:14)(cid:16)(cid:14)(cid:17)(cid:14)(cid:15)(cid:3)$(cid:18)(cid:10)(cid:20) 0(cid:27)(cid:21) ,theproblemisrestatedas: (cid:31)(cid:24)(cid:25)(cid:28)" (cid:1) (cid:1) 4 5 (cid:1) (cid:20)(cid:29)(cid:21) 0(cid:30)(cid:21) (5) (cid:21) (cid:1)8(cid:23)*(cid:11)(cid:13)(cid:3) (cid:1)8(cid:23)*(cid:11)(cid:13)(cid:3) (cid:1) (cid:7)2(cid:9)(cid:28)9:(cid:3);(cid:11)<(cid:20)(cid:12)(cid:3)-=(cid:31)(cid:23)F(cid:3)(cid:4)(cid:5) s.to 5 (cid:1) 0(cid:27)(cid:21) 5 0(cid:30)(cid:21) and 0(cid:30)(cid:21) (6) (cid:21) Thedeterministic annealingfree energyiswrittenasfollows: (cid:11) ? B! (cid:17)#EDF.G(cid:23)*H 5 (cid:1) (cid:20)"(cid:21) (cid:1)%D (cid:21) (cid:1) I 5 # (cid:21) # 5 (cid:1) D (cid:21) (cid:1)-HL(cid:11)M.7I 5 (cid:1)%$ (cid:1)(cid:19)# 5 D (cid:21) (cid:1)-H (cid:11)M.,I N 5 (cid:1) D (cid:21) (cid:1)<OQP(cid:13)(cid:29) D (cid:21) (cid:1)’(cid:14) (7) (cid:21) (cid:21) (cid:21) (cid:21) D In equation (7), is a doubly stochastic mean (cid:2)eld matrix with rows and columns summing to # (cid:3) . one. # are Lagrange parameters satisfying the row and column WTA constraints. As in the $ OQP(cid:13)(cid:29) D WTAcase,the (cid:0) (cid:0) formofthebarrierfunctionkeepsthe variablespositive. D # (cid:3) . We now proceed to solve for the variables and the Lagrange parameters # . [29, 2]. We $ get D (cid:21)(cid:1)(cid:4)(cid:3)(cid:6)(cid:1) (cid:5) (cid:23) (cid:7) "(cid:6)(cid:8) # N (cid:20)"(cid:21) (cid:1)(cid:2)H N’&# (cid:21) I $ (cid:1))(%.-=(cid:15)(cid:23) (cid:3)(cid:4)(cid:5) (cid:3)*(cid:23) (cid:3)(cid:4)(cid:5)(cid:8)(cid:7)2(cid:9)(cid:12)(cid:11)(cid:13)(cid:3)(cid:8)(cid:14)(cid:17)(cid:14)(cid:16)(cid:14)(cid:17)(cid:3)(cid:19)(cid:18)(cid:21)(cid:20) (8) Theassignment problemisdistinguishedfromthe WTAby requiring thesatisfactionoftwo-way WTAconstraintsasopposedtoone. Consequently,theLagrangeparameterscannotbesolvedfor in closed form. Rather than solving for the Lagrange parameters using steepest ascent, an iter- ated row and column normalizationmethod is used to obtain a doubly stochastic matrix at each temperature [29, 9]. Sinkhorn’s theorem [35] guarantees the convergence of this method. (This methodcanbeindependentlyderivedascoordinateascentw.r.t. theLagrangeparameters.) With Sinkhorn’s method in place, the overall dynamics at each temperature is referred to as the soft- assign [9]. Deterministic annealing assignment uses the softassign within an annealing schedule. 4 D (cid:3) (cid:12) (Here, we glossoverthetechnical issueofpropagatingthesolutionatagiventemperature to N(cid:15)(cid:14)(cid:17)(cid:16)(cid:19)(cid:18) betheinitialconditionatthenexttemperature .) Whentherearenoties, thisalgorithm(cid:2)nds theoptimalpermutationforanyreasonableannealingschedule. 2.3 Relatedproblems Having speci(cid:2)ed the two archetypal problems, the winner-take-all and assignment, we turn to other optimization problems which frequently arise in computer vision, pattern recognition and neuralnetworks. 2.3.1 Clusteringandlabeling Clustering is a very old problem in pattern recognition [1, 36]. In its simplest form, the prob- (cid:18) lem is to separate a set of vectors in dimension (cid:0) into (cid:1) categories. The precise statement of the problem depends on whether central or pairwise clustering is the goal. In central clustering, prototypes are required, in pairwise clustering, a distance measure between any two patterns is needed[37,18]. Closelyrelatedtopairwise clusteringisthelabelingproblemwhere aset ofcom- patibilitycoef(cid:2)cientsaregivenandweareaskedtoassignoneuniquelabeltoeachpatternvector. Inbothcases,wecanwritedownthefollowinggeneral energyfunction: (cid:11) (cid:31)(cid:3)(cid:2)(cid:5)(cid:4) (cid:6)(cid:8)(cid:7)B(cid:10)(cid:9) # .G(cid:23)*H (cid:14) (cid:1)(cid:16)(cid:15)(cid:12)(cid:10) (cid:1) (cid:12) (cid:10) 4 0 (cid:11) 5 (cid:1)(cid:13)(cid:12)(cid:10) (cid:21) 0 (cid:21) 0 (9) (cid:21) (cid:1)(cid:2)(cid:23) (cid:11)(cid:13)(cid:3) (cid:1) (cid:7)2(cid:9)(cid:28)9:(cid:3);(cid:11)<(cid:20)(cid:12)(cid:3)(cid:2)=(cid:24)(cid:23) (cid:3)(cid:4)(cid:5) s.to 5 0(cid:27)(cid:21) and0(cid:27)(cid:21) (cid:21) (This energy function is a simpli(cid:2)cation of the pairwise clustering objective function used in [37, (cid:14) 18], but it serves our purpose here.) If the set of compatibility coef(cid:2)cients is positive de(cid:2)nite in the subspace of the one-way WTA constraint, the local minima are WTAs with binary entries. We call this the quadratic WTA (QWTA) problem, emphasizing the quadratic objective with a one-wayWTAconstraint. For the (cid:2)rst time, we have gone beyond objective functions that are linear in the binary vari- ables 0 toobjectivefunctionsquadraticin 0 . Thistransitionisveryimportantandentirelyorthog- onal to the earlier transition from the WTA constraint to the permutation constraint. Quadratic objectives withbinary variablesobeying simplex like constraints are usuallymuch more dif(cid:2)cult to minimize than their linear objective counterparts. Notwithstanding the increased dif(cid:2)culty of this problem, a deterministic annealing algorithm which is fairly adept at avoiding poor local minimais: (cid:6) (cid:7)B(cid:10)(cid:9) #EDF. (cid:17) (cid:21) (cid:1)(cid:19)(cid:18)(cid:21)(cid:23)(cid:20)(cid:23)(cid:22) H(cid:25)(cid:24) D (cid:1) (cid:23) 5 (cid:12) (cid:10) (cid:14) (cid:21) (cid:1)(cid:26)(cid:15)(cid:12)(cid:10) D (cid:12) (cid:10) (10) (cid:24) (cid:21) "(cid:6)(cid:8) # N (cid:17) (cid:1) . D (cid:1)(cid:4)(cid:3)(cid:6)(cid:1) (cid:5) (cid:23) (cid:7) (cid:21) (cid:21) (cid:9) (cid:12) "(cid:6)(cid:8) # N (cid:17) (cid:12)E(cid:1) . (11) (cid:7) The intermediate (cid:17) variables have an increased signi(cid:2)cance in our later discussion on relaxation labeling. Thealgorithmconsistsofiteratingtheaboveequationsateachtemperature. Centraland 5 pairwise clustering energy functions have been used in image classi(cid:2)cation and segmentation or labelingproblemsingeneral [18]. 2.3.2 Matching Template matching is also one of the oldest problems in vision and pattern recognition. Conse- quently,thesub(cid:2)eldofimagematchinghasbecomeincreasinglyvariegatedovertheyears. Inour discussion,werestrict ourselvestofeaturematching. Akintolabelingorclustering, therearetwo different styles of matching depending on whether a spatial mapping exists between the features in one image and the other. When a spatial mapping exists (or is explicitly modeled), it acts as a strongconstraintonthematching[24]. Thesituationwhennospatialmappingisknownbetween thefeaturesissimilartothepairwiseclusteringcase. Instead,adistancemeasurebetweenpairsof featuresinthemodelandpairsoffeaturesintheimageareassumed. Thisresultsinthequadratic assignmentobjectivefunction(cid:151)formoredetailssee[22]: (cid:11) (cid:31)(cid:25)(cid:2)(cid:5)(cid:4) (cid:6)(cid:1)(cid:0)(cid:3)(cid:2) # .G(cid:23)*H (cid:14) (cid:1)(cid:26)(cid:15)(cid:12) (cid:10) (cid:1) (cid:12)(cid:10) 4 0 (cid:11) 5 (cid:1) (cid:12)(cid:10) (cid:21) 0 (cid:21) 0 (cid:21) (cid:1)8(cid:23)*(cid:11)(cid:13)(cid:3) (cid:1)8(cid:23)*(cid:11)(cid:13)(cid:3) (cid:1) (cid:7)2(cid:9)(cid:28)9:(cid:3);(cid:11)<(cid:20)(cid:12)(cid:3)-=(cid:31)(cid:23)F(cid:3)(cid:4)(cid:5) s.to 5 (cid:1) 0(cid:27)(cid:21) 5 0(cid:30)(cid:21) and 0(cid:30)(cid:21) (12) (cid:21) (cid:9) (cid:14) (cid:1)(cid:26)(cid:15)(cid:12) (cid:10) (cid:20) Ifthequadraticbene(cid:2)tmatrix (cid:21) ispositivede(cid:2)niteinthesubspacespannedbytherowand column constraints, the minima are permutation matrices. This result was shown in [2]. Once again,adeterministicannealingfreeenergyandalgorithmcanbewrittendownafterspottingthe basicform (linearorquadraticobjective, one-wayortwo-wayconstraint): (cid:6) (cid:0)(cid:4)(cid:2) #EDF. (cid:17) (cid:21) (cid:1) (cid:18)(cid:21)(cid:23)(cid:20)(cid:23)(cid:22) H(cid:3)(cid:24) D (cid:1) (cid:23) 5 (cid:12)(cid:10) (cid:14) (cid:21) (cid:1)(cid:26)(cid:15)(cid:12) (cid:10) D (cid:12) (cid:10) (13) (cid:24) (cid:21) D (cid:21)(cid:1)(cid:4)(cid:3)(cid:6)(cid:1) (cid:5) (cid:23) (cid:7) "(cid:6)(cid:8) # N (cid:17) (cid:21) (cid:1)7H N &# (cid:21) I $ (cid:1) (%. (14) The two Lagrange parameters # and are speci(cid:2)ed by Sinkhorn’s theorem and the softassign. $ D These two equations (one for the (cid:17) and one for the ) are iterated until convergence at each tem- perature. The softassign quadratic assignment algorithm is guaranteed to converge to a local minimumprovidedtheSinkhornprocedurealwaysreturnsadoublystochasticmatrix[19]. Wehavewrittendowndeterministicannealingalgorithmsfortwoproblems(QWTAandQAP) while drawing on the basic forms given by the WTA and linear assignment problems. The com- monfeaturesinthetwodeterministicannealingalgorithmsandtheirdifferences (one-wayversus two-wayconstraints)[17]havebeenhighlightedaswell. Wenowturntorelaxationlabeling. 3 Relaxation labeling Relaxation labeling as the name suggests began as a method for solving labeling problems [11]. While the framework has been extended to many applications [38, 39, 40, 41, 16, 15] the basic 6 (cid:5) feature of the framework remains: Start with a set of nodes (in feature or image space) and a J set of labels . Derive aset ofcompatibility coef(cid:2)cients (as inSection 2.3.1) for each problem of (cid:0) (cid:5) J interest andthen applythe basic recipe of relaxationlabeling for updating the node-label ( to ) assignments: (cid:14) (cid:14) (cid:17) (cid:1)(cid:1) (cid:5) # J&. (cid:23) (cid:1)(cid:10) # J(cid:2)(cid:3) # .(cid:3)(cid:2) (cid:10)(cid:1) (cid:5) # # . (15) 5(cid:10) (cid:0) (cid:1) (cid:14) (cid:14) (cid:14) (cid:16) (cid:18) (cid:2) (cid:1)(cid:1) (cid:5) # J&. #(cid:4)(cid:11)1I(cid:5)(cid:4) (cid:17) (cid:1)(cid:1) (cid:5) # J&.(cid:19). (cid:2) (cid:1)(cid:1) (cid:5) # J&.G(cid:23) (cid:14) (cid:14) (16) (cid:9) (cid:2) (cid:1)(cid:1) (cid:5) # # . #(cid:4)(cid:11)1I(cid:6)(cid:4) (cid:17) (cid:1)(cid:1) (cid:5) # # .$. (cid:1) (cid:5) J Herethe(cid:2) ’sarethenode-label( to )labelvariables,the(cid:17) areintermediatevariablessimilartothe (cid:17) ’sde(cid:2)nedearlierindeterministic annealing. (cid:4) isaparametergreater thanzerousedtomakethe numeratorpositive(andkeeptheprobabilitiespositive.) Theupdateequationistypicallywritten in the form of a discrete dynamical system. In particular, note the multiplicative update and the # I (cid:11)M. normalization step involved in the transition from step to step . We have deliberately (cid:7) (cid:7) written the relaxation labeling update equation in a quasi-canonical form while suggesting (at this point) similarities most notably to the pairwise clustering discrete time update equation. To makethesemanticconnectiontodeterministicannealingmoreobvious,wenowswitchtotheold D usageofthe variablesratherthanthe(cid:2) ’sinrelaxationlabeling. (cid:14) (cid:14) (cid:17) (cid:21)(cid:1) (cid:1) (cid:5) (cid:23) 5(cid:10) (cid:12) (cid:14) (cid:21) (cid:1)(cid:16)(cid:15)(cid:12)(cid:10) D (cid:12)(cid:1)(cid:10) (cid:5) (17) (cid:14) (cid:14) (cid:14)(cid:17)(cid:16)(cid:19)(cid:18) D (cid:1) (cid:1) (cid:5) #’(cid:11)1I(cid:6)(cid:4) (cid:17) (cid:1) (cid:1) (cid:5) . D(cid:2)(cid:1) (cid:1) (cid:5) (cid:23) (cid:21) (cid:14) (cid:21) (cid:14) (18) (cid:21) (cid:9) (cid:12) D(cid:2)(cid:12)%(cid:1)(cid:1) (cid:5) #(cid:4)(cid:11) I(cid:5)(cid:4) (cid:17) (cid:12)E(cid:1)(cid:1) (cid:5) . AsintheQAPandQWTAdeterministicannealingalgorithms,aLyapunovfunctionexists[42,43] forrelaxationlabeling. Wecannowproceedinthereverseorderfromtheprevioussectionondeterministicannealing. Having written down the basic recipe for relaxation labeling, specialize to WTA, AP, QWTA and QAP. While the contraction to WTA and QWTA may be obvious, the case of AP and QAP are not so clear. The reason: two-way constraints in AP are not handled by relaxation labeling. We have toinvoke something analogoustothe Sinkhornprocedure. Also, there is no clear analogto the iterative algorithms obtained at each temperature setting. Instead the label variables directly and multiplicatively depend on their previous state which is never encountered in deterministic annealing. How do we reconcile this situation so that we can clearly state just where these two algorithms are in accord? The introduction of self annealing promises to answer some of these questionsandwe nowturntoitsdevelopment. 4 Self annealing Self annealing has one goal, namely, the elimination of a temperature schedule. As a by-product we showthat theresulting algorithm bears aclose similarityto bothdeterministic annealingand 7 relaxationlabeling. Theselfannealingupdateequationforanyofthe(matchingorlabeling)prob- lemswehavediscussedsofarisderivedbyminimizing[44] (cid:11) ?(cid:24)#%D&(cid:3)(cid:1)(cid:0)7.G(cid:23) (cid:6) #EDF.7I #ED&(cid:3)(cid:2)(cid:0)(cid:2). (cid:4) (cid:0) (19) #%D (cid:3)(cid:2)(cid:0)7. D (cid:0) where (cid:0) isadistancemeasurebetween andan(cid:147)old(cid:148)value . (Theexplanationofthe(cid:147)old(cid:148) ? D valuewillfollowshortly.) When isminimizedw.r.t ,bothtermsin(19)comeintoplay. Indeed, #ED&(cid:3)(cid:2)(cid:0)(cid:2). D thedistancemeasure (cid:0) serves asan(cid:147)inertia(cid:148)term withthed(cid:18)egree of(cid:2)delitybetween and (cid:0) determined by the parameter (cid:4) . For example, when (cid:0) #%D (cid:3)(cid:2)(cid:0)7. is (cid:25) (cid:3) D H(cid:4)(cid:0) (cid:3) (cid:25) , the update equation D (cid:0) obtainedaftertakingderivativesw.r.t. and andsettingtheresultstozerois (cid:14) (cid:0):(cid:1) (cid:23) D (cid:1)(cid:1) (cid:5) (cid:14)(cid:17)(cid:16)(cid:19)(cid:18) (cid:6) #%D:. (cid:5) D (cid:1)(cid:1) (cid:5) (cid:23) (cid:0):(cid:1)-H (cid:4) (cid:24) D/(cid:1) (cid:5)(cid:5) (cid:14) (20) (cid:5)(cid:6)(cid:8)(cid:7)(cid:9)(cid:6)(cid:11)(cid:10)(cid:12)(cid:13)(cid:12)(cid:15)(cid:14)(cid:17)(cid:16) (cid:24) (cid:6) (cid:5) (cid:1) (cid:5) (cid:5) Thisupdateequationreducesto(cid:147)vanilla(cid:148)gradientdescentprovidedweapproximate (cid:18)(cid:20)(cid:19) (cid:6)(cid:13)(cid:21) (cid:5)(cid:6)(cid:11)(cid:7)(cid:22)(cid:6) (cid:10)(cid:12)(cid:23)(cid:12)(cid:24)(cid:14)(cid:25)(cid:16) (cid:6) (cid:5) (cid:18) by (cid:18)(cid:8)(cid:19) (cid:6)(cid:1)(cid:21) (cid:5) (cid:5)(cid:5)(cid:6)(cid:11)(cid:7)(cid:22)(cid:6) (cid:10)(cid:12)(cid:13)(cid:16) . (cid:4) becomes a step-size parameter. However, the distance measure is not re- (cid:18) D stricted tojust quadratic error measures. Especially, when positivity ofthe variablesis desired, #ED&(cid:3)(cid:2)(cid:0)7. aKullback-Leibler(KL)distancemeasurecanbeusedfor (cid:0) . In[44],theauthorsderivemany linearon-line predictionalgorithmsusingtheKLdivergence. Here, we applythe same approach totheQWTAandQAP. Examine the following QAP objective function using the KL divergence as the distance mea- sure: (cid:11) (cid:11) (cid:29) D (cid:1) ?(cid:27)(cid:26)%B(cid:2)(cid:28) B (cid:17)#%D (cid:3)(cid:2)(cid:0)7.(cid:8)(cid:23) H (cid:14) (cid:1)(cid:26)(cid:15)(cid:12)(cid:10) D (cid:1)CD (cid:12)(cid:10) I D (cid:1)/OQP(cid:13)(cid:29) (cid:21) H D (cid:1) I(cid:30)(cid:0) (cid:1) (cid:31) (cid:11) 5 (cid:1)(cid:13)(cid:12) (cid:10) (cid:21) (cid:21) (cid:4) 5 (cid:1) (cid:21) (cid:0) (cid:1) (cid:21) (cid:21) (cid:21) (cid:21) (cid:21) I # D (cid:1)-H (cid:11)M.,I (cid:1) # D (cid:1),HL(cid:11)M. 5 # (cid:21) 5 (cid:1) (cid:21) 5 (cid:1) $ 5 (cid:21) (21) (cid:21) (cid:21) (cid:21) # (cid:3)"!F.(cid:8)(cid:23) (cid:9) (cid:1) # (cid:1)<OQP(cid:13)(cid:29)$# H (cid:1)(cid:17)I&!/(cid:1) . WehaveusedthegeneralizedKLdivergence (cid:0) (cid:0) (cid:0) % (cid:21) (cid:0) whichisguaranteed (cid:9) (cid:1) (cid:1)7(cid:23) (cid:9) (cid:1) !/(cid:1)7(cid:23)*(cid:11) tobegreater thanorequaltozerowithoutrequiringtheusualconstraints (cid:0) . This energy function looks very similar to the earlier deterministic annealing energy function (12) for QAP. However, it has no temperature parameter. The parameter (cid:4) is (cid:2)xed and positive. Instead D ofthe entropy barrier function, thisenergy functionhasanew KLmeasure between andanew (cid:0) variable . Withouttryingtoexplaintheselfannealingalgorithminitsmostcomplexform(QAP), wespecializeimmediatelytotheWTA. (cid:11) D/(cid:1) ?(cid:27)(cid:26)%B @FA B<#ED&(cid:3)(cid:2)(cid:0)(cid:2).1(cid:23)*H (cid:0)&(cid:1)%D/(cid:1) ILJ(cid:6)# D<(cid:1) H (cid:11)M.-I D/(cid:1)/O P<(cid:29) H D/(cid:1) I)(cid:0):(cid:1)(cid:17)*(cid:21)(cid:14) 5 (cid:1) 5 (cid:1) (cid:4)(’ 5 (cid:1) (cid:0):(cid:1) (22) D (cid:0) Equation (22) can be alternately minimized w.r.t. and (using a closed form solution for the J Lagrangeparameter )resultingin (cid:14) (cid:14)(cid:17)(cid:16)(cid:19)(cid:18) D (cid:1)(cid:1) (cid:5) "(cid:17)(cid:8) # (cid:4),(cid:0)&(cid:1) . D(cid:2)(cid:1)(cid:1) (cid:5) (cid:23) (cid:14) (cid:7) (cid:3)8D (cid:1)(cid:1),+ (cid:5) - 9:(cid:3)(cid:2)=F(cid:5) (cid:3)(cid:6)(cid:5) (cid:7) (cid:9)(cid:12)(cid:11)(cid:13)(cid:3)1(cid:14)(cid:17)(cid:14)(cid:16)(cid:14)(cid:17)(cid:3)(cid:19)(cid:18)(cid:21)(cid:20)(cid:12)(cid:14) (23) (cid:9) (cid:10) D (cid:10)(cid:1) (cid:5) "(cid:17)(cid:8) # (cid:4),(cid:0) (cid:10) . (cid:7) 8 (cid:14) (cid:0) D (cid:1)(cid:1) (cid:5) D The new variable is identi(cid:2)ed with in (23). When an alternating minimization (between (cid:0) ? (cid:26)%B @:ACB and )isprescribedfor ,theupdateequation(23)results. Initialconditionsareanimportant D (cid:1)+ (cid:23)*(cid:11) (cid:1) (cid:18) I(cid:3)(cid:2);(cid:1)(cid:4)(cid:3) (cid:0) (cid:1)+ (cid:23) D (cid:1)+ (cid:3) = (cid:5) (cid:3)&(cid:5)(cid:8)(cid:7)2(cid:9)(cid:12)(cid:11)(cid:13)(cid:3) (cid:14)(cid:28)(cid:14)<(cid:14)Q(cid:3)(cid:19)(cid:18)(cid:21)(cid:20) factor. Areasonablechoiceis butotherinitialconditions (cid:2) may work as well. A small random factor is included in the initial condition speci(cid:2)cation. To (cid:0) D summarize, inthe WTA, the new variable is identi(cid:2)ed with the (cid:147)past(cid:148)value of . We have not yetshownanyrelationshiptodeterministic annealingorrelaxationlabeling. Wenowwrite downthequadraticassignmentself annealingalgorithm: Pseudo-codeforselfannealingQAP (cid:18) D (cid:1) I(cid:5)(cid:2) (cid:1) (cid:0) (cid:1) D (cid:1) Initialize (cid:21) to (cid:4) (cid:21) , (cid:21) to (cid:21) -(cid:7)(cid:6)(cid:9)(cid:8) BeginA:DoAuntilintegralityconditionismetornumberofiterations . D (cid:1) -(cid:10)(cid:6)(cid:12)(cid:11) BeginB:DoBuntilall (cid:21) convergeornumberofiterations (cid:17) (cid:1)(cid:14)(cid:13) (cid:9) (cid:12)(cid:10) (cid:14) (cid:1)(cid:26)(cid:15)(cid:12) (cid:10) D (cid:12)(cid:10) (cid:21) (cid:21) D (cid:1)(cid:15)(cid:13) (cid:0) (cid:1) "(cid:6)(cid:8)># (cid:4) (cid:17) (cid:1)(cid:4). (cid:21) (cid:21) (cid:7) (cid:21) D (cid:1) -(cid:16)(cid:6)(cid:18)(cid:17) BeginC:DoCuntilall (cid:21) convergeornumberofiterations D (cid:1) Update (cid:21) bynormalizingtherows: D (cid:1)(cid:15)(cid:13) (cid:6)(cid:20)(cid:19) (cid:21) (cid:21) (cid:9) (cid:21) (cid:6)(cid:21)(cid:19) (cid:21) D (cid:1) Update (cid:21) bynormalizingthecolumns: D (cid:1)(cid:15)(cid:13) (cid:6)(cid:20)(cid:19) (cid:21) (cid:21) (cid:9) (cid:19) (cid:6)(cid:21)(cid:19) (cid:21) EndC EndB (cid:0) (cid:1)(cid:15)(cid:13) D (cid:1) (cid:21) (cid:21) EndA ThisisthefullblownselfannealingQAPalgorithmwithSinkhorn’smethodandthesoftassign D usedfortheconstraintsbutmoreimportantlyabuiltindelaybetweenthe(cid:147)old(cid:148)valueof namely (cid:0) D andthecurrent valueof . Themainupdateequationusedbythealgorithmis (cid:11) (cid:14) (cid:16) (cid:18) (cid:14) (cid:11) (cid:4) OQP(cid:13)(cid:29)8D (cid:21)(cid:1) (cid:1) (cid:5) (cid:23) 5 (cid:12)(cid:10) (cid:14) (cid:21) (cid:1)(cid:26)(cid:15)(cid:12) (cid:10) D (cid:12)(cid:1)(cid:10) (cid:5) H # (cid:21) H $ (cid:1) I (cid:4) O P(cid:13)(cid:29) (cid:0) (cid:21) (cid:1) (24) Convergenceoftheselfannealingquadraticassignmentalgorithmtoalocalminimum canbe easily shown when we assume that the Sinkhorn procedure always returns a doubly stochastic matrix. Our treatment follows [19]. A discrete-time Lyapunov function for the self annealing quadratic assignment algorithm is (21). (The Lagrange parameter terms can be eliminated since D wearerestricting tobedoublystochastic.) Thechangeinenergyiswrittenas (cid:14) (cid:14) (cid:16) (cid:18) (cid:22) ?(cid:27)(cid:26)%B(cid:2)(cid:28) B (cid:18)(cid:21)(cid:23)(cid:20)(cid:23)(cid:22) ?(cid:27)(cid:26)%B(cid:1)(cid:28)(cid:30)B (cid:17)#%D (cid:1) (cid:5) (cid:3)(cid:2)(cid:0)(cid:2). H(cid:10)?(cid:27)(cid:26)%B(cid:2)(cid:28) B! (cid:16)#%D (cid:1) (cid:5) (cid:3)(cid:2)(cid:0)(cid:2). 9 (cid:14) (cid:11) (cid:14) (cid:14) (cid:11) (cid:14) D(cid:2)(cid:1) (cid:1) (cid:5) (cid:23)*H (cid:14) (cid:1)(cid:16)(cid:15)(cid:12)(cid:10) D (cid:1) (cid:1) (cid:5) D (cid:12)(cid:1)(cid:10) (cid:5) I D (cid:1) (cid:1) (cid:5) OQP(cid:13)(cid:29) (cid:21) (cid:11) 5 (cid:1) (cid:12)(cid:10) (cid:21) (cid:21) (cid:4) 5 (cid:1) (cid:21) (cid:0) (cid:1) (cid:21) (cid:21) (cid:21) (cid:14) (cid:16) (cid:18) (cid:11) (cid:14) (cid:16) (cid:18) (cid:14)(cid:17)(cid:16)(cid:19)(cid:18) (cid:11) (cid:14)(cid:17)(cid:16)(cid:19)(cid:18) D(cid:2)(cid:1) (cid:1) (cid:5) I (cid:14) (cid:1)(cid:16)(cid:15)(cid:12)(cid:10) D (cid:1) (cid:1) (cid:5) D (cid:12)(cid:1)(cid:10) (cid:5) H D (cid:1) (cid:1) (cid:5) OQP(cid:13)(cid:29) (cid:21) (cid:11) 5 (cid:1)(cid:13)(cid:12) (cid:10) (cid:21) (cid:21) (cid:4) 5 (cid:1) (cid:21) (cid:0) (cid:1) (25) (cid:21) (cid:21) (cid:21) (cid:9) (cid:1) D (cid:1) (cid:23) (cid:18) The Lyapunov energy difference has been simpli(cid:2)ed using the relations (cid:21) (cid:21) . Using the updateequationforselfannealingin(24),theenergydifference isrewritten as (cid:14) (cid:11) (cid:11) (cid:14) D (cid:1) (cid:1) (cid:5) (cid:22) ?(cid:27)(cid:26)EB(cid:2)(cid:28) B! (cid:23) (cid:11) 5 (cid:1)(cid:13)(cid:12) (cid:10) (cid:14) (cid:21) (cid:1)(cid:26)(cid:15)(cid:12) (cid:10) (cid:22) D (cid:21) (cid:1) (cid:22) D (cid:12)(cid:10) I (cid:4) 5 (cid:1) D(cid:2)(cid:21)(cid:1) (cid:1) (cid:5) OQP(cid:13)(cid:29) D (cid:1)(cid:14)(cid:1)(cid:21) (cid:16) (cid:18) (cid:5) (cid:0) 9 (26) (cid:21) (cid:21) (cid:21) (cid:14)(cid:17)(cid:16)(cid:19)(cid:18) (cid:14) (cid:22) D (cid:1) (cid:18)(cid:21)(cid:23)(cid:20)(cid:23)(cid:22) D(cid:2)(cid:1) (cid:1) (cid:5) H(cid:21)D(cid:2)(cid:1) (cid:1) (cid:5) where (cid:21) (cid:21) (cid:21) . The(cid:2)rstterm in(26)isnon-negativeduetothepositivede(cid:2)niteness (cid:9) (cid:14) (cid:1)(cid:16)(cid:15)(cid:12)(cid:10) (cid:20) of (cid:21) in the subspace spanned by the row and column constraints. The second term is non- negativebyvirtueofbeingaKullback-Leiblerdistancemeasure. Wehaveshowntheconvergence toa(cid:2)xedpointoftheself annealingQAPalgorithm. 5 Self annealing and deterministic annealing Selfannealinganddeterministicannealingarecloselyrelated. Toseethis,wereturntoourfavorite example(cid:151)the winner-take-all (WTA). The self annealing and deterministic annealing WTAs are now brought intoaccord: Assume uniform rather than random initial conditionsfor self anneal- D (cid:1)(cid:1),+ (cid:5) (cid:23) (cid:11) (cid:1) (cid:18) (cid:3) =F(cid:5) (cid:3) (cid:5)(cid:24)(cid:7) (cid:9)(cid:12)(cid:11)(cid:13)(cid:3)+(cid:14)(cid:2)(cid:14)(cid:6)(cid:14)7(cid:3)(cid:19)(cid:18)(cid:21)(cid:20) ing. . With uniform initial conditions, it is trivial to solve for (cid:14) D(cid:2)(cid:1)(cid:1) (cid:5) : (cid:14) "(cid:6)(cid:8) # (cid:4),(cid:0)&(cid:1) . D(cid:2)(cid:1)(cid:1) (cid:5) (cid:23) (cid:9) (cid:7)(cid:10) "(cid:17)(cid:8)(cid:7) # (cid:4),(cid:0) (cid:10) . (cid:3)7= (cid:5) (cid:3)(cid:6)(cid:5)1(cid:7) (cid:9)(cid:12)(cid:11)(cid:13)(cid:3) (cid:14)(cid:12)(cid:14)(cid:15)(cid:14)(cid:17)(cid:3)(cid:19)(cid:18)(cid:21)(cid:20)(cid:12)(cid:14) (27) (cid:7) (cid:7) The correspondence between self annealing anddeterministic annealing isclearly established by setting N(cid:2)(cid:14) (cid:23) (cid:4) (cid:3) (cid:23) (cid:11)(cid:13)(cid:3) (cid:11) (cid:3) (cid:14)(cid:6)(cid:14)(cid:6)(cid:14) We have shown that the self annealing WTA corresponds to a (cid:7) (cid:7) particularlinearscheduleforthedeterministic annealingWTA. SincethecaseofAPismoreinvolvedthanWTA,wepresentanecdotalexperimentalevidence thatselfannealinganddeterministicannealingarecloselyrelated. InFigure1,wehaveshownthe #’(cid:11)1H (cid:9) (cid:19) (cid:21) (cid:6)(cid:2)(cid:19)(cid:1) (cid:21) . evolutionofthepermutationnorm (cid:4) andtheAPfree energies. Alinearscheduleisused N N (cid:23) fortheinversetemperature withtheinitialinversetemperature + (cid:4) andthelinearincrement N(cid:4)(cid:3) also set to (cid:4) . The correspondence between DA and SA is nearly exact for the permutation norm despite the fact that the free energies evolve in a different manner. The correspondence is exact only when we match the linear schedule DA parameter (cid:4) to the self annealing parameter (cid:4) . It is important that SA and DA be in lockstep, otherwise we cannot make the claim that SA correspondstoDAwithanemergent linearschedule. TheselfannealinganddeterministicannealingQAPobjectivefunctionsarequitegeneral. The (cid:14) (cid:1)(cid:26)(cid:15)(cid:12) (cid:10) QAPbene(cid:2)tmatrix (cid:21) ispresetbasedonthechosenproblem(cid:151)inexact,weighted,graphmatch- ing, or pairwise clustering. The deterministic annealing pseudo-code follows (we have already writtendowntheselfannealingpseudo-codeintheprevioussection): 10
Description: