Speeding up sum-of-squares for tensor decomposition and planted sparse vectors SamuelB.Hopkins∗ TselilSchramm† JonathanShi‡ DavidSteurer§ January17,2016 Abstract We consider two problems that arise in machine learning applications: the problem of recoveringaplantedsparsevectorinarandomlinearsubspaceandtheproblemofdecomposing a random low-rank overcomplete 3-tensor. For both problems, the best known guarantees arebasedonthesum-of-squaresmethod. Wedevelopnewalgorithmsinspiredbyanalyses of the sum-of-squares method. Our algorithms achieve the same or similar guarantees as sum-of-squaresfortheseproblemsbuttherunningtimeissignificantlyfaster. Fortheplantedsparsevectorproblem,wegiveanalgorithmwithrunningtimenearlylinear intheinputsizethatapproximatelyrecoversaplantedsparsevectorwithuptoconstantrelative √ sparsityinarandomsubspaceof(cid:146)n ofdimensionuptoΩ˜( n). Theserecoveryguarantees matchthebestknownonesofBarak,Kelner,andSteurer(STOC2014)uptologarithmicfactors. Fortensordecomposition,wegiveanalgorithmwithrunningtimeclosetolinearintheinput size(withexponent≈ 1.086)thatapproximatelyrecoversacomponentofarandom3-tensor over(cid:146)n ofrankuptoΩ˜(n4/3). ThebestpreviousalgorithmforthisproblemduetoGeandMa (RANDOM2015)worksuptorankΩ˜(n3/2)butrequiresquasipolynomialtime. ∗CornellUniversity. [email protected]. SupportedbyanNSFGraduateResearchFellowship(NSFawardno. 1144153)andbyDavidSteurer’sNSFCAREERaward. †UCBerkeley,[email protected]. SupportedbyanNSFGraduateResearchFellowship(NSFawardno 1106400). ‡CornellUniversity,[email protected]. SupportedbyDavidSteurer’sNSFCAREERaward. §CornellUniversity,[email protected]. SupportedbyaMicrosoftResearchFellowship,aAlfredP.Sloan Fellowship,anNSFCAREERaward,andtheSimonsCollaborationforAlgorithmsandGeometry. Contents 1 Introduction 1 1.1 Plantedsparsevectorinrandomlinearsubspace . . . . . . . . . . . . . . . . . . . . . 2 1.2 Overcompletetensordecomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3 Tensorprincipalcomponentanalysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.4 Relatedwork . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2 Techniques 6 2.1 Plantedsparsevectorinrandomlinearsubspace . . . . . . . . . . . . . . . . . . . . . 8 2.2 Overcompletetensordecomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.3 Tensorprincipalcomponentanalysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 3 Preliminaries 13 4 Plantedsparsevectorinrandomlinearsubspace 14 4.1 Algorithmsucceedsongoodbasis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 4.2 Closenessofinputbasisandgoodbasis . . . . . . . . . . . . . . . . . . . . . . . . . . 18 5 Overcompletetensordecomposition 20 5.1 ProofofTheorem5.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 5.2 Spectralgapfordiagonalterms: proofofProposition5.4 . . . . . . . . . . . . . . . . 24 5.3 Boundforcrossterms: proofofProposition5.5 . . . . . . . . . . . . . . . . . . . . . . 28 5.4 FullalgorithmandproofofTheorem5.2 . . . . . . . . . . . . . . . . . . . . . . . . . . 32 6 Tensorprincipalcomponentanalysis 38 6.1 Spikedtensormodel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 6.2 Linear-timealgorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 References 40 A Additionalpreliminaries 43 A.1 Linearalgebra . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 A.2 Concentrationtools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 B Concentrationboundsforplantedsparsevectorinrandomlinearsubspace 52 C Concentrationboundsforovercompletetensordecomposition 55 D Concentrationboundsfortensorprincipalcomponentanalysis 62 1 Introduction The sum-of-squares (SoS) method (also known as the Lasserre hierarchy) [Sho87, Par00, Nes00, Las01]isapowerful,semidefinite-programmingbasedmeta-algorithmthatappliestoawide-range ofoptimizationproblems. Themethodhasbeenstudiedextensivelyformoderate-sizepolynomial optimizationproblemsthatariseforexampleincontroltheoryandinthecontextofapproximation algorithmsforcombinatorialoptimizationproblems,especiallyconstraintsatisfactionandgraph partitioning(seee.g. thesurvey[BS14]). Forthelatter,theSoSmethodcapturesandgeneralizesthe bestknownapproximationalgorithmsbasedonlinearprogramming(LP),semidefiniteprogram- ming(SDP),orspectralmethods,anditisinmanycasesthemostpromisingapproachtoobtain algorithmswithbetterguarantees—especiallyinthecontextofKhot’sUniqueGamesConjecture [BBH+12]. Asequenceofrecentworksappliesthesum-of-squaresmethodtobasicproblemsthatarisein unsupervisedmachinelearning: inparticular,recoveringsparsevectorsinlinearsubspacesand decomposingtensorsinarobustway[BKS14,BKS15,HSS15,BM15,GM15]. Forawiderangeof parametersoftheseproblems,SoSachievessignificantlystrongerguaranteesthanothermethods, inpolynomialorquasi-polynomialtime. LikeotherLPandSDPhierarchies,thesum-of-squaresmethodcomeswithadegreeparameter d ∈ (cid:142) that allows for trading off running time and solution quality. This trade-off is appealing becauseforapplicationstheadditionalutilityofbettersolutionscouldvastlyoutweighadditional computationalcosts. Unfortunately,thecomputationalcostgrowsrathersteeplyintermsofthe parameter d: therunningtimeis nO(d) where n isthenumberofvariables(usuallycomparableto theinstancesize). Further,evenwhentheSDPhassizepolynomialintheinput(when d (cid:3) O(1)), solvingtheunderlyingsemidefiniteprogramsisprohibitivelyslowforlargeinstances. Inthiswork,weintroducespectralalgorithmsforplantedsparsevector,tensordecomposition, andtensorprincipalcomponentsanalysis(PCA)thatexploitthesamehigh-degreeinformationas thecorrespondingsum-of-squaresalgorithmswithoutrelyingonsemidefiniteprogramming,and achievethesame(orclosetothesame)guarantees. Theresultingalgorithmsarequitesimple(a coupleoflinesofmatlabcode)andhaveconsiderablyfasterrunningtimes—quasi-linearorcloseto linearintheinputsize. Asurprisingimplicationofourworkisthatforsomeproblems,spectralalgorithmscanexploit informationfromlargervaluesoftheparameter d withoutspendingtime nO(d). Forexample,one ofouralgorithmsrunsinnearly-lineartimeintheinputsize,eventhoughitusespropertiesthat thesum-of-squaresmethodcanonlyusefordegreeparameter d (cid:62) 4. (Inparticular,theguarantees thatthealgorithmachievesarestrictlystrongerthantheguaranteesthatSoSachievesforvaluesof d < 4.) TheinitialsuccessesofSoSinthemachinelearningsettinggavehopethattechniquesdeveloped in the theory of approximation algorithms, specifically the techniques of hierarchies of convex relaxationsandroundingconvexrelaxations,couldbroadlyimpactthepracticeofmachinelearning. Thishopewasdampenedbythefactthatingeneral,algorithmsthatrelyonsolvinglargesemidefinite programsaretooslowtobepracticalforthelarge-scaleproblemsthatariseinmachinelearning. Ourworkbringsthishopebackintofocusbydemonstratingforthefirsttimethatwithsomecare SoSalgorithmscanbemadepracticalforlarge-scaleproblems. 1 In the following subsections we describe each of the problems that we consider, the prior best-knownguaranteeviatheSoShierarchy,andourresults. 1.1 Plantedsparsevectorinrandomlinearsubspace Theproblemoffindingasparsevectorplantedinarandomlinearsubspacewasintroducedby Spielman,Wang,andWrightasawayoflearningsparsedictionaries[SWW12]. Subsequentworks havefoundfurtherapplicationsandbegunstudyingtheprobleminitsownright[DH14,BKS14, QSW14]. Inthisproblem,wearegivenabasisfora d-dimensionallinearsubspaceof(cid:146)n thatis random except for one planted sparse direction, and the goal is to recover this sparse direction. Thecomputationalchallengeistosolvethisproblemevenwhentheplantedvectorisonlymildly sparse(aconstantfractionofnon-zerocoordinates)andthesubspacedimensionislargecompared totheambientdimension(d (cid:62) nΩ(1)). Severalkindsofalgorithmshavebeenproposedforthisproblembasedonlinearprogramming (LP),basicsemidefiniteprogramming(SDP),sum-of-squares,andnon-convexgradientdescent (alternatingdirectionsmethod). Aninherentlimitationofsimplerconvexmethods(LPandbasicSDP)[SWW12,dGJL04]isthat theyrequiretherelativesparsityoftheplantedvectortobepolynomialinthesubspacedimension √ (lessthan n/ d non-zerocoordinates). Sum-of-squares and non-convex methods do not share this limitation. They can recover planted vectors with constant relative sparsity even if the subspace has polynomial dimension (uptodimensionO(n1/2)forsum-of-squares[BKS14]anduptoO(n1/4)fornon-convexmethods [QSW14]). Westatetheproblemformally: Problem1.1(Plantedsparsevectorproblemwithambientdimension n ∈ (cid:142),subspacedimension d (cid:54) n, sparsity ε > 0, and accuracy η > 0). Given an arbitrary orthogonal basis of a subspace spannedbyvectors v0,v1,...,vd−1 ∈ (cid:146)n,where v0 isavectorwithatmost εn non-zeroentriesand v1,...,vd−1 arevectorssampledindependentlyatrandomfromthestandardGaussiandistribution on(cid:146)n,outputaunitvector v ∈ (cid:146)n thathascorrelation(cid:104)v,v0(cid:105)2 (cid:62) 1−ηwiththesparsevector v0. Our Results. Our algorithm runs in nearly linear time in the input size, and matches the best- knownguaranteesuptoapolylogarithmicfactorinthesubspacedimension[BKS14]. Theorem 1.2 (Planted sparse vector in nearly-linear time). There exists an algorithm that, for every √ sparsityε > 0,ambientdimensionn,andsubspacedimensiondwithd (cid:54) n/(logn)O(1),solvestheplanted sparsevectorproblemwithhighprobabilityforsomeaccuracy η (cid:54) O(ε1/4)+on→∞(1). Therunningtimeof thealgorithmisO˜(nd). WegiveatechnicaloverviewoftheproofinSection2,andafullproofinSection4. Previousworkalsoshowedhowtorecovertheplantedsparsevectorexactly. Thetaskofgoing fromanapproximatesolutiontoanexactoneisaspecialcaseofstandardcompressedsensing(see e.g. [BKS14]). 2 Table1: Comparisonofalgorithmsfortheplantedsparsevectorproblemwithambientdimensionn,subspacedimension d,andrelativesparsityε. Reference Technique Runtime Largest d Largest ε √ Demanet,Hand[DH14] linearprogramming poly any Ω(1/ d) √ Barak,Kelner,Steurer[BKS14] SoS,generalSDP poly Ω( n) Ω(1) Qu,Sun,Wright[QSW14] alternatingminimization O˜(n2d5) Ω(n1/4) Ω(1) √ thiswork SoS,partialtraces O˜(nd) Ω˜( n) Ω(1) 1.2 Overcompletetensordecomposition Tensorsnaturallyrepresentmultilinearrelationshipsindata. Algorithmsfortensordecompositions have long been studied as a tool for data analysis across a wide-range of disciplines (see the early work of Harshman [Har70] and the survey [KB09]). While the problem is NP-hard in the worst-case [Hås90, HL13], algorithms for special cases of tensor decomposition have recently led to new provable algorithmic results for several unsupervised learning problems [AGH+14, BCMV14, GVX14, AGHK14] including independent component analysis, learning mixtures of Gaussians[GHK15],LatentDirichlettopicmodeling[AFH+15]anddictionarylearning[BKS15]. Some previous learning algorithms can also be reinterpreted in terms of tensor decomposition [Cha96,MR06,NR09]. Akeyalgorithmicchallengefortensordecompositionsisovercompleteness,whenthenumber ofcomponentsislargerthantheirdimension(i.e.,thecomponentsarelinearlydependent). Most algorithmsthatworkinthisregimerequiretensorsoforder4orhigher[LCC07,BCMV14]. For example,theFOOBIalgorithmof[LCC07]canrecoveruptoΩ(d2)componentsgivenanorder-4 tensor in dimension d under mild algebraic independence assumptions for the components— satisfiedwithhighprobabilitybyrandomcomponents. Forovercomplete3-tensors,whicharisein manyapplicationsoftensordecompositions,sucharesultremainselusive. Researchershavethereforeturnedtoinvestigateaverage-caseversionsoftheproblem,whenthe componentsoftheovercomplete3-tensorarerandom: Givena3-tensorT ∈ (cid:146)d3 oftheform (cid:88)n T (cid:3) a ⊗ a ⊗ a , i i i i(cid:3)1 where a1,...,an are random unit or Gaussian vectors, the goal is to approximately recover the components a1,...,an. Algorithms based on tensor power iteration—a gradient-descent approach for tensor decomposition—solvethisprobleminpolynomialtimewhen n (cid:54) C·d foranyconstantC (cid:62) 1(the runningtimeisexponentialinC)[AGJ15]. Tensorpoweriterationalsoadmitslocalconvergence analysesforupto n (cid:54) Ω˜(d1.5)components[AGJ15,AGJ14]. Unfortunately,theseanalysesdonot give polynomial-time algorithms because it is not known how to efficiently obtain the kind of initializationsassumedbytheanalyses. Recently,GeandMa[GM15]wereabletoshowthatatensor-decompositionalgorithm[BKS15] basedonsum-of-squaressolvestheaboveproblemforn (cid:54) Ω˜(d1.5)inquasi-polynomialtimenO(logn). Thekeyingredientoftheirelegantanalysisisasubtlespectralconcentrationboundforapartic- ulardegree-4matrix-valuedpolynomialassociatedwiththedecompositionproblemofrandom 3 Table2: Comparisonofdecompositionalgorithmsforovercomplete3-tensorswithncomponentsindimensiond. Reference Technique Runtime Largest n Components Anandkumaretal. [AGJ15]a tensorpoweriteration poly C·d incoherent Ge,Ma[GM15] SoS,generalSDP nO(logn) Ω˜(d3/2) N(0, 1 Id ) d d thisworkb SoS,partialtraces O˜(nd1+ω) Ω˜(d4/3) N(0, 1 Id ) d d aTheanalysisshowsthatforeveryconstantC (cid:62)1,therunningtimeispolynomialforn (cid:54)C·dcomponents. The analysisassumesthatthecomponentsalsosatisfyotherrandom-likepropertiesbesidesincoherence. bHere,ω (cid:54)2.373istheconstantsothatd×dmatricescanbemultipliedinO(dω)arithmeticoperations. overcomplete3-tensors. Westatetheproblemformally: Problem 1.3 (Random tensor decomposition with dimension d, rank n, and accuracy η). Let lae1t,T...∈,(a(cid:146)nd∈)⊗(cid:146)3dbbeethined3e-tpeennsdoernTtly(cid:3)s(cid:80)amn plae⊗d3.vectorsfromtheGaussiandistributionN(0, d1 Idd),and i(cid:3)1 i Singlecomponent: GivenTsampledasabove,findaunitvector b thathascorrelation max (cid:104)a ,b(cid:105) (cid:62) 1−ηwithoneofthevectors a . i i i Allcomponents: GivenTsampledasabove,findasetofunitvectors{b1,...,bn}such that(cid:104)a ,b (cid:105) (cid:62) 1−ηforevery i ∈ [n]. i i OurResults. Wegivethefirstpolynomial-timealgorithmfordecomposingrandomovercomplete 3-tensorswithuptoω(d)components. Ouralgorithmsworksaslongasthenumberofcomponents satisfies n (cid:54) Ω˜(d4/3), which comes close to the bound Ω˜(d1.5) achieved by the aforementioned quasi-polynomialalgorithmofGeandMa. Forthesingle-componentversionoftheproblem,our algorithmrunsintimeclosetolinearintheinputsize. Theorem1.4(Fastrandomtensordecomposition). Thereexistrandomizedalgorithmsthat,forevery dimension d and rank n with d (cid:54) n (cid:54) d4/3/(logn)O(1), solve the random tensor decomposition problem withprobability1−o(1)forsomeaccuracy η (cid:54) O˜(n3/d4)1/2. Therunningtimeforthesingle-component versionoftheproblemisO˜(min{d1+ω,d3.257}),where dω isthetimetomultiplytwo d-by-d matrices. The runningtimefortheall-componentsversionoftheproblemisO˜(n ·min{d1+ω,d3.257}). WegiveatechnicaloverviewoftheproofinSection2,andafullproofinSection5. Weremarkthattheabovealgorithmonlyrequiresaccesstotheinputtensorwithsomefixed inversepolynomialaccuracybecauseeachofitsfourstepsamplifieserrorsbyatmostapolynomial factor(seeAlgorithm5.17). Inthissense,thealgorithmisrobust. 1.3 Tensorprincipalcomponentanalysis Theproblemoftensorprincipalcomponentanalysisissimilartothetensordecompositionproblem. However, here the focus is not on the number of components in the tensor, but about recovery in the presence of a large amount of random noise. We are given as input a tensor τ · v⊗3 +A, where v ∈ (cid:146)n isaunitvectorandtheentriesofAarechoseniidfromN(0,1). Thisspikedtensor modelwasintroducedbyMontanariandRichard[RM14],whoalsoobtainedthefirstalgorithmsto 4 Table3: Comparisonofalgorithmsforprincipalcomponentanalysisof3-tensorsindimensiondandwithsignal-to-noise ratioτ. Reference Technique Runtime Smallest τ Montanari,Richard[RM14] spectral O˜(d3) n Hopkins,Shi,Steurer[HSS15] SoS,spectral O˜(d3) Ω(n3/4) thiswork SoS,partialtraces O(d3) Ω˜(n3/4) solvethemodelwithprovablestatisticalguarantees. Thespikedtensormodelwassubsequently addressedbyasubsetofthepresentauthors[HSS15],whoappliedtheSoSapproachtoimprove thesignal-to-noiseratiorequiredforrecoveryfromodd-ordertensors. Westatetheproblemformally: Problem1.5(Tensorprincipalcomponentsanalysiswithsignal-to-noiseratio τ). LetT ∈ ((cid:146)d)⊗3 be atensorsothatT (cid:3) τ·v⊗3+A,whereAisatensorwithindependentstandardgaussianentries and v ∈ (cid:146)d isaunitvector. GivenT,recoveraunitvector v(cid:48) ∈ (cid:146)d suchthat(cid:104)v(cid:48),v(cid:105) (cid:62) 1−η. Ourresults. Forthisproblem,ourimprovementsoverthepreviousresultsaremoremodest—we achievesignal-to-noiseguaranteesmatching[HSS15],butwithanalgorithmthatrunsinlineartime ratherthannear-lineartime(timeO(d3)ratherthanO(d3polylogd),foraninputofsize d3). Theorem1.6(Tensorprincipalcomponentanalysisinlineartime). Thereisanalgorithmwhichsolves thetensorprincipalcomponentanalysisalgorithmwhen τ (cid:62) n3/4log1/2n/ε,with η (cid:54) O(ε). Furthermore, thealgorithmrunsintimeO(d3). ThoughfortensorPCAourimprovementoverpreviousworkismodest,weincludetheresults hereasthisproblemisapedagogicallypoignantillustrationofourtechniques. Wegiveatechnical overviewoftheproofinSection2,andafullproofinSection6. 1.4 Relatedwork Foremost,thisworkbuildsupontheSoSalgorithmsof[BKS14,BKS15,GM15,HSS15]. Ineachof thesepreviousworks,amachinelearningdecisionproblemissolvedusinganSDPrelaxationfor SoS.Intheseworks,theSDPvalueislargeintheyescaseandsmallinthenocase,andtheSDP valuecanbeboundedusingthespectrumofaspecificmatrix. Thiswasimplicitin[BKS14,BKS15], and in [HSS15] it was used to obtain a fast algorithm as well. In our work, we design spectral algorithmswhichusesmallermatrices,inspiredbytheSoScertificatesinpreviousworks,tosolve thesemachine-learningproblemsmuchfaster,withalmostmatchingguarantees. Akeyideainourworkisthatgivenalargematrixwithinformationencodedinthematrix’s spectralgap,onecanoftenefficiently“compress”thematrixtoamuchsmalleronewithoutlosing thatinformation. Thisisparticularlytrueforproblemswithplantedsolutions. Inthisway,weare abletoimproverunningtimebyreplacingan nO(d)-sizedSDPwithaneigenvectorcomputationfor an nk ×nk matrix,forsome k < d. TheideaofspeedingupLPandSDPhierarchiesforspecificproblemshasbeeninvestigatedina seriesofpreviousworks[dlVK07,BRS11,GS12],whichshowsthatwithrespecttolocalanalysesof 5 thesum-of-squaresalgorithmitissometimespossibletoimprovetherunningtimefrom nO(d) to 2O(d) ·nO(1). However,thescopesandstrategiesoftheseworksarecompletelydifferentfromours. First,thenotionoflocalanalysisfromtheseworksdoesnotapplytotheproblemsconsideredhere. Second,theseworksemploytheellipsoidmethodwithaseparationoracleinspiredbyrounding algorithms,whereaswereducetheproblemtoanordinaryeigenvectorcomputation. Itwouldalsobeinterestingtoseeifourmethodscanbeusedtospeedupsomeoftheother recentsuccessfulapplicationsofSoStomachine-learningtypeproblems,suchas[BM15],orthe applicationof[BKS14]totensordecompositionwithcomponentsthatarewell-separated(rather thanrandom). Finally,wewouldberemissnottomentionthatSoSlowerboundsexistforseveralof theseproblems,specificallyfortensorprincipalcomponentsanalysis,tensorprediction,andsparse PCA[HSS15,BM15,MW15]. ThelowerboundsintheSoSframeworkareagoodindicationthatwe cannotexpectspectralalgorithmsachievingbetterguarantees. 2 Techniques Sum-of-squares method (for polynomial optimization over the sphere). The problems we consider are connected to optimization problems of the following form: Given a homogeneous n-variaterealpolynomial f ofconstantdegree,findaunitvector x ∈ (cid:146)n soastomaximize f(x). Thesum-of-squaresmethodallowsustoefficientlycomputeupperboundsonthemaximumvalue ofsuchapolynomial f overtheunitsphere. For the case that k (cid:3) deg(f) is even, the most basic upper bound of this kind is the largest eigenvalueofamatrixrepresentationof f. Amatrixrepresentationofapolynomial f isasymmetric matrix M withrowsandcolumnsindexedbymonomialsofdegree k/2sothat f(x)canbewritten asthequadraticform f(x) (cid:3) (cid:104)x⊗k/2,Mx⊗k/2(cid:105),where x⊗k/2 isthe k/2-foldtensorpowerof x. The largesteigenvalueofamatrixrepresentation M isanupperboundonthevalueof f(x)overunit x ∈ (cid:146)n because f(x) (cid:3) (cid:104)x⊗k/2,Mx⊗k/2(cid:105) (cid:54) λmax(M)· (cid:107)x⊗k/2(cid:107)22 (cid:3) λmax(M). Thesum-of-squaresmethodsimprovesonthisbasicspectralboundsystematicallybyassociating alargefamilyofpolynomials(potentiallyofdegreehigherthandeg(f))withtheinputpolynomial f andcomputingthebestpossiblespectralboundwithinthisfamilyofpolynomials. Concretely, thesum-of-squaresmethodwithdegreeparameter d appliedtoapolynomial f withdeg(f) (cid:54) d considers the affine subspace of polynomials {f +(1− (cid:107)x(cid:107)2) · (cid:49) | deg((cid:49)) (cid:54) d −2} ⊆ (cid:146)[x] and 2 minimizes λmax(M) among all matrix representations 1 M of polynomials in this space.2 The problemofsearchingthroughthisaffinelinearspaceofpolynomialsandtheirmatrixrepresentations andfindingtheoneofsmallestmaximumeigenvaluecanbesolvedusingsemidefiniteprogramming. 1Earlierwedefinedmatrixrepresentationsonlyforhomogeneouspolynomialsofevendegree. Ingeneral,amatrix representationofapolynomial(cid:49)isasymmetricmatrixMwithrowsandcolumnsindexedbymonomialsofdegreeat most(cid:96)(cid:3)deg((cid:49))/2suchthat(cid:49)(x)(cid:3)(cid:104)x⊗(cid:54)(cid:96),Mx⊗(cid:54)(cid:96)(cid:105)(asapolynomialidentity),wherex⊗(cid:54)(cid:96) (cid:3) √1 (x⊗0,x⊗1,...,x⊗(cid:96))is (cid:96)+1 thevectorofallmonomialsofdegreeatmost(cid:96). Notethat(cid:107)x⊗(cid:54)(cid:96)(cid:107)(cid:3)1forallxwith(cid:107)x(cid:107)(cid:3)1. 2Thenameofthemethodstemsfromthefactthatthislaststepisequivalenttofindingtheminimumnumberλsuch thatthespacecontainsapolynomialoftheformλ−((cid:49)12+···+(cid:49)t2),where(cid:49)1,...,(cid:49)t arepolynomialsofdegreeatmost d/2. 6 OurapproachforfasteralgorithmsbasedonSoSalgorithmsistoconstructspecificmatrices (polynomials)inthisaffinelinearspace,thencomputetheirtopeigenvectors. Bydesigningour matricescarefully,weensurethatouralgorithmshaveaccesstothesamehigherdegreeinformation thatthesum-of-squaresalgorithmcanaccess,andthisinformationaffordsanadvantageoverthe basic spectral methods for these problems. At the same time, our algorithms avoid searching forthebestpolynomialandmatrixrepresentation,whichgivesusfasterrunningtimessincewe avoidsemidefiniteprogramming. Thisapproachiswellsuitedtoaverage-caseproblemswherewe avoidtheproblemofadversarialchoiceofinput;inparticularitisapplicabletomachinelearning problemswherenoiseandinputsareassumedtoberandom. Compressingmatriceswithpartialtraces. Aseriouslimitationoftheaboveapproachisthatthe representationofadegree-d polynomialrequiressizeroughly nd. Hence,evenavoidingtheuseof semidefiniteprogramming,improvinguponrunningtimeO(nd)requiresadditionalideas. Ineachoftheproblemsthatweconsider,wehavealargematrix(suggestedbyaSoSalgorithm) witha“signal”plantedinsomeamountof“noise”. Weshowthatinsomesituations,thislarge matrixcanbecompressedsignificantlywithoutlossinthesignalbyapplyingpartialtraceoperations. Inthesesituations,thepartialtraceyieldsasmallermatrixwiththesamesignal-to-noiseratioasthe largematrixsuggestedbytheSoSalgorithm,eveninsituationswhenlowerdegreesum-of-squares approachesareknowntofail. ThepartialtraceTr : (cid:146)d2×d2 → (cid:146)d×d isthelinearoperatorthatsatisfiesTr A⊗B (cid:3) (TrA)·B (cid:146)d (cid:146)d forallA,B ∈ (cid:146)d×d. Toseehowthepartialtracecanbeusedtocompresslargematricestosmaller ones with little loss, consider the following problem: Given a matrix M ∈ (cid:146)d2×d2 of the form M (cid:3) τ·(v ⊗ v)(v ⊗ v)(cid:62)+A⊗B forsomeunitvector v ∈ (cid:146)d andmatricesA,B ∈ (cid:146)d×d,wewishto recoverthevector v. (Thisisasimplifiedversionoftheplantedproblemsthatweconsiderinthis paper,where τ·(v ⊗ v)(v ⊗ v)(cid:62) isthesignalandA⊗B playstheroleofnoise.) ItisstraightforwardtoseethatthematrixA⊗B hasspectralnorm (cid:107)A⊗B(cid:107) (cid:3) (cid:107)A(cid:107)·(cid:107)B(cid:107),andso when τ (cid:29) (cid:107)A(cid:107)(cid:107)B(cid:107),thematrix M hasanoticeablespectralgap,andthetopeigenvectorof M will becloseto v ⊗ v. If |TrA| ≈ (cid:107)A(cid:107),thematrixTr M (cid:3) τ·vv(cid:62)+Tr(A)·B hasamatchingspectral (cid:146)d gap,andwecanstillrecover v,butnowweonlyneedtocomputethetopeigenvectorofa d×d (as opposedto d2×d2)matrix.3 √ IfAisaWignermatrix(e.g. asymmetricmatrixwithiid±1entries),thenbothTr(A),(cid:107)A(cid:107) ≈ n, andtheaboveconditionisindeedmet. Inouraveragecase/machinelearningsettingsthe“noise” componentisnotassimpleasA⊗B withAaWignermatrix. Nonetheless,weareabletoensure thatthenoisedisplaysasimilarbehaviorunderpartialtraceoperations. Insomecases,thisrequires additionalalgorithmicsteps, suchasrandomprojectioninthecaseoftensordecomposition, or centeringthematrixeigenvaluedistributioninthecaseoftheplantedsparsevector. Itisaninterestingquestioniftherearegeneraltheoremsdescribingthebehaviorofspectral normsunderpartialtraceoperations. Inthecurrentwork,wecomputethepartialtracesexplicitly andestimatetheirnormsdirectly. Indeed,ouranalysesboildowntoconcentrationsboundsfor special matrix polynomials. A general theory for the concentration of matrix polynomials is a notoriousopenproblem(see[MW13]). 3Insomeofourapplications,thematrixMisonlyrepresentedimplicitlyandhassizesuperlinearinthesizeofthe input,butneverthelesswecancomputethetopeigenvectorofthepartialtraceTr(cid:146)d Minnearly-lineartime. 7 PartialtraceoperationshavepreviouslybeenappliedforroundingSoSrelaxations. Specifically, theoperationofreweighingandconditioning,usedinroundingalgorithmsforsum-of-squaressuch as [BRS11, RT12, BKS14, BKS15, LR15], corresponds to applying a partial trace operation to the momentsmatrixreturnedbythesum-of-squaresrelaxation. We now give a technical overview of our algorithmic approach for each problem, and some broad strokes of the analysis for each case. Our most substantial improvements in runtime are fortheplantedsparsevectorandovercompletetensordecompositionproblems(Section2.1and Section2.2respectively). OuralgorithmfortensorPCAisthesimplestapplicationofourtechniques, anditmaybeinstructivetoskipaheadandreadabouttensorPCAfirst(Section2.3). 2.1 Plantedsparsevectorinrandomlinearsubspace RecallthatinthisproblemwearegivenalinearsubspaceU (representedbysomebasis)thatis spannedbya k-sparseunitvector v0 ∈ (cid:146)d andrandomunitvectors v1,...,vd−1 ∈ (cid:146)d. Thegoalis torecoverthevector v0 approximately. BackgroundandSoSanalysis. LetA ∈ (cid:146)n×d beamatrixwhosecolumnsformanorthonormal basisforU. Ourstartingpointisthepolynomial f(x) (cid:3) (cid:107)Ax(cid:107)4 (cid:3) (cid:80)n (Ax)4. Previousworkshowed √ 4 i(cid:3)1 i thatfor d (cid:28) n themaximizerofthispolynomialoverthespherecorrespondstoavectorcloseto v0 andthatdegree-4sum-of-squaresisabletocapturethisfact[BBH+12,BKS14]. Indeed,typical randomvectors v in(cid:146)n satisfy (cid:107)v(cid:107)44 ≈ 1/n whereasourplantedvectorsatisfies (cid:107)v0(cid:107)44 (cid:62) 1/k (cid:29) 1/n, andthisdegree-4informationisleveragedbytheSoSalgorithms. Thepolynomial f hasaconvenientmatrixrepresentation M (cid:3) (cid:80)ni(cid:3)1(aia(cid:62)i )⊗2,where a1,...,an aretherowsofthegeneratormatrixA. Itturnsoutthattheeigenvaluesofthismatrixindeedgive information about the planted sparse vector v0. In particular, the vector x0 ∈ (cid:146)d with Ax0 (cid:3) v0 witnessesthat M hasaneigenvalueofatleast1/k because M’squadraticformwiththevector x⊗2 0 satisfies(cid:104)x0⊗2,Mx0⊗2(cid:105) (cid:3) (cid:107)v0(cid:107)44 (cid:62) 1/k. Ifwelet M(cid:48)bethecorrespondingmatrixforthesubspaceU withouttheplantedsparsevector, M(cid:48)turnsouttohaveonlyeigenvaluesofatmostO(1/n)uptoa singlespuriouseigenvaluewitheigenvectorfarfromanyvectoroftheform x ⊗ x [BBH+12]. Itfollowsthatinordertodistinguishbetweenarandomsubspacewithaplantedsparsevector (yescase)andacompletelyrandomsubspace(nocase),itisenoughtocomputethesecond-largest eigenvalueofa d2-by-d2 matrix(representingthe4-normpolynomialoverthesubspaceasabove). Thisdecisionversionoftheproblem,whilestrictlyspeakingeasierthanthesearchversionabove,is attheheartofthematter: onecanshowthatthelargeeigenvaluefortheyescasecorrespondstoan eigenvectorwhichencodesthecoefficientsofthesparseplantedvectorinthebasis. Improvements. ThebestrunningtimewecanhopeforwiththisbasicapproachisO(d4)(thesize √ ofthematrix). Sinceweareinterestedin d (cid:54) O( n),theresultingrunningtimeO(nd2)wouldbe subquadraticbutstillsuper-linearintheinputsizen·d (forrepresentingad-dimensionalsubspace of(cid:146)n). Tospeedthingsup, weusethepartialtraceapproachoutlinedabove. Wewillbeginby applyingthepartialtraceapproachnaively,obtainingreasonablebounds,andthenshowthata smallmodificationtothematrixbeforethepartialtraceoperationallowsustoachieveevensmaller signal-to-noiseratios. In the planted case, we may approximate M ≈ 1k(x0x0(cid:62))⊗2 + Z, where x0 is the vector of coefficients of v0 in the basis representation given by A (so that Ax0 (cid:3) v0), and Z is the noise 8
Description: