ebook img

A New Geometric Approach to Latent Topic Modeling and Discovery PDF

0.16 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview A New Geometric Approach to Latent Topic Modeling and Discovery

A NEW GEOMETRIC APPROACHTO LATENT TOPIC MODELINGAND DISCOVERY WeicongDing,MohammadH. Rohban,PrakashIshwar,Venkatesh Saligrama DepartmentofElectricaland ComputerEngineering,Boston University,Boston,MA,USA. ABSTRACT columnsaretheK latenttopicword-distributionvectorsand 3 θ denotesthe K ×M weight-matrixwhose M columnsare 1 A new geometrically-motivated algorithm for nonnegative themixingweightsoverK topicsfortheM documents,then 0 matrix factorization is developed and applied to the dis- 2 each column of the W × M matrix A = βθ corresponds covery of latent “topics” for text and image “document” to a document word-distribution vector. Let X denote the n corpora. The algorithm is based on robustly finding and a observed W × M words-by-documents matrix whose M clusteringextreme-pointsofempiricalcross-documentword- J columns are the empirical word-frequency vectors of the frequenciesthatcorrespondtonovel“words”uniquetoeach 5 M documents when each document is generated by N iid topic. In contrast to related approaches that are based on drawingsof wordsfrom the correspondingcolumn of the A ] solvingnon-convexoptimizationproblemsusingsuboptimal L matrix. ThengivenonlyX andK,thegoalistoestimatethe approximations, locally-optimal methods, or heuristics, the M topic matrix β and possibly the weight-matrix θ. This can newalgorithmisconvex,haspolynomialcomplexity,andhas be formulated as a nonnegative matrix factorization (NMF) t. competitive qualitative and quantitative performance com- problem[4, 5, 6, 7] where the typical solution strategy is to a pared to the current state-of-the-art approacheson synthetic minimizeacostfunctionoftheform t andreal-worlddatasets. s [ kX−βθk2 + ψ(β,θ) (1) IndexTerms— Topicmodeling,nonnegativematrixfac- 1 torization(NMF),extremepoints,subspaceclustering. wherethe regularizationtermψ is introducedto enforcede- v sirable properties in the solution such as uniqueness of the 8 5 1. INTRODUCTION factorization,sparsity,etc. Thejointoptimizationof(1)with 8 respectto(β,θ)is,however,non-convexandnecessitatesthe 0 Topicmodelingisa statistical toolfortheautomaticdiscov- useofsuboptimalstrategiessuchasalternatingminimization, . 1 eryandcomprehensionoflatentthematicstructureortopics, greedy gradient descent, local search, approximations, and 0 assumedtopervadeacorpusofdocuments. heuristics. These are also typically sensitive to small sam- 3 Suppose that we have a corpus of M documents com- plesizes(wordsperdocument)N especiallywhenN ≪ W 1 : posed of words from a vocabulary of W distinct words in- becausemanywordsmaynotbesampledandX maybefar v dexedbyw =1,...,W. Intheclassic“bagsofwords”mod- fromAinEuclideandistance. InLDA,thecolumnsofβ and i X elingparadigmwidely-usedinProbabilisticLatentSemantic θ are modeled as iid random drawings from Dirichlet prior r Analysis [1] and Latent Dirichlet Allocation (LDA) [2, 3], distributions. Theresultingmaximumaposterioriprobability a eachdocumentismodeledasbeinggeneratedbyN indepen- estimationof(β,θ),however,turnsouttobeafairlycomplex dentandidenticallydistributed(iid)drawingsofwordsfrom non-convexproblem. Onethentakesrecoursetosub-optimal anunknownW ×1documentword-distributionvector.Each solutions based on variational Bayes approximations of the documentword-distributionvectorisitselfmodeledasanun- posteriordistributionandothermethodsbasedonGibbssam- knownprobabilistic mixture of K < min(M,W) unknown plingandexpectationpropagation. W ×1 latenttopicword-distributionvectorsthatareshared Incontrasttotheseapproachesweadoptthenon-negative among the M documents in the corpus. The goal of topic matrixfactorizationframeworkandproposeanewgeometri- modelingthenistoestimatethelatenttopicword-distribution cally motivated algorithm that has competitive performance vectorsandpossiblythetopicmixingweightsforeachdocu- comparedtothecurrentstate-of-theartandisfreeofheuris- mentfromtheempiricalword-frequencyvectorsofalldocu- ticsandapproximations. ments.Topicmodelinghasalsobeenappliedtovarioustypes ofdataotherthantext,e.g.,images,videos(withphotometric 2. ANEWGEOMETRICAPPROACH andspatio-temporalfeature-vectorsinterpretedasthewords), genetic sequences, hyper-spectralimages, voice, and music, Akeyingredientofthenewapproachistheso-called“separa- forsignalseparationandblinddeconvolution. bility”assumptionintroducedin[5]toensuretheuniqueness If β denotes the unknown W × K topic-matrix whose ofnonnegativematrixfactorization. Appliedtoβ thismeans thateachtopiccontains“novel”wordswhichappearonlyin serve that in practice, the unique words of any topic only thattopic– apropertythathasbeenfoundto holdinthees- occur in a few documents. This implies that the rows of θ timatesoftopicmatricesproducedbyseveralalgorithms[8]. are sparse and that the row-vectors of X corresponding to Moreprecisely, A W ×K topicmatrixβ is separableif for the novel words of the same topic are likely to form a low- eachk ∈[1,K],thereexistsarowofβ thathasasinglenon- dimensionalsubspace(e.g.,S1,S2 inFige.1)sincetheirsup- zero entry which is in the k-th column. Figure 1 shows an ports are subsets of the supports of the same row-vector of exampleofaseparabletopicmatrixwiththreetopics. Words θ. Ifwemakethefurtherassumptionthatforanypairofdis- 1 and 2 are unique (novel)to topic 1, words3, 4 to topic 2, tincttopicsthereareseveraldocumentsinwhichtheirnovel andword5totopic3. words do not co-occur then the row subspaces of X corre- LetCk bethesetofnovelwordsoftopickfork ∈[1,K] spondingtothenovelwordsanytwodistincttopicsarelikely andletC0 betheremainingwordsinthevocabulary. LetAw tobesignificantlydisjoint(althoughtheymightshareeacom- and θk denote the w-th and k-th row-vectorsof A and θ re- monlow-dimensionalsubspace). Finally, the row-vectorsof spectively. Observe thatall the row-vectorsof A that corre- X correspondingtonon-novelwordsareunlikelytobeclose spondto thenovelwordsofthesametopicarejustdifferent totherowsubspacesofX correspondingtothenovelwords scaled versionsof the same θ row-vector: for each w ∈ Ck, aeny one topic (e.g., X6 in Fig. 1). These observations and Aw =βwkθk. ThusifA,β,andθdenotetherow-normalized assumptionsmotivatetheerevised 5-stepAlgorithm1 forex- versions(i.e.,unitrowsums)ofA,β,andθrespectivelythen tractingβ fromX. e A = βθ and for all we e∈ Ck,eAw = θk (e.g., in Fig. 1, A1 = A2 = θ1 and A3 = A4 = θ2), and for all w ∈ C0, Algorithm1TopicDiscovery Aew liveesein the convex hull of θke’s (in Feig. 1, A6 is in the Input: W ×M word-documentmatrixX;#topicsK. ceonvexheullofeθ1,θ2,θe3). e e Output: Estimateβ ofW ×K topicmatrixβ. M e e e 1: Row-normalizeX togetX. LetNw := d=1Xwd. e e e 2: Apply Algorithbm 2 to rows of X to obtain a subset of P word1 rowsE thatcorrespondtoecandidatenovelwords. LetC0 word2 betheremainingrowindices. e word3 3: Applythesparsesubspaceclusteringalgorithmof[9,10b] word4 toE withparametersλ1,γ toobtainK clusters{Ck}Kk=1 word5 of novelwords and cluster Cout of outlier words. Rear- word6 rangetherowsofX indexedbyCk intoamatrixYbk. 4: Foreachw∈C0 Cout,solve Topic Topic Topic 1 2 3 e c S K K ProbabilitySimplex b 2 min kXw− bwlYlk2+λ2 kbwlk∞ Fig.1.Aseparabletopicmatrixandtheunderlyinggeometric {bwl∈R|+Cbl|}Kl=1 Xl=1 Xl=1 structure. Solid circles represent rows of A, empty circles e representrowsofX. forsomeλ2 ≥0. Let{b∗wl}Kl=1betheoptimalsolution. 5: Forw=1,...,W,k=1,...,K,set Thisgeometricviewpointrevealshowtoeextractthetopic tmreamtreixpβoinfrtosmofAA:’es(1ro)wR-ovwec-ntoorrsm. a(3li)zeClAusttoerAth.e(r2o)wF-ivnedcteoxrs- βwk = Nw1(w ∈Ck) for w ∈ Kl=1Cl ofAthatcorrespondtothesameextremepoinetintothesame ( Nwkb∗wkk1 for w ∈SC0 Cout b b group. Therewiell be K disjointgroupsandeach groupwill b S correespondtothenovelwordsofthesametopic. (4)Express andnormalizeeachcolumnofβtobecolubmnstochastic. the remaining row-vectors of A as convex combinations of theextremepoints.Thisgivesusβ(5)Finally,renormalizeβ b Algorithm2Findcandidatenovelwords toobtainβ. e Input: Set of 1 × M probability row-vectors x1,...,xW; Thereality,however,isthatweeonlyhaveaccesstoX,noet NumberofprojectionsP;Toleranceδ. A. TheabovealgorithmwhenappliedtoX wouldworkwell Output: SetE ofcandidatenovelrow-vectors. e e ifXisclosetoAwhichwouldhappenifN islarge.WhenN 1: SetE =∅. issmall,twoproblemsarise:(i)Pointscorrespondingtonovel 2: Generaterow-vectord∼Uniform(unit-sphereinRM). wordsofthesametopicmaybecomemultipleextremepoints 3: imax :=argmaxixidT,imin :=argminixidT. ainndFimg.ay1)b.e(ifia)rPforoinmtseianchtheothcoernv(eex.g.h,uXll1m,Xay2aalnsodbXe3c,oXm4e 4: E ←E {xi :kxi−ximaxk1 ≤δorkxi−ximink1 ≤δ}. e e “outlier”extremepoints(e.g.,X6inFig.e1). e e e 5: RepeatSsteps2through4,P times. As a step towards overcoming these difficulties we ob- e Step (2)of Algorithm1 findsrows of X˜ manyof which Uniform[0,1]valuesaregeneratedforthenonzeroentriesin arelikelytocorrespondtothenovelwordsoftopicsandsome therowsofnovelwords.Theresultingmatrixisthencolumn- to outliers (non-novel words). This step uses Algorithm 2 normalized to get one realization of β. Let ρ := W1/W. whichisalinear-complexityprocedureforfinding,withhigh Next, M iid K ×1 column-vectorsare generated for the θ K probability,extremepointsandpointsclosetothem(thecan- matrix accordingto a Dirichlet priorc θαi−1. Following i didatenovelwordsoftopics)usingasmallnumberP ofran- i=1 domprojections.Step(3)usesthestate-of-the-artsparsesub- [12],wesetαi = 0.1foralli. Finally,QweobtainX bygen- space clustering algorithm from [9, 10] to identify K clus- eratingN iidwordsforeachdocument. ters of novel words, one for each topic, and an additional FordifferentsettingsofW,ρ,K,M andN,wecalculate cluster containing the outliers (non-novel words). Step (4) the error of the estimated topic matrix β as kβ −βkF. For expresses rows of X corresponding to non-novel words as each setting we average the error over 50 random samples. convex combinations of these K groups of rows and step Insparsesubspaceclusteringthevalueobfλ1 isbsetasin[10] (5) estimates the enetries in the topic matrix and normalizes (itdependsonthe sizeofthecandidateset)andthevalueof it to make it column-stochastic. In many applications, non- γ asin [9] (itdependson the valuesof N,M). InStep 4 of novelwords occur in only a few topics. The group-sparsity Algorithm1,wesetλ2 =0.01forallsettings. K penaltyλ2 l=1kbwlk∞ proposedin[11]isusedinstep(4) WecompareouralgorithmagainsttheLDAalgorithm[2] of Algorithm 1 to favor solutions where the row vectors of anda state-of-artNMF-basedalgorithm[13]. ThisNMF al- P non-novelwords are convex combinations of as few groups gorithmischosenbecauseitcompensatesforthetypeofnoise of novelwords as possible. Our proposedalgorithmrunsin we use in our topic model. Our LDA algorithm uses Gibbs polynomial-time in W, M, and K and all the optimization samplingforinferencing. Figure2depictstheestimationer- problemsinvolvedareconvex. rorasafunctionofthenumberofdocumentsM (top)andthe numberofwords/documentN (bottom).Evidently,ouralgo- rithmisuniformlybetterthancomparabletechniques.Specif- 3. EXPERIMENTALRESULTS ically,whileNMFhassimilarerrorasouralgorithmforlarge M it performsrelatively poorly as a function of N. On the 3.1. SyntheticDataset other hand LDA has similar error performance as ours for 0.16 large N but performs poorly as a function of M. Note that 0.14 Proposed Alg bothofthesealgorithmshavecomparablyhigherrorratesfor NMF KL−Divergence 0.12 LDA Gibbs−Sampling smallM andN. R O 0.1 3.2. SwimmerImageDataset R ER0.08 0.06 5 5 5 5 10 10 10 10 15 15 15 15 0.04 20 20 20 20 25 25 25 25 (a) 30 30 30 30 0.02 100 200 300 400 500 600 700 800 900 1000 5 5 5 5 M 10 10 10 10 15 15 15 15 20 20 20 20 0.18 25 25 25 25 (b) 30 30 30 30 0.16 ProposedAlg 0.14 NMFKL−Divergence 0.12 LDA Gibbs−Sampling (c) R O 0.1 R Fig. 3. (a) Example “clean” images (cols. of A) in Swim- R0.08 E merdataset;(b)Correspondingimageswithsampling“noise” 0.06 (cols.ofX);(c)Examplesofidealtopics(cols.ofβ). 0.04 0.02 In this section we apply our algorithm to the synthetic 0 0 50 100 150 200 250 300 350 400 450 500 swimmerimagedatasetintroducedin[5].ThereareM =256 N binary images each of W = 32×32 = 1024 pixels. Each Fig. 2. Error of estimated topic matrix in Frobenius norm. imagerepresentsaswimmercomposedoffourlimbs,eachof Upper: W = 500,ρ = 0.2,N = 50,K = 5; Lower: W = whichcanbeinoneof4distinctpositions,andatorso. 500,ρ=0.2,K =10,M =500. Weinterpretpixelpositions(i,j),1≤i,j ≤32aswords In this section, we validate our algorithm on some syn- in a dictionary. Documents are images, where an image is theticexamples. We generateaW ×K separabletopicma- interpreted as a collection of pixel positions with non-zero trixβwithW1/K >1novelwordspertopicasfollows:first, values. Since each of the fourlimbs can independentlytake iid1×K rows-vectorscorrespondingtonon-novelwordsare oneoffourpositions,itturnsoutthatthetopicmatrixβsatis- generateduniformlyontheprobabilitysimplex.Then,W1iid fiestheseparabilityassumptionwithK = 16“groundtruth” Pos.LA1 LA2 LA3 LA4 RA1 RA2 RA3 RA4 LL1 LL2 LL3 LL4 RL1 RL2 RL3 RL4 a) b) c) Fig.4. Topicsestimatedfornoisyswimmerdatasetbya) proposedalgorithm,b)LDAinferenceusingcodein[12], c)NMF algorithmusingcodein[13]. Topicsclosesttothe16ideal(groundtruth)topicsLA1,LA2,etc.,areshown.LDAmisses5and NMFmisses6ofthegroundtruthtopicswhileouralgorithmrecoversall16andourtopicestimateslooklessnoisy. “chips” “vision” “networks” “learning” a) chip visual network learning circuit cells routing training analog ocular system error b) current cortical delay SVM gate activity load model Fig.5. Topicerrorsfor(a)LDAalgorithm[12]and(b)NMF “election” “law” “market” “game” algorithm[13]ontheSwimmerdataset. Figuredepictstopics state case market game thatareextractedbyLDAandNMFbutarenotclosetoany politics law executive play “groundtruth” topic. The groundtruth topicscorrespondto election lawyer industry team 16differentpositionsofleft/rightarmsandlegs. campaign charge sell run vote court business season topicsthatcorrespondto16singlelimbpositions. Following Table1. Mostfrequentwordsinexamplesofestimatedtop- thesettingof[13], wesetbodypixelvaluesto10andback- ics. Upper: NIPS, with K = 40 topics; Lower: NY Times, groundpixelvalues to 1. We then take each “clean” image, withK =20topics suitablynormalized,asanunderlyingdistributionacrosspix- elsandgeneratea“noisy”documentofN =200iid“words” accordingtothetopicmodel. ExamplesareshowninFig.3. NY Times dataset, M = 3000, W = 9340, and N ≈ 270. Wethenapplyouralgorithmtothe“noisy”dataset.Weagain Thevocabularyisobtainedbydeletingastandard“stop”word compareouralgorithmagainstLDAandtheNMFalgorithm listusedincomputationallinguistics,includingnumbers,in- from[13]. Results are shown in Figures4 and 5. Values of dividualcharacters,andsomecommonEnglishwordssuchas tuningparametersλ1,γ,andλ2aresetasinSec.3.1. Specif- “the”.Wordsthatoccurlessthan5timesinthedatasetandthe ically,λ1 =0.1,λ2 =0.01fortheresultsinFigs.4and5. wordsthatoccurinlessthan5documentsareremovedfrom This dataset is a good validation test for different algo- thevocabularyaswell. Thetuningparametersλ1,γ,andλ2 rithmssincethegroundtruthtopicsareknownandareunique. aresetinthesamewayasinSec.3.1(specifically,λ1 = 0.1 AsweseeinFig.5,bothLDAandNMFproducetopicsthat andλ2 =0.1). donotcorrespondtoanypureleft/rightarm/legpositions.In- Table1depictstypicaltopicsextractedbyouralgorithm. deed,manyestimatedtopicsarecomposedofmultiplelimbs. Foreachtopicweshowitsmostfrequentwords,listedinde- Nevertheless,nosucherrorsarerealizedinouralgorithmand scendingorderofestimatedprobability.Althoughthereisno ourtopic-estimatesareclosertothegroundtruthimages. “groundtruth” to comparewith, the most frequentwords in the estimated topics do form recognizable themes. For ex- ample, inthe NIPSdataset, theset of(mostfrequent)words 3.3. TextCorpora “chip”,“circuit”,etc.,canbeannotatedas“ICDesign”;The Inthis section, we applyour algorithmon two differenttext words“visual”,“cells”,etc.,canbelabeledas“humanvisual corpora,namely,theNIPSdataset[14]andtheNewYork(NY) system”. As a point of comparison, we also experimented Timesdataset[15]. IntheNIPSdataset,thereareM = 2484 withrelatedconvexprogrammingalgorithms[8,7]thathave documentswithW = 14036wordsinthevocabulary. There recentlyappearedintheliterature. Wefoundthattheyfailto are, on average, N ≈ 900 words in each document. In the producemeaningfulresultsforthesedatasets. 4. REFERENCES [14] A. Globerson, G. Chechik, F. Pereira, and N. Tishby, “Euclidean embedding of co-occurrence data,” The [1] T. Hofmann, “Probabilistic latent semantic analysis,” Journal of Machine Learning Research, vol. 8, pp. in Uncertaintyin ArtificialIntelligence,SanFrancisco, 2265–2295,2007. CA,1999,pp.289–296,MorganKaufmannPublishers. [15] A.ChaneyandD.M.Blei, “Visualizingtopicmodels,” [2] D.M.Blei,A.Y.Ng,andM.I.Jordan,“Latentdirichlet in InternationalAAAI Conference on Weblogs and So- allocation,” J.Mach.Learn.Res.,vol.3,pp.993–1022, cialMedia,2012. Mar.2003. [3] D. M. Blei, “Probabilistic topic models,” Commun. ACM,vol.55,no.4,pp.77–84,Apr.2012. [4] D. D. Leeans H. S. Seung, “Learningthe partsofob- jectsbynon-negativematrixfactorization,” Nature,vol. 401,no.6755,pp.788–791,Oct.1999. [5] D. Donohoand V. Stodden, “When does non-negative matrix factorization give a correct decomposition into parts?,” inAdvancesinNeuralInformationProcessing Systems16,Cambridge,MA,2004,MITPress. [6] A.Cichocki,R.Zdunek,A.H.Phan,andS.Amari,Non- negativematrixandtensorfactorizations: applications toexploratorymulti-waydataanalysisandblindsource separation, Wiley,2009. [7] B. Recht, C. Re, J. Tropp, and V. Bittorf, “Factoring nonnegative matrices with linear programs,” in Ad- vances in Neural Information Processing Systems 25, 2012,pp.1223–1231. [8] S. Arora, R. Ge, and A. Moitra, “Learningtopic mod- els–goingbeyondSVD,” arXiv:1204.1956v2[cs.LG], 2012. [9] M. Soltanolkotabi, and E. J. Candes, “A geometric analysisofsubspaceclusteringwithoutliers,” ArXive- prints,Dec.2011. [10] E.ElhamifarandR.Vidal, “Sparsesubspaceclustering: algorithm,theory,andapplications,”IEEETransactions onPatternAnalysisandMachineIntelligence,2012. [11] E. Esser, M. Moller, S. Osher, G. Sapiro, and J. Xin, “A convex model for nonnegative matrix factorization anddimensionalityreductiononphysicalspace,” IEEE Trans. Image Processing, vol. 21, pp. 3239–3252, Jul. 2012. [12] T. Griffiths and M. Steyvers, “Finding scientific top- ics,” in Proceedings of the National Academy of Sci- ences,2004,vol.101,pp.5228–5235. [13] V. Y. F. Tan and C. Fe´votte, “Automatic relevance de- terminationinnonnegativematrixfactorizationwiththe beta-divergence,” IEEETransactionsonPatternAnaly- sisandMachineIntelligence,inpress.

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.