ebook img

Parsing entire discourses as very long strings: Capturing topic continuity in grounded language ... PDF

12 Pages·2013·0.33 MB·English
by  
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Parsing entire discourses as very long strings: Capturing topic continuity in grounded language ...

Parsing entire discourses as very long strings: Capturing topic continuity in grounded language learning ThangLuongMinh MichaelC.Frank MarkJohnson DepartmentofComputerScience DepartmentofPsychology DepartmentofComputing StanfordUniversity StanfordUniversity MacquarieUniversity Stanford,California Stanford,California Sydney,Australia [email protected] [email protected]@MQ.edu.au Abstract Two influential examples of grounded language learning tasks are the sportscasting task, RoboCup, Groundedlanguagelearning,thetaskofmap- wheretheNListhesetofrunningcommentaryand pingfromnaturallanguagetoarepresentation the MR is the set of logical forms representing ac- of meaning, has attracted more and more in- tions like kicking or passing (Chen and Mooney, terest in recent years. In most work on this topic, however, utterances in a conversation 2008), and the cross-situational word-learning task, aretreatedindependentlyanddiscoursestruc- where the NL is the caregiver’s utterances and the ture information is largely ignored. In the MR is the set of objects present in the context contextoflanguageacquisition,thisindepen- (Siskind, 1996; Yu and Ballard, 2007). Work dence assumption discards cues that are im- in these domains suggests that, based on the co- portant to the learner, e.g., the fact that con- occurrence between words and their referents in secutiveutterancesarelikelytosharethesame context,itispossibletolearnmappingsbetweenNL referent (Frank et al., 2013). The current pa- per describes an approach to the problem of andMRevenundersubstantialambiguity. simultaneously modeling grounded language Nevertheless,contextslikeRoboCup—whereev- atthesentenceanddiscourselevels. Wecom- ery single utterance is grounded—are extremely bine ideas from parsing and grammar induc- rare. Much more common are cases where a sin- tion to produce a parser that can handle long gle topic is introduced and then discussed at length input strings with thousands of tokens, creat- throughout a discourse. In a television news show, ing parse trees that represent full discourses. forexample,atopicmightbeintroducedbypresent- By casting grounded language learning as a grammaticalinferencetask,weuseourparser ing a relevant picture or video clip. Once the topic to extend the work of Johnson et al. (2012), is introduced, the anchors can discuss it by name investigatingtheimportanceofdiscoursecon- or even using a pronoun without showing a picture. tinuity in children’s language acquisition and Thediscourseisgroundedwithouthavingtoground its interaction with social cues. Our model everyutterance. boosts performance in a language acquisition Moreover, although previous work has largely taskandyieldsgooddiscoursesegmentations treated utterance order as independent, the order of comparedwithhumanannotators. utterancesiscriticalingroundeddiscoursecontexts: if the order is scrambled, it can become impossible 1 Introduction to recover the topic. Supporting this idea, Frank et Learning mappings between natural language (NL) al.(2013)foundthattopiccontinuity—thetendency and meaning representations (MR) is an important to talk about the same topic in multiple utterances goalforbothcomputationallinguisticsandcognitive that are contiguous in time—is both prevalent and science. Accuratelylearningnovelmappingsiscru- informativeforwordlearning. Thispaperexamines cial in grounded language understanding tasks and the importance of topic continuity through a gram- suchsystemscansuggestinsightsintothenatureof matical inference problem. We build on Johnson et childrenlanguagelearning. al.(2012)’sworkthatusedgrammaticalinferenceto !"#$"#%" &’()%*!"# +’,-.*!"# &*/’#" &’()%*!"# +’,-*/’#" +’,-.*!"# /’$&’()%01*%2)1-*"3". &*!"# &’()%*/’#" +’,-*/’#" +’,-.*!"# /’$&’()%01*%2)1-*20#-. &’()%01*%2)1-*"3". +’,-*!"# /’$&’()%01*4’4*"3". &’()%01*%2)1-*20#-. /’$&’()%01*4’4*20#-. &’()%01*4’4*"3". /’$&’()%01*4’4*(’)#$ &’()%01*4’4*20#-. &’()%01*4’4*(’)#$ $%&#’ 1 $!"# ()"*%$+,+- .&.$)/0%- 1 11 2)+3+- 4)+ !"##"+ -’67’89"%$7(,":); ()67’89"%$7(,":); 5$$",0#%" Figure1: UnigramSocialCuePCFGs(Johnsonetal.,2012)–shownisaparsetreeoftheinpututterance“wheres thepiggie”accompaniedwithsocialcueprefixes,indicatingthatthecaregiverisholdingapigtoywhilethechildis lookingatit;atthesametime,adogtoyispresentinthescreen. learn word-object mappings and to investigate the developed supervised semantic parsers to map sen- role of social information (cues like eye-gaze and tences to meaning representations of various forms, pointing)inachildlanguageacquisitiontask. includingmeaninghierarchies(Luetal.,2008)and, Ourmaincontributionliesinthenovelintegration most dominantly, λ-calculus expressions (Zettle- ofexistingtechniquesandalgorithmsinparsingand moyerandCollins, 2005; Zettlemoyer, 2007; Wong grammar induction to offer a complete solution for and Mooney, 2007; Kwiatkowski et al., 2010). simultaneously modeling grounded language at the Theseapproachesrelyontrainingdataofannotated sentence and discourse levels. Specifically, we: (1) sentence-meaning pairs, however. Such data are usetheEarleyalgorithmtoexploitthespecialstruc- costly to obtain and are quite different from the ex- ture of our grammars, deterministic or have at most perienceoflanguagelearners. boundedambiguity,toachieveapproximatelylinear Grounded Language Learning. In contrast to se- parsing time; (2) suggest a rescaling approach that manticparsers,groundedlanguagelearningsystems enables us to build a PCFG parser capable of han- aim to learn the meanings of words and sentences dling very long strings consisting of thousands of givenanobservedworldstate(YuandBallard,2004; tokens; and(3)employVariationalBayesforgram- Gorniak and Roy, 2007). A growing body of work matical inference to obtain better grammars than inthisfieldemploysdistincttechniquesfromawide thosegivenbyExpectationMaximizationalgorithm. variety of perspectives from text-to-record align- Byparsingentirediscoursesatonce,weshedlight ment using structured classification (Barzilay and onascientificallyinterestingquestionaboutwhythe Lapata, 2005; Snyder and Barzilay, 2007), iterative child’s own gaze is a positive cue for word learn- retraining(Chenetal.,2010),andgenerativemodels ing(Johnsonetal.,2012). Ourdataprovidesupport of segmentation and alignment (Liang et al., 2009) for the hypothesis (from previous work) that care- to text-to-interaction mapping using reinforcement givers “follow in”: they name objects that the child learning (Branavan et al., 2009; Vogel and Juraf- is already looking at (Tomasello and Farrar, 1986). sky, 2010), graphical model semantics representa- In addition, our discourse model produces a perfor- tion (Tellex et al., 2011a; Tellex et al., 2011b), mance improvement in a language acquisition task and Combinatory Categorical Grammar (Artzi and and yields good discourse segmentations compared Zettlemoyer,2013). Anumberofsystemshavealso withhumanannotators. usedalternativeformsofsupervision,includingsen- tences paired with responses (Clarke et al., 2010; 2 RelatedWork GoldwasserandRoth,2011;Liangetal.,2011)and no supervision (Poon and Domingos, 2009; Gold- Supervised semantic parsers. Previous work has wasseretal.,2011). objects (language learning) as well as the intended Recent work has also introduced an alternative topicofindividualutterances(languagecomprehen- approach to grounded learning by reducing it to a sion). Weconsideracorpusofchild-directedspeech grammatical inference problem. Bo¨rschinger et al. annotated with social cues, described in (Frank et (2011) casted the problem of learning a semantic al., 2013). There are a total of 4,763 utterances parser as a PCFG induction task, achieving state-of in the corpus, each of which is orthographically- the art performance in the RoboCup domain. Kim transcribed from videos of caregivers playing with andMooney(2012)extendedthetechniquetomake pre-linguistic children of various ages (6, 12, and it tractable for more complex problems. Later, Kim 18 months) during home visits.1 Each utterance and Mooney (2013) adapted discriminative rerank- was hand-annotated with objects present in the ing to the grounded learning problem using a form (non-linguistic) context, e.g. dog and pig (Fig- ofweaksupervision. Weemploythisgeneralgram- ure 1), together with sets of social cues, one set maticalinferenceapproachinthecurrentwork. per object. The social cues describe objects the Child Language Acquisition. In the context of care-giver is looking at (mom.eyes), holding onto language acquisition, Frank et al. (2008) proposed (mom.hands), or pointing to (mom.point); sim- a system that learned words and jointly inferred ilarly,for(child.eyes)and(child.hands). speakers’ intended referent (utterance topic) using 3.1 Sentence-levelModels graphical models. Johnson et al. (2012) used gram- matical inference to demonstrate the importance of Motivated by the importance of social information socialcuesinchildren’searlywordlearning. Weex- in children’s early language acquisition (Carpenter tendthisbodyofworkbycapturingdiscourse-based et al., 1998), Johnson et al. (2012) proposed a joint dependencies among utterances rather than treating model of non-linguistic information including the eachutteranceindependently. physical context and social cues, and the linguis- Discourse Parsing. A substantial literature has ex- tic content of individual utterances. They framed amined formal representations of discourse across thejointinferenceproblemofinferringword-object a wide variety of theoretical perspectives (Mann mappings and inferring sentence topics as a gram- and Thompson, 1988; Scha and Polanyi, 1988; marinductiontaskwhereinputstringsareutterances Hobbs, 1990; Lascarides and Asher, 1993; Knott prefixed with non-linguistic information. Objects and Sanders, 1997). Although much of this work present in the non-linguistic context of an utterance washighlyinfluential, Marcu(1997)’sworkondis- are considered its potential topics. There is also a course parsing brought this task to special promi- special null topic, None, to indicate non-topical ut- nence. Since then, more and more sophisticated terances. Thegoalofthemodelisthentoselectthe models of discourse analysis have been developed:, mostprobabletopicforeachutterance. e.g., (Marcu, 1999; Marcu, 2000; Soricut and Top-levelrules, Sentence→Topic Words t t Marcu, 2003; Forbes et al., 2003; Polanyi et al., (unigram PCFG) or Sentence → Topic t 2004; Baldridge and Lascarides, 2005; Subba and Collocs (collocationAdaptorGrammar),aretai- t Di Eugenio, 2009; Hernault et al., 2010; Lin et lored to link the two modalities (t ranges over T(cid:48), al., 2012; Feng and Hirst, 2012). Our contribution thesetofallavailabletopics(T)andNone). These to work on this task is to examine latent discourse rules enforce sharing of topics between prefixes structure specifically in grounded language learn- (Topic )andwords(Words orCollocs ). Each t t t ing. word in the utterance is drawn from either a topic- specific distribution Word or a general “null” dis- t 3 AGroundedLearningTask tributionWord . None AsillustratedinFigure1,theselectedtopic,pig, Our focus in this paper is to develop computational is propagated down to the input string through two modelsthathelpusbetterunderstandchildren’slan- guage acquisition. The goal is to learn both the 1Caregivers were given pairs of toys to play with, e.g. a long term lexicon of mappings between words and stuffeddogandpig,orawoodencarandtruck. paths: (a) through topical nodes until an object is these per-recording concatenations is 2152 tokens reached,inthiscasethe.pigobject,and(b)through (σ=972). Parsing such strings poses many chal- lexicalnodestotopicalwordtokens,e.g. piggie. So- lengesforexistingalgorithms. cial cues are then generated by a series of binary For familiar algorithms such as CYK, runtime decisions as detailed in Johnson et al. (2012). The quickly becomes enormous: the time complexity of key feature of these grammars is that parameter in- CYKisO(n3)foraninputoflengthn. Fortunately, ference corresponds both to learning word-topic re- we can take advantage of a special structural prop- lations and learning the salience of social cues in ertyofourgrammars. Theshapeoftheparsetreeis groundedlearning. completely determined by the input string; the only In the current work, we restrict our attention to variation is in the topic annotations in the nonter- only the unigram PCFG model to focus on investi- minal labels. So even though the number of possi- gating the role of topic continuity. Unlike the ap- ble parses grows exponentially with input length n, proachofJohnsonetal.(2012),whichusesMarkov the number of possible constituents grows only lin- Chain Monte Carlo techniques to perform gram- earlywithinputlength,andthepossibleconstituents matical inference, we experiment with Variational can be identified from the left context.2 These con- Bayesmethods,detailedinSection6. straints ensure that the Earley algorithm3 (Earley, 1970)willparseaninputoflengthnwiththisgram- 3.2 ADiscourse-levelModel marintimeO(n). Topic continuity—the tendency to group utterances Asecondchallengeinparsingverylongstringsis into coherent discourses about a single topic—may that the probability of a parse is the product of the be an important source of information for children probabilities of the rules involved in its derivation. learningthemeaningsofwords(Franketal.,2013). Asthelengthofaderivationgrowslinearlywiththe To address this issue, we consider a new discourse- length of the input, the parse probabilities decrease level model of grounded language that captures de- exponentiallyasafunctionofsentencelength,caus- pendenciesbetweenutterances. Bylinkingmultiple ingfloating-pointunderflowoninputsofevenmod- utterancesinasingleparse,ourproposedgrammat- erate length. The standard method for handling this ical formalism approximates a bigram Markov pro- is to compute log probabilities (which decrease lin- cessthatmodelstransitionsamongutterancetopics. early as a function of input length, rather than ex- Our grammar starts with a root symbol ponentially),butasweexplainlater(Section5),we Discourse, which then selects a starting can use the ability of the Earley algorithm to com- topic through a set of discourse initial rules, pute prefix-probabilities (Stolcke, 1995) to rescale Discourse → Discourse for t ∈ T(cid:48). Each theprobabilityoftheparseincrementallyandavoid t of the Discourse nodes generates an utter- floating-pointunderflows. t ance of the same topic, and advances into other In the next section, we provide background in- topics through transition rules, Discourse → formation on the Earley algorithm for PCFGs, the t Sentence Discourse for t(cid:48) ∈ T(cid:48). Dis- prefix probability scheme we use, and the inside- t t(cid:48) courses terminate by ending rules, Discourse outsidealgorithmintheEarleycontext. t → Sentence . Other rules in the unigram PCFG t 4 Background model by Johnson are reused except for the top- level rules in which we replace the non-terminal 4.1 EarleyAlgorithmforPCFGs Sentencebytopic-specificonesSentence . t The Earley algorithm was developed by Earley 3.3 ParsingDiscoursesandChallenges (1970) and known to be efficient for certain kinds Using a discourse-level grammar, we must parse 2Theprefixmarkers#and##andthetopicmarkerssuch a concatenation of all the utterances (with annota- as“.dog”enablealeft-to-rightparsertounambiguouslyiden- tifyitslocationintheinputstring. tions) in each conversation. This concatenation re- 3Inordertoachievelineartimetheparsingchartmusthave sults in an extremely long string: in the social-cue suitableindexing;seeAhoandUllman(1972),Leo(1991)and corpus (Frank et al., 2013), the average length of AycockandHorspool(2002)fordetails. ofCFGs(AhoandUllman,1972). AnEarleyparser Prefix probabilities are important because it en- constructsleft-mostderivationsofstrings,usingdot- ables probabilistic prediction of possible follow- ted productions to keep track of partial derivations. words x as P(x |x ...x ) = P(x0...xixi+1) i+1 i+1 0 i P(x0...xi) Specifically, each state in an Earley parser is rep- (Jelinek and Lafferty, 1991). These conditional resented as [l,r]: X→α.β to indicate that input probabilities allow estimation of the incremental symbols xl,...,xr−1 have been processed and the costs of a stack decoder, e.g. (Bahl et al., 1983), parser is expecting to expand β. States are gen- orin(HuangandSagae, 2010), aconceptuallysim- erated on the fly using three transition operations: ilar prefix cost is defined to order states in a beam predict(addstatestocharts),scan(shiftdotsacross search decoder. Moreover, the negative logarithm terminals), and complete (merge two states). Fig- of such conditional probabilities are termed as sur- ure2showsanexampleofacompletionstepwhich prisalvaluesinthepsycholinguisticsliterature,e.g. also illustrates the implicit binarization automati- (Levy, 2008), to describe how difficult a word is in callydoneinEarleyalgorithm. a given context. Interestingly, we show that prefix probabilitiesleadustoconstructaparserthatcould !!"#"$ parseextremelylongstringsnext. !!" "# $ #!&" 4.3 InsideOutsideAlgorithm ’ ( ) ToextendtheInsideOutside(IO)algorithm(Baker, Figure 2: Completion step – merging two states [l,m]: 1979) to the Earley context, Stolcke introduced in- X→α.Yβ and [m,r]:Y→ν . to produce a new state ner and outer probabilities which generalize the in- [l,r]: X→αY .β. side and outside probabilities in the IO algorithm. Specifically, the inner probability of a state [l,r]: InordertohandlePCFGs,Stolcke(1995)extends X→α.β is the probability of generating an input the Earley parsing algorithm to introduce the no- substringx ,...,x fromanon-terminalX using l r−1 tion of an Earley path being a sequence of states aproductionX→αβ.4 linked by Earley operations. By establishing a one- to-onemappingbetweenpartialderivationsandEar- !!" $ "% ley paths, Stolcke could then assign each path a derivation probability, that is the product of the all !!" "$ % #!&" ruleprobabilitiesusedinthepredictedstatesofthat path. Here,EachproductionX→ν correspondstoa ’ ( ) predictedstate[l,l] : X→.ν. Figure 3: Inner and outer probabilities. The outer Besidesparsing,beingabletocomputestringand probability of X→α.Yβ is a sum of all products of prefixprobabilitiesbysummingderivationprobabil- its parent outer probability (X→αY .β) and its sibling ities is also of great importance. To compute these inner probability (Y→ν .). Similarly, the outer proba- sumsefficiently,eachEarleystateisattachedwitha bility of Y→ν . is derived from the outer probability of forwardandaninnerprobabilitywhichareupdated X→αY .β andtheinnerprobabilityofX→α.Yβ. incrementallyasnewstatesarespawnedbythethree transitionoperations. Once all inner probabilities have been populated in a forward pass, outer probabilities are derived 4.2 ForwardandPrefixProbabilities backward, starting from the outer probability of the goal state [0,n] :→ S . being 1. Here, each Earley Intuitively, the forward probability of a state [l,r]: state is associated with an outer probability which X→α.β is the probability of an Earley path complements the inner probability by referring pre- through that state, generating input up to position ciselytothoseparts(notcoveredbythecorrespond- r-1. This probability generalizes a similar concept inginnerprobability)ofthecompletepathsgenerat- in HMM and lends itself to the computation of pre- fix probabilities, sums of forward probabilities over 4SummingupinnerprobabilitiesofallstatesY→ν.exactly scannedstatesyieldingaprefixx. yieldsBaker’sinsideprobabilityforY. ing the input string x. The implicit binarization in computing expected counts, scaling factors in the Earleyparsingallowsouterprobabilitiestobeaccu- outer and inner terms cancel out with those in the mulatedinasimilarwayasitscounterpartintheIO stringprobabilityinEq.(1),implyingthatruleprob- algorithm(seeFigure3). abilityestimationisunaffectedbyrescaling. These quantities allow for efficient grammatical inference in which the expected count of each rule 5.1 ParsingTimeonDenseGrammars X→λgivenastringxiscomputedas: We compare in Table 1 the parsing time (on a (cid:80) 2.4GHz Xeon CPU) of our parser (Earleyx) and outer(s)·inner(s) c(X→λ|x) = s:[l,r]X→.λ . Levy’s. The task is to compute surprisal values for P(S ⇒∗ x) a 22-word sentence over a dense grammar.6 Given (1) thatourparserisnowcapableofperformingscaling to avoid underflow, we avoid converting probabili- 5 ARescalingApproachforParsing ties to logarithmic form, which yields a speedup of Our parser originated from the prefix probability about4timescomparedtoLevy’sparser. parser by Levy (2008), but has diverged markedly since then. The parser, called Earleyx, is ca- Parser Time(s) pable of producing Viterbi parses and performing (Levy,2008) 640 grammatical induction based on the expectation- Earleyx+scaling 145 maximization and variational Bayes algorithms. Parsercodeisavailableathttp://url. Table 1: Parsing time (dense grammars) – to compute surprisal values for a 22-word sentence using Levy’s To tackle the underflow problem posed when parserandours(Earleyx). parsing discourses (§3.3), we borrow the rescal- ing concept from HMMs (Rabiner, 1990) to extend the probabilistic Earley algorithm. Specifically, the 5.2 ParsingTimeonSparseGrammars probability of each Earley path is scaled by a con- stant c each time it passes through a scanned state Parsing Time on Sparse Grammars i 30 generating the input symbol x . In fact, each path FastComplete i Normal passes through each scanned state exactly once, so 25 we consistently accumulate scaling factors for the 20 forward and inner probabilities of a state [l,r] : X→Arαgu.aβblays, cth0e..m.cors−t1inatnuditcivle..c.hcor−ic1eroesfptehcetivscealyl-. Seconds 15 10 ingfactorsaretheprefixprobabilities,whichessen- tially resets the probability of any Earley path start- 5 ing from any position i to 1. Concretely, we set c = 1 and c = P(x0...xi−1) for i=1,...,n-1 200 400 600 800 1000 1200 1400 1600 1800 2000 0 P(x0) i P(x0...xi) # Words wherenistheinputlength. Itisinterestingtopoint out that the logarithm of ci gives us the surprisal Figure4: Parsingtime(sparsegrammars)–tocompute (§4.2)valuefortheinputsymbolxi. Viterbiparsesforsentencesofincreasinglengths. Rescaling factors are only introduced in the for- ward pass, during which the outer probability of a Figure 4 shows the time taken (as a function of state[l,r]: X→α.βhasalreadybeenscaledbyfac- the input length) for Earleyx to compute a Viterbi torsc ...c c ...c .5 Moreimportantly,when parses over our sparse grammars (§3.2). The plot 0 l−1 r n−1 confirmed our analysis in that the special structure 5Theouterprobabilityofastateisessentiallytheproductof ofourgrammarsyieldsapproximatelylinearparsing innerprobabilitiescoveringallinputsymbolsoutsidethespan ofthatstate. Forgrammarscontainingcyclicunitproductions, timeintheinputlength(see§3.3). we also need to multiply with terms from the unit-production relationmatrix(Stolcke,1995). 6MLEestimatedfromtheEnglishPennTreebank. 6 GrammarInduction terminal Word .7 For all other rules, a uniform hy- t perparameter value of 1 is used. We initialized rule We employ a Variational Bayes (VB) approach to probabilitieswithuniformdistributionsplusrandom perform grammatical inference instead of the stan- noise. dard Inside Outside (IO) algorithm, or equivalently the Expectation Maximization (EM) algorithm, for 7 Experiments several reasons: (1) it has been shown to be less Our experiments apply sentence- and discourse- likely to cause over-fitting for PCFGs than EM level models to the annotated corpus of child- (Kurihara and Sato, 2004) and (2) implementation- directedspeechdescribedinSection3. Eachmodel wise, VB is a straightforward extension from EM isevaluatedon(a)topicaccuracy—howmanyutter- as they both share the same process of computing ances are labeled with correct topics (including the the expected counts (the IO part) and only differ null), (b) topic metrics (f-scores/precision/recall)— at how rule probabilities are reestimated. At the how well the model predicts non-null topical ut- same time, VB has also been demonstrated to do terances, (c) word metrics—how well the model wellonlargedatasetsandiscompetitivewithGibbs predicts topical words based on their parent non- samplers while having the fastest convergence time terminals, and (d) lexicon metrics—how well word amongtheseestimators(GaoandJohnson,2008). types are assigned to the topic that they attach to The rule reestimation in VB is carried as fol- mostfrequently(Johnsonetal.,2012). Forexample, lows. Let α be the prior hyperparameter of a r in Figure 1, the correct topic is pig, and the word- rule r in the rule set R and c be its expected r topicassociationsarewheres–None,the–None,and count accumulated over the entire corpus after an piggie–pig. IO iteration. The posterior hyperparameter for r is α∗ = α + c . Let ψ be the digamma function, InSection7.1,weexaminebaselineperformance r r r of models that do not make use of social cues the rule parameter update formula is: θ = r:X→λ exp(cid:2)ψ(α∗)−ψ(cid:0)(cid:80) α∗(cid:1)(cid:3). (mother and child’s eye-gaze and hand position) to r r(cid:48):X→λ(cid:48) r(cid:48) discover the topic; these baselines are contrasted WhereasIOminimizesthenegativeloglikelihood with a range of social cues in Sections 7.2 and 7.3. oftheobserveddata(sentences),-logp(x),VBmin- In Section 7.4, we evaluate the discourse structures imizes a quantity called free energy, which we will discoveredbyourmodels. uselatertomonitorconvergence. Herexdenotesthe observed data and θ represents the model parame- 7.1 BaselineModels(NoSocialCues) ters(PCFGruleprobabilities). Following(Kurihara andSato,2006),wecomputethefreeenergyas: Model Acc. TopicF WordF LexiconF 1 1 1 (cid:88) Γ((cid:80) α∗) MCMC 33.95 40.44 20.07 10.37 F(x,θ) = −logp(x)+ log r:X→λ r EM 32.08 39.76 13.31 6.09 (cid:80) Γ( α ) X∈N r:X→λ r VB 39.64 39.22 17.40 12.27 (cid:88)(cid:18) Γ(α∗) (cid:19) discourse 40.63 42.01 19.31 12.72 − log r +c logθ r r Γ(α ) r Table 2: Baseline (non-social) models. Comparison of r∈R sentence-level models (MCMC (Johnson et al., 2012), whereΓdenotesthegammafunction. EM,VB)andthediscourse-levelmodel. 6.1 SparseDirichletPriors To create baselines for later experiments, we be- gin by evaluating our models without social infor- In our application, since each topic should only be mation. We compare sentence-level models us- associated with a few words rather than the entire ing three different inference procedures—Markov vocabulary, we impose sparse Dirichlet priors over theWord distributionsbysettingasymmetricprior ChainMonteCarlo(MCMC)(Johnsonetal.,2012), t α<1 for all rules Word →w (∀t ∈ T,w ∈ W), Expectation Maximization (EM), and Variational t where W is the set of all words in the corpus. This 7It is important to not sparsify the Word distribution None biases the model to select only a few rules per non- sinceWord couldexpandintomanynon-topicalwords. None Topic Word Lexicon Model Energy Acc. F P R F P R F P R 1 1 1 MCMC 49.07 60.64 48.67 80.43 29.50 17.63 90.31 14.83 8.10 88.10 VB 53.14 60.89 50.53 76.59 25.62 14.94 89.91 16.71 9.25 85.71 156719 discourse 51.02 59.40 48.60 76.35 23.86 13.82 87.33 15.05 8.27 83.33 150023 discourse+init 55.78 60.91 52.15 73.22 29.75 17.91 87.65 21.11 11.95 90.48 149458 Table 3: Social-cue models. Comparison of sentence- and discourse-level models (init: initialized from the VB sentence-levelmodel)overfullmetrics. FreeenergiesareshowntocompareVB-basedmodels. Bayes (VB)—as well as the discourse-level model thatVBisoverallquitecompetitivewithMCMC9. describedabove. Turning to the discourse models, social informa- Results in Table 2 suggest that incorporating tion and topic continuity both independently boost topiccontinuitythroughthediscoursemodelboosts learning performance (as evidenced in (Johnson et performance compared to sentence-level models. al., 2012) and in Section 7.1). Nevertheless, joint Withinsentence-levelmodels,EMisinferiortoboth inferenceusingbothinformationsources(discourse MCMC and VB (in accordance with the consensus row) resulted in a performance decrement. Rather thatEMislikelytooverfitforPCFGs).8 Comparing than reflecting issues in the model itself, perhaps VB and MCMC, VB is significantly better at topic the increased complexity of the inference problem accuracy but is worse at topic F . This result sug- mighthaveledtothisperformancedecrement. 1 geststhatVBpredictsthatmoreutterancesarenon- To test this explanation, we initialized our topical compared with MCMC, perhaps explaining discourse-level model with the VB sentence-level whyMCMChasthehighestwordF . Nevertheless, model. Results are shown in the discourse+init 1 unlikeVB,thediscoursemodeloutperformsMCMC row. With a sentence-level initialization, perfor- in all topic metrics, indicating that topic continuity mance improved substantially, yielding the best re- helpsinpredictingbothnullandtopicalutterances. sults over most metrics. In addition, the discourse The discourse model is also capable of captur- model with sentence-level initialization achieved ing topical transitions. Examining one instance of lowerfreeenergythanthestandardinitializationdis- a learned grammar reveals that the distribution un- course model. Both of these results support the hy- derDiscourse isoftendominatedbyafewmajor pothesisthatinitializationfacilitatedinferenceinthe t transitions. For example, car tends to have transi- more complex discourse model. From a cognitive tionsintocar(0.72)andtruck(0.19);whilepig science perspective, this sort of result may point to prefers to transit into pig (0.69) and dog (0.24). theutilityofbeginningthetaskofdiscoursesegmen- These learned transitions nicely recover the struc- tationwithsomeinitialsentence-levelexpectations. ture of the task that caregivers were given: to play 7.3 EffectsofIndividualSocialCues withtoypairslikecar/truckandpig/dog. The importance of particular social cues and their 7.2 Social-cueModels relationship to discourse continuity is an additional topic of interest from the perspective of cognitive Wenextexplorehowtopiccontinuityinteractswith science (Frank et al., 2013). Returning to one of socialinformationviaasetofsimulationsmirroring the questions that motivated this work, we can use those in the previous section. Results are shown in Table 3. For the sentence-level models using social 9MCMC still has better performance over word metrics though. Detailed breakdown of word f-scores reveals that cues,VBnowoutperformsMCMCintopicaccuracy MCMCismuchbetteratprecision. Thisseemstoindicatethat and F , as well as lexicon evaluations, suggesting 1 VBpredictsmorewordsastopicalthanMCMC.Anexplana- tionforsucheffectisthatweusethesamehyperparameterα 8To determine the best sparsity hyperparameter α for for all lexical rules. As a result, sparsity levels might not be lexical rules (§6.1), we performed a line search over optimalforallWord distributions.ForMCMC,Johnsonetal. t {1,0.1,0.01,0.001,0.0001}.Asαdecreases,performanceim- (2012)usedtheadaptorgrammarsoftwaretolearnthehyperpa- proves,peakingat0.001,thevalueusedforallreportedresults rametersautomaticallyfromdata. all no.child.eyes no.child.hands no.mom.eyes no.mom.hands no.mom.point MCMC 49.1/60.6/29.5/14.8 38.4/46.6/21.5/11.1 49.1/60.6/29.6/15.3 48.0/59.7/29.0/15.5 48.7/60.0/29.3/15.6 48.8/60.3/29.3/15.6 VB 53.1/60.9/25.62/16.71 49.3/56.0/22.6/15.1 52.9/60.4/26.2/16.2 51.5/59.1/24.6/16.3 51.9/59.2/25.3/16.3 52.9/60.6/25.5/16.6 discourse+init 55.8/60.9/29.8/21.1 53.5/59.1/27.4/19.6 55.3/60.8/28.9/21.4 54.9/60.2/28.5/21.2 55.2/60.1/28.8/21.3 55.9/61.0/29.2/21.3 Table4: Socialcueinfluence. Ablationtestresultsacrossmodelswithoutdiscourse(MCMC,VB)andwithdiscourse (discourse+init). Westartwiththefullsetofsocialinformationanddroponecueatatime. Eachcellcontainsresults formetrics: topicaccuracy/topicF /wordF /lexiconF . 1 1 1 none child.eyes child.hands mom.eyes mom.hands mom.point MCMC 34.0/40.4/20.1/10.4 45.7/57.3/28.9/13.6 34.0/40.1/20.1/9.7 33.8/40.2/19.9/9.7 35.6/42.8/19.8/10.0 30.6/35.5/18.1/9.2 VB 39.6/39.2/17.4/12.27 47.2/53.0/21.9/13.9 43.0/45.8/15.4/12.9 42.9/46.5/14.6/12.4 41.1/43.8/17.1/12.4 39.7/39.7/17.5/13.4 discourse 40.6/42.0/19.3/12.7 47.7/55.4/22.6/13.9 44.1/50.3/18.8/13.3 45.6/51.7/21.4/14.1 44.2/49.2/19.5/12.6 38.9/40.8/16.9/12.2 Table5: Socialcueinfluence. Add-onetestresultsacrossmodelswithoutdiscourse(MCMC,VB)andwithdiscourse (discourse).Westartwithnosocialinformationandaddonecueatatime.Eachcellcontainsresultsformetrics:topic accuracy/topicF /wordF /lexiconF . 1 1 1 our discourse model to answer the question about child.eyesinthediscoursemodelsuggeststhat the role that the child.eyes cue plays in child- this cue encodes useful information in addition to directed discourses. Johnson et al. (2012) raised theintersententialdiscoursetopic. twohypothesesthatcouldexplaintheimportanceof 7.4 DiscourseStructureEvaluation child.eyes as a social cue: (1) caregivers “fol- low in” on the child’s gaze: they tend to talk about Whilethediscoursemodelperformswellusingmet- what the child is looking at (Baldwin, 1993), or (2) rics from previous work, these metrics do not fully thechild.eyescueencodesthetopicofthepre- reflectanimportantstrengthofthemodel: itsability vioussentence,sothiscueinadvertentlygivesanon- tocaptureinter-utterancestructure. discoursemodelaccesstorudimentarydiscoursein- Raw Discourse Utterance formation. car car comehereletsfindthecar Toaddressthisquestion,weconducttwotests: (1) car there ablation – eliminating each social cue in turn (e.g. car car isthatacar child.eyes),and(2)add-one,usingasingleso- car car thecargoesvroomvroomvroom cial cue per turn. Table 4 and 5 show correspond- ingresultsformodelswithoutdiscourse(theMCMC Table 6: Topic annotation examples. raw (previous and VB sentence-level models) and with discourse metrics)anddiscourse(newmetrics). (discourse+init for the ablation test and discourse Forexample, considerthesequenceofutterances for the add-one test). We observe similar trends to in Table 6. Our results are based on the raw anno- (Johnson et al., 2012): the child’s gaze is the most tation, which labels only those utterances as topi- importantcue(removingitresultsinthelargestper- cal that contain topical words or pronouns referring formance drop).10 This result is preserved even in to an object. As a result, classifying the utterance the discourse model (row y). The large decrement “there” as car will be penalized as incorrect. From for child.eyes is consistent with the hypothesis theperspectiveofahumanlistener,however,“there” thatcaregiversarefollowingin,ordiscussingtheob- ispartofabroaderdiscourseaboutthecar,andla- jectthatchildrenareinterestedin–evencontrolling belingitwiththesametopiccapturesthefactthatit for the continuity of discourse, a confound in pre- encodes useful information for learners. To differ- vious analyses. In other words, the importance of entia these cases, Frank and Rohde (under review) 10Itissomewhatsurprisingwhenchild.eyehasmuchless added a new set of annotations (to the dataset used influenceonVBthanonMCMCintheablationtest. Though in Section 7) based on the discourse structure per- resultsintheadd-onetestrevealthatVBcouldgeneralizemuch ceivedbyhuman,similartocolumndiscourse,. betterthanMCMCwhenpresentedwithasinglesocialcue,it We utilize these new annotations to compare the remainsinterestingtofindoutinternallywhatcausesthediffer- ence,whichweleaveforfutureanalysis. topics predicted by our discourse model with those assigned by human annotators. We also adopt their modeling grounded language at the sentence and suggestedmetricsfordiscoursesegmentationevalu- discourse levels. Specifically, we used the Ear- ation: a=b–asimpleproportionequivalenceofdis- ley algorithm to exploit the special structure of our courseassignments;p –amovingwindowmethod grammars to achieve approximately linear parsing k (Beeferman et al., 1999) to measure the probability time,introducedarescalingapproachtohandlevery that two random utterances are correctly classified longinputstrings,andutilizedVariationalBayesfor as being in the same discourse segment; and Win- grammar induction to obtain better solutions than dowDiff (Pevzner and Hearst, 2002) – an improved theExpectationMaximizationalgorithm. version of p which gives “partial credit” to bound- By transforming a grounded language learning k ariesclosetothecorrectones. problemintoagrammaticalinferencetask,weused Results in Table 7 demonstrate that our model is our parser to study how discourse structure could in better agreement with human annotation (model- facilitate children’s language acquisition. In ad- human)thantherawannotation(raw-human)across dition, we investigate the interaction between dis- all metrics. As is visible from the limited change course structure and social cues, both important in the a=b metric, relatively few topic assignments and complementary sources of information in lan- are altered; yet these alterations create much more guage learning (Baldwin, 1993; Frank et al., 2013). coherentdiscoursesthatallowforfarbettersegmen- We also examined why individual children’s gaze tationperformanceunderp andWindowDiff. was an important predictor of reference in previ- k ous work (Johnson et al., 2012). Using ablation raw-human model-human tests, we showed that information provided by the a=b 63.6 69.3 child’s gaze is still valuable even in the presence of p 57.0 83.6 k discourse continuity, supporting the hypothesis that WindowDiff 36.2 61.2 parents “follow in” on the particular focus of chil- Table7: DiscourseevaluationSingleannotatorsample, dren’sattention(TomaselloandFarrar,1986). comparison between topics assigned by the raw annota- Lastly, we showed that our models can produce tion,ourdiscoursemodel,andahumancoder. accuratediscoursesegmentations. Oursystem’sout- put is considerably better than the raw topic anno- Toputanupperboundonpossiblediscourseseg- tations provided in the previous social cue corpus mentationresults,wefurtherevaluatedperformance (Frank et al., 2013) and is in good agreement with onasubsetof634utterancesforwhichmultiplean- discourse topics assigned by human annotators in notationswerecollected. ResultsinTable8demon- FrankandRohde(underreview). stratethatourmodelpredictsdiscoursetopics(m-h , 1 In conclusion, although previous work on m-h )atalevelquiteclosetothelevelofagreement 1 grounded language learning has treated individual betweenhumanannotators(columnh -h ). 1 2 utterancesasindependentfromoneanother,wehave shown here that the ability to incorporate discourse r-h r-h m-h m-h h -h 1 2 1 2 1 2 information can be quite useful for such problems. a=b 60.1 65.6 70.4 72.4 81.7 Discourse continuity is an important source of in- p 50.7 51.8 85.1 84.9 89.7 k WindowDiff 29.0 30.1 60.1 66.9 72.7 formation in children’s language acquisition, and it maybeavaluablepartoffuturegroundedlanguage Table8: Discourseevaluation. Multipleannotatorsam- learningsystems. ple, comparisonbetweenrawannotations(r), ourmodel (m),andtwoindependenthumancoders(h ,h ). 1 2 References 8 ConclusionandFutureWork Alfred V. Aho and Jeffery D. Ullman. 1972. The The- oryofParsing,TranslationandCompiling;Volume1: In this paper, we proposed a novel integration of Parsing. Prentice-Hall,EnglewoodCliffs,NewJersey. existing techniques in parsing and grammar induc- YoavArtziandLukeS.Zettlemoyer. 2013. Weaklysu- tion to offer a complete solution for simultaneously pervisedlearningofsemanticparsersformappingin-

Description:
bine ideas from parsing and grammar induc- tion to produce a .. (see §3.3). 6MLE estimated from the English Penn Treebank. tation performance under pk and WindowDiff. raw-human . Anoop Sarkar, Joshi Aravind, and Bonnie Webber. 2003. bilistic ccg grammars from logical form with higher-.
See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.