Collocations Lecture #12 Computational Linguistics CMPSCI 591N, Spring 2006 University of Massachusetts Amherst Andrew McCallum Andrew McCallum, UMass Amherst Words and their meaning • Word disambiguation – one word, multiple meanings • Word clustering – multiple words, “same” meaning • Collocations - this lecture – multiple words together, different meaning than than the sum of its parts – Simple measures on text, yielding interesting, insights into language, meaning, culture. Andrew McCallum, UMass Amherst Today’s Main Points • What is collocation? • Why do people care? • Three ways of finding them automatically. Andrew McCallum, UMass Amherst Collocations • An expression consisting of two or more words that correspond to some conventional way of saying things. • Characterized by limited compositionality. – compositional: meaning of expression can be predicted by meaning of its parts. – “strong tea”, “rich in calcium” – “weapons of mass destruction” – “kick the bucket”, “hear it through the grapevine” Andrew McCallum, UMass Amherst Collocations important for… • Terminology extraction – Finding special phrases in technical domains • Natural language generation – To make natural output • Computational lexicography – To automatically identify phrases to be listed in a dictionary • Parsing – To give preference to parses with natural collocations • Study of social phenomena – Like the reinforcement of cultural stereotypes through language (Stubbs 1996) Andrew McCallum, UMass Amherst Contextual Theory of Meaning • In contrast with “structural linguistics”, which emphasizes abstractions, properties of sentences • Contextual Theory of Meaning emphasizes the importance of context – context of the social setting (not idealized speaker) – context of discourse (not sentence in isolation) – context of surrounding words Firth: “a word is characterized by the company it keeps” • Example [Halliday] – “strong tea”, coffee, cigarettes – “powerful drugs”, heroin, cocaine – Important for idiomatically correct English, but also social implications of language use Andrew McCallum, UMass Amherst Method #1 Frequency 80871 of the 58841 in the 26430 to the 21842 on the 21839 for the 18568 and the 16121 that the 15630 at the 15494 to be 13899 in a 13689 of a 13361 by the 13183 with the 12622 from the 11428 New York 10007 he said Andrew McCallum, UMass Amherst Method #1 Frequency with POS Filter AN, NN, AAN, ANN, NAN, NNN, NPN 11487 New York A N 7261 United States A N 5412 Los Angeles A N 3301 last year N N 3191 Saudi Arabia N N 2699 last week A N 2514 vice president A N 2378 Persian Gulf A N 2161 San Francisco N N 2106 President Bush N N 2001 Middle East A N 1942 Saddam Hussein N N 1867 Soviet Union A N 1850 White House A N 1633 United Nations A N 1328 oil prices N N 1210 next year A N 1074 chief executive A N 1073 real estate A N Andrew McCallum, UMass Amherst Method #2 Mean and Variance • Some collocations are not of adjacent words, but words in more flexible distance relationship – she knocked on his door – they knocked at the door – 100 women knocked on Donaldson’s door – a man knocked on the metal front door • Not a constant distance relationship • But enough evidence that “knock” is better than “hit”, “punch”, etc. Andrew McCallum, UMass Amherst Method #2 Mean and Variance Sentence: Stocks crash as rescue plan teeters. Time-shifted bigrams: 1 2 3 stocks crash stocks as stocks rescue crash as crash rescue crash plan as rescue as plan as teeters ... • To ask about relationship between “stocks” and “crash”, gather many such pairs, and calculate the mean and variance of their offset. Andrew McCallum, UMass Amherst
Description: