ebook img

Collocations - University of Massachusetts Amherst PDF

20 Pages·2006·0.46 MB·English
by  
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Collocations - University of Massachusetts Amherst

Collocations Lecture #12 Computational Linguistics CMPSCI 591N, Spring 2006 University of Massachusetts Amherst Andrew McCallum Andrew McCallum, UMass Amherst Words and their meaning • Word disambiguation – one word, multiple meanings • Word clustering – multiple words, “same” meaning • Collocations - this lecture – multiple words together, different meaning than than the sum of its parts – Simple measures on text, yielding interesting, insights into language, meaning, culture. Andrew McCallum, UMass Amherst Today’s Main Points • What is collocation? • Why do people care? • Three ways of finding them automatically. Andrew McCallum, UMass Amherst Collocations • An expression consisting of two or more words that correspond to some conventional way of saying things. • Characterized by limited compositionality. – compositional: meaning of expression can be predicted by meaning of its parts. – “strong tea”, “rich in calcium” – “weapons of mass destruction” – “kick the bucket”, “hear it through the grapevine” Andrew McCallum, UMass Amherst Collocations important for… • Terminology extraction – Finding special phrases in technical domains • Natural language generation – To make natural output • Computational lexicography – To automatically identify phrases to be listed in a dictionary • Parsing – To give preference to parses with natural collocations • Study of social phenomena – Like the reinforcement of cultural stereotypes through language (Stubbs 1996) Andrew McCallum, UMass Amherst Contextual Theory of Meaning • In contrast with “structural linguistics”, which emphasizes abstractions, properties of sentences • Contextual Theory of Meaning emphasizes the importance of context – context of the social setting (not idealized speaker) – context of discourse (not sentence in isolation) – context of surrounding words Firth: “a word is characterized by the company it keeps” • Example [Halliday] – “strong tea”, coffee, cigarettes – “powerful drugs”, heroin, cocaine – Important for idiomatically correct English, but also social implications of language use Andrew McCallum, UMass Amherst Method #1 Frequency 80871 of the 58841 in the 26430 to the 21842 on the 21839 for the 18568 and the 16121 that the 15630 at the 15494 to be 13899 in a 13689 of a 13361 by the 13183 with the 12622 from the 11428 New York 10007 he said Andrew McCallum, UMass Amherst Method #1 Frequency with POS Filter AN, NN, AAN, ANN, NAN, NNN, NPN 11487 New York A N 7261 United States A N 5412 Los Angeles A N 3301 last year N N 3191 Saudi Arabia N N 2699 last week A N 2514 vice president A N 2378 Persian Gulf A N 2161 San Francisco N N 2106 President Bush N N 2001 Middle East A N 1942 Saddam Hussein N N 1867 Soviet Union A N 1850 White House A N 1633 United Nations A N 1328 oil prices N N 1210 next year A N 1074 chief executive A N 1073 real estate A N Andrew McCallum, UMass Amherst Method #2 Mean and Variance • Some collocations are not of adjacent words, but words in more flexible distance relationship – she knocked on his door – they knocked at the door – 100 women knocked on Donaldson’s door – a man knocked on the metal front door • Not a constant distance relationship • But enough evidence that “knock” is better than “hit”, “punch”, etc. Andrew McCallum, UMass Amherst Method #2 Mean and Variance Sentence: Stocks crash as rescue plan teeters. Time-shifted bigrams: 1 2 3 stocks crash stocks as stocks rescue crash as crash rescue crash plan as rescue as plan as teeters ... • To ask about relationship between “stocks” and “crash”, gather many such pairs, and calculate the mean and variance of their offset. Andrew McCallum, UMass Amherst

Description:
Andrew McCallum, UMass Amherst Method #2 Mean and Variance •Some collocations are not of adjacent words, but words in more flexible distance relationship
See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.