File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/06/w06-1006_intro.xml
Size: 4,731 bytes
Last Modified: 2025-10-06 14:03:53
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-1006"> <Title>Violeta.Seretan@latl.unige.ch</Title> <Section position="4" start_page="0" end_page="40" type="intro"> <SectionTitle> 2 Motivation </SectionTitle> <Paragraph position="0"> The key importanceof collocationsin text productiontaskssuchasmachinetranslationandnat- null ural languagegenerationhas beenstressedmany times.It hasbeenequallyshownthatcollocations areusefulina rangeofotherapplications,suchas word sense disambiguation(Brown et al., 1991) andparsing(AlshawiandCarter, 1994).</Paragraph> <Paragraph position="1"> The NLP communityfully acknowledgedthe need for an appropriatetreatmentof multi-word expressionsin general(Sag et al., 2002). Collocationsareparticularlyimportantbecauseof their prevalencein language,regardlessof the domain or genre. Accordingto Jackendoff (1997, 156) and Mel'Vcuk (1998, 24), collocationsconstitute thebulkofa language's lexicon.</Paragraph> <Paragraph position="2"> Thelastdecadeshave witnesseda considerable developmentof collocationextraction techniques, thatconcernbothmonolingualand(parallel)multilingualcorpora. null We can mentionhere only part of this work: (Berry-Rogghe, 1973; Church et al., 1989; sumoto,1996; Melamed,1997)for bilingualextractionviaalignment. null Traditionally, collocationextractionwas considereda language-independenttask. Sincecollocationsarerecurrent,typicallexicalcombinations, null a widerangeofstatisticalmethodsbasedonword co-occurrencefrequency have been heavily used for detectingthem in text corpora. Amongthe mostoftenusedtypesof lexicalassociationmeasures (henceforth AMs) we mention: statistical hypothesistests(e.g.,binomial,Poisson,Fisher,zscore,chi-squared,t-score,andlog-likelihoodra- null tiotests),thatmeasurethesignificanceoftheassociationbetweentwowordsbasedonacontingency null table listing their joint and marginal frequency, and Information-theoretic measures (Mutual Information-- henceforthMI -- and its variants), that quantityof 'information'sharedby two randomvariables.A detailedreview of thestatistical methodsemployedincollocationextractioncanbe found,for instance,in (Evert, 2004). A comprehensive listofAMsisgiven(Pecina,2005).</Paragraph> <Paragraph position="3"> Veryoften,inadditiontotheinformationoncooccurrencefrequency, language-specificinformation is also integratedin a collocationextraction system(asit willbeseeninsection3): - morphologicalinformation,inordertocount inflectedwordformsasinstancesofthesame baseform. For instance,ask questions, asks question, asked questionare all instancesof thesamewordpair, ask- question; - syntacticinformation,inordertorecognizea wordpairevenifsubjectto(complex)syntactic transformations:ask multiplequestions, questionasked, questionsthatonemightask.</Paragraph> <Paragraph position="4"> Thelanguage-specificmodulesthusaimatcoping with the problemof morphosyntacticvariation,inordertoimprovetheaccuracyoffrequency null information.Thisbecomestrulyimportantespeciallyfor free-word orderand for high-inflection languages,for which the token(form)-basedfrequencyfiguresbecometooskewedduetothehigh null lexical dispersion. Not only the data scattering modifythe frequency numbersusedby AMs,but it also altersthe performanceof AMs, if the the probabilitiesinthecontingencytablebecomevery low.</Paragraph> <Paragraph position="5"> Morphosyntacticinformationhas in fact been shown to significantlyimprove the extractionresults (Breidt, 1993; Smadja, 1993; Zajac et al., 2003). Morphologicaltoolssuch as lemmatizers andPOStaggersarebeingcommonlyusedin extractionsystems;they areemployedbothfordealingwithtext variationandfor validatingthe candidatepairs: combinationsof functionwordsare typicallyruledout (Justesonand Katz, 1995),as are the ungrammaticalcombinationsin the sys-</Paragraph> <Paragraph position="7"> Given the motivations for performing a linguistically-informedextraction-- whichwere also put forth, among others, by Church and Hanks(1990,25), Smadja(1993,151) and Heid (1994) -- and given the recent developmentof linguisticanalysistools,itseemsplausiblethatthe linguisticstructurewill be more and more taken intoaccountbycollocationextractionsystems.</Paragraph> <Paragraph position="8"> Therestofthepaperisorganizedasfollows. In section3 we provide a language-orientedreview of the existingcollocationextractionwork. Then wehighlight,insection4,aseriesofproblemsthat arisein thetransferof methodologyto a new language,andweproposea strategyfordealingwith them. Section5 describesan extractionsystem, and,finally, section6 presentsa case-studyonthe collocationsextractedforfourlanguages,illustratingthecross-lingualvariationin theperformance ofa particularAM.</Paragraph> </Section> class="xml-element"></Paper>