File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-1006_metho.xml

Size: 16,896 bytes

Last Modified: 2025-10-06 14:10:41

<?xml version="1.0" standalone="yes"?>
<Paper uid="W06-1006">
  <Title>Violeta.Seretan@latl.unige.ch</Title>
  <Section position="5" start_page="40" end_page="42" type="metho">
    <SectionTitle>
3 OverviewofExtractionWork
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="40" end_page="41" type="sub_section">
      <SectionTitle>
3.1 English
</SectionTitle>
      <Paragraph position="0"> As one mightexpect,the bulk of the collocation extractionwork concernsthe English language:</Paragraph>
      <Paragraph position="2"> jacentwords)only, by simplycomputingthe cooccurrencefrequency. Justesonand Katz (1995) applya POS-filteronthepairsthey extract.Asin (Kjellmer, 1994),the AM they use is the simple frequency.</Paragraph>
      <Paragraph position="3">  Smadja(1993)employsthez-scoreinconjunction with several heuristics(e.g., the systematic occurrenceof two lexical items at the same distanceintext)andextractspredicativecollocations, null  rigidnounphrasesandphrasaltemplates.Hethen uses the a parserin order to validatethe results. Theparsing is shownto leadto an increasein accuracy from40%to80%.</Paragraph>
      <Paragraph position="4"> (Churchet al., 1989)and (Churchand Hanks, 1990)usePOSinformationanda parsertoextract verb-objectpairs,whichthenthey rankaccording to the mutualinformation(MI) measurethey introduce. null</Paragraph>
    </Section>
    <Section position="2" start_page="41" end_page="41" type="sub_section">
      <SectionTitle>
Lin's(1998)isalsoahybridapproachthatrelies
</SectionTitle>
      <Paragraph position="0"> ona dependency parser. Thecandidatesextracted arethenrankedwithMI.</Paragraph>
    </Section>
    <Section position="3" start_page="41" end_page="41" type="sub_section">
      <SectionTitle>
3.2 German
Germanis thesecondmostinvestigatedlanguage,
</SectionTitle>
      <Paragraph position="0"> thanks to the early work of Breidt (1993) and, morerecently, to thatof KrennandEvert,such as (Krennand Evert, 2001; Evert and Krenn,2001; Evert,2004)centeredonevaluation.</Paragraph>
      <Paragraph position="1"> Breidt uses MI and t-score and comparesthe results accuracy when various parametersvary, such as the window size, presencevs. absence of lemmatization,corpus size, and presencevs.</Paragraph>
      <Paragraph position="2"> absenceof POS and syntacticinformation. She focuses on N-V pairs2 and, despite the lack of syntacticanalysistoolsat the time,by simulating parsing she comes to the conclusionthat &amp;quot;Very high precisionrates, which are an indispensable  requirementforlexicalacquisition,canonlyrealisticallybeenvisagedforGermanwithparsedcor- null pora&amp;quot;(Breidt,1993,82).</Paragraph>
      <Paragraph position="3"> Later, Krennand Evert (2001)used a German chunker to extractsyntacticpairssuchas P-N-V.</Paragraph>
      <Paragraph position="4"> Their work put the basis of formal and systematic methodsin collocationextractionevaluation. Zinsmeisterand Heid (2003; 2004) focused on N-V and A-N-Vcombinationsidentifiedusinga stochasticparser. They appliedmachinelearning techniquesin combinationto the log-likelihood measure(henceforthLL)fordistinguishingtrivial compoundsfromlexicalizedones.</Paragraph>
      <Paragraph position="5"> Finally, Wermter and Hahn (2004) identified PP-V combinationsusing a POS tagger and a chunker. They basedtheirmethodon a linguistic criterion(that of limitedmodifiability)and compared their resultswith those obtainedusing the t-scoreandLLtests.</Paragraph>
      <Paragraph position="6"> 2Thefollowingabbreviationsare usedin thispaper: N noun,V- verb,A- adjective,Adv- adverb,Det- determiner, Conj- conjunction,P - preposition.</Paragraph>
    </Section>
    <Section position="4" start_page="41" end_page="41" type="sub_section">
      <SectionTitle>
3.3 French
</SectionTitle>
      <Paragraph position="0"> Thanks to the outstanding work of Gross on lexicon-grammar(1984), French is one of the moststudiedlanguagesin termsof distributional and transformationalpotential of words. This workhasbeencarriedoutbeforethe computerera and the advent of corpuslinguistics,whileautomaticextractionwaslaterperformed,forinstance, null</Paragraph>
      <Paragraph position="2"> Daille (1994) aimed at extracting compound nouns,defineda prioriby meansof certainsyntacticpatterns,like N-A,N-N,N-`a-N,N-de-N,N PDetN.Sheuseda lemmatizeranda POS-tagger beforeapplyinga seriesof AMs,whichshe then evaluatedagainst a domain-specificterminology dictionaryand against a gold-standardmanually createdfromtheextractioncorpus.</Paragraph>
      <Paragraph position="3"> Similarly, Bourigault (1992) extracted nounphrasesfromshallow-parsedtext,andGoldmanet null al. (2001)extractedsyntacticcollocationsby usinga fullparserandapplyingtheLLtest.</Paragraph>
    </Section>
    <Section position="5" start_page="41" end_page="41" type="sub_section">
      <SectionTitle>
3.4 OtherLanguages
In additionto English,GermanandFrench,other
</SectionTitle>
      <Paragraph position="0"> animprovedn-grammethod.</Paragraph>
      <Paragraph position="1"> As for multilingualextraction via alignment (wherecollocationsare first detectedin one languageand then matchedwith their translationin  anotherlanguage),mostortheexistingworkconcern the English-Frenchlanguagepair, and the</Paragraph>
    </Section>
    <Section position="6" start_page="41" end_page="42" type="sub_section">
      <SectionTitle>
Hansardcorpusof CanadianParliamentproceed-
</SectionTitle>
      <Paragraph position="0"> ings. Wu (1994)signalsa numberof problems  that non-Indo-Europeanlanguagespose for the existingalignmentmethodsbased on word- and sentence-length:in Chinese,forinstance,mostof thewordsarejustoneortwo characterslong,and thereareno worddelimiters.Thisresultsuggests thattheportabilityof existingalignmentmethods tonewlanguagepairsisquestionable.</Paragraph>
      <Paragraph position="1"> We are not concernedhere with extractionvia alignment.We assume,instead,thatmultilingual supportin collocationextractionmeansthe customizationof the extraction procedurefor each language.Thistopicwillbeaddressedin thenext sections.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="42" end_page="44" type="metho">
    <SectionTitle>
4 Multilingualism:WhyandHow?
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="42" end_page="42" type="sub_section">
      <SectionTitle>
4.1 SomeIssues
</SectionTitle>
      <Paragraph position="0"> Astheprevioussectionshowed,many systemsof collocationextractionrely on the linguisticpreprocessingof sourcecorporain order to support the candidateidentificationprocess. Languagespecificinformation,suchastheonederivedfrom null morphologicalandsyntacticanalysis,was shown to be highlybeneficialfor extraction. Moreover, the possibilityto applythe associationmeasures onsyntacticallyhomogenousmaterialisarguedto benefitextraction,as the performanceof association measuresmightvary withthe syntacticconfigurationsbecauseof the differencesin distribution(KrennandEvert,2001). null The lexical distribution is thereforea relevant  issuefromtheperspectiveofmultilingualcollocationextraction.Differentlanguagesshowdifferent null proportionsof lexical categories (N, V, A, Adv, P, etc.) whichare evenly distributed acrosssyntactictypes3. Dependingon the frequency numbers,a given AMcouldbe moresuitedfor a specificsyntactic configurationin onelanguage,and less suitedfor the sameconfigurationin another.</Paragraph>
      <Paragraph position="1"> Ideally, eachlanguageshouldbe assigneda suitable set of AMs to be appliedon syntacticallyhomogenousdata. null Another issue that is relevant in the multi- null lingualism perspective is that of the syntactic configurationscharacterizingcollocations. Severalsuchrelations(e.g.,noun-adjectival modifier, predicate-argument)are likely to remainconstant throughlanguages,i.e., to be judgedas collocationallyinterestingin many languages.However, 3For instance,V-P pairsare morerepresentedin English  thaninotherlanguages(asphrasalverbsorverb-particleconstructions). null other configurationscould be language-specific (like P-N-V in German, whose English equivalentisV-P-N).Yetotherconfigurationsmighthave null nocounterpartat allinanotherlanguage(e.g.,the FrenchP-Apair `a neuf is translatedintoEnglish asa Conj-Apair, asnew).</Paragraph>
      <Paragraph position="2">  Findingall the collocationally-relevant syntactictypesfora languageis thereforeanotherproblem that has to be solved in multilingualextraction. Since a priori definingthese types based on intuitiondoesnot ensurethe necessarycoverage,analternativeproposalistoinducethemfrom null POSdataanddependencyrelations,asin(Seretan, 2005).</Paragraph>
      <Paragraph position="3"> The morphoyntactic differences between languagesalso have to be taken into account. With Englishasthemostinvestigatedlanguage,several hypotheseswere put forth in extractionand becamecommonplace. null Forinstance,usinga5-wordswindowassearch spaceforcollocationpairsisausualpractice,since this span lengthwas shown sufficientto cover a  highpercentageofsyntacticco-occurrencesinEnglish. But -- as suggestedby otherresearchers, e.g., (Goldmanet al., 2001)--, this assumption doesnotnecessaryholdforotherlanguages.</Paragraph>
      <Paragraph position="4"> Similarly, the higherinflectionand the higher transformation potential shown by some languages pose additional problems in extraction, whichwereratherignoredforEnglish. AsKimet al. (1999)notice,collocationextractionisparticularlydifficultin free-orderlanguageslike Korean, whereargumentsscramblefreely. Breidt(1993) alsopointedouta coupleof problemsthatmakes extractionfor Germanmoredifficultthanfor English: the stronginflectionfor verbs,the variable  word-order,andthepositionalambiguityofthearguments.Sheshowsthatevendistinguishingsub- null jectsfromobjectsisverydifficultwithoutparsing.</Paragraph>
    </Section>
    <Section position="2" start_page="42" end_page="44" type="sub_section">
      <SectionTitle>
4.2 AStrategyforMultilingualExtraction
</SectionTitle>
      <Paragraph position="0"> Summing up the previous discussion, the cus- null tomizationof collocationextractionfor a given languageneedstotake intoaccount: - the syntactic configurationscharacterizing collocations, - thelexicaldistributionover syntacticconfigurations, null - theadequacyofAMstotheseconfigurations.  These are language-specificparameterswhich needto be setin a successfulmultilingualextraction procedure. Truly multilingualsystemshave not been developedyet, but we suggestthe followingstrategyforbuildingsucha system: A. parse the source corpus, extract all the syntactic pairs (e.g., head-modifier, predicateargument)andrankthemwitha givenAM, B. analyzethe resultsandfindthe syntacticconfigurationscharacterizingcollocations, null C. evaluatetheadequacy ofAMsforrankingcollocationsin each syntacticconfiguration,and find the most convenientmappingconfigurations- AMs.</Paragraph>
      <Paragraph position="1"> Oncecustomizedfora language,theextraction procedureinvolves: Stage1. parsing the source corpus for extracting the lexical pairs in the relevant, language-specific syntactic configurationsfoundinstepB; null Stage2. ranking the pairs from each syntactic  dependency parsersoronchunking.</Paragraph>
      <Paragraph position="2"> It is based on a symbolicparser that was developedover the last decade(Wehrli, 2004)and achieves a highlevel of performance,in termsof accuracy, speedandrobustness. Thelanguagesit supportsare, for the timebeing,French,English, Italian, Spanishand German. A few other languagesare beingalso implementedin the frameworkofa multilingualismproject.</Paragraph>
      <Paragraph position="3">  as a two-stageprocess(where,in stage1, collocationcandidatesareidentifiedinthetextcorpora, null andinstage2,theyarerankedaccordingtoagiven AM, cf. section4.2), the role of the parseris to supportthe first stage. A pair of lexicalitemsis  selectedasacandidateonlyifthereexistasyntacticrelationholdingbetweenthetwo items. Unlike the traditional,window-basedmethods, candidateselectionis basedon syntacticproximity (as opposedto textual proximity). Another peculiarityof our systemis that candidatepairs are identifiedas the parsinggoeson; in otherapproaches, they are extracted by post-processing theoutputofsyntactictools.</Paragraph>
      <Paragraph position="4"> Thecandidatepairsidentifiedareclassifiedinto syntacticallyhomogenoussets, according to the syntacticrelationsholdingbetweenthetwo items.</Paragraph>
      <Paragraph position="5"> Only certain predefined syntactic relations are kept, that were judged as collocationally relevant aftermultipleexperimentsof extractionand data analysis (e.g., adjective-noun, verb-object, subject-verb, noun-noun,verb-preposition-noun). The sets obtainedare then ranked usingthe loglikelihoodratiostest(Dunning,1993). null More details about the systemand its perfor- null Possopertantoaccogliere in parte e in lineadi principiogli emendamentinn. 43-46e l'emendamento n. 85.</Paragraph>
      <Paragraph position="6"> 1.c) reforzar cooperaci'on (Es): Queremos permitira los pases que lo deseen reforzar, en un contexto unitario,su cooperaci'on en cierto n'umerodesectores.</Paragraph>
      <Paragraph position="7">  Thecollocationextractorispartofabiggersystem (Seretanet al., 2004) that integrates a concordancerand a sentencealigner, and that supportsthe visualization,the manualvalidationand the managementof a multilingualterminology database. Thevalidatedcollocationsare usedfor populatingthe lexiconof the parserand that of a</Paragraph>
      <Paragraph position="9"> A collocation extraction experiment concerning four different languages (English, Spanish, French,Italian)has beenconductedon a parallel subcorpusof 42 files from the EuropeanParliamentproceedings.Severalstatisticsandextraction null resultsarereportedinTable1.</Paragraph>
      <Paragraph position="10">  We computedthe distribution of pair tokens according to the syntactic type and noted that the most marked distributionaldifferenceamong</Paragraph>
      <Paragraph position="12"> Unsurprisingly, theRomancelanguagesareless differentin termsof syntacticco-occurrencedistribution, and the deviationof Englishfrom the  Romancemeanismorepronounced--inparticular, forN-A(9.72),V-P(5.63),A-N(5.25),N-P-N  We performed a contrastive analysisof results, by carryingout a case-studyaimed at checking the LL performancevariabilityacrosslanguages.</Paragraph>
      <Paragraph position="13"> Thestudyconcernedthe verb-objectcollocations having the noun policyas the directobject. We specificallyfocusedonthebest-scoredcollocation extractedfromthe Frenchcorpus,namelymener unepolitique(lit.,conducta policy).</Paragraph>
      <Paragraph position="14"> We looked at the translationequivalentsof its 74 instancesidentifiedby our extractionsystem in the corpus. The analysisrevealed that -- at least in this particularcase -- the verbal collocates of this noun are highly scattered: pursue, implement,conduct,adopt,apply, develop,have, draft, launch, run, carry out for English; practicar, llevar a cabo,desarrollar, realizar, aplicar, seguir, hacer, adoptar, ejercer for Spanish;condurre, attuare, portare avanti,perseguire, pratticare, adottare, fare forItalian(amongseveralothers). Someofthecollocates(thoselistedfirst)are more prominentlyused. But generallythey are highlydispersed,andthismightindicatea bigger difficultyforLLtopinpointthebestcollocateina languagevs. another.</Paragraph>
      <Paragraph position="15">  withour intuitionthat the lower-scoredpairsobserved manifestless a collocationalstrength. It happensto be situatedaroundthe LL valueof 20 foreachlanguage(andis of coursespecificto the sizeofourcorpusandtothenumberofV-Otokens identifiedtherein).</Paragraph>
      <Paragraph position="16"> If weconsidertheLLrankas thesuccessmeasurefor collocatedetection,we caninferthatthe  collocatesofthewordunderinvestigationareeasier to found in French,as comparedto English, Italianor Spanish,becausethe value in the first rowofthelastcolumnissmaller. Thisholdsifwe areinterestedin onlyone(themostsalient)collocatefora word.</Paragraph>
      <Paragraph position="17"> If we measurethe successof retrievingall the collocates(byconsidering,forinstance,thespeed to accessthemin theresults list-- thehigherthe  rank,thebetter),thenFrenchcanbeagainconsidered the easiestbecauseoverall, the positionsin the V-O list are higher(i.e.,the meanof the rank columnissmaller)withrespecttoSpanish,Italian and,respectively, English.</Paragraph>
      <Paragraph position="18"> This latter result corresponds,approximately, to the order given by relative proportionof V-O  pairs in each language(Spanish15.12%, French 15.14%, Italian 17.06%, and English 20.82%).</Paragraph>
      <Paragraph position="19"> Given thatin EnglishV-O pairsare morenumerousandtheverbsalsoparticipateinV-Pconstruc- null tions, it might seem reasonableto expect lower LLscoresforV-O collocationsin Englishvs. the other3 languages.</Paragraph>
      <Paragraph position="20"> Ingeneral,weexpecta correlationbetweenextractiondifficultyandthedistributionalproperties null ofco-occurrencetypes.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML