File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/93/w93-0309_intro.xml

Size: 2,506 bytes

Last Modified: 2025-10-06 14:05:30

<?xml version="1.0" standalone="yes"?>
<Paper uid="W93-0309">
  <Title>Extraction of V-N-Collocations from Text Corpora: A Feasibility Study for German</Title>
  <Section position="2" start_page="0" end_page="74" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> Collocations present an area that is important both for lexicography to improve their coverage in modern dictionaries as well as for lexical acquisition in computational linguistics, where the goal is to build either large reusable lexical databases (LDBs) or specific lexica for specialized NLP-applications. We have tested the statistical approach Mutual Information (MI), brought up by Church and Hanks (1989) for linguistics, for a (semi- )automatic extraction of verb-noun (V-N) collocations from untagged German text corpora. We try to answer the question how much can be done with an untagged corpus and what might be gained by lemmatizing, POS-tagging or even superficial parsing.</Paragraph>
    <Paragraph position="1"> Choueka (1988) describes how to automatically extract word combinations from English corpora as a preselection of collocation candidates to ease a lexicographer's search for collocations. He only uses quantitative selection criteria, no statistical ones, his main extraction criterion being frequency with a lower threshold of at least one occurrence of the collocation in one million words. He mentions plans to define a &amp;quot;My thanks go to the ldS that made the two corpora available for research purposes, to Angelika Storrer for her steady encouragement and many fruitful discussions, and to Mats Ftooth and Matthias Heyn who introduced me to the corpora tools. 1 am also greatful to the anonymous reviewers for their helpful comments and constructive criticism.</Paragraph>
    <Paragraph position="2">  &amp;quot;binding degree' on how strong tile words of a collocation attract each other, which would be similar in spirit to what is calculated with MI. The work described in Smadja and McKeown (1990) and Smadja (1991a.b) is along the same lines as ours, though he uses a different statistical calculation, a z-score, and tagged, lemmatized corpora. Some properties specific to German, however, lead to a type of problem that needs different treatment (section a.a). Calzolari, Bindi {1990) use MI to extract compounds, fixed expressions and collocations fl'om an Italian corpus, but to our knowledge have not evaluated their results so far.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML