XML Viewer - w06-2406

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-2406_metho.xml
Size: 15,211 bytes
Last Modified: 2025-10-06 14:10:53
<?xml version="1.0" standalone="yes"?>
<Paper uid="W06-2406">
  <Title>Collocation Extraction: Needs, Feeds and Results of an Extraction System for German</Title>
  <Section position="3" start_page="0" end_page="42" type="metho">
    <SectionTitle>
2 Collocation Extraction Tools:
Requirements
</SectionTitle>
    <Paragraph position="0"> The development of a collocation extraction tool depends on the following conditions:  1. properties of the targeted language  2. the targeted application 3. the kinds of collocations to be extracted 4. the degree of detail  Whereas issues 1 to 3 deal with the collocation itself, issue 4 is focused at the collocation in context, i.e. its behaviour (from a syntagmatic analysis point of view) or, respectively, its use (from a generation perspective).</Paragraph>
    <Section position="1" start_page="41" end_page="41" type="sub_section">
      <SectionTitle>
2.1 Language factors
</SectionTitle>
      <Paragraph position="0"> One of the most important factors is, of course, the targeted language and its main characteristics with respect to word formation and word order. Depending on word and constituent order, the pros and cons of positional vs. relational extraction patterns need to be considered. Positional patterns (based on adjacency or a 'window') are adequate for con gurational languages, but in languages with rather free word order, words belonging to a phrase or collocation do not necessarily occur within a prede ned span1.</Paragraph>
      <Paragraph position="1"> Extracting word combinations using relational patterns (represented by part of speech (PoS) tags or dependency rules) offers a higher level of abstraction and improves the results (cf. (Krenn, 2000b; Smadja, 1993)). However, this requires part of speech tagging and possibly partial parsing.</Paragraph>
      <Paragraph position="2"> A system extracting word combinations by applying relational patterns, obviously pro ts from language speci c knowledge about phrase and sentence structure and word formation. One example is the order of adjective + noun pairs: in English and German, the adjective occurs left of the noun, whereas in French, the adjective can occur left or right of the noun. Another example is compounding, handled differently in different languages: noun + noun in English, typically separated by a white space (e.g. death penalty) vs. noun + prepositional phrase in French (e.g. peine de mort) vs. compound noun in German (e.g. Todesstrafe).</Paragraph>
      <Paragraph position="3"> Consequently, language speci c word formation rules need to be considered when designing extraction patterns. For languages with a rich inectional morphology where the individual word forms are rather rare, frequency counts and results 1In German, e.g., in usual verb second constructions with a full verb in the left sentence bracket (topological eld theory see (W llstein-Leisten et al., 1997)), particles of particle verbs appear in the right sentence bracket. The middle eld (containing arguments and possibly adjuncts of the verb) is of undetermined length.</Paragraph>
      <Paragraph position="4"> of statistical analyses are little reliable. To allow a grouping of words sharing the same lemma, lemmatisation is crucial.</Paragraph>
    </Section>
    <Section position="2" start_page="41" end_page="41" type="sub_section">
      <SectionTitle>
2.2 Application factors
</SectionTitle>
      <Paragraph position="0"> Other important factors are the targeted application (i.e. analysis vs. generation) and, to some extent resulting from it, factors (3.) and (4.), above.</Paragraph>
      <Paragraph position="1"> Depending on the purpose of the tool (or lexicon, respectively), the collocation de nition chosen as an outline may vary, e.g. including transparent and regular collocations (cf. (Tutin, 2004)) for generation purposes, but excluding them for analysis purposes. In addition, a more detailed description of the use of collocations in context (e.g. information about preferences with respect to the determiner, etc.) is needed for generation purposes than for text analysis.</Paragraph>
    </Section>
    <Section position="3" start_page="41" end_page="41" type="sub_section">
      <SectionTitle>
2.3 Factors of collocation de nition
</SectionTitle>
      <Paragraph position="0"> Collocations can be distinguished on two levels: the formal level and the content level. On the formal level, a collocation can be classi ed according to the structural relation between its elements. Typical patterns are shown in table 12 (taken from (Heid and Gouws, 2005)).</Paragraph>
      <Paragraph position="1"> On the content level, there are regular, transparent, and opaque collocations (according to (Tutin, 2004)) and, taking de nition (b) into account, idioms as well. However, as a classi cation at the content level needs detailed semantic description, we see no means of accomplishing this goal other than manually at the moment.</Paragraph>
    </Section>
    <Section position="4" start_page="41" end_page="42" type="sub_section">
      <SectionTitle>
2.4 Contextual factors
</SectionTitle>
      <Paragraph position="0"> (Hausmann, 2003; Heid and Gouws, 2005; Evert et al., 2004) argue that collocations have strong preferences with respect to their morphosyntax (see examples (1) and (2)) and may be combined (see example (3)). The collocation in example (1) ('to charge somebody') is restricted with respect to the determiner (null determiner) of the base, whereas the same base shows a strong preference for a (de nite or inde nite) determiner when used  2Abbreviations in table 1: advl - adverbial prd - predicative subj - subject obj - object pobj - prepositional object dat - dative case gen - genitive case quant - quantifying  with a different collocate (example (2), 'to drop a lawsuit'). Example (3) shows two collocations sharing the base can form a collocational sequence (example taken from (Heid and Gouws, 2005)).  (1) l Anklage erheben (2) die/eine Anklage fallenlassen (3) Kritik ben + scharfe Kritik scharfe Kritik ben  For both natural language generation systems and lexicography, such information is highly relevant. Therefore, the extraction of contextual information (called 'context parameters' in the following) should be integrated into the collocation extraction process.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="42" end_page="45" type="metho">
    <SectionTitle>
3 Extracting noun + verb collocations
</SectionTitle>
    <Paragraph position="0"> from German The standard architecture for collocation extraction systems contains three stages (cf. (Krenn, 2000)): a more or less detailed linguistic analysis of the corpus text (preprocessing), an extraction step and a statistic ltering of the extracted word combinations. We follow this architecture (see gure 1). However, our hypothesis differs from other approaches. Collocations are often restricted with respect to their morphosyntax. We test to what extent they can be identi ed via these restrictions.</Paragraph>
    <Section position="1" start_page="42" end_page="42" type="sub_section">
      <SectionTitle>
3.1 Approach
</SectionTitle>
      <Paragraph position="0"> In an experiment, we extracted relational word combinations (verb + subject/object pairs) from German newspaper texts.</Paragraph>
      <Paragraph position="1"> The syntactic patterns for the extraction of these combinations concentrate on verb- nal constructions as in example (4) and verb second constructions with a modal verb in the left sentence bracket according to the topological eld theory (see (W llstein-Leisten et al., 1997)) as in example (5). The reason is that, in these constructions, the particle forms one word with the verb (see example (6)), as opposed to usual verb second constructions (see example (7)). Thus, we need not  recombine verb + particle groups that appear separatedly. null (4) ... wenn Wien einen Antrag auf Vollmitgliedschaft stellt.</Paragraph>
      <Paragraph position="2"> ('if Vienna an application for full membership puts') (if Vienna applies for full membership) (5) ... kann Wien einen Antrag auf Vollmitgliedschaft stellen.</Paragraph>
      <Paragraph position="3"> ('might Vienna an application for full membership put.') (Vienna might apply for full membership.) (6) ..., da er ein Schild auf stellt.</Paragraph>
      <Paragraph position="4"> ('that he a sign upputs') (that he puts up a sign) (7) Er stellt ein Schild auf.</Paragraph>
      <Paragraph position="5"> ('He puts a sign up.') (He puts up a sign.)  As data, we used a collection of 300 million words from German newspaper texts dating from 1987 to 1993. The corpus is tokenized and PoS-tagged by the Treetagger (Schmid, 1994), then chunk annotated by YAC (Kermes, 2003). The chunker YAC determines phrase boundaries and heads, and disambiguates agreement information as far as possible. It is based on the corpus query language cqp (Christ et al., 1999)3, which can in turn be used to query the chunk annotations.</Paragraph>
    </Section>
    <Section position="2" start_page="42" end_page="44" type="sub_section">
      <SectionTitle>
Data Extraction
</SectionTitle>
      <Paragraph position="0"> The syntactic patterns used to extract verb + subject/object combinations are based on PoS tags and chunk information. These patterns are represented using cqp macros (see gure 2). The cqp syntax largely overlaps with regular expressions.</Paragraph>
      <Paragraph position="2"> macro and the number of its parameters. In line (3), a word PoS tagged KOUS (subordinating conjunction) or VMFIN ( nite modal verb) is requested, followed by an arbitrary number ('*') of words without any restrictions (line (4)). Line (5) indicates the start of a nominal phrase (np), line (11) its end. The elements within this np (one or more words, as indicated by '+') must not be part of a prepositional phrase (pp) to avoid the extraction of pp + verb (line (6), see example (8)). In addition, the np must be neither a named entity (ne, see line (7)) nor a pronoun (pron, line (8)) nor an np of measure (meas, line (9), see example (9)), nor must its head be a cardinal number (card, line (10), see example (10)). An arbitrary number of words may follow the np (punctuation marks (PoS tagged $.), subordinating conjunctions and nite modal verbs excluded). At least one verb is required (line (13), all PoS tags for verbs start with a capital 'V'4). Line (14) indicates the end of the subclause or sentence.</Paragraph>
      <Paragraph position="3">  (8) ... kann [zur Verf gung]a28a6a28 gestellt werden. (9) ... weil davon j hrlich [3,5 Tonnen]a29 a28a31a30a33a32a35a34a2a36 eingef hrt werden.</Paragraph>
      <Paragraph position="4"> (10) ... obwohl er [1989]a29 a28a38a37a9a34a14a39a4a40 noch dort  arbeitete.</Paragraph>
      <Paragraph position="5"> By applying the macro to the corpus, all sequences of words matching the pattern are extracted. null From these sequences, the following information is made explicit (cf. (Heid and Ritz, 2005)):  the niteness and the role of the verb (auxiliary, modal or full verb). Thus, line (13) matches verbal complexes. It also covers cases where full verbs are accidentally PoS tagged modal or auxiliary verbs.</Paragraph>
      <Paragraph position="6">  a42 determination of the noun (de nite, inde nite, null, demonstrative, quanti er) a42 modi cation of the noun (adjective, cardinal number, genitive np, compound noun etc.) a42 negation (yes/no) a42 auxiliaries and modal verbs a42 original phrase from the corpus For each instance found, the lemmas of noun and verb along with all the context parameters mentioned above are stored as feature value pairs in a relational data base. The database can be queried via SQL. See gure 3 for a sample query asking for distinct lemma pairs, ordered by frequency (in descending order), and gures 5 and 4 for more speci c queries and some of their results.</Paragraph>
    </Section>
    <Section position="3" start_page="44" end_page="44" type="sub_section">
      <SectionTitle>
Filtering
</SectionTitle>
      <Paragraph position="0"> The instances extracted in the previous step are grouped according to noun and verb lemmas, i.e.</Paragraph>
      <Paragraph position="1"> instances of the same lemma pair form one group.</Paragraph>
      <Paragraph position="2"> Within these groups, a relative frequency distribution is computed for each of the features. For queriability reasons, the results of this postprocessing are also stored in the database, as shown in gure  1. A word combination is chosen as a collocation candidate if a preference (speci ed by a threshold of e.g. 60% of the occurrences) for a certain feature value (singular / plural, presence / absence of a determiner, de nite / inde nite / demonstrative / possessive / quantifying determiner, presence of modifying elements) is discovered.</Paragraph>
    </Section>
    <Section position="4" start_page="44" end_page="45" type="sub_section">
      <SectionTitle>
3.2 Results
</SectionTitle>
      <Paragraph position="0"> From 300 million words, we extracted more than 1.3 million noun + verb combinations, the instances of 726,488 different lemma pairs. 10,934 of these lemma pairs appeared with a minimum</Paragraph>
      <Paragraph position="2"> negated phrases frequency of 10. Sample results are shown in gure 65.</Paragraph>
      <Paragraph position="3"> We evaluated collocation candidates with a frequency of at least 100. Within the 323 most frequent collocation candidates, we found 213 collocations (including 11 idioms). This corresponds to a precision of 66% (see table 26). As a comparison, a window-based study was carried out on the same (PoS-tagged) data. In this study, the window was de ned in a way that up to two tokens (excluding sentence boundaries and nite full verbs) were allowed to appear between a noun (PoS tagged NN) and a nite full verb (PoS tagged VVFIN).</Paragraph>
      <Paragraph position="4"> Log-likelihood7 was used as an association measure. The precision of this approach is 41%8.</Paragraph>
      <Paragraph position="5">  For chosing collocation candidates, a threshold of 60% is used. However, additional preferences are displayed for values greater than 50%.</Paragraph>
      <Paragraph position="7"> tory for some of the results. First, there is the problem of semantic equivalence: does the combination express more than its elements (consider example (11))? Secondly, de nitions (a) and (b) may judge the same example differently: Anteil nehmen (example (12)) is usually agreed upon to be a support verb construction, but the distinction of the noun Anteil as the base (making the main contribution to the meaning) is questionable. On the other hand, its unpredictable syntactic properties (e.g. null determiner) and semantics (partial loss of meaning of the collocate nehmen) make it clear that this combination has to be listed in a lexicon. null (without the corresponding preposition), have been treated as  For evaluation purposes, combinations judged collocations by either (or both) of the de nitions were marked as correct matches. In cases like example (11), combinations were marked as correct matches if no alternative collocate existed for describing the denoted situation or event.</Paragraph>
      <Paragraph position="8">  (11) Chance + haben ('to have a/the chance') (12) Anteil + nehmen ('to commiserate')</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML