XML Viewer - w06-2408

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-2408_metho.xml
Size: 21,962 bytes
Last Modified: 2025-10-06 14:10:52
<?xml version="1.0" standalone="yes"?>
<Paper uid="W06-2408">
  <Title>Heiki-Jaan.Kaalep@ut.ee</Title>
  <Section position="3" start_page="0" end_page="59" type="metho">
    <SectionTitle>
2 Types of verbal multi-word
</SectionTitle>
    <Paragraph position="0"> expressions in Estonian A VMWE consists of a verb and 1) a particle or 2) a nominal phrase (usually, but not always, consisting of one noun) in more or less frozen inflectional form, or 3) a non-finite form of a verb. This last combination - verb plus a non-finite verb - remains outside the scope of this paper.</Paragraph>
    <Paragraph position="1">  The first combination results in a particle verb. The particle can express location or direction (1), perfectivity (2) etc.</Paragraph>
    <Paragraph position="2">  'What to do now?' The combinations of a verb and a nominal phrase can be divided into three groups depending on how the components form the meaning of the expression: 1) idiomatic expressions; 2) support verb constructions; 3) collocations.</Paragraph>
    <Paragraph position="3"> Idiomatic expressions are usually defined as word combinations, the meaning of which is not the sum or combination of the meanings of its parts. It is meaningful to distinguish between opaque (e.g. English idiom kick the bucket) and transparent idioms (e.g. English pull strings) as they allow different degrees of internal variability.</Paragraph>
    <Paragraph position="4"> Support verb constructions, sometimes also called light verb constructions, are combinations of a verb and its object or, rarely, some other argument, where the nominal component denotes an action of some kind and the verb is semantically empty in this context, e.g. English make a speech, take a walk. The collocations are the fuzziest category. They can be described as VMWEs that do not fit in the previous categories, but still, for some reason, have often been included in dictionaries or are statistically significant combinations of a verb and its argument(s) in the corpus.</Paragraph>
    <Paragraph position="5"> In all three groups the non-verbal component is a nominal phrase (not a particle); it can formally be either the object of the verb as in (4), or some other argument as in (5).</Paragraph>
    <Section position="1" start_page="57" end_page="57" type="sub_section">
      <SectionTitle>
3.1 Database of VMWEs
</SectionTitle>
      <Paragraph position="0"> Prior to the corpus tagging experiment, a database of Estonian VMWEs (DB) had been compiled, with the aim of creating a comprehensive resource of VMWEs, consisting of 12,200 entries. First, it contained VMWEs from six human-created dictionaries: the Explanatory Dictionary of Estonian (EKSS, 1988-2000), Index of the Thesaurus of Estonian (Saareste, 1979), a list of particle verbs (Hasselblatt, 1990), Dictionary of Phrases (Oim, 1991), Dictionary of Synonyms (Oim, 1993) and the Filosoft thesaurus (http://www.filosoft.ee/ thes_et/). In addition, the database had been enriched with VMWEs, extracted semi-automatically from corpora totaling 20 million words, and missing from any of the aforementioned human-made dictionaries. This collocation extraction experiment is described in (Kaalep, Muischnek 2003).</Paragraph>
    </Section>
    <Section position="2" start_page="57" end_page="59" type="sub_section">
      <SectionTitle>
3.2 Corpus
</SectionTitle>
      <Paragraph position="0"> We have a corpus where all the VMWEs have been tagged (by hand). Table 1 shows the composition of the corpus and the number of VMWE instances, compared with the number of sentences and simplex verb instances.</Paragraph>
      <Paragraph position="1">  The fiction texts are 2000-word excerpts from Estonian authors from 1980ies. The press files represent various Estonian newspapers (nation-wide and local, dailies and weeklies, quality and tabloid press) from 1995-1999.</Paragraph>
      <Paragraph position="2"> Popular science comes from the journal ,,Horisont&amp;quot;, from 1996-2003.</Paragraph>
      <Paragraph position="3"> Before tagging the VMWEs, the corpus had been morphologically analyzed and manually disambiguated (Kaalep, Muischnek 2005), making it possible to pre-process the text automatically by tagging the candidate VMWEs in the texts, according to what VMWEs were present in a database of VMWEs. It was then the task of a human annotator to select the right VMWEs, and occasionally to tag new VMWEs, missing from the database and thus having not been tagged automatically. The tagged version was checked by another person, in order to minimize accidental mistakes.</Paragraph>
      <Paragraph position="4"> Table 1 shows that the amount and proportion of VMWEs depends on the text class. Table 2 serves to compare the lexicon of VMWEs based on the corpus with the entries of the DB (the VMWEs from the corpus have been converted to the base form they have in the DB). A DB entries 12200 B A, found in the corpus 2300 C hapax legomena of B 1200 D new VMWEs 1100 E hapax legomena of D 900 Table 2. VMWEs in the DB and corpus.</Paragraph>
      <Paragraph position="5"> First, from rows A, B and D we see that the intersection of the DB and the corpus lexicon is surprisingly small.</Paragraph>
      <Paragraph position="6"> The small proportion of VMWEs of the DB that can be found in real texts (compare row B with row A) may be first explained by the small size of the corpus. The second reason is that the human-oriented dictionaries that were used when building the DB implicitly aimed at showing the phraseological richness of the language and thus contained a lot of idiomatic expressions well known to be rare in real-life texts.</Paragraph>
      <Paragraph position="7"> The fact that so many VMWEs were missing from the DB was a surprise (compare row D with row A), because, as mentioned earlier, the DB had been enriched with VMWEs from real texts in order to be comprehensive. At the moment, it is not clear what the exact reason is.</Paragraph>
      <Paragraph position="8"> The size of hapax legomena of new VMWEs also deserves some explanation (compare rows B and C versus D and E).</Paragraph>
      <Paragraph position="9"> From the literature, one may find a number of MWU or collocation extraction experiments from a corpus that show that the extraction method yields many items, missing from the available pre-compiled lexicons. Some of the items may be false hits, but the authors (whose aim has been to present good extraction methods) tend to claim that a large number of those should be added to the lexicon.</Paragraph>
      <Paragraph position="10"> (Evert 2005) lists a number of authors, who have found that lexical resources (machine readable or paper dictionaries, including terminological resources) are not suitable for serving as a gold standard for the set of MWUs (for a given language or domain). According to (Evert 2005), manual annotation of MWUs in a corpus would be more trustworthy, if one wants to compare the findings of a human (the gold standard) with those of a collocation extraction algorithm.</Paragraph>
      <Paragraph position="11"> In lexicography, we may find a slightly conflicting view: not everything found in real texts deserves to be included in a dictionary. Producing a text is a creative process, sometimes resulting in ad hoc neologisms and MWUs that are never picked up and re-used after the final full stop of the text they were born in.</Paragraph>
      <Paragraph position="12"> Unfortunately these two conflicting views mean that there is no general, simple solution for the problem of finding a gold standard for automatic treatment (extraction or tagging) of MWUs. It is normal that there is a discrepancy between a stand-alone lexicon and the vocabulary of a text.</Paragraph>
      <Paragraph position="13"> We believe that the surprisingly high proportion of hapax legomena in the set of newly found VMWEs manifests this normal discrepancy of a precompiled lexicon and a text corpus, in our case.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="59" end_page="59" type="metho">
    <SectionTitle>
4 Behavior of the VMWEs in the corpus
</SectionTitle>
    <Paragraph position="0"> and the problems of their automatic</Paragraph>
    <Section position="1" start_page="59" end_page="59" type="sub_section">
      <SectionTitle>
analysis
4.1 Particle verbs
</SectionTitle>
      <Paragraph position="0"> There are two main problems encountered in the automatic identification of the particle verbs.</Paragraph>
      <Paragraph position="1"> First, as shown in (6-7), the order of the components may vary, and the verb and the particle need not be adjacent to each other, behaving much like particle verbs in German.</Paragraph>
      <Paragraph position="2"> This varying order and disjuncture of the components is actually characteristic for all the  over(particle) look then is everything ready 'Once you have looked over those papers, we will be done.' The second main problem is that most of the particles are homonymous with pre- or postpositions (Estonian has both of them), creating a disambiguation problem, similar to the one concerning the English word over in the following examples.</Paragraph>
      <Paragraph position="3">  (8) He looked over the papers in less than 10 minutes.</Paragraph>
      <Paragraph position="4"> (9) He looked over the fence  and saw his neighbor.</Paragraph>
      <Paragraph position="5"> Just like in English examples the word-forms look and over form a phrasal verb look over in example (8), but don't belong together in the same way in example (9), the Estonian verb vaatama 'to look' and adverb ule 'over' form a particle verb in the examples (6) and (7), but not in the following example, where ule is a preposition: (10) Ta vaatas ule aia ja nagi oma naabrit.</Paragraph>
      <Paragraph position="6"> s/he looked over fence-GEN and saw own neighbor-PART 'S/he looked over the fence and saw his/her neighbor.' As a pre- or postposition has to be adjacent to the noun phrase that is the constituent of the adpositional phrase, they are usually easier to detect. In (11), however, the invariable word ule, that can function both as a particle and a preposition, is positioned before the noun jou 'force' in genitive, as if ule were a preposition in prepositional phrase ule jou 'exceeding capabilities'. Actually, it functions as a particle in this clause, forming a particle verb laks ule  'went over'.</Paragraph>
      <Paragraph position="7"> (11) Meelitustelt laks ta ule jou kasutamisele.</Paragraph>
      <Paragraph position="8"> Flattery-PL-ABL went s/he over force-GEN utilization-</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="59" end_page="62" type="metho">
    <SectionTitle>
ALL
</SectionTitle>
    <Paragraph position="0"> 'S/he switched from flattery to violence' Many of these invariable words that can function either as particles or as pre- and postpositions are quite frequent in the texts. The most frequent simplex verbs are also the most frequent verbal components, forming various VMWEs. The sentences of the written language tend to consist of several clauses. All this results in sentences like (12), where the possible components of particle verbs are scattered across several clauses. In this sentence there are four possible candidate particle verbs: ule jaama 'to have no choice but, lit. remain over', ule tegema 'to redo, lit. do over', ara jaama 'be canceled, lit. remain away', ara tegema 'to accomplish, lit. do away' (12) Tal ei jaa muud ule, kui too ise ara teha.</Paragraph>
    <Paragraph position="1"> S/he-ALL not remain else over(particle)than work-GEN self away(particle) do-INF 'S/he has no choice but to accomplish the work by her/himself.' Our preprocessor took only sentence boundaries into account and that resulted in serious overgeneration of possible particle verbs. After experimental tagging of clause boundaries  in the texts, the precision of pre-processor improved from 40% to 74% while tagging the particle verbs.</Paragraph>
    <Paragraph position="2"> For other types of VMWEs the clause boundaries detection is not so essential. The nominal components of opaque idioms are not so frequent. Some transparent idioms, all support verb constructions and collocations can stretch across clause boundaries, like in (13).</Paragraph>
    <Paragraph position="3"> (13) Kone, mille president pidas, on mojutanud meie valispoliitikat.</Paragraph>
    <Paragraph position="4"> Speech that-GEN president held is influenced our foreign-policy-PART 'The speech held by the president has influenced our foreign policy.'</Paragraph>
    <Section position="1" start_page="60" end_page="60" type="sub_section">
      <SectionTitle>
4.2 VMWEs consisting of a verb and a
</SectionTitle>
      <Paragraph position="0"> nominal component In section 2 we differentiated between three types of VMWEs consisting of a verb and a nominal component, namely idioms, support verb constructions and collocations. All these constructions show considerable variability in the manually annotated corpus. Differently from English, there are no special restrictions on the morphological or syntactic behavior of the verb that is part of an idiom. A VP-idiom, for example the opaque idiom jalga laskma 'to run off, lit. to shoot the foot' combines freely with all the morphological categories relevant for the verb, including person, number, tense, mood, non-finite forms and (impersonal) passive. (The latter differs from the English passive - it can be formed from all verbs, having a possible human agent.) The other types of VMWEs - support verb constructions and collocations - have also no restrictions with respect to the verbal inflection.</Paragraph>
      <Paragraph position="1"> In this section we will concentrate on the variability of the nominal components of VMWEs - their case and number alternations as registered in the corpus. The case alternation is relevant only for the nominal components that are syntactically in the object position. Our interest in case alternation is motivated by observation that multiword units generally and cross-linguistically tend to be frozen in form.</Paragraph>
      <Paragraph position="2"> The less variability there is in form, the easier the computational treatment is. We may also draw an analogy between simplex words and multiword units as items in a lexicon. For an inflectional language, every word has an inflectional paradigm, and words with similar paradigms form an inflectional type or class. Variability of VMWEs can be analyzed from the same viewpoint.</Paragraph>
      <Paragraph position="3"> From these three types of VMWEs the variation of idioms has received most attention in the literature. Idioms have been regarded as units that can not be given a compositional analysis (e.g. Katz 1973 among others). This view has been opposed later (e.g. Nunberg et. al. 1994).</Paragraph>
      <Paragraph position="4"> Riehemann (2001) has pointed out that English idioms show considerable variability in text corpora. Describing the automatic treatment of multiword expressions in Basque, Alegria et.al.</Paragraph>
      <Paragraph position="5"> (2004) show that the support verb constructions in Basque can have significant morphosyntactic variability, including modification of the noun and case alternation. Similar phenomenon (number and case alternation) in Turkish is described in (Oflazer et. al. 2004).</Paragraph>
      <Paragraph position="6"> In the following subsections we will briefly describe the phenomenon of the case alternation of the object in Estonian and then discuss the variation of the nominal component of idioms and support verb constructions. Then we will describe the number alternations of the nominal components.</Paragraph>
    </Section>
    <Section position="2" start_page="60" end_page="61" type="sub_section">
      <SectionTitle>
4.3 The case alternation of the object in
Estonian
</SectionTitle>
      <Paragraph position="0"> A VMWE often consists of a verb and a noun phrase that is its object syntactically. A few words should be said about the case alternation of the object in Estonian in general (cf also Erelt 2003: 96-97). Three case forms are possible for the object - in singular the object can be either in nominative, genitive or partitive; in plural it can be either in nominative or in partitive. Often the nominative and genitive forms are grouped together under the label ,total object'.</Paragraph>
      <Paragraph position="1"> Partitive is the unmarked form of the object. The partial object, as it is often called, alternates with the total object only in the affirmative clause. In the negative clause only partial object can be used. In the affirmative clause the total object is used only if it denotes definite quantity (is quantitatively bounded) and the clause expresses perfective activity. So, in Estonian, the case alternation of the object is used to express the aspect of the clause - total object can be used if the action described in the clause is perfective:  In idioms and support verb constructions the nominal component is only formally or syntactically the object of the verb, semantically it is a part of the predicate. So, it would not be surprising, if such objects wouldn't undergo the case alternations characteristic of the object and would be frozen into the partitive as the unmarked case for the object. Indeed - that is true for the opaque idioms. But for transparent idioms and support verb constructions this is not the case - our corpus data shows that their nominal components can alternate between the forms of total and partial object.</Paragraph>
      <Paragraph position="2"> Ca 25% of the transparent idioms in our corpus have their nominal components in the case of the total object: (16) Esinemisele pani punkti ilutulestik.</Paragraph>
      <Paragraph position="3"> Show-ALL put full-stop-GEN firework 'The fireworks put an end to the show.' In the previous example (16) the transparent idiom with the nominal component in the form of the total object was used to describe a perfective action. But the transparent idioms do not form a homogenous group with respect to the case alternation of the nominal component. Some of them behave like regular verb-object combinations; others show irregular variation; and the nominal components of many of them are frozen in the partitive case.</Paragraph>
      <Paragraph position="4"> In support verb constructions the case alternation of the object is regularly used to express the aspect of the clause, although the noun denoting an action is non-referential.</Paragraph>
      <Paragraph position="5">  Some support verb constructions are generally used to refer to the imperfective aspect, to emphasize the process of the action (atelic action), not its result. Such expressions are e.g. tood tegema 'to work, lit. do work-PART' or soda pidama 'fight a war, lit. hold a war-PART'. But, while the nominal component is modified with an appropriate attribute, it can also be in the case of the total object and the support verb expression as a whole then refers to a perfective  event: (19) X ja Y pidasid viimase omavahelise soja 17.</Paragraph>
      <Paragraph position="6"> sajandil.</Paragraph>
      <Paragraph position="7"> X and Y held last-GEN mutual-GEN war-GEN 17.</Paragraph>
      <Paragraph position="8"> century-ADE 'X and Y fought the last  war in the 17th century.'</Paragraph>
    </Section>
    <Section position="3" start_page="61" end_page="62" type="sub_section">
      <SectionTitle>
4.4 Number alternations of the nominal
</SectionTitle>
      <Paragraph position="0"> components of VMWEs The nominal component of an opaque idiom in the corpus was always in the same number (singular or plural) as its base form in the DB. For the transparent idioms, the picture was clearly different. Although the nominal component of many transparent idioms does not alternate between singular and plural, there are exceptions, and 14% of the nominal components in the object position and 4% in some other position were in plural.</Paragraph>
      <Paragraph position="1"> Support verb constructions, in turn, make extensive use of the number alternations of the nominal component, whereas the plural form  of the noun denoting an action can really refer to several events as in (20)</Paragraph>
    </Section>
    <Section position="4" start_page="62" end_page="62" type="sub_section">
      <SectionTitle>
4.5 The conclusions for the automatic
</SectionTitle>
      <Paragraph position="0"> analysis of VMWEs The conclusions of the corpus findings for the automatic detection of the VMWEs are the following: 1) The free word order requires that, while detecting automatically the particle verbs in a text, we should be limited with a clause as possible context for the co-occurrences. Using the whole sentence as the possible context would create too much noise, so the detection of clause boundaries is a must.</Paragraph>
      <Paragraph position="1"> 2) We can treat opaque idioms much like the particle verbs - multi-word units consisting of an inflecting verb and a frozen nominal component that don't cross the clause boundaries.</Paragraph>
      <Paragraph position="2"> 3) Transparent idioms in the database have to be divided into those enabling their nominal component to appear in the cases of the total object and those, which nominal component is always in partitive. But can the annotator rely on her/his intuition while making such decisions? Rather not, but carrying out corpus research separately on each item is a time-consuming task. It could be a better solution for the transparent idioms to generate all the case forms possible for the object, as the nouns that are part of the idioms are not as frequent as the noninflecting words that may be particles as well as pre- and postpositions.</Paragraph>
      <Paragraph position="3"> 4) The nominal component of the support verb constructions can under certain circumstances always be in the form of total object. The nouns denoting action in support verb constructions can also be pluralized. So the best solution for them is to generate all forms of the object, both in singular and in plural, in the database.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML