File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/i05-6008_metho.xml

Size: 17,808 bytes

Last Modified: 2025-10-06 14:09:41

<?xml version="1.0" standalone="yes"?>
<Paper uid="I05-6008">
  <Title>Linguistically enriched corpora for establishing variation in support verb constructions</Title>
  <Section position="3" start_page="64" end_page="67" type="metho">
    <SectionTitle>
3 A corpus-based method to infer
</SectionTitle>
    <Paragraph position="0"> variation With access to automatically parsed data, subcategorization frames and a standard search query language such as dt search, we can extract all instances of an LVC that satisfy rather specific morphosyntactic features and head-complement dependencies; these requirements - expressed as dt search queries - are applied to XMLencoded syntactic dependency trees. For a more detailed description of the corpus-based method refer to (Villada Moir'on, 2005).</Paragraph>
    <Section position="1" start_page="64" end_page="65" type="sub_section">
      <SectionTitle>
3.1 Corpus annotation
</SectionTitle>
      <Paragraph position="0"> A list of P N V triples was automatically acquired from a syntactically annotated corpus using collocation statistics and linguistic diagnostics (Villada Moir'on, 2004). A P N V triple represents an abstraction of a support verb construction (LVC).</Paragraph>
      <Paragraph position="1"> For each automatically extracted triple, all sentences containing the three component lexemes found in the Twente Nieuws Corpus (TwNC) (Ordelman, 2002) were collected in a subcorpus. For example, for the expression uit zijn dak gaan 'go crazy', all sentences that include the preposition uit 'out', the noun dak 'roof' and the verb gaan 'go' or one of its inflectional variants are collected in a subcorpus.</Paragraph>
      <Paragraph position="2"> TheAlpinoparser (van der Beek et al., 2002) was used to annotate the subcorpora. This is a wide-coverage parser for Dutch. Based on a lexicalist constraint-based grammar framework (Head-Driven Phrase Structure Grammar) (Pollard and Sag, 1994), the Alpino grammar licenses a wide variety of syntactic constructions.</Paragraph>
      <Paragraph position="3"> All parsed data is stored as XML-dependency trees. To illustrate the annotation, the result of parsing example (2-b) is the dependency structure tree shown in figure 1.</Paragraph>
      <Paragraph position="4"> Among the information contained in the parsed trees, we use: (i) categorical information (phrasal  information (grammatical function or dependency relation (subject su, direct object obj1, locative or directive complement ld, head hd, determiner det)) and (iii) lexical information (lexemes and word forms). Dependency nodes are crucial in stating daughter-ancestor relations between constituents and sub-constituents in an LVC.</Paragraph>
    </Section>
    <Section position="2" start_page="65" end_page="65" type="sub_section">
      <SectionTitle>
3.2 Extraction
</SectionTitle>
      <Paragraph position="0"> dt search (Bouma and Kloosterman, 2002), a treebank query tool based on XPATH,2 is used to extract evidence from the annotated subcorpora. A dt searchquery applied on the corresponding parsed subcorpus searches for all LVC instances. Two types of queries are needed: narrow search and wide search queries. Narrow search queries seek instances of a head-dependent relation between a VERB and a PP sibbling, given necessary lexical restrictions as input. Wide searches state that the PP is embedded (somewhere) under a clausal node whose head is VERB.</Paragraph>
      <Paragraph position="1"> Wide searches are needed because the parser may wrongly attach the sought PP to a previous noun.</Paragraph>
      <Paragraph position="2"> (Thus, in the annotated data the PP and VERB do 2Nevertheless, other XML-based query tools are also freely available, e.g. XSLT or the TIGERSearch kit.</Paragraph>
      <Paragraph position="3"> not satisfy a head-dependent relation). Finally, the vaguest search states that a given PP needs to occur within the same sentence as the verb. This type of search is used in case the other two types fail to retrieve any evidence. The query in figure 2 seeks NP-internal adjectival modification.</Paragraph>
      <Paragraph position="4">  pression iemand op gedachten brengen.</Paragraph>
      <Paragraph position="5"> Among the constraints expressed in the search queries there are: parent-child relations between nodes, phrase category (@cat), dependency relation (@rel), word base form (@root) or surface form (@word). Queries need to capture deeply embedded LVCs. A verbal complement embedded under several modal or auxiliary verbs is rather common. To allow uncertainty about the location of the PP argument node with respect to its head verb, disjunctive constraints are introduced in the queries (figure 2).</Paragraph>
    </Section>
    <Section position="3" start_page="65" end_page="65" type="sub_section">
      <SectionTitle>
3.3 Retrieved corpus evidence
</SectionTitle>
      <Paragraph position="0"> A search query retrieves each LVC realization that satisfies the query requirements, as well as the LVC frequency in the subcorpora.</Paragraph>
      <Paragraph position="1"> Figure 3 gives an excerpt from the observed adjectival modification in iemand op gedachten brengen 'give s.o. the idea'. Op andere gedachten brengen 'change s.o.'s idea' is the most frequent realization with 634 out of a total of 682 occurrences. This suggests that the adjective andere is almost frozen in the expression.</Paragraph>
      <Paragraph position="2"> The method extracts evidence of morphological productivity, variation of specifiers and adjectival modification, i.e. positive and negative evidence. A description of the positive evidence  LVC iemand op gedachten brengen.</Paragraph>
      <Paragraph position="3"> follows. We investigated 107 Dutch LVCs: 94 expressions that require a PP argument among which some show an NPacc open slot; lexical restrictions affect the verb and the PP argument; in addition, 13 other expressions are made up of a (partially) lexicalized NP and a PP argument.</Paragraph>
      <Paragraph position="4"> LVCs fall in one of three groups: (a) totally fixed, (b) semi-fixed and (c) flexible. Fixed LVCs show no variation and no modification in the lexicalized NP (if present) and PP constituent(s). 42% of the LVCs studied are fixed. Semi-fixed LVCs show partially lexicalized constituent(s) (20.5% of the studied LVCs). Rarely, a singular noun appears in plural. Variation affects the lexeme's morphology and/or the specifiers slot.</Paragraph>
      <Paragraph position="5"> Expressions whose lexicalized argument requires a reflexive are included into this group. Flexible LVCs allow adjectival modification (37.5% of the studied LVCs). The data is rather varied. There are LVCs that show: (i) non-productive morphology and no specifier variation but they show a limited number of adjectives and, (ii) specifier variation (some show compounding) and limited adjectival variation. Border-line cases exhibit no morphological productivity and either definite/possessive determiner alternation or no specifier variation; modification involves a unique adjective (e.g. in (verzekerde) bewaring stellen 'put into custody').</Paragraph>
      <Paragraph position="6"> Negative evidence (noise) typically includes sentences where the VERB and the PP occur within the same clause but not in the LVC context (in its literal use). Often, the PP is an adjunct or a complement of another verb. The reason for this noise can be attributed to the uncertainty in the search queries or errors in the annotated data.</Paragraph>
    </Section>
    <Section position="4" start_page="65" end_page="67" type="sub_section">
      <SectionTitle>
3.4 Discussion
</SectionTitle>
      <Paragraph position="0"> We argue that the corpus-based method is efficient in extracting the linguistic contexts where variation and internal modification are found inside LVCs. Examining the evidence retrieved by the corpus-based method, a researcher quickly forms an impression about which expressions are totally fixed and which expressions allow some variation and/or modification. One also has direct access to the realizations of the variable slots, the LVC frequency and relevant examples in the corpus. Next, we discuss some limitations posed by the corpus annotation, extraction procedure and the nature of the idiosyncratic data.</Paragraph>
      <Paragraph position="1"> Finding specific constructions in corpora of free word order languages such as Dutch is not trivial. Corpus annotation enriched with grammatical functions and/or dependency relations facilitates the search task.3 Thus, we are able to explore LVC occurrences in any syntactic structure (main or subordinate sentence, questions, etc.) without stating linear precedence constraints. Furthermore, in most sentences, the annotation correctly identifies the clause containing the LVC thus, granting access to all sibblings of the head verb.</Paragraph>
      <Paragraph position="2"> In general, knowledge of the grammar and the lexicon used by the parser is helpful. In particular, knowing whether some LVCs or idiosyncratic phrases are already annotated in the lexicon as lexicalized phrases helps. In the event that an LVC were described in the lexicon, the parser either analyzes the expression as an LVC or as a regular verb phrase. This uncertainty needs to be taken into account in the extraction queries.</Paragraph>
      <Paragraph position="3"> The corpus-based method requires information about the subcategorization requirements of the LVCs. This information was manually entered for each expression. Once we have a list of PREPOSITION NOUN VERB triples, methods described in the literature on automatic acquisition of subcategorization information might be successful in finding out the remaining LVC syntactic requirements. This is an open issue for future re3Preliminary experiments were done on chunked data. A corpus-based method applied on phrasal chunks was impractical. A lot of noise needed to be manually discarded.  search, but a starting point would be the approach by (Briscoe and Carroll, 1997).</Paragraph>
      <Paragraph position="4"> The success of the search queries is dependent on parsing accuracy. Sometimes extracted evidence shows the specific PP we seek but misanalyzed as a dependent of another verb. Parsing accuracy introduces another shortcoming: evidence of relative clauses and PP post-nominal modifiers cannot be automatically retrieved. Because of structural ambiguity, attachment decisions are still a hard parsing problem. This led us to ignore these two types of modification in our research.</Paragraph>
      <Paragraph position="5"> Some limitations due to the nature of the support verb constructions emerged. Specifier changes or insertion of modification may destroy the LVC reading. The queries could extract evidence that looks like a variant of the LVC base form; in practice, the LVC interpretation does not apply. For example, in most of the instances of the expression de hand boven het hoofd houden 'to protect s.o.' (lit. the hand above the head hold), hoofd is preceded by the definite determiner; there are also a few instances with a reciprocal elkaars 'each other's' and some instances with possessive determiners. The query results suggest that all three specifiers are possible; however, the instances with possessive determiners are literal uses. Occasionally, a PREPOSITION NOUN VERB triple clusters homonymous expressions. A search that specifies the triple base form IN HAND HOUDEN could match any of the following: iets in 'e'en hand houden 'to be the boss', het heft in handen houden 'remain in control', de touwtjes in handen houden, iets in handen houden 'have control over sth' or iets in de handen houden 'to hold sth in one's hands (lit.)'. Access to the subcategorization requirements of the LVC use (that differs from those of the regular phrase) (e.g. iemand van de straat houden 'keep s.o. off the street' vs.</Paragraph>
      <Paragraph position="6"> van de straat houden 'to love the street') would solve some cases.</Paragraph>
      <Paragraph position="7"> The corpus-based method cannot be fully automated; that is, extraction of variation and modification evidence cannot be done fully automatically. Instead, the evidence retrieved needs to be manually inspected. This brings up a last limitation of the method. At least one instance of each variation and modification type requires manual inspection. The researcher needs to establish whether the LVC interpretation is present or only a literal reading applies. Yet, all the tools we used facilitated this process and they provide plenty of relevant linguistic empirical evidence.</Paragraph>
      <Paragraph position="8"> A last limitation affecting most corpus-based research is that having found no evidence of variation and modification does not mean that it is not possible in LVCs. Some LVCs are rare in the corpus; LVCs that exhibit variation and/or modification are even more infrequent. A larger corpus is desirable.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="67" end_page="68" type="metho">
    <SectionTitle>
4 Lexicon representation in Alpino
</SectionTitle>
    <Paragraph position="0"> The Alpino lexicon entries specify (if applicable) subcategorization frames enriched with dependency relations and some lexical restrictions. Support verb constructions and idiomatic expressions are treated similarly; neither of these expressions constitute a lexical entry on their own (cf. (Breidt et al., 1996)). We concentrate on the LVC annotation in the remainder.</Paragraph>
    <Paragraph position="1"> Support verb constructions are lexicalized combinations of a support verb. Main verbs exhibit the same form (lemma) as their related support verb. We distinguish between a main verb and a support verb by specifying the distributional context of the support verb. This context is captured as an extended subcategorization frame. 4 An extended subcategorization frame consists of two parts: (a) list of syntactic dependents and (b) syntactic operations that the LVC (dis)allows.</Paragraph>
    <Paragraph position="2"> Among syntactic dependents, we include those lexemes and/or phrases necessary to derive the predicational content of the LVC. The syntactic dependents may be realized by three types of phrases: (i) fully lexicalized, (ii) partially lexicalized and (iii) variable argument slots. Next, the description of the phrase types is supported with expressions encountered earlier in the paper. 5 Fully lexicalized phrases exist as individual lexical entries. No variation, modification nor extraction out of these phrases is possible. A fully 4This working implementation assumes that the verb selects the dependents of the LVC, thus, departing from other proposals (Abeill'e, 1995) where the complement noun selects the support verb. Although the semantics layer is left out, this approach echoes lexicalist HPSG proposals such as (Krenn and Erbach, 1994; Sailer, 2000).</Paragraph>
    <Paragraph position="3"> 5Each example displays the light verb followed by its syntactic dependents given within &lt;&gt; . Subject is omitted.  lexicalized phrase is a string of lexemes - each in their surface form - and is represented within '[]': houden &lt; dat,[de,hand],[boven,het,hoofd]&gt; houden &lt; refl,[van,de,domme]&gt; Partially lexicalized phrases declare the type of argument they introduce e.g. accusative, semi-fixed prepositional phrase, predicative argument. These phrases also specify lexical restrictions on the head lexeme and, allow alternation of specifiers and morphological productivity in nouns.</Paragraph>
    <Paragraph position="4"> Partially lexicalized PPs list the head preposition and its object NP head.</Paragraph>
    <Paragraph position="5"> houden &lt; acc(rekening),pc(met)&gt; brengen &lt; acc,pp(op,gedachten)&gt; Finally, open argument slots state what sort of argument is required (e.g. acc(usative), refl(exive), dat(ive)). No lexical restrictions are declared.</Paragraph>
    <Paragraph position="6"> stellen &lt; acc,pp(in,bewaring)&gt; Concerning the syntactic behavior of LVCs, Alpino currently only declares whether the expressions allow passive or not and the type of passive. The current representation allows intervening adjuncts and other material between the syntactic dependents. No explicit constraints are stated with regards to topicalization, whextraction, coordination, clefting, etc.</Paragraph>
  </Section>
  <Section position="5" start_page="68" end_page="68" type="metho">
    <SectionTitle>
5 Related work
</SectionTitle>
    <Paragraph position="0"> Automatically annotated corpora have been used before to identify (prepositional) support verb constructions and to asses their variation and modification potential. Led by (Krenn, 2000) and continued by (Spranger, 2004) (among others), most work focused on German support verb constructions and figurative expressions. Our use of fully parsed corpora and the treebank query tool to extract relevant evidence introduces a fundamental difference with the cited work.</Paragraph>
    <Paragraph position="1"> Analytic techniques to annotate syntactically flexible (but idiosyncratic) expressions in lexical resources are discussed in (Breidt et al., 1996; Sag et al., 2001) and (Odijk, 2004). Within a similar line of work, (Sag et al., 2001) propose lexical selection, inheritance hierarchies of constructions and the notion of idiomatic construction to formalize the syntax and semantics of truly fixed, semi-fixed and syntactically flexible expressions. Assuming a regular syntactic behavior and having checked that component lexemes satisfy certain predicate-argument relationships, the semantics layer assigns the idiomatic interpretation to syntactically flexible expressions. (Sag et al., 2001) only mention light verb plus noun constructions. Supposedly, the Dutch prepositional LVCs fall into the syntactically flexible group.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML