File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/99/p99-1051_intro.xml
Size: 15,367 bytes
Last Modified: 2025-10-06 14:06:55
<?xml version="1.0" standalone="yes"?> <Paper uid="P99-1051"> <Title>Acquiring Lexical Generalizations from Corpora: A Case Study for Diathesis Alternations</Title> <Section position="3" start_page="0" end_page="399" type="intro"> <SectionTitle> 2 Method </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="397" type="sub_section"> <SectionTitle> 2.1 The parser </SectionTitle> <Paragraph position="0"> The part-of-speech tagged version of the British National Corpus (BNC), a 100 million word collection of written and spoken British English (Burnard, 1995), was used to acquire the frames characteristic of the dative and benefactive alternations. Surface syntactic structure was identified using Gsearch (Keller et al., 1999), a tool which allows the search of arbitrary POS-tagged corpora for shallow syntactic patterns based on a user-specified context-free grammar and a syntactic query. It achieves this by combining a left-corner parser with a regular expression matcher.</Paragraph> <Paragraph position="1"> Depending on the grammar specification (i.e., recursive or not) Gsearch can be used as a full context-free parser or a chunk parser. Depending on the syntactic query, Gsearch can parse full sentences, identify syntactic relations (e.g., verb-object, adjectivenoun) or even single words (e.g., all indefinite pro- null nouns in the corpus).</Paragraph> <Paragraph position="2"> Gsearch outputs all corpus sentences containing substrings that match a given syntactic query. Given two possible parses that begin at the same point in the sentence, the parser chooses the longest match.</Paragraph> <Paragraph position="3"> If there are two possible parses that can be produced for the same substring, only one parse is returned.</Paragraph> <Paragraph position="4"> This means that if the number of ambiguous rules in the grammar is large, the correctness of the parsed output is not guaranteed.</Paragraph> </Section> <Section position="2" start_page="397" end_page="397" type="sub_section"> <SectionTitle> 2.2 Acquisition </SectionTitle> <Paragraph position="0"> We used Gsearch to extract tokens matching the patterns 'V NP1 NP2', 'VP NP1 to NP2', and 'V NPI for NP2' by specifying a chunk grammar for recognizing the verbal complex and NPs. POS-tags were retained in the parser's output which was post-processed to remove adverbials and interjections. Examples of the parser's output are given in (3).</Paragraph> <Paragraph position="1"> Although there are cases where Gsearch produces the right parse (cf. (3a)), the parser wrongly identifies as instances of the double object frame tokens containing compounds (cf. (3b)), bare relative clauses (cf. (3c)) and NPs in apposition (cf. (3d)).</Paragraph> <Paragraph position="2"> Sometimes the parser attaches prepositional phrases to the wrong site (cf. (3e)) and cannot distinguish between arguments and adjuncts (cf. (3f)) or between different types of adjuncts (e.g., temporal (cf. (3f)) versus benefactive (cf. (3g))). Erroneous output also arises from tagging mistakes.</Paragraph> <Paragraph position="3"> (3) a. The police driver \[v shot\] \[NP Jamie\] \[ie a look of enquiry\] which he missed.</Paragraph> <Paragraph position="4"> b. Some also \[v offer\] \[ipa free bus\] lip service\], to encourage customers who do not have their own transport.</Paragraph> <Paragraph position="5"> c. A Jaffna schoolboy \[v shows\] \[NP a drawing\] lip he\] made of helicopters strafing his home town.</Paragraph> <Paragraph position="6"> d. For the latter catalogue Barr \[v chose\] \[NP the Surrealist writer\] \[yp Georges Hugnet\] to write a historical essay.</Paragraph> <Paragraph position="7"> e. It \[v controlled\] \[yp access\] \[pp to \[Nr' the vault\]\].</Paragraph> <Paragraph position="8"> f. Yesterday he \[v rang\] \[NP the bell\] \[Pl, for \[NP a long time\]\].</Paragraph> <Paragraph position="9"> g. Don't Iv save\] \[NP the bread\] \[pp for \[NP the birds\]\].</Paragraph> <Paragraph position="10"> We identified erroneous subcategorization frames (cf. (3b)-(3d)) by using linguistic heuristics and a process for compound noun detection (cf. section 2.3). We disambiguated the attachment site of PPs (cf. (3e)) using Hindle and Rooth's (1993) lexical association score (cf. section 2.4). Finally, we recognized benefactive PPs (cf. (3g)) by exploiting the WordNet taxonomy (cf. section 2.5).</Paragraph> </Section> <Section position="3" start_page="397" end_page="398" type="sub_section"> <SectionTitle> 2.3 Guessing the double object frame </SectionTitle> <Paragraph position="0"> We developed a process which assesses whether the syntactic patterns (called cues below) derived from the corpus are instances of the double object frame.</Paragraph> <Paragraph position="1"> Linguistic Heuristics. We applied several heuristics to the parser's output which determined whether corpus tokens were instances of the double object frame. The 'Reject' heuristics below identified erroneous matches (cf. (3b-d)), whereas the 'Accept' heuristics identified true instances of the double ob- null ject frame (cf. (3a)).</Paragraph> <Paragraph position="2"> 1. Reject if cue contains at least two proper names adjacent to each other (e.g., killed Henry Phipps ).</Paragraph> <Paragraph position="3"> 2. Reject if cue contains possessive noun phrases (e.g., give a showman's award).</Paragraph> <Paragraph position="4"> 3. Reject if cue's last word is a pronoun or an anaphor (e.g., ask the subjects themselves).</Paragraph> <Paragraph position="5"> 4. Accept if verb is followed by a personal or indefinite pronoun (e.g., found him a home).</Paragraph> <Paragraph position="6"> 5. Accept if verb is followed by an anaphor (e.g., made herself a snack).</Paragraph> <Paragraph position="7"> 6. Accept if cue's surface structure is either 'V MOD l NP MOD NP' or 'V NP MOD NP' (e.g., send Bailey a postcard).</Paragraph> <Paragraph position="8"> 7. Cannot decide if cue's surface structure is 'V MOD* N N+' (e.g., offer a free bus ser null vice).</Paragraph> <Paragraph position="9"> Compound Noun Detection. Tokens identified by heuristic (7) were dealt with separately by a procedure which guesses whether the nouns following the verb are two distinct arguments or parts of a compound. This procedure was applied only to noun sequences of length 2 and 3 which were extracted from the parser's output 2 and compared against a compound noun dictionary (48,661 entries) compiled from WordNet. 13.9% of the noun sequences were identified as compounds in the dictionary. I Here MOD represents any prenominal modifier (e.g., articles, pronouns, adjectives, quantifiers, ordinals). \[\[energy efficiency\] office\] \[\[council tax\] bills\] \[alcohol \[education course\]\] \[hospital \[out-patient department\] \[\[turnout suppressor\] function\] \[\[nature conservation\] resources\] \[\[quality amplifier\] circuits\] 2: Random sample of three word compounds For sequences of length 2 not found in WordNet, we used the log-likelihood ratio (G-score) to estimate the lexical association between the nouns, in order to determine if they formed a compound noun. We preferred the log-likelihood ratio to other statistical scores, such as the association ratio (Church and Hanks, 1990) or ;(2, since it adequately takes into account the frequency of the co-occurring words and is less sensitive to rare events and corpussize (Dunning, 1993; Daille, 1996). We assumed that two nouns cannot be disjoint arguments of the verb if they are lexically associated. On this basis, tokens were rejected as instances of the double object frame if they contained two nouns whose G-score had a p-value less than 0.05.</Paragraph> <Paragraph position="10"> A two-step process was applied to noun sequences of length 3: first their bracketing was determined and second the G-score was computed between the single noun and the 2-noun sequence. We inferred the bracketing by modifying an algorithm initially proposed by Pustejovsky et al. (1993). Given three nouns n 1, n2, n3, if either \[n I n2\] or \[n2 n3\] are in the compound noun dictionary, we built structures \[\[nt n2\] n3\] or \[r/l \[n2 n3\]\] accordingly; if both \[n I n2\] and In2 n3\] appear in the dictionary, we chose the most frequent pair; if neither \[n l n2\] nor \[n2 n3\] appear in WordNet, we computed the G-score for \[nl n2\] and \[n2 n3\] and chose the pair with highest value (p < 0.05). Tables 1 and 2 display a random sample of the compounds the method found (p < 0.05).</Paragraph> <Paragraph position="11"> The performance of the linguistic heuristics and the compound detection procedure were evaluated by randomly selecting approximate!y 3,000 corpus tokens which were previously accepted or rejected as instances of the double object frame. Two judges decided whether the tokens were classified correctly. The judges' agreement on the classification task was calculated using the Kappa coefficient (Siegel and tection and lexical association Castellan, 1988) which measures inter-rater agreement among a set of coders making category judgments. null The Kappa coefficient of agreement (K) is the ratio of the proportion of times, P(A), that k raters agree to the proportion of times, P(E), that we would expect the raters to agree by chance (cf. (4)). If there is a complete agreement among the raters,</Paragraph> <Paragraph position="13"> Precision figures 3 (Prec) and inter-judge agreement (Kappa) are summarized in table 3. In sum, the heuristics achieved a high accuracy in classifying cues for the double object frame. Agreement on the classification was good given that the judges were given minimal instructions and no prior training.</Paragraph> </Section> <Section position="4" start_page="398" end_page="399" type="sub_section"> <SectionTitle> 2.4 Guessing the prepositional frames </SectionTitle> <Paragraph position="0"> In order to consider verbs with prepositional frames as candidates for the dative and benefactive alternations the following requirements needed to be met: 2. in the case of the 'V NPI to NP2' structure, the to-PP must be an argument of the verb; 3. in the case of the 'V NPI for NP2' structure, the for-PP must be benefactive. 4 In older to meet requirements (1)-(3), we first determined the attachment site (e.g., verb or noun) of the PP and secondly developed a procedure for distinguishing benefactive from non-benefactive PPs. Several approaches have statistically addressed the problem of prepositional phrase ambiguity, with comparable results (Hindle and Rooth, 1993; Collins and Brooks, 1995; Ratnaparkhi, 1998). Hindle and Rooth (1993) used a partial parser to extract (v, n, p) tuples from a corpus, where p is the preposition whose attachment is ambiguous between the verb v and the noun n. We used a variant of the method described in Hindle and Rooth (1993), the main difference being that we applied their lexical association score (a log-likelihood ratio which compares the probability of noun versus verb attachment) in an unsupervised non-iterative manner. Furthermore, the procedure was applied to the special case of tuples containing the prepositions to and for only.</Paragraph> <Paragraph position="1"> We evaluated the procedure by randomly selecting 2,124 tokens containing to-PPs and for-PPs for which the procedure guessed verb or noun attachment. The tokens were disambiguated by two judges. Precision figures are reported in table 3. The lexicai association score was highly accurate on guessing both verb and noun attachment for to-PPs. Further evaluation revealed that for 98.6% (K = 0.9, N = 494, k -- 2) of the tokens classified as instances of verb attachment, the to-PP was an argument of the verb, which meant that the log-likelihood ratio satisfied both requirements (1) and (2) for to-PPs.</Paragraph> <Paragraph position="2"> A low precision of 36% was achieved in detecting instances of noun attachment for for-PPs. One reason for this is the polysemy of the preposition for: for-PPs can be temporal, purposive, benefactive or causal adjuncts and consequently can attach to various sites. Another difficulty is that benefactive for-PPs semantically license both attachment sites. To further analyze the poor performance of the log-likelihood ratio on this task, 500 tokens con4Syntactically speaking, benefactive for-PPs are not arguments but adjuncts (Jackendoff, 1990) and can appear on any verb with which they are semantically compatible.</Paragraph> <Paragraph position="3"> taining for-PPs were randomly selected from the parser's output and disambiguated. Of these 73.9% (K = 0.9, N = 500, k ---- 2) were instances of verb attachment, which indicates that verb attachments outnumber noun attachments for for-PPs, and therefore a higher precision for verb attachment (cf. requirement (1)) can be achieved without applying the log-likelihood ratio, but instead classifying all instances as verb attachment.</Paragraph> </Section> <Section position="5" start_page="399" end_page="399" type="sub_section"> <SectionTitle> 2.5 Benefactive PPs </SectionTitle> <Paragraph position="0"> Although surface syntactic cues can be important for determining the attachment site of prepositional phrases, they provide no indication of the semantic role of the preposition in question. This is particularly the case for the preposition for which can have several roles, besides the benefactive.</Paragraph> <Paragraph position="1"> Two judges discriminated benefactive from non-benefactive PPs for 500 tokens, randomly selected from the parser's output. Only 18.5% (K ---- 0.73, N ---- 500, k = 2) of the sample contained benefactive PPs. An analysis of the nouns headed by the preposition for revealed that 59.6% were animate, 17% were collective, 4.9% denoted locations, and the remaining 18.5% denoted events, artifacts, body parts,'or actions. Animate, collective and location nouns account for 81.5% of the benefactive data.</Paragraph> <Paragraph position="2"> We used the WordNet taxonomy (Miller et al., 1990) to recognize benefactive PPs (cf. requirement (3)). Nouns in WordNet are organized into an inheritance system defined by hypernymic relations. Instead of being contained in a single hierarchy, nouns are partitioned into a set of semantic primitives (e.g., act, animal, time) which are treated as the unique beginners of separate hierarchies. We compiled a &quot;concept dictionary&quot; from WordNet (87,642 entries), where each entry consisted of the noun and the semantic primitive distinguishing each noun sense (cf. table 4).</Paragraph> <Paragraph position="3"> We considered a for-PP to be benefactive if the noun headed by for was listed in the concept dictionary and the semantic primitive of its prime sense (Sense 1) was person, animal, group or location. PPs with head nouns not listed in the dictionary were considered benefactive only if their head nouns were proper names. Tokens containing personal, indefinite and anaphoric pronouns were also considered benefactive (e.g., build a home for him).</Paragraph> <Paragraph position="4"> Two judges evaluated the procedure by judging 1,000 randomly selected tokens, which were accepted or rejected as benefactive. The procedure achieved a precision of 48.8% (K ----- 0.89, N =</Paragraph> </Section> </Section> class="xml-element"></Paper>