XML Viewer - w06-1602

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-1602_metho.xml
Size: 21,732 bytes
Last Modified: 2025-10-06 14:10:39
<?xml version="1.0" standalone="yes"?>
<Paper uid="W06-1602">
  <Title>Sydney, July 2006. c(c)2006 Association for Computational Linguistics An Empirical Approach to the Interpretation of Superlatives</Title>
  <Section position="5" start_page="10" end_page="11" type="metho">
    <SectionTitle>
3 Annotated Corpus of Superlatives
</SectionTitle>
    <Paragraph position="0"> In order to develop and evaluate our system we manually annotated a collection of newspaper article and questions withoccurrences of superlatives.</Paragraph>
    <Paragraph position="1"> The design of the corpus and its characteristics are described in this section.</Paragraph>
    <Section position="1" start_page="10" end_page="10" type="sub_section">
      <SectionTitle>
3.1 Classification and Annotation Scheme
</SectionTitle>
      <Paragraph position="0"> Instances of superlatives are identified in text and classified into one of four possible classes: attributive, predicative, adverbial, or idiomatic: its rates will be among the highest (predicative) the strongest dividend growth (attributive) free to do the task most quickly (adverbial) who won the TONY for best featured actor? (idiom) For all cases, we annotate the span of the superlative adjective in terms of the position of the tokens in the sentence. For instance, in &amp;quot;its1 rates2 will3 be4 among5 the6 highest7&amp;quot;, the superlative span would be 7-7.</Paragraph>
      <Paragraph position="1"> Additional information is encoded for the attributive case: type of determiner (possessive, definite, bare, demonstrative, quantifier), number (sg, pl, mass), cardinality (yes, no), modification (adjective, ordinal, intensifier, none). Table 1 shows some examples from the WSJ with annotation values. null Not included in this study are adjectives such as &amp;quot;next&amp;quot;, &amp;quot;past&amp;quot;, &amp;quot;last&amp;quot;, nor the ordinal &amp;quot;first&amp;quot;, although they somewhat resemble superlatives in their semantics. Also excluded are adjectives that lexicalise a superlative meaning but are not superlatives morphologically, like &amp;quot;main&amp;quot;, &amp;quot;principal&amp;quot;, and the like. For etymological reasons we however include &amp;quot;foremost&amp;quot; and &amp;quot;uttermost.&amp;quot;</Paragraph>
    </Section>
    <Section position="2" start_page="10" end_page="11" type="sub_section">
      <SectionTitle>
3.2 Data and Annotation
</SectionTitle>
      <Paragraph position="0"> Our corpus consists of a collection of newswire articles from the Wall Street Journal (Sections 00, 01, 02, 03, 04, 10, and 15) and the Glasgow Herald (GH950110 from the CLEF evaluation forum), and a large set of questions from the TREC QA evaluation exercise (years 2002 and 2003) and natural language queries submitted to the Excite search engine (Jansen and Spink, 2000). The data was automatically tokenised, but all typos and extra-grammaticalities were preserved. The corpus was split into a development set used for tuning the system and a test set for evaluation. The size of each sub-corpus is shown in Table 2.</Paragraph>
      <Paragraph position="1">  The annotation was performed by two trained linguists. One section of the WSJ was annotated by both annotators independently to calculate inter-annotator agreement. All other documents were first annotated by one judge and then checked by the second, in order to ensure maximum correctness. All disagreements were discussed and resolved for the creation of a gold standard corpus.</Paragraph>
      <Paragraph position="2"> Inter-annotator agreement was assessed mainly using f-score and percentage agreement as well as  example sup span det num car mod comp set The third-largest thrift institution in Puerto Rico also [...] 2-2 def sg no ord 3-7 The Agriculture Department reported that feedlots in the 13 biggest ranch states held [...] 9-10 def pl yes no 11-12 The failed takeover would have given UAL employees 75 % voting control of the nation 's second-largest airline [...] 17-17 pos sg no ord 14-18 the kappa statistics (K), where applicable (Carletta, 1996). In using f-score, we arbitrarily take one of the annotators' decisions (A) as gold standard and compare them with the other annotator's decisions (B). Note that here f-score is symmetric, sinceprecision(A,B)=recall(B,A),and(balanced) f-score is the harmonic mean of precision and recall (Tjong Kim Sang, 2002; Hachey et al., 2005, see also Section 5).</Paragraph>
      <Paragraph position="3"> We evaluated three levels of agreement on a sample of 1967 sentences (one full WSJ section). The first level concerns superlative detection: to what extent different human judges can agree on what constitutes a superlative. For this task, f-score was measured at 0.963 with a total of 79 superlative phrases agreed upon.</Paragraph>
      <Paragraph position="4"> Thesecondlevelofagreementisrelativetotype identification (attributive, predicative, adverbial, idiomatic), and is only calculated on the subset ofcasesbothannotatorsrecognisedassuperlatives (79 instances, as mentioned). The overall f-score for the classification task is 0.974, with 77 cases where both annotators assigned the same type to a superlative phrase. We also assessed agreement for each class, and the attributive type resulted the most reliable with an f-score of 1 (total agreement on 64 cases), whereas there was some disagreement in classifying predicative and adverbial cases (0.9 and 0.8 f-score, respectively). Idiomatic uses where not detected in this portion of the data. To assess this classification task we also used the kappa statistics which yielded KCo=0.922 (following (Eugenio and Glass, 2004) we report K as KCo, indicating that we calculate K `a la Cohen (Cohen, 1960). KCo over 0.9 is considered to signal very good agreement (Krippendorff, 1980).</Paragraph>
      <Paragraph position="5"> The third and last level of agreement deals with the span of the comparison set and only concerns attributive cases (64 out of 79). Percentage agreement was used since this is not a classification task and was measured at 95.31%.</Paragraph>
      <Paragraph position="6"> The agreement results show that the task appears quite easy to perform for linguists. Despite thelimitednumberofinstancescompared, thishas also emerged from the annotators' perception of the difficulty of the task for humans.</Paragraph>
    </Section>
    <Section position="3" start_page="11" end_page="11" type="sub_section">
      <SectionTitle>
3.3 Distribution
</SectionTitle>
      <Paragraph position="0"> The gold standard corpus comprises a total of 3,045 superlatives, which roughly amounts to one superlative in every 25 sentences/questions. The overwhelming majority of superlatives are attributive (89.1%), and only a few are used in a predicative way (6.9%), adverbially (3.0%), or in idiomatic expressions (0.9%).1 Table 3 shows the detailed distribution according to data source and experimental sets. Although the corpus also includes annotation about determination, modification, grammatical number, and cardinality of attributive superlatives (see Section 3.1), this information is not used by the system described in this paper.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="11" end_page="13" type="metho">
    <SectionTitle>
4 Automatic Analysis of Superlatives
</SectionTitle>
    <Paragraph position="0"> The system that we use to analyse superlatives is based on two linguistic formalisms: Combinatory Categorial Grammar (CCG), for a theory of syntax; and Discourse Representation Theory (DRT)  foratheoryofsemantics. Inthissectionwewillillustrate how we extend these theories to deal with superlatives and how we implemented this into a working system.</Paragraph>
    <Section position="1" start_page="12" end_page="12" type="sub_section">
      <SectionTitle>
4.1 Combinatory Categorial Grammar
</SectionTitle>
      <Paragraph position="0"> (CCG) CCG is a lexicalised theory of grammar (Steedman, 2001). We used Clark &amp; Curran's wide-coverage statistical parser (Clark and Curran, 2004) trained on CCG-bank, which in turn is derived from the Penn-Treebank (Hockenmaier and Steedman, 2002). In CCG-bank, the majority of superlative adjective of cases are analysed as follows: null  derivation of the input sentence also a part-of-speech (POS) tag and a lemmatised form for each input token. To recognise attributive superlatives in the output of the parser, we look both at the POS tag and the CCG-category assigned to a word. Words with POS-tag JJS and CCG-category N/N, (N/N)/(N/N), or (N/N)\(N/N) are considered attributive superlatives adjectives, and so are the words &amp;quot;most&amp;quot; and &amp;quot;least&amp;quot; with CCG category (N/N)/(N/N).</Paragraph>
      <Paragraph position="1"> However, most hyphenated superlatives are not recognised by the parser as JJ instead of JJS, and are corrected in a post-processing step.2 Examples that fall in this category are &amp;quot;most-recent wave&amp;quot; and &amp;quot;third-highest&amp;quot;.</Paragraph>
    </Section>
    <Section position="2" start_page="12" end_page="13" type="sub_section">
      <SectionTitle>
4.2 Discourse Representation Theory (DRT)
</SectionTitle>
      <Paragraph position="0"> The output of the parser, a CCG derivation of the input sentence, is used to construct a Discourse Representation Structure (DRS, the semantic representation proposed by DRT (Kamp and Reyle, 2This is due to the fact that the Penn-Treebank annotation guidelines prescribe that all hyphenated adjectives ought to be tagged as JJ.</Paragraph>
      <Paragraph position="1"> 1993)). We follow (Bos et al., 2004; Bos, 2005) in automatically building semantic representation on the basis of CCG derivations in a compositional fashion. We briefly summarise the approach here.</Paragraph>
      <Paragraph position="2"> The semantic representation for a word is determined by its CCG category, POS-tag, and lemma.</Paragraph>
      <Paragraph position="3"> Consider the following lexical entries:</Paragraph>
      <Paragraph position="5"> man: lx. man(x) These lexical entries are combined in a compositional fashion following the CCG derivation, using the l-calculus as a glue language: tallest man: lx.</Paragraph>
      <Paragraph position="7"> In this way DRSs can be produced in a robust way, achieving high-coverage. An example output representation of the complete system is shown in  As is often the case, the output of the parser is  notalwayswhatoneneedstoconstructameaningful semantic representation. There are two cases where we alter the CCG derivation output by the parserinordertoimprovetheresultingDRSs. The first case concerns modifiers following a superlative construction, that are attached to the NP node rather than N. A case in point is ... the largest toxicology lab in New England ...</Paragraph>
      <Paragraph position="8"> where the PP in New England has the CCG category NP\NP rather than N\N. This would result in a comparison set containing of toxicology labs, rather than a set toxicology labs in New England. The second case are possessive NPs preceding a superlative construction. An example here is ... Jaguar's largest shareholder ...</Paragraph>
      <Paragraph position="10"> where a correct interpretation of the superlative requires a comparison set of shareholders from Jaguar, rather than just any shareholder. However, the parser outputs a derivation where &amp;quot;largest&amp;quot; is combined with &amp;quot;shareholder&amp;quot;, and then with the possessive construction, yielding the wrong semantic interpretation. To deal with this, we anal- null This analysis yields the correct comparison set for superlative that follow a possessive noun phrase, given the following lexical semantics for the genitive: null ln.lS.lp.lq.( u ;S(lx.(p(x);n(ly. of(y,x) )(u);q(u)))) For both cases, we apply some simple post-processing rules to the output of the parser to obtain the required derivations. The effect of these rules is reported in the next section, where we assess the accuracy of the semantic representations produced for superlatives by comparing the automatic analysis with the gold standard.</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="13" end_page="14" type="metho">
    <SectionTitle>
5 Evaluation
</SectionTitle>
    <Paragraph position="0"> The automatic analysis of superlatives we present in the following experiments consists of two sequential tasks: superlative detection, and comparison set determination.</Paragraph>
    <Paragraph position="1"> The first task is concerned with finding a superlative in text and its exact span (&amp;quot;largest&amp;quot;, &amp;quot;most beautiful&amp;quot;, &amp;quot;10 biggest&amp;quot;). For a found string to to be judged as correct, its whole span must correspond to the gold standard. The task is evaluated using precision (P), recall (R), and f-score (F), calculated as follows:</Paragraph>
    <Paragraph position="3"> The second task is conditional on the first: once a superlative is found, its comparison set must also be identified (&amp;quot;rarest flower in New Zealand&amp;quot;, &amp;quot;New York's tallest building&amp;quot;, see Section 2.2). A selected comparison set is evaluated as correct if it corresponds exactly to the gold standard annotation: partial matches are counted as wrong. Assignments are evaluated using accuracy (number of correct decisions made) only on the subset of previously correctly identified superlatives.</Paragraph>
    <Paragraph position="4"> For both tasks we developed simple baseline systems based on part-of-speech tags, and a more sophisticated linguistic analysis based on CCG and DRT (i.e. the system described in Section 4).</Paragraph>
    <Paragraph position="5"> In the remainder of the paper we refer to the latter system as DLA (Deep Linguistic Analysis).</Paragraph>
    <Section position="1" start_page="13" end_page="14" type="sub_section">
      <SectionTitle>
5.1 Superlative Detection
</SectionTitle>
      <Paragraph position="0"> Baseline system For superlative detection we generated a baseline that solely relies on part-of-speech information. The data was tagged using TnT (Brants, 2000), using a model trained on the Wall Street Journal. In the WSJ tagset, superlativescanbemarkedintwodifferentways, depending on whether the adjective is inflected or modified by most/least. So, &amp;quot;largest&amp;quot;, for instance, is tagged as JJS, whereas &amp;quot;most beautiful&amp;quot; is a sequence of RBS (most) and JJ (beautiful). We also checked that they are followed by a common or proper noun (NN.*), allowing one word to occur in between. To cover more complex cases, we also considered pre-modification by adjectives (JJ), and cardinals (CD). In summary, we matched on sequences found by the following pattern: [(CD  ||JJ)* (JJS  ||(RBS JJ)) * NN.*] This rather simple baseline is capable of detecting superlatives such as &amp;quot;100 biggest banks&amp;quot;, &amp;quot;fourth largest investors&amp;quot;, and &amp;quot;most important  element&amp;quot;, but will fail on expressions such as &amp;quot;fastest growing segments&amp;quot; or &amp;quot;Scotland 's lowest permitted 1995-96 increase&amp;quot;.</Paragraph>
      <Paragraph position="1"> DLA system For evaluation, we extrapolated superlatives from the DRSs output by the system.</Paragraph>
      <Paragraph position="2"> Each superlative introduces an implicational DRS condition, but not all implicational DRS conditions are introduced by superlatives. Hence, for the purposes of this experiment superlative DRS conditions were assigned a special mark. While traversing the DRS, we use this mark to retrieve superlative instances. In order to retrieve the originalstringthatgaverisetothesuperlativeinterpre- null tation, we exploit the meta information encoded in each DRS about the relation between input tokens and semantic information. The obtained string position can in turn be evaluated against the gold standard.</Paragraph>
      <Paragraph position="3"> Table 4 lists the results achieved by the base-line system and the DLA system on the detection task. The DLA system outperforms the baseline system on precision in all sub-corpora. However, the baseline achieves a higher recall on the Excite queries. This is not entirely surprising given that the coverage of the parser is between 90-95% on unseen data. Moreover, Excite queries are often ungrammatical, thus further affecting the performance of parsing.</Paragraph>
    </Section>
    <Section position="2" start_page="14" end_page="14" type="sub_section">
      <SectionTitle>
5.2 Comparison Set Determination
</SectionTitle>
      <Paragraph position="0"> Baseline For comparison set determination we developed two baseline systems. Both use the same match on sequences of part-of-speech tags described above. For Baseline 1, the beginning of the comparison set is the first word following the superlative. The end of the comparison set is the first word tagged as NN.* in that sequence (the same word could be the beginning and end of the comparison set, as it often happens).</Paragraph>
      <Paragraph position="1"> The second baseline takes the first word after the superlative as the beginning of the comparison set, and the end of the sentence (or question) as the end (excluding the final punctuation mark). We expect this strategy to perform well on questions, as the following examples show.</Paragraph>
      <Paragraph position="2"> Where is the oldest synagogue in the United States? What was the largest crowd to ever come see Michael Jordan? This approach is obviously likely to generate comparison sets much wider than required.</Paragraph>
      <Paragraph position="3"> More complex examples that neither baseline can tackle involve possessives, since on the surface the comparison set lies at both ends of the superlative adjective: The nation's largest pension fund the world's most corrupt organizations</Paragraph>
    </Section>
  </Section>
  <Section position="8" start_page="14" end_page="15" type="metho">
    <SectionTitle>
DLA1 Wefirstextrapolatesuperlativesfromthe
</SectionTitle>
    <Paragraph position="0"> DRS output by the system (see procedure above).</Paragraph>
    <Paragraph position="1">  Then,weexploitthesemanticrepresentationtoselect the comparison set: it is determined by the information encoded in the antecedent of the DRSconditional introduced by the superlative. Again, we exploit meta information to reconstruct the original span, and we match it against the gold standard for evaluation.</Paragraph>
    <Paragraph position="2"> DLA2 DLA2buildsonDLA1, towhichitadds post-processing rules to the CCG derivation, i.e.</Paragraph>
    <Paragraph position="3"> before the DRSs are constructed. This set of rules deal with NP post-modification of the superlative (see Section 4).</Paragraph>
    <Paragraph position="4"> DLA 3 In this version we include a set of post-processing rules that apply to the CCG derivation to deal with possessives preceding the superlative (see Section 4).</Paragraph>
    <Paragraph position="5"> DLA 4 This is a combination of DLA 2 and DLA 3. This system is clearly expected to perform best.</Paragraph>
    <Paragraph position="6"> Results for both baseline systems and all versions of DLA are shown in Table 5 On text documents, DLA 2/3/4 outperform the baseline systems. DLA 4 achieves the best performance, with an accuracy of 69-83%. On questions, however, DLA 4 competes with the baseline: whereas it is better on TREC questions, it performs worse on Excite questions. One of the obvious reasons for this is that the parser's model  for questions was trained on TREC data. Additionally, as noted earlier, Excite questions are often ungrammatical and make parsing less likely to succeed. However, the baseline system, by definition, does not output semantic representations, so that its outcome is of little use for further reasoning, as required by question answering or general information extraction systems.</Paragraph>
  </Section>
  <Section position="9" start_page="15" end_page="15" type="metho">
    <SectionTitle>
6 Conclusions
</SectionTitle>
    <Paragraph position="0"> We have presented the first empirically grounded study of superlatives, and shown the feasibility of their semantic interpretation in an automatic fashion. Using Combinatory Categorial Grammar and Discourse Representation Theory we have implementedasystemthatisabletorecogniseasuperla- null tive expression and its comparison set with high accuracy.</Paragraph>
    <Paragraph position="1"> For developing and testing our system, we have created a collection of over 3,000 instances of superlatives, both in newswire text and in natural language questions. This very first corpus of superlativesallowsusto getacomprehensivepicture of the behaviour and distribution of superlatives in real occurring data. Thanks to such broad view of the phenomenon, we were able discover issues previously unnoted in the formal semantics literature, such as the interaction of prenominal possessives and superlatives, which cause problems at the syntax-semantics interface in the determination of the comparison set. Similarly problematic are hyphenated superlatives, which are tagged as normal adjectives in the Penn Treebank.</Paragraph>
    <Paragraph position="2"> Moreover, this work provides a concrete way of evaluating the output of a stochastic wide-coverage parser trained on the CCGBank (Hockenmaier and Steedman, 2002). With respect to superlatives, our experiments show that the quality of the raw output is not entirely satisfactory. However, we have also shown that some simple post-processing rules can increase the performance considerably. This might indicate that the way superlatives are annotated in the CCGbank, although consistent, is not fully adequate for the purpose of generating meaningful semantic representations, but probably easy to amend.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML