XML Viewer - w04-1901

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-1901_metho.xml
Size: 11,598 bytes
Last Modified: 2025-10-06 14:09:18
<?xml version="1.0" standalone="yes"?>
<Paper uid="W04-1901">
  <Title>The Hinoki Treebank: Working Toward Text Understanding</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 The Lexeed Semantic Database of
</SectionTitle>
    <Paragraph position="0"> Japanese The Lexeed Semantic Database of Japanese consists of all Japanese words with a familiarity greater than or equal to ve on a seven point scale (Kasahara et al., 2004). This gives 28,000 words in all, with 46,347 di erent senses. De nition sentences for these sentences were rewritten to use only the 28,000 familiar words (and some function words). The de ning vocabulary is actually 16,900 di erent words (60% of all possible words). An example entry for rst two senses of the word a0a2a1a4a3a6a5a8a7 doraib a \driver&amp;quot; is given in Figure 1, with English glosses added (underlined features are those added by Hinoki).</Paragraph>
  </Section>
  <Section position="4" start_page="0" end_page="2" type="metho">
    <SectionTitle>
3 The Hinoki Treebank
</SectionTitle>
    <Paragraph position="0"> The structure of our treebank is inspired by the Redwoods treebank of English in which utterances are parsed and the annotator selects the best parse from the full analyses derived by the grammar (Oepen et al., 2002). We had four main reasons for selecting this approach. The rst was that we wanted to develop a precise broad-coverage grammar in tandem with the treebank, as part of our research into natural language understanding. Treebanking the out- null put of the parser allows us to immediately identify problems in the grammar, and improving the grammar directly improves the quality of the treebank in a mutually bene cial feedback loop (Oepen et al., 2004).</Paragraph>
    <Paragraph position="1"> The second reason is that we wanted to annotate to a high level of detail, marking not only dependency and constituent structure but also detailed semantic relations. By using a Japanese grammar (JACY: Siegel and Bender (2002)) based on a monostratal theory of grammar (HPSG: Pollard and Sag (1994)) we could simultaneously annotate syntactic and semantic structure without overburdening the annotator. The treebank records the complete syntacto-semantic analysis provided by the HPSG grammar, along with an annotator's choice of the most appropriate parse. From this record, all kinds of information can be extracted at various levels of granularity. In particular, traditional syntactic structure (e.g., in the form of labeled trees), dependency relations between words and full meaning representations using minimal recursion semantics (MRS: Copestake et al. (1999)). A simpli ed example of the labeled tree, MRS and dependency views for the de nition of a0 a1 a3 a5 a7 2 doraib a \driver&amp;quot; is given in Figure 2.</Paragraph>
    <Paragraph position="2"> The third reason was that we expect the use of the grammar as a base to aid in enforcing consistency | all sentences annotated are guaranteed to have well-formed parses. Experience with semi-automatically constructed grammars, such as the Penn Treebank, shows many inconsistencies remain (around 4,500 types estimated by Dickinson and Meurers (2003)) and the treebank does not allow them to be identied automatically.</Paragraph>
    <Paragraph position="3"> The last reason was the availability of a reasonably robust existing HPSG of Japanese (JACY), and a wide range of open source tools for developing the grammars. We made extensive use of the LKB (Copestake, 2002), a grammar development environment, in order to extend JACY to the domain of de ning sentences. We also used the extremely e cient PET parser (Callmeier, 2000), which handles grammars developed using the LKB, to parse large test sets for regression testing, treebanking and nally knowledge acquisition. Most of our development was done within the [incr tsdb()] pro ling environment (Oepen and Carroll, 2000). The existing resources enabled us to rapidly develop and test our approach.</Paragraph>
    <Section position="1" start_page="2" end_page="2" type="sub_section">
      <SectionTitle>
3.1 Creating and Maintaining the Treebank
</SectionTitle>
      <Paragraph position="0"> The construction of the treebank is a two stage process. First, the corpus is parsed (in our case using JACY with the PET parser), and then the annotator selects the correct analysis (or occasionally rejects all analyses). Selection is done through a choice of discriminants. The system selects features that distinguish between di erent parses, and the annotator selects or rejects the features until only one parse is left. The number of decisions for each sentence is proportional to log2 of the number of parses, although sometimes a single decision can reduce the number of remaining parses by more or less than half. In general, even a sentence with 5,000 parses only requires around 12 decisions.</Paragraph>
      <Paragraph position="1"> Because the disambiguating choices made by the annotators are saved, it is possible to update the treebank when the grammar changes (Oepen et al., 2004). Although the trees depend on the grammar, re-annotation is only necessary in cases where either the parse has become more ambiguous, so new decisions have to be made, or existing rules or lexical items have changed so much that the system cannot reconstruct the parse.</Paragraph>
      <Paragraph position="2"> One concern that has been raised with Redwoods style treebanking is the fact that the treebank is tied to a particular implementation of a grammar. The ability to update the treebank alleviates this concern to a large extent. A more serious concern is that it is only possible to annotate those trees that the grammar can parse. Sentences for which no analysis had been implemented in the grammar or which fail to parse due to processing constraints are left unannotated. This makes grammar coverage an urgent issue. However, dictionary de nition sentences are more repetitive than newspaper text. In addition, there is little reference to outside context, and Lex-</Paragraph>
      <Paragraph position="4"> eed has a xed de ning vocabulary. This makes it a relatively easy domain to work with.</Paragraph>
      <Paragraph position="5"> We extended JACY by adding the de ning vocabulary, and added some new rules and lexicaltypes (more detail is given in Bond et al. (2004a)).1 Almost none of the rules are speci c to the dictionary domain. The grammatical coverage over all sentences when we began to treebank was 84%, and it is currently being increased further as we work on the grammar. We have now treebanked all de nition sentences for words with a familiarity greater than or equal to 6.0. This came to 38,900 sentences with an average length of 6.7 words/sentence. The extended JACY grammar is available for download from www.dfki.uni-sb.de/ ~siegel/grammar-download/JACY-grammar.html.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="2" end_page="2" type="metho">
    <SectionTitle>
4 Applications
</SectionTitle>
    <Paragraph position="0"> The treebanked data and grammar have been tested in two ways. The rst is in training a stochastic model for parse selection. The second is in building a thesaurus from the parsed data.</Paragraph>
    <Section position="1" start_page="2" end_page="2" type="sub_section">
      <SectionTitle>
4.1 Stochastic Parse Ranking
</SectionTitle>
      <Paragraph position="0"> Using the treebanked data, we built a stochastic parse ranking model with [incr tsdb()]. The ranker uses a maximum entropy learner to train a PCFG over the parse derivation trees, with the current node as a conditioning feature. The correct parse is selected 61.7% of the time (training on 4,000 sentences and testing on another 1,000; evaluated per sentence). More feature-rich models using parent and grandparent nodes along with models trained on the MRS representations have been proposed and implemented with an English grammar and the Redwoods treebank (Oepen et al., 2002). We intend to include such features, as well as adding our own extensions to train on constituent weight and semantic class.</Paragraph>
      <Paragraph position="1"> 1We bene ted greatly from advice from the main JACY developers: Melanie Siegel and Emily Bender.</Paragraph>
    </Section>
    <Section position="2" start_page="2" end_page="2" type="sub_section">
      <SectionTitle>
4.2 Knowledge Acquisition
</SectionTitle>
      <Paragraph position="0"> We selected dictionary de nitions as our rst corpus in order to use them to acquire lexical and ontological knowledge. Currently we are classifying hypernym, hyponym, synonym and domain relationships in addition to linking senses to an existing ontology. Our approach is described in more detail in Bond et al. (2004b). The main di erence between our research and earlier approaches, such as Tsurumaru et al. (1991), is that we are fully parsing the input, not just using regular expressions. Parsing sentences to a semantic representation (Minimal Recursion Semantics, Copestake et al. (1999)) has three advantages. The rst is that it makes our knowledge acquisition somewhat language independent: if we have a parser for some language that can produce MRS, and a dictionary, the algorithm can easily be ported. The second reason is that we can go on to use the same system to acquire knowledge from non-dictionary sources, which will not be as regular as dictionaries and thus harder to parse using only regular expressions. Third, we can more easily acquire knowledge beyond simple hypernyms, for example, identifying synonyms through common de nition patterns (Tsuchiya et al., 2001).</Paragraph>
      <Paragraph position="1"> To extract hypernyms, we parse the rst de nition sentence for each sense. The parser uses the stochastic parse ranking model learned from the Hinoki treebank, and returns the MRS of the rst ranked parse. Currently, 84% of the sentences can be parsed. In most cases, the word with the highest scope in the MRS representation will be the hypernym. For example, for doraib a1 the hypernym is a12 a13 d ogu \tool&amp;quot; and for doraib a 2 the hypernym is a11 hito \person&amp;quot; (see Figure 1). Although the actual hypernym is in very di erent positions in the Japanese and English de nition sentences, it takes the highest scope in both their semantic representations. null For some de nition sentences (around 20%), further parsing of the semantic representation is necessary. For example, a14a16a15 1 ana is de ned as ana: The abbreviation of \announcer&amp;quot; (translated to English). In this case abbreviation has the highest scope but is an explicit relation. We therefore parse to nd its complement and extract the relationship abbreviation(ana1,announcer1). The semantic representation is largely language independent. In order to port the extraction to another language, we only have to know the semantic relation for abbreviation.</Paragraph>
      <Paragraph position="2"> We evaluate the extracted pairs by comparison with an existing thesaurus: Goi-Taikei (Ikehara et al., 1997). Currently 58.5% of the pairs extracted for nouns are linked to nodes in the Goi-Taikei ontology (Bond et al., 2004b). In general, we are extracting pairs with more information than the Goi-Taikei hierarchy of 2,710 classes. In particular, many classes contain a mixture of class names and instance names: a0a2a1 buta niku \pork&amp;quot; and a1 niku \meat&amp;quot; are in the same class, as are a0 a1a4a3 percussion instrument \drum&amp;quot; and a5a7a6a7a8 dagakki \percussion instrument&amp;quot;, which we can now distinguish.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML