File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/06/w06-1110_intro.xml

Size: 6,467 bytes

Last Modified: 2025-10-06 14:03:52

<?xml version="1.0" standalone="yes"?>
<Paper uid="W06-1110">
  <Title>Sydney, July 2006. c(c)2006 Association for Computational Linguistics Towards Case-Based Parsing: Are Chunks Reliable Indicators for Syntax Trees?</Title>
  <Section position="4" start_page="0" end_page="74" type="intro">
    <SectionTitle>
2 A Memory-Based Parser
</SectionTitle>
    <Paragraph position="0"> The parser in (K&amp;quot;ubler, 2004a; K&amp;quot;ubler, 2004b) approaches parsing as the task of finding a complete syntax tree rather than incrementally building the tree by rule applications, as in standard PCFGs. Despite this holistic approach to selecting the most similar tree, the parser has a reasonable performance: the first column of Table 1 shows the parser's evaluation on German spontaneous speech dialog data. This approach profits from the fact that it has a more global view on parsing than a PCFG parser. In this respect, the memory-based  memory-based parser KaRoPars labeled recall (syntactic categories) 82.45% 90.86% labeled precision (syntactic categories) 87.25% 90.17% Fa0 84.78 90.51 labeled recall (incl. gramm. functions) 71.72% labeled precision (incl. gramm. functions) 75.79%  and Ule, 2002; M&amp;quot;uller, 2005). The evaluation of KaRoPars is based on chunk annotations only. parser employs a similar strategy to the one in Data-Oriented Parsing (DOP) (Bod, 1998; Scha et al., 1999). Both parsers use larger tree fragments than the standard trees. The two approaches differ mainly in two respects: 1) DOP allows different tree fragments to be extracted from one tree, thus making different combinations of fragments available for the assembly of a specific tree. Our parser, in contrast, allows only one clearly defined tree fragment for each tree, in which only the phrase-internal structure is variable. 2) Our parser does not use a probabilistic model, but a simple cost function instead. Both factors in combination result in a nearly deterministic, and thus highly efficient parsing strategy.</Paragraph>
    <Paragraph position="1"> Since the complete tree structure in the memory-based parser is produced in two steps (retrieval of the syntax tree belonging to the most similar sentence and adaptation of this tree to the input sentence), the parser must rely on more information than the local information on which a PCFG parser suggests the next constituent. For this reason, we suggested a backing-off architecture, in which each modules used different types of easily obtainable linguistic information such as the sequence of words, the sequence of POS tags, and the sequence of chunks. Chunk parsing is a partial parsing approach (Abney, 1991), which is generally implemented as cascade of finite-state transducers. A chunk parser generally gives an analysis on the clause level and on the phrase level. However, it does not make any decisions concerning the attachment of locally ambiguous phrases. Thus, the German sentence in (1a) receives the chunk annotation in (1b).</Paragraph>
    <Paragraph position="2">  creativity.</Paragraph>
    <Paragraph position="3"> 'The internationally recognized artist discerns the origin of all creativity in the conscious perception of life.' b. [PC In der bewussten Wahrnehmung des Lebens] [VCL sieht] [NC der international angesehene K&amp;quot;unstler] [NC den Ursprung] [NC aller Kreativit&amp;quot;at].</Paragraph>
    <Paragraph position="4"> NCs are noun chunks, PC is a prepositional chunk, and VCL is the finite verb chunk. While for the chunks to the right of the verb chunk, no attachment decision could be made, the genitive noun phrase des Lebens could be grouped with the PC because of German word order regularities, which allow exactly one constituent in front of the finite verb.</Paragraph>
    <Paragraph position="5"> It can be hypothesized that the selection of the most similar sentence based on sequences of words or POS tags works best for dialog data because of the repetitive nature of such dialogs. The strategy with the greatest potential for generalization to newspaper texts is thus the usage of chunk sequences. In the remainder of this paper, we will therefore concentrate on this approach.</Paragraph>
    <Paragraph position="6"> The proposed parser is based on the following architecture: The parser needs a syntactically annotated treebank for training. In the learning phase, the training data are chunk parsed, the chunk sequences are extracted from the chunk parse and fitted to the syntax trees; then the trees are stored in memory. In the annotation phase, the new sentence is chunk parsed. Based on the sequence of chunks, the group of most similar sentences, which all share the same chunk analysis, is retrieved from memory. In a second step, the best sentence from this group needs to be selected, and the corresponding tree needs to be adapted to the input sentence.</Paragraph>
    <Paragraph position="7"> The complexity of such a parser crucially depends on the question whether these chunk se- null quences are reliable indicators for the correct syntax trees. Basically, there exist two extreme possibilities: 1) most chunk sequences are associated with exactly one sentence, and 2) there is only a small number of different chunk sequences, which are each associated with many sentences. In the first case, the selection of the correct tree based on a chunk sequence is trivial but the coverage of the parser would be rather low. The parser would encounter many sentences with chunk sequences which are not present in the training data. In the second case, in contrast, the coverage of chunk sequences would be good, but then such a chunk sequence would correspond to many different trees. As a consequence, the tree selection process would have to be more elaborate. Both extremes would be extremely difficult for a parser to handle, so in the optimal case, we should have a good coverage of chunk sequences combined with a reasonable number of trees associated with a chunk sequence.</Paragraph>
    <Paragraph position="8"> The investigation on the usefulness of chunk sequences was performed on the data of the German treebank T&amp;quot;uBa-D/Z (Telljohann et al., 2004) and on output from KaRoPars, a partial parser for German (M&amp;quot;uller and Ule, 2002). But in principle, the parsing approach is valid for languages ranging from a fixed to a more flexible word order. The German data will be described in more detail in the following section.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML