XML Viewer - w00-1307

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/00/w00-1307_metho.xml
Size: 31,145 bytes
Last Modified: 2025-10-06 14:07:25
<?xml version="1.0" standalone="yes"?>
<Paper uid="W00-1307">
  <Title>A Uniform Method of Grammar Extraction and Its Applications</Title>
  <Section position="4" start_page="0" end_page="53" type="metho">
    <SectionTitle>
2 LTAG formalism
</SectionTitle>
    <Paragraph position="0"> The primitive elements of an LTAG are elementary trees (etrees). Each etree is associated with a lexical item (called the anchor of the tree) on its frontier. We choose LTAGs as our target grammars (i.e., the grammars to be extracted) because LTAGs possess many desirable properties, such as the Extended Domain of Locality, which allows the encapsulation of all arguments of \[he anchor associated with an etree. There are two types of etrees:.</Paragraph>
    <Paragraph position="1"> initial trees and auxiliary trees. An auxiliary tree represents recursive structure and has a unique leaf node, called the foot node, which has the same syntactic category as the root node. Leaf nodes other than anchor nodes and foot nodes are substitutionnodes. Etrees are combined by two operations: substitution and adjunction, as in Figure 1 and 2. The  resulting structure of the combined etrees is called a derived tree. The combination process is expressed as a derivation tree.</Paragraph>
    <Paragraph position="2">  Figure 3 shows the etrees, the derived tree, and the derivation tree for the sentence underwriters still draft policies. Foot and substitution nodes are marked by., and $, respectively. The dashed and solid lines in the derivation tree are for adjunction and substitution operations, respectively.</Paragraph>
  </Section>
  <Section position="5" start_page="53" end_page="57" type="metho">
    <SectionTitle>
3 System Overview
</SectionTitle>
    <Paragraph position="0"> We have built a system, called LexTract, for grammar extraction. The architecture of LexTract is shown in Figure 4 (the parts that will be discussed in this paper are in bold). The core of LexTract is an extraction algorithm that takes a Treebank sentence such as the one in Figure 5 and produces the trees (elementary trees, derived trees and derivation trees) such as the ones in Figure 3.</Paragraph>
    <Section position="1" start_page="53" end_page="54" type="sub_section">
      <SectionTitle>
3.1 The Form of Target Grammars
</SectionTitle>
      <Paragraph position="0"> Without further constraints, the etrees in the target grammar could be of various shapes.</Paragraph>
      <Paragraph position="1"> #, .4&amp;quot;--. ~, #3. ~: l: &amp;quot; VP -- ' S</Paragraph>
      <Paragraph position="3"> tree for underwriters still draft policies</Paragraph>
      <Paragraph position="5"> Our system recognizes three types of relation (namely, predicate-argument, modification, and coordination relations) between the anchor of an etree and other nodes in the etree, and imposes the constraint that all the etrees to be extracted should fall into exactly one of the three patterns in Figure 6.</Paragraph>
      <Paragraph position="6"> * The spine-etrees for predicate-argument relations. X deg is the head of X m and the anchor of the etree. The etree is formed by a spine X m --+ X m-1 -~ .. --+ X deg and the arguments of X deg.</Paragraph>
      <Paragraph position="7"> * The mod-etrees for modification relations. The root of the etree has two children, one is a foot node with the label Wq, and the other node X m is a modifier  of the foot node. X TM is further expanded into a spine-etree whose head X deg is the anchor of the whole mod-etree.</Paragraph>
      <Paragraph position="8"> * The conj-etrees for coordination relations. In a conj-etree, the children of the root are two conjoined constituents and a node for a coordination conjunction. One conjoined constituent is marked as the foot node, and the other is expanded into a spine-etree whose head is the anchor of the whole tree.</Paragraph>
      <Paragraph position="9"> Spine-etrees are initial trees, whereas mod-etrees and conj-etrees are auxiliary trees.</Paragraph>
    </Section>
    <Section position="2" start_page="54" end_page="54" type="sub_section">
      <SectionTitle>
3.2 Treebank-specific information
</SectionTitle>
      <Paragraph position="0"> The phrase structures in the Treebank (ttrees for short) are partially bracketed in the sense that arguments and modifiers are not structurally distinguished. In order to construct the etrees, which make such distinction, LexTract requires its user to provide additional information in the form of three tables: a Head Percolation Table, an Argument Table, and a Tagset Table.</Paragraph>
      <Paragraph position="1"> A Head Percolation Table has previously been used in several statistical parsers (Magerman, 1995; Collins, 1997) to find heads of phrases. Our strategy for choosing heads is similar to the one in (Collins, 1997). An Argument Table informs LexTract what types of arguments a head can take. The Tagset Table specifies what function tags always mark arguments (adjuncts, heads, respectively). LexTract marks each sibling of a head as an argument if the sibling can be an argument of the head according to the Argument Table and none of the function tags of the sister indicates that it is an adjunct. For example, in Figure 5, the head of the root S is the verb draft, and the verb has two siblings: the noun phrase policies is marked as an argument of the verb because from the Argument Table we know that verbs in general can take an NP object; the clause is marked as a modifier of the verb because, although verbs in general can take a sentential argument, the Tagset Table informs LexTract that the function tag -MNR (manner) always marks a modifier.</Paragraph>
    </Section>
    <Section position="3" start_page="54" end_page="55" type="sub_section">
      <SectionTitle>
3.3 Overview of the Extraction
Algorithm
</SectionTitle>
      <Paragraph position="0"> The extraction process has three steps: First, LexTract fully brackets each ttree; Second, LexTract decomposes the fully bracketed ttree  into a set of etrees; Third, LexTract builds the derivation tree for the ttree.</Paragraph>
      <Paragraph position="1">  As just mentioned, the ttrees in the Tree-bank do not explicitly distinguish arguments and modifiers, whereas etrees do. To account for this difference, we first fully bracket the ttrees by adding intermediate nodes so that at each level, one of the following relations holds between the head and its siblings: (1) head-argument relation, (2) modification relation, and (3) coordination relation. LexTract achieves this by first choosing the head-child at each level and distinguishing arguments from adjuncts with the help of the three tables mentioned in Section 3.2, then adding intermediate nodes so that the modifiers and arguments of a head attach to different levels.  In this step, LexTract removes recursive structures - which will become mod-etrees or conj-etrees - from the fully bracketed ttrees and builds spine-etrees for the non-recursive structures. Starting from the root of a fully bracketed ttree, LexTract first finds a unique path from the root to its head. It then checks each node e on the path. If a sibling of e in the ttree is marked as a modifier, LexTract marks e and e's parent, and builds a mod-etree (or a conj-etree if e has another sibling which is a conjunction) with e's parent as the root node, e as the foot node, and e's siblings as the modifier. Next, LexTract creates a spine-etree with the remaining unmarked nodes on the path and their siblings. Finally, LexTract repeats this process for the nodes that are not on the path. In Figure 8, which is the same as the one in  path from the root $1 to the head VBP is</Paragraph>
      <Paragraph position="3"> Along the path the PP ~ at FNX- is a modifier of $2; therefore, Sl.b, S2.t, and the spine-etree rooted at PP form a mod-etree #1. Similarly, the ADVP still is a modifier of VP2 and $3 is a modifier of VP3, and the corresponding structures form mod-etrees #4 and #7. On the path from the root to VBP, Sl.t and S2.b are merged (and so are VPi.t and VP3.b) to from the spine-etree #5. Repeating this process for other nodes will generate other trees such as trees #2, #3 and #6. The whole ttree yields twelve etrees as shown in Figure 9.</Paragraph>
      <Paragraph position="4"> 3.3.3 Building derivation trees The fully bracketed ttree is in fact a derived tree of the sentence if the sentence is parsed with the etrees extracted by LexTract. In addition to these etrees and the derived tree, we the root of one etree is merged with a node in the other etree. Splitting nodes into top and bottom pairs during the decomposition of the derived tree is the reverse process of merging nodes during parsing. For the sake of simplicity, we show the top and the bottom parts of a node only when the two parts will end up in different etrees.</Paragraph>
      <Paragraph position="5"> also need derivation trees to train statistical LTAG parsers. Recall that, in general, given a derived tree, the derivation tree that can generate the derived tree may not be unique.</Paragraph>
      <Paragraph position="6"> Nevertheless, given the fully bracketed ttree, the etrees, and the positions of the etrees in the ttree (see Figure 8), the derivation tree becomes unique if we choose either one of the following: * We adopt the traditional definition of derivation trees (which allows at most one adjunction at any node) and add an additional constraint which says that no adjunction operation is allowed at the foot node of any auxiliary tree. 2 * We adopt the definition of derivation trees in (Schabes and Shieber, 1992) (which allows multiple adjunction at any node) and require all mod-etrees adjoin to the etree that they modify.</Paragraph>
      <Paragraph position="7"> The user of LexTract can choose either option and inform LexTract about his choice by setting a parameter. 3 Figure 10 shows the derivation tree based on the second option.</Paragraph>
      <Paragraph position="9"/>
    </Section>
    <Section position="4" start_page="55" end_page="56" type="sub_section">
      <SectionTitle>
3.4 Uniqueness of decomposition
</SectionTitle>
      <Paragraph position="0"> To summarize, LexTract is a language-independent grammar extraction system, which takes Treebank-specific information (see Section 3.2) and a ttree T, and creates  tion were allowed at foot nodes, ~4 could adjoin to ~7 at VP2.b, and #7 would adjoin to #5 at VPs.b. An alternative is for #4 to adjoin to #5 at VPs.b and for ~7 to adjoin to ~4 at VP2.t. The no-adjunctionat-foot-node constraint would rule out the latter alternative and make the derivation tree unique. Note that this constraint has been adopted by several hand-crafted grammars such as the XTAG grammar for English (XTAG-Group, 1998), because it eliminates this source of spurious ambiguity.</Paragraph>
      <Paragraph position="1"> SThis decision may affect parsing accuracy of an LTAG parser which uses the derivation trees for training, but it will not affect the results reported in this paper.</Paragraph>
      <Paragraph position="2">  (1) a fully bracketed ttree T*, (2) a set Eset of etrees, and (3) a derivation tree D for T*.</Paragraph>
      <Paragraph position="3"> Furthermore, Eset is the only tree set that satisfies all the following conditions: (C1) Decomposition: The tree set is a decomposition of T*, that is, T* would be generated if the trees in the set were combined via the substitution and adjunction operations.</Paragraph>
      <Paragraph position="4"> (C2) LTAG formalism: Each tree in the set is a valid etree, according to the LTAG  formalism. For instance, each tree should be lexicalized and the arguments of the anchor should be encapsulated in the same etree.</Paragraph>
      <Paragraph position="5"> (C3) Target grammar: Each tree in the set falls into one of the three types as specified in Section 3.1.</Paragraph>
      <Paragraph position="6"> (C4) Treebank-specific information: The head/argument/adjunct distinction in the trees is made according to the  This uniqueness of the tree set may be quite surprising at first sight, considering that the number of possible decompositions of T* is ~(2n), where n is the number of nodes in T*. 4 Instead of giving a proof of the uniqueness, 4Recall that the process of building etrees has two steps. First, LexTract treats each node as a pair of the top and bottom parts. The ttree is cut into pieces along the boundaries of the top and bottom parts of some nodes. The top and the bottom parts of each node belong to either two distinct pieces or one piece, as a result, there are 2 ~ distinct partitions. Second, some non-adjacent pieces in a partition can be glued together to form a bigger piece. Therefore, each partition will result in one or more decompositions of the ttree. In total, there are at least 2 n decompositions of the ttree.</Paragraph>
      <Paragraph position="7"> we use an example to illustrate how the conditions (C1)--(C4) rule out all the decompositions except the one produced by LexTract. In Figure 11, the ttree T* has 5 nodes (i.e., S, NP, N, VP, and V). There are 32 distinct decompositions for T*, 6 of which are shown in the same figure. Out of these 32 decompositions, only five (i.e., E2 -- E6) are fully lexicalized -- that is, each tree in these tree sets is anchored by a lexical item. The rest, including El, are not fully lexicalized, and are therefore ruled out by the condition (C2). For the remaining five etree sets, E2 -- E4 are ruled out by the condition (C3), because each of these tree sets has one tree that violates one constraint which says that in a spine-etree an argument of the anchor should be a substitution node, rather than an internal node. For the remaining two, E5 is ruled out by (C4) because according to the Head Table provided by the user, the head of the S node should be V, not N. Therefore, E6, the tree set that is produced by LexTract, is the only etree set for T* that satisfies (C1)--(C4).</Paragraph>
    </Section>
    <Section position="5" start_page="56" end_page="57" type="sub_section">
      <SectionTitle>
3.5 The Experiments
</SectionTitle>
      <Paragraph position="0"> We have ran LexTract on the one-millionword English Penn Treebank (Marcus et al., 1993) and got two Treebank grammars.</Paragraph>
      <Paragraph position="1"> The first one, G1, uses the Treebank's tagset. The second Treebank grammar, G2, uses a reduced tagset, where some tags in the Treebank tagset are merged into a single tag. For example, the tags for verbs, MD/VB/VBP/VBZ/VBN/VBD/VBG, are merged into a single tag V. The reduced tagset is basically the same as the tagset used in the XTAG grammar (XTAG-Group, 1998). G2 is built so that we can compare it with the XTAG grammar, as will be discussed in the next section. We also ran the system on the 100-thousand-word Chinese Penn Treebank (Xia et al., 2000b) and on a 30-thousand-word Korean Penn Treebank.</Paragraph>
      <Paragraph position="2"> The sizes of extracted grammars are shown in Table 1. (For more discussion on the Chinese and the Korean Treebanks and the comparison between these Treebank grammars, see (Xia et al., 2000a)). The second column of the table lists the numbers of unique templates in each grammar, where templates are etrees with the lexical items removed, s The third column shows the numbers of unique  etrees. The average numbers of etrees for each word type in G1 and G2 are 2.67 and-2.38 respectively. Because frequent words often anchor many etrees, the numbers increase by more than 10 times when we consider word token, as shown in the fifth and sixth columns of the table. G3 and G4 are much smaller than G1 and G2 because the Chinese and the Korean Treebanks are much smaller than the English Treebank.</Paragraph>
      <Paragraph position="3"> In addition to LTAGs, by reading context-free rules off the etrees of a Treebank LTAG, LexTract also produces CFGs. The numbers of unlexicalized context-free rules from G1--G4 are shown in the last column of Table 1. Comparing with other CFG extraction algorithms such as the one in (Krotov et al., 1998), the CFGs produced by LexTract have several good properties. For example, they allow unary rules and epsilon rules, they are more compact and the size of the grammar remains monotonic as the Treebank grows.</Paragraph>
      <Paragraph position="4"> Figure 12 shows the log frequency of templates and the percentage of template tokens covered by template types in G1. 6 In both cases, template types are sorted according to their frequencies and plotted on the X-axes.</Paragraph>
      <Paragraph position="5"> The figure indicates that a small portion of template types, which can be seen as the core of the grammar, cover majority of template tokens in the Treebank. For example, the first 100 (500, 1000 and 1500, resp.) templates cover 87.1% (96.6~o, 98.4% and 99.0% resp.) of the tokens, whereas about half (3411) of the templates each occur only once, accounting for only 0.29% of template tokens in total.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="57" end_page="60" type="metho">
    <SectionTitle>
4 Applications of LexTract
</SectionTitle>
    <Paragraph position="0"> In addition to extract LTAGs and CFGs, LexTract has been used to perform the following tasks:  * We use the Treebank grammars produced by LexTract to evaluate the coverage of hand-crafted grammars.</Paragraph>
    <Paragraph position="1"> * We use the (word, template) sequence produced by LexTract to re-train Srinivas' Supertaggers (Srinivas, 1997). * The derivation trees created by LexTract are used to train a statistical LTAG  parser (Sarkar, 2000). LexTract output has also been used to train an LR LTAG parser (Prolo, 2000).</Paragraph>
    <Paragraph position="2">  plausibility of extracted etrees by decomposing each etree into substructures and checking them. Implausible etrees are often caused by Treebank annotation errors. Because LexTract maintains the mappings between etree nodes and ttree nodes, it can detect certain types of annotation errors. We have used LexTract for the final cleanup of the Penn Chinese Treebank.</Paragraph>
    <Paragraph position="3"> Due to space limitation, in this paper we will only discuss the first two tasks.</Paragraph>
    <Section position="1" start_page="57" end_page="59" type="sub_section">
      <SectionTitle>
4.1 Evaluating the coverage of
</SectionTitle>
      <Paragraph position="0"> hand-crafted grammars The XTAG grammar (XTAG-Group, 1998) is a hand-crafted large-scale grammar for English, which has been developed at University of Pennsylvania in the last decade. It has been used in many NLP applications such as generation (Stone and Doran, 1997). Evaluating the coverage of such a grammar is important for both its developers and its users.</Paragraph>
      <Paragraph position="1"> Previous evaluations (Doran et al., 1994; Srinivas et al., 1998) of the XTAG grammar use raw data (i.e., a set of sentences without syntactic bracketing). The data are first parsed by an LTAG parser and the coverage of the grammar is measured as the percentage of sentences in the data that get at least one parse, which is not necessarily the correct parse. For more discussion on this approach, see (Prasad and Sarkar, 2000).</Paragraph>
      <Paragraph position="2"> We propose a new evaluation method that takes advantage of Treebanks and LexTract.</Paragraph>
      <Paragraph position="3"> The idea is as follows: given a Treebank T and a hand-crafted grammar Gh, the coverage of Gh on T can be measured by the overlap of Gh and a Treebank grammar Gt that is produced by LexTract from T. In this case, we will estimate the coverage of the XTAG grammar on the English Penn Treebank (PTB) using the Treebank grammar G2.</Paragraph>
      <Paragraph position="4"> There are obvious differences between these two grammars. For example, feature structures and multi-anchor etrees are present only in the XTAG grammar, whereas frequency information is available only in G2. When we match templates in two grammars, we disre- null etree types etree types CFG rules per word type i per word token (unlexicalized)  gard the type of information that is present only in one grammar. As a result, the mapping between two grammars is not one-to-one.  XTAG grammar and 215 templates in G2 match, and the latter accounts for 82.1% of the template tokens in the PTB. The remaining 17.9% template tokens in the PTB do not match any template in the XTAG grammar because of one of the following reasons: (T1) Incorrect templates in G2: These templates result from Treebank annotation errors, and therefore, are not in XTAG.</Paragraph>
      <Paragraph position="5"> (T2) Coordination in XTAG: the templates for coordinations in XTAG are generated on the fly while parsing (Sarkar and Joshi, 1996), and are not part of the 1004 templates. Therefore, the conj-etrees in G2, which account for 3.4% of the template tokens in the Treebank, do not match any templates in XTAG.</Paragraph>
      <Paragraph position="6"> (T3) Alternative analyses: XTAG and PTB sometimes choose different analyses for the same phenomenon. For example, the two grammars treat reduced relative clauses differently. As a result, the templates used to handle those phenomena in these two grammars do not match according to our definition. null (T4) Constructions not covered by XTAG: Some of such constructions are the unlike coordination phrase (UCP), parenthetical (PRN), and ellipsis.</Paragraph>
      <Paragraph position="7"> For (T1)--(T3), the XTAG grammar can handle the corresponding constructions although the templates used in two grammars look very different. To find out what constructions are not covered by XTAG, we manually classify 289 of the most frequent unmatched templates in G2 according to the reason why they are absent from XTAG. These 289 templates account for 93.9% of all the unmatched template tokens in the Treebank. The results are shown in Table 3, where the percentage is with respect to all the tokens in the Treebank. From the table, it is clear that the most common reason for mis-matches is (T3). Combining the results in Table 2 and 3, we conclude that 97.2% of template tokens in the Treebank are covered by XTAG, while another 1.7% are not. For the remaining 1.1% template tokens, we do not know whether or not they are covered by XTAG because we have not checked the remaining 2416 unmatched templates in G2. T To summarize, we have just showed that, 7The number 97.2% is the sum of two numbers: the first one is the percentage of matched template tokens (82.1% from Table 2). The secb-nd number is the percentage of template tokens which fall under (T1)-(T3), i.e., 16.8%-1.7%=15.1% from Table 3.  by comparing templates in the XTAG grammar with the 'IYeebank grammar produced by LexTract, we estimate that the XTAG grammar covers 97.2% of template tokens in the English Treebank. Comparing with previous evaluation approach, this :method has several advantages. First, the whole process is semi-automatic and requires little human effort. Second, the coverage can be calculated at either sentence level or etree level, which is more fine-grained. Third, the method provides a list of etrees that can be added to the grammar to improve its coverage. Fourth, there is no need to parse the whole corpus, which could have been very time-consuming.</Paragraph>
    </Section>
    <Section position="2" start_page="59" end_page="60" type="sub_section">
      <SectionTitle>
4.2 Training Supertaggers
</SectionTitle>
      <Paragraph position="0"> A Supertagger (Joshi and Srinivas, 1994; Srinivas, 1997) assigns an etree template to each word in a sentence. The templates are also called Supertags because they include more information than Part-of-Speech tags. Srinivas implemented the first Supertagger, and he also built a Lightweight Dependency Analyzer that assembles the Supertags of words to create an almost-parse for the sentence. Supertaggers have been found useful for several applications, such as information retrieval (Chandrasekar and Srinivas, 1997).</Paragraph>
      <Paragraph position="1"> To use a Treebank to train a Supertagger, the phrase structures in the Treebank have to be converted into (word, Supertag) sequences first. Producing such sequences is exactly one of LexTract's main functions, as shown previously in Section 3.3.2 and Figure 9.</Paragraph>
      <Paragraph position="2"> Besides LexTract, there are two other attempts in converting the English Penn Tree-bank to train a Supertagger. Srinivas (1997) uses heuristics to map structural information in the Treebank into Supertags. His method is different from LexTract in that the set of Supertags in his method is chosen from the pre-existing XTAG grammar before the conversion starts, whereas LexTract extracts the Supertag set from Treebanks. His conversion program is also designed for this particular Supertag set, and it is not very-easy to port it to another Supertag set. A third difference is that the Supertags in his converted data do not always fit together, due to the discrepancy between the XTAG grammar and the Treebank annotation and the fact that the XTAG grammar does not cover all the templates in the Treebank (see Section 4.1). In other words, even if the Supertagger is 100% accurate, it is possible that the correct parse for a sentence can not be produced by combining those Supertags in the sentence.</Paragraph>
      <Paragraph position="3"> Another work in converting Treebanks into LTAGs is described in (Chen and Vijay-Shanker, 2000). The method is similar to ours in that both work use Head Percolation Tables to find the head and both distinguish adjuncts from modifiers using syntactic tags and functional tags. Nevertheless, there are several differences: only LexTract explicitly creates fully bracketed ttrees, which are identical to the derived trees for the sentences. As a result, building etrees can be seen as a task of decomposing the fully bracketed ttrees. The mapping between the nodes in fully bracketed ttrees and etrees makes LexTract a useful tool for 'IYeebank annotation and error detection.</Paragraph>
      <Paragraph position="4"> The two approaches also differ in how they distinguish arguments from adjuncts and how they handle coordinations.</Paragraph>
      <Paragraph position="5"> Table 4 lists the tagging accuracy of the same trigram Supertagger (Srinivas, 1997) trained and tested on the same original PTB data. s The difference in tagging accuracy is caused by different conversion algorithms that convert the original PTB data into the (word, template) sequences, which are fed to the Supertagger. The results of Chen &amp; Vijay-Shanker's method come from their paper (Chen and Vijay-Shanker, 2000). They built eight grammars. We just list two of them which seem to be most relevant: C4 uses a reduced tagset while C3 uses the PTB tagset.</Paragraph>
      <Paragraph position="6"> As for Srinivas' results, we did not use the results reported in (Srinivas, 1997) and (Chen et al., 1999) because they are based on different training and testing data. 9 Instead, we re-ran SAll use Section 2-21 of the PTB for training, and Section 22 or 23 for testing. We choose those sections because several state-of-thwart parsers (Collins, 1997; Ratnaparkhi, 1998; Charniak, 1997) are trained on Section 2-21 and tested on Section 23. We include the results for Section 22 because (Chen and Vijay-Shanker, 2000) is tested on that section. For Srinivas' and our grammars, the first line is the results tested on Section 23, and the second line is the one for Section 22. Chen &amp; Vijay-Shauker's results~e for Section 22 only.</Paragraph>
      <Paragraph position="7"> 9He used Section 0-24 minus Section 20 for training and the Section 20 for testing.</Paragraph>
      <Paragraph position="8">  his Supertagger using his data on the sections that we have chosen. 1deg We have calculated two baselines for each seg of data. The first one tags each word in testing data with the most common Supertag w.r.t the word in the training data. For an unknown word, just use its most common Supertag. For the second baseline, we use a trigram POS tagger to tag the words first, and then for each word we use the most common Supertag w.r.t, the (word, POS tag) pair.</Paragraph>
      <Paragraph position="9">  different conversion algorithms A few observations are in order. First, the baselines for Supertagging are lower than the one for POS tagging, which is 91%, indicating Supertagging is harder than POS tagging. Second, the second baseline is slightly better than the first baseline, indicating using ~degNoticeably, the results we report on Srinivas' data, 85.78% on Section 23 and 85.53% on Section 22, axe lower than 92.2% reported in (Srinivas, 1997) and 91.37% in (Chen et al., 1999). There axe several reasons for the difference. First, the size of training data in our report is smaller than the one for his previous work; Second, we treat punctuation marks as normal words during evaluation because, like other words, punctuation marks can anchor etrees, whereas he treats the Supertags for punctuation marks as always correct. Third, he used some equivalent classes during evaluations. If a word is mis-tagged as x, while the correct Supertag is y, he considers that not to be an error if x and y appear in the same equivalent class. We suspect that the reason that those Supertagging errors axe disregarded is that those errors might not affect parsing results when the Supertags are combined. For example, both adjectives and nouns can modify other nouns. The two templates (i.e. Supertags) representing these modification relations look the same except for the POS tags of the anchors. If a word which should be tagged with one Supertag is mis-tagged with the other Supertag, it is likely that the wrong Supertag can still fit with other Supertags in the sentence and produce the right parse. We did not use these equivalent classes in this experiment because we are not aware of a systematic way to find all the cases in which Supertagging errors do not affect the final parsing results.</Paragraph>
      <Paragraph position="10"> POS tags may improve the Supertagging accuracy, n Third, the Supertagging accuracy using G2 is 1.3-1.9% lower than the one using Srinivas' data. This is not surprising since the size of G2 is 6 times that of Srinivas' grammar.</Paragraph>
      <Paragraph position="11"> Notice that G1 is twice the size of G2 and the accuracy using G1 is 2% lower. Fourth, higher Supertagging accuracy does not necessarily means the quality of converted data are better since the underlying grammars differ a lot with respect to the size and the coverage.</Paragraph>
      <Paragraph position="12"> A better measure will be the parsing accuracy (i.e., the converted data should be fed to a common LTAG parser and the evaluations should be based on parsing results). We are currently working on that. Nevertheless, the experiments show that the (word, template) sequences produced by LexTract are useful for training Supertaggers. Our results are slightly lower than the ones trained on Srinivas' data, but our conversion algorithm has several appealing properties: LexTract does not use pre-existing Supertag set; LexTract is languageindependent; the (word, Supertag) sequence produced by LexTract fit together.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML