File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/02/w02-1509_metho.xml
Size: 19,444 bytes
Last Modified: 2025-10-06 14:08:08
<?xml version="1.0" standalone="yes"?> <Paper uid="W02-1509"> <Title>Coping with problems in grammars automatically extracted from treebanks</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 The extracted grammar </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.1 TAGs </SectionTitle> <Paragraph position="0"> A TAG is a set of lexicalized elementary trees that can be combined, through the operations of tree adjunction and tree substitution, to derive syntactic structures for sentences. We follow a common approach to grammar development for natural language using TAGs, under which, driven by locality principles, each elementary tree for a given lexical head is expected to contain its projection, and slots for its arguments (e.g., (Frank, 2002)). Figure 1 shows typical grammar template trees that can be selected by lexical items and combined to generate the structure in Figure 2. The derivation tree, to the right, contains the history of the tree grafting process that generated the derived tree, to the left.3</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.2 LexTract </SectionTitle> <Paragraph position="0"> Given an annotated sentence from the PTB as input Xia's LexTract tool (Xia, 1999; Xia, 2001) first executes a rebracketing. More precisely, additional nodes are inserted to separate arguments and modifiers and to structure the modifying process as binary branching. A typical rebracketed PTB tree is shown in Figure 3,4 in which we have distinguished the tree nodes inserted by LexTract.</Paragraph> <Paragraph position="1"> The second stage is the extraction of the grammar trees proper shown in Figure 4. In particular, recursive modifier structures have to be detected and factored out of the derived tree to compose the auxiliary trees, the rest becoming an initial tree. The process is recursive also in the sense that factored subtree structures still undergo the spinning off process until we have all modifiers with their own trees, all the arguments of a head as substitution nodes of the tree containing their head, and the material under the argument nodes defining additional initial trees for themselves. Auxiliary trees are extracted from parent-child pairs with matching labels if the child is elected the parent's head and the child's sibling is marked as modifier: the parent is mapped into a root of an auxiliary tree, the head-child into its 4Figures 3 and 4 are thanks to Fei Xia. We are also grateful to her for allowing us to use LexTract and make changes to its source code to customize to our needs.</Paragraph> <Paragraph position="2"> foot, with the sibling subtree (after being recursively processed) being carried together into the auxiliary tree. Notice that the auxiliary trees are therefore either strictly right or left branching, the foot always immediately under the root node. Other kinds of auxiliary trees are therefore not allowed.</Paragraph> <Paragraph position="3"> To extract a grammar with Xia's tool one has to define tables for finding: the head child of a constituent expansion; which of the siblings of a head are acceptable arguments; and which constituent labels are plausible modifiers of another. Special provisions are made for handling coordination. For additional information see (Xia, 2001). In this paper we refer to (Xia, 1999)'s table settings and extracted grammar, which we used as our starting point, as Xia's sample. We used a customized version of Lex-Tract, plus additional pre-processing of the PTB input and post-processing of the extracted trees.</Paragraph> </Section> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Extraction Problems </SectionTitle> <Paragraph position="0"> Extraction problems arise from several sources, including: (1) lack of proper linguistic account,5 (2) the (Penn Treebank) annotation style, (3) the (Lex-Tract) extraction tool, (4) possible unsuitability of the (TAG) model, and (5) annotation errors. We refrained from making a rigid classification of the problems we present according to these sources. In particular it is often difficult to decide whether to blame sources (1), (3), or (5) for a certain problem.</Paragraph> <Paragraph position="1"> We will not discuss in this paper problems due to annotation errors. As for the PTB style problems we only discuss one, the first listed below.</Paragraph> <Paragraph position="2"> 5Here included the (occasional) inability on the part of grammar developers to find or make use of an existing account.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.1 Free Relatives </SectionTitle> <Paragraph position="0"> Free relatives are annotated in the Penn Treebank as sentential complements as in Figure 5.a. The extracted tree corresponding to the occurrence of &quot;make&quot; would be of a verb that takes a sentential complement (SBAR). This does not seem to be correct6, as the proper subcategorization of the verb occurrence is transitive.</Paragraph> <Paragraph position="1"> In fact, free relatives may occur wherever an NP argument may occur. So, the only reasonable extraction account consistent with maintaining them as SBARs would be one in which every NP substitution node in an extracted tree would admit the cussed in the literature, the presence of the NP (or DP) is clear. existence of a counterpart tree, identical to the first, except that the NP argument label is replaced with an SBAR. Instead we opted to reflect the NP character of the free relatives by pre-processing the corpus (using the Head-analysis, for practical convenience). The annotated example is then automatically replaced with the one in Figure 5.b. Other cases of free-relatives (non-NP) are rare and not likely to interfere with verb subcategorization.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.2 Wh percolation up </SectionTitle> <Paragraph position="0"> In the Penn Treebank the same constituent is annotated with different syntactic categories depending on whether it possesses or not the wh feature. For instance, a regular noun phrase has the syntactic category NP, whereas when the constituent is whmarked, and is in the landing site of wh-movement, it carries the label WHNP.7 While that might look appealing since the two constituents seem to have distinct distributional properties, it poses a design problem. While regular constituents inherit their syntactic categorial feature (i.e. their label) from their heads, wh projections are often formed by inheritance from their modifiers. For instance: &quot;the father&quot; is an NP, but modified by a wh expression (&quot;the father of whom&quot;, &quot;whose father&quot;, &quot;which father&quot;), it becomes a WHNP. The only solution we see is to allow for nouns and NPs to freely project up to WHNPs during extraction.8 On the other hand, in cases when the wh constituent is in a non-wh position, we need the opposite effect: a WHNP (or whnoun POS tag) is allowed to project up to an NP.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.3 Unlike Coordinated Phrases (UCP) </SectionTitle> <Paragraph position="0"> This is the expression used in the PTB to denote coordinated phrases in which the coordinated constituents are not of the same syntactic category. The rationale for the existence of such constructions is that the coordinated constituents are alternative realizations of the same grammatical function with respect to a lexical head. In Figure 6.a, both a noun and an adjective are allowed to modify another noun, and therefore they can be conjoined while realizing that function. Two other common cases are: coordination of predicates in copular constructions (Figure 6.b) and adverbial modification (Figure 6.c).</Paragraph> <Paragraph position="1"> We deal with the problem as follows. First, we allow for a UCP to be extracted as an argument when the head is a verb and the UCP is marked predicative (PRD function tag) in the training example; or whenever the head is seen to have an obligatory argument requirement (e.g., prepositions: &quot;They come from ((NP the house) and (PP behind the tree))&quot;). Second, a UCP is allowed to modify (adjoin to) most of the nodes, according to evidence in the corpus and common sense (in the first and third examples above we had NP and VP modification).</Paragraph> <Paragraph position="2"> With respect to the host tree, when attached as an argument they are treated like any other non-terminal: a substitution node. The left tree in Figure 7 shows</Paragraph> <Paragraph position="4"> the case where the UCP is treated as a modifier. In fact the trees are both for the example in Figure 6.a.</Paragraph> <Paragraph position="5"> Notice that the tree is non-lexicalized to avoid effects of sparseness. The UCP is then expanded as in the right tree in Figure 7: an initial tree anchored by the conjunction (the tree attaches either to a tree like the one in the left or as a true argument - the latter would be the case for the example in Figure 6.b).</Paragraph> <Paragraph position="6"> Now, the caveats. First, we are giving the UCP the status of an independent non-terminal, as if it had some intrinsic categorial significance (as a syntactic projection). The assumption of independence of expansion, that for context-free grammars is inherent to each non-terminal, in TAGs is further restricted to the substitution nodes. For example, when an NP appears as substitution node, in a sub-ject or object position, or as an argument of a preposition or a genitive marker, we are stating that any possible expansion for the NP is licensed there. The same happens for other labels in argument positions as well. While that is an overgenerating assumption (e.g. the expletive &quot;there&quot; cannot be the realization of an NP in object position), it is generally true. For the UCP, however, we know that its expansion is in fact strongly dependent on where the substitution node is, as we have argued before. In fact it is lexically dependent (cf. &quot;I know ((the problem) and (that there is no solution to it))&quot;, where the conjuncts are licensed by the subcategorizations of the verb &quot;know&quot;). On the other hand, it does not seem reasonable to expand the UCP node at the hosting tree - a cross product explosion. A possible way of alleviating this effect could be to expand only the auxiliary trees (a UCP modifying a VP is distinct from a UCP modifying an NP, and moreover they are independent of lexical items). But for true argument positions there seems to be no clear solution.</Paragraph> <Paragraph position="7"> Second, the oddity of the UCP as a label becomes apparent once again when there are multiple conjuncts, as in Figure 8: it is enough for one of them to be distinct to turn the entire constituent into a UCP.</Paragraph> <Paragraph position="8"> Recursive decomposition in the grammar in these situations clearly leads to some non-standard trees.</Paragraph> <Paragraph position="9"> Finally, and more crucially, we have omitted one case in our discussion: the case in which the UCP</Paragraph> <Paragraph position="11"> is the natural head-child of some node. Under some accounts of grammar development this never happens: we have observed that UCP does not appear as head child in the account where the head is the syntactic head of a node. We have not always followed this rule. With respect to the VP head, so far we have followed one major tendency in the computational implementation of lexicalized grammars, according to which lexical verbs are prefered to auxiliary verbs to head the VP. Now, consider the pair of sentences in Figure 9.</Paragraph> <Paragraph position="12"> Under the lexical verb paradigm, in the first sentence the derivation would start with an initial tree anchored by the past participle verb (&quot;rated&quot;). But then we have an interesting problem in the second sentence, for which we do not currently have a neat solution. Following Xia's sample settings of LexTract parameters, in these cases the extraction is rescued by switching to the other paradigm: the initial tree is extracted anchored by the auxiliary verb with a UCP argument, and the VP is accepted as a possible conjunct. A systematic move to the syntactic head paradigm, which we may indeed try, would have important consequences in the locality assumptions for the grammar development.</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.4 VP topicalization </SectionTitle> <Paragraph position="0"> Another problem with the lexical verb paradigm (see also discussion under UCP above) is the VP topicalization as in the sentence in Figure 10. The solution currently adopted (again, inherited from Xia's sample settings) is as above: the paradigm is switched and the auxiliary verb (&quot;be&quot;) is chosen as the anchor of the initial tree.</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.5 Extraposition and Verb Subcategorization </SectionTitle> <Paragraph position="0"> One of the key design principles that have been guiding grammar development with TAGs is to keep verb arguments as substitution slots local to the tree anchored by the verb. It is widely known that the Penn Treebank does not distinguish verb objects from adjuncts. So some sorts of heuristics are needed to decide, among the candidates, which are to be taken as arguments (Kinyon and Prolo, 2002); the rest is extracted as separate VP modifier trees.</Paragraph> <Paragraph position="1"> However, this step is not enough for the trees to correctly reflect verb subcategorizations. The occurrence of discontinuous arguments, frequently explained as argument extraposition (the argument is raised past the adjunct) creates a problem. In the sentence in Figure 11 the verb &quot;pass&quot; should anchor a tree with one NP object.</Paragraph> <Paragraph position="2"> However in such a tree it would be impossible to adjoin the tree for the intervening ADVP &quot;quickly&quot; as a VP modifier and still have it between the verb and the NP.9 LexTract then would instead extract an 9A striking use of sister adjunction in (Chiang, 2000) is exactly the elegant way it solves this problem: the non-argument tree can be adjoined onto a node (say, VP), positioning itself in between the VP's children, which is not possible with TAGs. (NP (NP the 3 billion New Zealand dollars) intransitive tree for the VB &quot;pass&quot;, onto which the ADVP modifier tree would adjoin. The second oddity is that the NP object would also be extracted as a VP modifier tree. In a nutshell, objects in extracted trees are restricted to those which are not extraposed and hence the trees may not truly reflect the proper domain of locality. One view is that the set of trees for a certain subcategorization frame would include these degenerate cases. LexTract has an option to allow limited discontinuity, i.e., a non-argument sequence between the verb and the first object (but not between two objects). The non-arguments would then be adjoined to the V node.10 So far we have used only the latter alternative.</Paragraph> <Paragraph position="3"> It is worth mentioning two other cases of extraposition. Subject extraposition is handled by having the extraposed subject, usually a sentential form, adjoin at the VP of which it is the logical subject (the original position is still occupied by an NP with the expletive pronoun &quot;it&quot;). Relative clause extraposition is modeled by a relative clause tree, only it adjoins at a VP, instead of at an NP as is usual.</Paragraph> </Section> <Section position="6" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.6 Parentheticals </SectionTitle> <Paragraph position="0"> Parenthetical expressions are ubiquitous in language: they may appear almost everywhere in a sentence and can be of almost any category (Fig. 12).</Paragraph> <Paragraph position="1"> We model them as adjoining, either to the left or right of the constituent they are dominated by, depending on whether they are to the left or right of the head child of the parent's node. Occasionally such trees can also be initial. The respective trees for the examples of Figure 12 are drawn in Figure 13. It 10Of course, although the solution covers most of the occurrences, and apart of any linguistic concern, there are still uncovered cases, e.g., when a parenthetical expression intervenes between the first and the second argument.</Paragraph> <Paragraph position="2"> is always the case that the label PRN dominates a single substitution node. Whenever this was not the case in the training corpus, heuristics based on observation were used to enforce that, by inserting an appropriate missing node.</Paragraph> </Section> <Section position="7" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.7 Projection labels </SectionTitle> <Paragraph position="0"> LexTract extracts trees with no concern for the appropriate projective structure of constituents when not explicitly marked in the PTB. Figure 14 shows two examples of NP modification where the modifiers are single lexical items. The extracted modifier trees, shown on the right, do not have the projection for the modifiers JJR &quot;stronger&quot; and the NNP &quot;October&quot; (which should be, respectively, an ADJP and an NP). That is so, because those nodes are not found in the annotation.</Paragraph> <Paragraph position="1"> However, if the modifiers are complex, that is, if the modifiers are themselves modified, the PTB inserts their respective projections, and therefore they appear in the extracted trees, as shown in Figure 15. There seems to be no reason for the two pairs of extracted trees to be different. Much of this is caused by the acknowledged flatness in the Penn Treebank annotation. That said, the trees like those in the second pair should be preferred. The projection node (ADJP or NP) is understood to be dominating its head even when there is no further modification, and it should be a concern of a good extraction process to insert the missing node into the grammar. Since LexTract do not allow us to spec- null ify for the insertion of &quot;obligatory&quot; projections we had to accomplish this through a somewhat complicated post-processing step using a projection table. Some of our current projections are: nouns, personal pronouns and the existential expletive to NP; adjectives to ADJP; adverbs to ADVP; sentences either to SBAR (S, SINV) or to SBARQ (SQ); Cardinals (CD) to Quantifier Phrases (QP) which themselves project to NP. Notice that not all categories are forcefully projected. For instance, verbs are not, allowing for simple auxiliary extraction. IN is also not projected due to its double role as PP head (true preposition) and subordinate conjunction, which should project onto SBARs.</Paragraph> </Section> </Section> class="xml-element"></Paper>