File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/93/h93-1047_metho.xml

Size: 9,123 bytes

Last Modified: 2025-10-06 14:13:25

<?xml version="1.0" standalone="yes"?>
<Paper uid="H93-1047">
  <Title>Automatic Grammar Induction and Parsing Free Text: A Transformation-Based Approach</Title>
  <Section position="3" start_page="0" end_page="239" type="metho">
    <SectionTitle>
2. THE ALGORITHM
</SectionTitle>
    <Paragraph position="0"> The learning algorithm is trained on a small corpus of partially bracketed text which is also annotated with part of speech information. All of the experiments presented below were done using the Penn Treebank annotated corpus\[11\]. The learner begins in a naive initial state, knowing very little about the phrase structure of the target corpus. In particular, all that is initially known is that English tends to be right branching and that final punctuation is final punctuation. Transformations are then learned automatically which transform the output of the naive parser into output which better resembles the phrase structure found in the training corpus. Once a set of transformations has been learned, the system is capable of taking sentences tagged with parts of speech and returning a binary-branching structure with nonterminals unlabelled 3.</Paragraph>
    <Paragraph position="1"> 2.1. The Initial State Of The Parser Initially, the parser operates by assigning a right-linear structure to all sentences. The only exception is that final punctuation is attached high. So, the sentence &amp;quot;The dog and old cat ate .&amp;quot; would be incorrectly bracketed as:</Paragraph>
    <Paragraph position="3"> see \[5, 4\].</Paragraph>
    <Paragraph position="4">  The parser in its initial state will obviously not bracket sentences with great accuracy. In some experiments below, we begin with an even more naive initial state of knowledge: sentences are parsed by assigning them a random binary-branching structure with final punctuation always attached high.</Paragraph>
    <Section position="1" start_page="237" end_page="238" type="sub_section">
      <SectionTitle>
2.2. Structural Transformations
</SectionTitle>
      <Paragraph position="0"> The next stage involves learning a set of transformations that can be applied to the output of the naive parser to make these sentences better conform to the proper structure specified in the training corpus. The list of possible transformation types is prespecified. Transformations involve making a simple change triggered by a simple environment. In the current implementation, there are twelve allowable transformation types:</Paragraph>
      <Paragraph position="2"> To carry out a transformation by adding or deleting a parenthesis, a number of additional simple changes must take place to preserve balanced parentheses and binary branching. To give an example, to delete a left paren in a particular environment, the following operations take place (assuming, of course, that there is a left paren to delete):  1. Delete the left paren.</Paragraph>
      <Paragraph position="3"> 2. Delete the right paren that matches the just deleted paren. 3. Add a left paren to the left of the constituent immediately to the left of the deleted left paren.</Paragraph>
      <Paragraph position="4"> 4. Add a right paren to the right ofthe constituent immediately to the right of the deleted paren.</Paragraph>
      <Paragraph position="5"> 5. If there is no constituent immediately to the right, or none  immediately to the left, then the transformation fails to apply.</Paragraph>
      <Paragraph position="6"> Structurally, the transformation can be seen as follows. If we wish to delete a left paren to the right of constituent X 4, where</Paragraph>
      <Paragraph position="8"/>
      <Paragraph position="10"> Given the sentence6: The dog barked.</Paragraph>
      <Paragraph position="11"> this would initially be bracketed by the naive parser as:</Paragraph>
      <Paragraph position="13"> If the transformation delete a left paren to the right of a determiner is applied, the structure would be transformed to the correct bracketing: ( ( ( The dog ) barked ). ) To add a right parenthesis to the right of YY, YY must once again be in a subtree of the form:</Paragraph>
      <Paragraph position="15"> If it is, the following steps are carried out to add the right paren:  1. Add the right paren.</Paragraph>
      <Paragraph position="16"> 2. Delete the left paren that now matches the newly added paren.</Paragraph>
      <Paragraph position="17"> 3. Find the right paren that used to match the just deleted paren and delete it.</Paragraph>
      <Paragraph position="18"> 4. Add a left paren to match the added right paren. 5The twelve transformations can be decomposedinto two structural trans- null formations, that shown here and its converse, along with six triggering environments.</Paragraph>
      <Paragraph position="19"> 6Input sentences are also labelled with parts of speech.  This results in the same structural change as deleting a left paren to the right of X in this particular structure. Applying the transformation add a right paren to the right of a noun to the bracketing:</Paragraph>
      <Paragraph position="21"/>
    </Section>
    <Section position="2" start_page="238" end_page="239" type="sub_section">
      <SectionTitle>
2.3. Learning Transformations
</SectionTitle>
      <Paragraph position="0"> Learning proceeds as follows. Sentences in the training set are first parsed using the naive parser which assigns right linear structure to all sentences, attaching final punctuation high.</Paragraph>
      <Paragraph position="1"> Next, for each possible instantiation of the twelve transformation templates, that particular transformation is applied to the naively parsed sentences. The resulting structures are then scored using some measure of success which compares these parses to the correct structural descriptions for the sentences provided in the training corpus. The transformation which results in the best scoring structures then becomes the first transformation of the ordered set of transformations that are to be learned. That transformation is applied to the right-linear structures, and then learning proceeds on the corpus of improved sentence bracketings. The following procedure is carried out repeatedly on the training corpus until no more transformations can be found whose application reduces the error in parsing the training corpus:  1. The best transformation is found for the structures output by the parser in its current state. 7 2. The transformation is applied to the output resulting from bracketing the corpus using the parser in its current state.</Paragraph>
      <Paragraph position="2"> 3. This transformation is added to the end of the ordered list of transformations.</Paragraph>
      <Paragraph position="3"> 4. Go to 1.</Paragraph>
      <Paragraph position="4">  After a set of transformations has been learned, it can be used to effectively parse fresh text. To parse fresh text, the text is first naively parsed and then every transformation is applied, in order, to the naively parsed text.</Paragraph>
      <Paragraph position="5"> One nice feature of this method is that different measures of bracketing success can be used: learning can proceed in such 7The state of the parser is defined as naive initial-state knowledge plus all transformations that currently have been learned.</Paragraph>
      <Paragraph position="6"> a way as to try to optimize any specified measure of success. The measure we have chosen for our experiments is the same measure described in \[12\], which is one of the measures that arose out of a parser evaluation workshop \[2\]. The measure is the percentage of constituents (strings of words between matching parentheses) from sentences output by our system which do not cross any constituents in the Penn Treebank structural description of the sentence. For example, if our</Paragraph>
      <Paragraph position="8"> then the constituent the big would be judged correct whereas the constituent dog ate would not.</Paragraph>
      <Paragraph position="9"> Below are the first seven transformations found from one run of training on the Wall Street Journal corpus, which was initially bracketed using the right-linear initial-state parser.  1. Delete a left paren to the left of a singular noun. 2. Delete a left paren to the left of a plural noun.</Paragraph>
      <Paragraph position="10"> 3. Delete a left paren between two proper nouns.</Paragraph>
      <Paragraph position="11"> 4. Delet a left paren to the right of a determiner.</Paragraph>
      <Paragraph position="12"> 5. Add a right paren to the left of a comma.</Paragraph>
      <Paragraph position="13"> 6. Add a right paren to the left of a period.</Paragraph>
      <Paragraph position="14"> 7. Delete a right paren to the left of a plural noun.  Applying the'fifth transformation to the bracketing: ( ( We ( ran (, ( and ( they walked ) ) ) ) ). ) would result in improve performance in the test corpus. One way around this overtraining would be to set a threshold: specify a minimum level of improvement that must result for a transformation to be learned. Another possibility is to use additional training material to prune the set of learned transformations. ( ( ( We ran ) (, ( and ( they walked ) ) ) ). )</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML