File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/97/p97-1041_intro.xml

Size: 5,396 bytes

Last Modified: 2025-10-06 14:06:14

<?xml version="1.0" standalone="yes"?>
<Paper uid="P97-1041">
  <Title>A Trainable Rule-based Algorithm for Word Segmentation</Title>
  <Section position="3" start_page="321" end_page="322" type="intro">
    <SectionTitle>
2 Transformation-based
Segmentation
</SectionTitle>
    <Paragraph position="0"> The key component of our trainable segmentation algorithm is Transformation-based Error-driven Learning, the corpus-based language processing method introduced by Brill (1993a). This technique provides a simple algorithm for learning a sequence of rules that can be applied to various NLP tasks.</Paragraph>
    <Paragraph position="1"> It differs from other common corpus-based methods in several ways. For one, it is weakly statistical, but not probabilistic; transformation-based approaches conseo,:~,tly require far less training data than most o;a~is~ical approaches. It is rule-based, but relies on 2See, for example, Sproat et al. (1996).</Paragraph>
    <Paragraph position="2"> machine learning to acquire the rules, rather than expensive manual knowledge engineering. The rules produced can be inspected, which is useful for gaining insight into the nature of the rule sequence and for manual improvement and debugging of the sequence. The learning algorithm also considers the entire training set at all learning steps, rather than decreasing the size of the training data as learning progresses, such as is the case in decision-tree induction (Quinlan, 1986). For a thorough discussion of transformation-based learning, see Ramshaw and Marcus (1996).</Paragraph>
    <Paragraph position="3"> Brill's work provides a proof of viability of transformation-based techniques in the form of a number of processors, including a (widelydistributed) part-of-speech tagger (Brill, 1994), a procedure for prepositional phrase attachment (Brill and Resnik, 1994), and a bracketing parser (Brill, 1993b). All of these provided performance comparable to or better than previous attempts.</Paragraph>
    <Paragraph position="4"> Transformation-based learning has also been successfully applied to text chunking (Ramshaw and Marcus, 1995), morphological disambiguation (Oflazer and Tur, 1996), and phrase parsing (Vilain and Day, 1996).</Paragraph>
    <Section position="1" start_page="321" end_page="322" type="sub_section">
      <SectionTitle>
2.1 Training
</SectionTitle>
      <Paragraph position="0"> Word segmentation can easily be cast as a transformation-based problem, which requires an initial model, a goal state into which we wish to transform the initial model (the &amp;quot;gold standard&amp;quot;), and a series of transformations to effect this improvement. The transformation-based algorithm involves applying and scoring all the possible rules to training data and determining which rule improves the model the most. This rule is then applied to all applicable sentences, and the process is repeated until no rule improves the score of the training data. In this manner a sequence of rules is built for iteratively improving the initial model. Evaluation of the rule sequence is carried out on a test set of data which is independent of the training data.</Paragraph>
      <Paragraph position="1"> If we treat the output of an existing segmentation algorithm 3 as the initial state and the desired segmentation as the goal state, we can perform a series of transformations on the initial state - removing extraneous boundaries and inserting new boundaries to obtain a more accurate approximation of the goal  tion and prepare appropriate training data.</Paragraph>
      <Paragraph position="2"> For our experiments, we obtained corpora which had been manually segmented by native or near-native speakers of Chinese and Thai. We divided the hand-segmented data randomly into training and test sets. Roughly 80% of the data was used to train the segmentation algorithm, and 20% was used as a blind test set to score the rules learned from the training data. In addition to Chinese and Thai, we also performed segmentation experiments using a large corpus of English in which all the spaces had been removed from the texts. Most of our English experiments were performed using training and test sets with roughly the same 80-20 ratio, but in Section 3.4.3 we discuss results of English experiments with different amounts of training data. Unfortunately, we could not repeat these experiments with Chinese and Thai due to the small amount of hand-segmented data available.</Paragraph>
    </Section>
    <Section position="2" start_page="322" end_page="322" type="sub_section">
      <SectionTitle>
2.2 Rule syntax
</SectionTitle>
      <Paragraph position="0"> There are three main types of transformations which can act on the current state of an imperfect segmentation: null  * Insert - place a new boundary between two characters null * Delete - remove an existing boundary between two characters * Slide - move an existing boundary from its cur null rent location between two characters to a loca-tion 1, 2, or 3 characters to the left or right 4 In our syntax, Insert and Delete transformations can be triggered by any two adjacent characters (a bigram) and one character to the left or right of the bigram. Slide transformations can be triggered by a sequence of one, two, or three characters over which the boundary is to be moved. Figure 1 enumerates the 22 segmentation transformations we define.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML