XML Viewer - p04-3021

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/p04-3021_metho.xml
Size: 9,430 bytes
Last Modified: 2025-10-06 14:09:07
<?xml version="1.0" standalone="yes"?>
<Paper uid="P04-3021">
  <Title>Compiling Boostexter Rules into a Finite-state Transducer</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 Resolving Ambiguity by Classi cation
</SectionTitle>
    <Paragraph position="0"> In general, we can characterize all these tagging problems as search problems formulated as shown 2Furthermore, software implementing the nite-state calculus is available for research purposes.</Paragraph>
    <Paragraph position="1"> in Equation (1). We notatea0 to be the input vocabulary,a1 to be the vocabulary ofa2 tags, ana3 word input sequence asa4 (a5 a0a7a6 ) and tag sequence asa8</Paragraph>
    <Paragraph position="3"> a8a11a10 , the most likely tag sequence out of the possible tag sequences (a8 ) that can be associated toa4 .</Paragraph>
    <Paragraph position="5"> Following the techniques of Hidden Markov Models (HMM) applied to speech recognition, these tagging problems have been previously modeled indirectly through the transformation of the Bayes rule as in Equation 2. The problem is then approximated for sequence classi cation by a ka33a35a34 -order Markov model as shown in Equation (3).</Paragraph>
    <Paragraph position="7"> (3)Although the HMM approach to tagging can easily be represented as a WFST, it has a drawback in that the use of large contexts and richer features results in sparseness leading to unreliable estimation of the parameters of the model.</Paragraph>
    <Paragraph position="8"> An alternate approach to arriving at a8a58a10 is to model Equation 1 directly. There are many examples in recent literature (Breiman et al., 1984; Freund and Schapire, 1996; Roth, 1998; Lafferty et al., 2001; McCallum et al., 2000) which take this approach and are well equipped to handle large number of features. The general framework for these approaches is to learn a model from pairs of associations of the form (a23 a42a60a59a62a61a63a42) where a23 a42 is a feature representation of a4 and a61a21a42 (a5a64a1 ) is one of the members of the tag set. Although these approaches have been more effective than HMMs, there have not been many attempts to represent these models as a WFST, with the exception of the work on compiling decision trees (Sproat and Riley, 1996). In this paper, we consider the boosting (Freund and Schapire, 1996) approach (which outperforms decision trees) to Equation 1 and present a technique for compiling the classi er model into a WFST.</Paragraph>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Boostexter
</SectionTitle>
    <Paragraph position="0"> Boostexter is a machine learning tool which is based on the boosting family of algorithms rst proposed in (Freund and Schapire, 1996). The basic idea of boosting is to build a highly accurate classi er by combining many weak or simple base learner, each one of which may only be moderately accurate.</Paragraph>
    <Paragraph position="1"> A weak learner or a rulea65 is a triplea26a67a66 a59a69a68a70 a59 a68a71 a31 , which tests a predicate (a66 ) of the input (a23 ) and assigns a</Paragraph>
    <Paragraph position="3"> is assumed that a pool of such weak learners a75 a12  a65a78a77 can be constructed easily. From the pool of weak learners, the selection the weak learner to be combined is performed iteratively. At each iteration a48 , a weak learner a65 a33is selected that minimizes a prediction error loss function on the training corpus which takes into account the weighta46  assigned to each training example. Intuitively, the weights encode how important it is that a65  correctly classi es each training example. Generally, the examples that were most often misclassi ed by the preceding base classi ers will be given the most weight so as to force the base learner to focus on the hardest examples. As described in (Schapire and Singer, 1999), Boostexter uses con dence rated classi ers a65</Paragraph>
    <Paragraph position="5"> a31 whose sign (-1 or +1) is interpreted as a prediction, and whose magnitude a28a65</Paragraph>
    <Paragraph position="7"> is a measure of con dence . The iterative algorithm for combining weak learners stops after a prespeci ed number of iterations or when the training set accuracy saturates.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 Weak Learners
</SectionTitle>
      <Paragraph position="0"> In the case of text classi cation applications, the set of possible weak learners is instantiated from simplea2 -grams of the input text (a4 ). Thus, ifa79a81a80 is a function to produce alla2 -grams up toa2 of its argument, then the set of predicates for the weak learn-</Paragraph>
      <Paragraph position="2"> problems, which take into account the left and right context, we extend the set of weak learners created from the word features with those created from the left and right context features. Thus features of the left context (a83 a42a84 ), features of the right context (a83 a42a85 ) and the features of the word itself (a83 a42a86a57a87) constitute the features at positiona72. The predicates for the pool of weak learners are created from these set of features and are typicallya2 -grams on the feature representations. Thus the set of predicates resulting from the word level features is a75a89a88 a12a91a90 a42a79a45a80 a26a83 a42a86a57a87a31 , from left context features is a75</Paragraph>
      <Paragraph position="4"> To date, decoding using the boosted rule sets is restricted to cases where the test input is unambiguous such as strings or words (not word graphs). By compiling these rule sets into WFSTs, we intend to extend their applicability to packed representations of ambiguous input such as word graphs.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Compilation
</SectionTitle>
    <Paragraph position="0"> We note that the weak learners selected at the end of the training process can be partitioned into one of three types based on the features that the learners  : test features of the right context We use the representation of context-dependent rewrite rules (Johnson, 1972; Kaplan and Kay, 1994) and their weighted version (Mohri and Sproat, 1996) to represent these weak learners. The (weighted) context-dependent rewrite rules have the general form a83a119a118a64a120a121a28a97a122 a123 (6) where a83 ,a120 ,a122 and a123 are regular expressions on the alphabet of the rules. The interpretation of these rules are as follows: Rewrite a83 by a120 when it is preceded by a122 and followed by a123 . Furthermore, a120 can be extended to a rational power series which are weighted regular expressions where the weights encode preferences over the paths ina120 (Mohri and Sproat, 1996).</Paragraph>
    <Paragraph position="1"> Each weak learner can then be viewed as a set of weighted rewrite rules mapping the input word into each membera61a124a42 (a5a125a1 ) with a weight a70 a42 when the predicate of the weak learner is true and with weight a71 a42 when the predicate of the weak learner is false. The translation between the three types of weak learners and the weighted context-dependency rules is shown in Table 13.</Paragraph>
    <Paragraph position="2"> We note that these rules apply left to right on an input and do not repeatedly apply at the same point in an input since the output vocabularya1 would typically be disjoint from the input vocabulary a0 . We use the technique described in (Mohri and Sproat, 1996) to compile each weighted context-dependency rules into an WFST. The compilation is accomplished by the introduction of context symbols which are used as markers to identify locations for rewrites of a83 with a120 . After the rewrites, the markers are deleted. The compilation process is represented as a composition of ve transducers.</Paragraph>
    <Paragraph position="3"> The WFSTs resulting from the compilation of each selected weak learner (a126 a42) are unioned to create the WFST to be used for decoding. The weights of paths with the same input and output labels are added during the union operation.</Paragraph>
    <Paragraph position="5"> We note that the due to the difference in the nature of the learning algorithm, compiling decision trees results in a composition of WFSTs representing the rules on the path from the root to a leaf node (Sproat and Riley, 1996), while compiling boosted rules results in a union of WFSTs, which is expected to result in smaller transducers.</Paragraph>
    <Paragraph position="6"> In order to apply the WFST for decoding, we simply compose the model with the input represented as an WFST (a126 a105 ) and search for the best path (if we are interested in the single best classi cation result).</Paragraph>
    <Paragraph position="8"> We have compiled the rules resulting from boostexter trained on transcriptions of speech utterances from a call routing task with a vocabulary (a28a0 a28) of 2912 and 40 classes (a2 a12a64a132a124a133 ). There were a total of 1800 rules comprising of 900 positive rules and their negative counterparts. The WFST resulting from compiling these rules has a 14372 states and 5.7 million arcs. The accuracy of the WFST on a random set of 7013 sentences was the same (85% accuracy) as the accuracy with the decoder that accompanies the boostexter program. This validates the compilation procedure.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML