XML Viewer - p01-1069

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/01/p01-1069_metho.xml
Size: 17,298 bytes
Last Modified: 2025-10-06 14:07:40
<?xml version="1.0" standalone="yes"?>
<Paper uid="P01-1069">
  <Title>Text Chunking using Regularized Winnow</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 CoNLL-2000 chunking task
</SectionTitle>
    <Paragraph position="0"> The text chunking task is to divide text into syntactically related non-overlapping groups of words (chunks). It is considered an important problem in natural language processing. As an example of text chunking, the sentence &amp;quot;Balcor, which has interests in real estate, said the position is newly created.&amp;quot; can be divided as follows: [NP Balcor], [NP which] [VP has] [NP interests] [PP in] [NP real estate], [VP said] [NP the position] [VP is newly created].</Paragraph>
    <Paragraph position="1"> In this example, NP denotes non phrase, VP denotes verb phrase, and PP denotes prepositional phrase.</Paragraph>
    <Paragraph position="2"> The CoNLL-2000 shared task (Sang and Buchholz, 2000), introduced last year, is an attempt to set up a standard dataset so that researchers can compare different statistical chunking methods. The data are extracted from sections of the Penn Treebank. The training set consists of WSJ sections 15-18 of the Penn Treebank, and the test set consists of WSJ sections 20. Additionally, a part-of-speech (POS) tag was assigned to each token by a standard POS tagger (Brill, 1994) that was trained on the Penn Treebank. These POS tags can be used as features in a machine learning based chunking algorithm. See Section 4 for detail.</Paragraph>
    <Paragraph position="3"> The data contains eleven different chunk types.</Paragraph>
    <Paragraph position="4"> However, except for the most frequent three types: NP (noun phrase), VP (verb phrase), and PP (prepositional phrase), each of the remaining chunks has less thana130a8a131 occurrences. The chunks are represented by the following three types of tags: B-X first word of a chunk of type X I-X non-initial word in an X chunk O word outside of any chunk A standard software program has been provided (which is available from http://lcgwww.uia.ac.be/conll2000/chunking) to compute the performance of each algorithm. For each chunk, three figures of merit are computed: precision (the percentage of detected phrases that are correct), recall (the percentage of phrases in the data that are found), and the a132a76a133 a110a76a37 metric which is the harmonic mean of the precision and the recall. The overall precision, recall and a132a76a133 a110a76a37metric on all chunks are also computed. The overall a132a76a133 a110a76a37 metric gives a single number that can be used to compare different algorithms.</Paragraph>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 System description
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.1 Encoding of basic features
</SectionTitle>
      <Paragraph position="0"> An advantage of regularized Winnow is its robustness to irrelevant features. We can thus include as many features as possible, and let the algorithm itself find the relevant ones. This strategy ensures that we do not miss any features that are important. However, using more features requires more memory and slows down the algorithm. Therefore in practice it is still necessary to limit the number of features used.</Paragraph>
      <Paragraph position="1">  be a string of tokenized text (each token is a word or punctuation). We want to predict the chunk type of the current token a134a111a135a136a121a8a140 . For each word</Paragraph>
      <Paragraph position="3"> denote the associated POS tag, which is assumed to be given in the CoNLL-2000 shared task. The following is a list of the features we use as input to the regularized Winnow (where we choose a143a144a25a124a145 ): a146 first order features:</Paragraph>
      <Paragraph position="5"> In addition, since in a sequential process, the predicted chunk tags a134  For each data point (corresponding to the current token a134a71a135a136a121 a140 ), the associated features are encoded as a binary vector a16 , which is the input to Winnow. Each component of a16 corresponds to a possible feature value a161 of a feature a162 in one of the above feature lists. The value of the component corresponds to a test which has value one if the corresponding feature a162 achieves value a161 , or value zero if the corresponding featurea162 achieves another feature value.</Paragraph>
      <Paragraph position="6"> For example, since a141a38a135a136a142a43a140 is in our feature list, each of the possible POS value a161 of a141a99a135a12a142a139a140 corresponds to a component of a16 : the component has value one if a141a38a135a136a142a139a140a163a25a164a161 (the feature value represented by the component is active), and value zero otherwise. Similarly for a second order feature in our feature list such as a141a38a135a136a142a43a140  feature value represented by the component is active), and value zero otherwise. The same encoding is applied to all other first order and second order features, with each possible test of &amp;quot;feature = feature value&amp;quot; corresponds to a unique component in a16 .</Paragraph>
      <Paragraph position="7"> Clearly, in this representation, the high order features are conjunction features that become active when all of their components are active. In principle, one may also consider disjunction features that become active when some of their components are active. However, such features are not considered in this work. Note that the above representation leads to a sparse, but very large dimensional vector. This explains why we do not include all possible second order features since this will quickly consume more memory than we can handle.</Paragraph>
      <Paragraph position="8"> Also the above list of features are not necessarily the best available. We only included the most straight-forward features and pair-wise feature interactions. One might try even higher order features to obtain better results.</Paragraph>
      <Paragraph position="9"> Since Winnow is relatively robust to irrelevant features, it is usually helpful to provide the algorithm with as many features as possible, and let the algorithm pick up relevant ones. The main problem that prohibits us from using more features in the Winnow algorithm is memory consumption (mainly in training). The time complexity of the Winnow algorithm does not depend on the number of features, but rather on the average number of non-zero features per data, which is usually quite small.</Paragraph>
      <Paragraph position="10"> Due to the memory problem, in our implementation we have to limit the number of token features (words or punctuation) to a130a12a34a136a34a136a34 : we sort the tokens by their frequencies in the training set from high frequency to low frequency; we then treat tokens of rank a130a12a34a136a34a136a34 or higher as the same token. Since the number a130a12a34a136a34a136a34 is still reasonably large, this restriction is relatively minor.</Paragraph>
      <Paragraph position="11"> There are possible remedies to the memory consumption problem, although we have not implemented them in our current system. One solution comes from noticing that although the feature vector is of very high dimension, most dimensions are empty. Therefore one may create a hash table for the features, which can significantly reduce the memory consumption.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.2 Using enhanced linguistic features
</SectionTitle>
      <Paragraph position="0"> We were interested in determining if additional features with more linguistic content would lead to even better performance. The ESG (English Slot Grammar) system in (McCord, 1989) is not directly comparable to the phrase structure grammar implicit in the WSJ treebank. ESG is a dependency grammar in which each phrase has a head and dependent elements, each marked with a syntactic role. ESG normally produces multiple parses for a sentence, but has the capability, which we used, to output only the highest ranked parse, where rank is determined by a system-defined measure.</Paragraph>
      <Paragraph position="1"> There are a number of incompatibilities between the treebank and ESG in tokenization, which had to be compensated for in order to transfer the syntactic role features to the tokens in the standard training and test sets. We also transferred the ESG part-of-speech codes (different from those in the WSJ corpus) and made an attempt to attach B-PP, B-NP and I-NP tags as inferred from the ESG dependency structure. In the end, the latter two tags did not prove useful. ESG is also very fast, parsing several thousand sentences on an IBM RS/6000 in a few minutes of clock time.</Paragraph>
      <Paragraph position="2"> It might seem odd to use a parser output as input to a machine learning system to find syntactic chunks. As noted above, ESG or any other parser normally produces many analyses, whereas in the kind of applications for which chunking is used, e.g., information extraction, only one solution is normally desired. In addition, due to many incompatibilities between ESG and WSJ treebank, less than a165a12a34a73a131 of ESG generated syntactic role tags are in agreement with WSJ chunks. However, the ESG syntactic role tags can be regarded as features in a statistical chunker. Another view is that the statistical chunker can be regarded as a machine learned transformation that maps ESG syntactic role tags into WSJ chunks.</Paragraph>
      <Paragraph position="3"> We denote by a162  the syntactic role tag associated with token a134a71a135a136a121  . Each tag takes one of 138 possible values. The following features are added to our system.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.3 Dynamic programming
</SectionTitle>
      <Paragraph position="0"> In text chunking, we predict hidden states (chunk types) based on a sequence of observed states (text). This resembles hidden Markov models where dynamic programming has been widely employed. Our approach is related to ideas described in (Punyakanok and Roth, 2001). Similar methods have also appeared in other natural language processing systems (for example, in (Kudoh and Matsumoto, 2000)).</Paragraph>
      <Paragraph position="1"> Given input vectors a16 consisting of features constructed as above, we apply the regularized Winnow algorithm to train linear weight vectors.</Paragraph>
      <Paragraph position="2"> Since the Winnow algorithm only produces positive weights, we employ the balanced version of Winnow with a16 being transformed into a65a16a169a25  problems, we train one linear classifier for each chunk type. In this way, we obtain twenty-three linear classifiers, one for each chunk type a134 . Denote bya17a27a175 the weight associated with typea134 , then a straight-forward method to classify an incoming datum is to assign the chunk tag as the one with the highest score a173 a36a17 a175a11a16a82a39 .</Paragraph>
      <Paragraph position="3"> However, there are constraints in any valid sequence of chunk types: if the current chunk is of type I-X, then the previous chunk type can only be either B-X or I-X. This constraint can be explored to improve chunking performance. We denote by a176 the set of all valid chunk sequences (that is, the sequence satisfies the above chunk type constraint). null  The truncation onto the interval a66a7a10a9a12a11a13a9 a68 is to make sure that no single point contributes too much in the summation.</Paragraph>
      <Paragraph position="4"> The optimization problem  can be solved by using dynamic programming.</Paragraph>
      <Paragraph position="5"> We build a table of all chunk types for every token  a9 , and ran the regularized Winnow update formula (4) repeatedly thirty times over the training data. The algorithm is not very sensitive to these parameter choices. Some other aspects of the system design (such as dynamic programming, features used, etc) have more impact on the performance. However, due to the limitation of space, we will not discuss their impact in detail.</Paragraph>
      <Paragraph position="6"> Table 1 gives results obtained with the basic features. This representation gives a total number</Paragraph>
      <Paragraph position="8"> a34a136a205 binary features. However, the number of non-zero features per datum is a206a73a165 , which determines the time complexity of our system. The training time on a 400Mhz Pentium machine running Linux is about sixteen minutes, which corresponds to less than one minute per category.</Paragraph>
      <Paragraph position="9"> The time using the dynamic programming to produce chunk predictions, excluding tokenization, is less than ten seconds. There are about a207</Paragraph>
      <Paragraph position="11"> non-zero linear weight components per chunktype, which corresponds to a sparsity of more than a209 a165a8a131 . Most features are thus irrelevant. All previous systems achieving a similar performance are significantly more complex. For example, the previous best result in the literature was achieved by a combination of 231 kernel support vector machines (Kudoh and Matsumoto, 2000) with an overall a132a76a133 a110a76a37 value of a209 a204 a41a206a73a165 . Each kernel support vector machine is computationally significantly more expensive than a corresponding Winnow classifier, and they use an order of magnitude more classifiers. This implies that their system should be orders of magnitudes more expensive than ours. This point can be verified from their training time of about one day on a 500Mhz Linux machine. The previously second best system was a combination of five different WPDV models, with an overall a132a21a133 a110a76a37 value of a209 a204 a41a204a136a145 (van Halteren, 2000). This system is again more complex than the regularized Winnow approach we propose (their best single classifier performance is a132a76a133  value of a209 a145 a41a130a12a34 . The rest of the eleven reported systems employed a variety of statistical techniques such as maximum entropy, Hidden Markov models, and transformation based rule learners. Interested readers are referred to the summary paper (Sang and Buchholz, 2000) which contains the references to all systems being tested. testdata precision recall a132a76a133  The above comparison implies that the regularized Winnow approach achieves state of the art performance with significant less computation. The success of this method relies on regularized Winnow's ability to tolerate irrelevant features. This allows us to use a very large feature space and let the algorithm to pick the relevant ones. In addition, the algorithm presented in this paper is simple. Unlike some other approaches, there is little ad hoc engineering tuning involved in our system. This simplicity allows other researchers to reproduce our results easily.</Paragraph>
      <Paragraph position="12"> In Table 2, we report the results of our system with the basic features enhanced by using ESG syntactic roles, showing that using more linguistic features can enhance the performance of the system. In addition, since regularized Winnow is able to pick up relevant features automatically, we can easily integrate different features into our system in a systematic way without concerning ourselves with the semantics of the features. The resulting overalla132a21a133 a110a76a37 value ofa209 a206 a41a75a9 a204 is appreciably better than any previous system. The overall complexity of the system is still quite reasonable. The total number of features is about a206 a41a145  hanced features It is also interesting to compare the regularized Winnow results with those of the original Winnow method. We only report results with the basic linguistic features in Table 3. In this experiment, we use the same setup as in the regularized Winnow approach. We start with a uniform  a9 . The Winnow update (1) is performed thirty times repeatedly over the data. The training time is about sixteen minutes, which is approximately the same as that of the regularized Winnow method.</Paragraph>
      <Paragraph position="13"> Clearly regularized Winnow method has indeed enhanced the performance of the original Winnow method. The improvement is more or less consistent over all chunk types. It can also be seen that the improvement is not dramatic. This is not too surprising since the data is very close to linearly separable. Even on the testset, the multi-class classification accuracy is around a209a136a212 a131 . On average, the binary classification accuracy on the training set (note that we train one binary classifier for each chunk type) is close to a9 a34a136a34a73a131 . This means that the training data is close to linearly separable. Since the benefit of regularized Winnow is more significant with noisy data, the improvement in this case is not dramatic. We shall mention that for some other more noisy problems which we have tested on, the improvement of regularized Winnow method over the original Winnow method can be much more significant.</Paragraph>
      <Paragraph position="14"> testdata precision recall a132a76a133</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML