XML Viewer - w06-1647

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-1647_metho.xml
Size: 26,213 bytes
Last Modified: 2025-10-06 14:10:45
<?xml version="1.0" standalone="yes"?>
<Paper uid="W06-1647">
  <Title>Lexicon Acquisition for Dialectal Arabic Using Transductive Learning</Title>
  <Section position="5" start_page="400" end_page="402" type="metho">
    <SectionTitle>
3 Learning Frameworks and Algorithms
</SectionTitle>
    <Paragraph position="0"> Let us formally de ne the lexicon learning problem. We have a wordlist of size m + u. A portion of these words (m) are annotated with POS-set labels, which may be acquired by manual annotation or an automatic analysis tool. The set of labeled words {Xm} is the training set, also referred to as the partial lexicon. The task is to predict the POS-sets of the remaining u unlabeled words {Xu}, the test set. The goal of lexicon learning is to label {Xu} with low error. The nal result is a full lexicon that contains POS-sets for all m + u words.</Paragraph>
    <Section position="1" start_page="400" end_page="401" type="sub_section">
      <SectionTitle>
3.1 Transductive Learning with Structured
Outputs
</SectionTitle>
      <Paragraph position="0"> We argue that the above problem formulation lends itself to a transductive learning framework.</Paragraph>
      <Paragraph position="1"> Standard inductive learning uses a training set of fully labeled samples in order to learn a classication function. After completion of the training phase, the learned model is then used to classify samples from a new, previously unseen test set. Semi-supervised inductive learning exploits unlabeled data in addition to labeled data to better learn a classi cation function. Transductive learning, rst described by Vapnik (Vapnik, 1998) also describes a setting where both labeled and unlabeled data are used jointly to decide on a label assignment to the unlabeled data points. However, the goal here is not to learn a general classi cation function that can be applied to new test sets multiple times but to achieve a high-quality one-time labeling of a particular data set. Transductive learning and inductive semi-supervised learning are sometimes confused in the literature. Both approaches use unlabeled data in learning the key difference is that a transductive classi er only optimizes the performance on the given unlabeled data while an inductive semi-supervised classi er is trained to perform well on any new unlabeled data.</Paragraph>
      <Paragraph position="2"> Lexicon learning ts in the transductive learning framework as follows: The test set {Xu}, i.e. the unlabeled words, is static and known dur- null single or compound labels ing learning time; we are not interested in inferring POS-sets for any words outside the word list.</Paragraph>
      <Paragraph position="3"> An additional characterization of the lexicon learning problem is that it is a problem of learning with complex, structured outputs. The label for each word is its POS-set, which may contain one to K POS tags (where K is the size of the tagset, K=20 in our case). This differs from traditional classi cation tasks where the output is a single scalar variable.</Paragraph>
      <Paragraph position="4"> Structured output problems like lexicon learning can be characterized by the granularity of the basic unit of labels. We de ne two cases: single-label and compound-label. In the single-label framework (see Figure 1), each individual POS tag is the target of classi cation and we have K binary classi ers each hypothesizing whether a word has a POS tag k (k = 1,... ,K). A second-stage classi er takes the results of the K individual classiers and outputs a POS-set. This classi er can simply take all POS tags hypothesized positive by the individual binary classi ers to form the POSset, or use a more sophisticated scheme for determining the number of POS tags (Elisseeff and Weston, 2002).</Paragraph>
      <Paragraph position="5"> The alternative compound-label framework treats each POS-set as an atomic label for classi cation. A POS-set such as { NN , VB } is compounded into one label NN-VB , which results in a different label than, say, NN or NN-JJ . Suppose there exist N distinct POS-sets in the  training data; then we have N atomic units for labeling. Thus a (N-ary) multi-class classi er is employed to directly predict the POS-set of a word. If only binary classi ers are available (i.e. in the case of Support Vector Machines), one can use one-vsrest, pairwise, or error correcting code schemes to implement the multi-class classi cation.</Paragraph>
      <Paragraph position="6"> The single-label framework is potentially ill-suited for capturing the dependencies between POS tags. Dependencies between POS tags arise since some tags, such as NN and NNP can often be tagged to the same word and therefore co-occur in the POS-set label. The compound-label framework implicitly captures tag co-occurrence, but potentially suffers from training data fragmentation as well as the inability to hypothesize POS-sets that do not already exist in the training data. In our initial experiments, the compound-label framework gave better classi cation results; thus we implemented all of our algorithms in the multi-class framework (using the one-vs-rest scheme and choosing the argmax as the nal decision).</Paragraph>
    </Section>
    <Section position="2" start_page="401" end_page="401" type="sub_section">
      <SectionTitle>
3.2 Transductive Clustering
</SectionTitle>
      <Paragraph position="0"> How does a transductive algorithm effectively utilize unlabeled samples in the learning process? One popular approach is the application of the so-called cluster assumption, which intuitively states that samples close to each other (i.e. samples that form a cluster) should have similar labels.</Paragraph>
      <Paragraph position="1"> Transductive clustering (TC) is a simple algorithm that directly implements the cluster assumption. The algorithm clusters labeled and unlabeled samples jointly, then uses the labels of labeled samples to infer the labels of unlabeled words in the same cluster. This idea is relatively straightforward, yet what is needed is a principled way of deciding the correct number of clusters and the precise way of label transduction (e.g. based on majority vote vs. probability thresholds). Typically, such parameters are decided heuristically (e.g. (Duh and Kirchhoff, 2005a)) or by tuning on a labeled development set; for resource-poor languages, however, no such set may be available.</Paragraph>
      <Paragraph position="2"> As suggested by (El-Yaniv and Gerzon, 2005), the TC algorithm can utilize a theoretical error bound as a principled way of determining the parameters. Let ^Rh(Xm) be the empirical risk of a given hypothesis (i.e. classi er) on the training set; let Rh(Xu) be the test risk. (Derbeko et al., 2004) derive an error bound which states that, with probability 1[?]d, the risk on the test samples is bounded  i.e. the test risk is bounded by the empirical risk on the labeled data, ^Rh(Xm), plus a term that varies with the prior p(h) of the hypothesis or classi er.</Paragraph>
      <Paragraph position="3"> This is a PAC-Bayesian bound (McAllester, 1999).</Paragraph>
      <Paragraph position="4"> The prior p(h) indicates ones prior belief on the hypothesis h over the set of all possible hypotheses. If the prior is low or the empirical risk is high, then the bound is large, implying that test risk may be large. A good hypothesis (i.e. classi er) will ideally have a small value for the bound, thus predicting a small expected test risk.</Paragraph>
      <Paragraph position="5"> The PAC-Bayesian bound is important because it provides a theoretical guarantee on the quality of a hypothesis. Moreover, the bound in Eq. 2 is particularly useful because it is easily computable on any hypothesis h, assuming that one is given the value of p(h). Given two hypothesized labelings of the test set, h1 and h2, the one with the lower PAC-Bayesian bound will achieve a lower expected test risk. Therefore, one can use the bound as a principled way of choosing the parameters in the Transductive Clustering algorithm: First, a large number of different clusterings is created; then the one that achieves the lowest PAC-Bayesian bound is chosen. The pseudo-code is given in Figure 2.</Paragraph>
      <Paragraph position="6"> (El-Yaniv and Gerzon, 2005) has applied the Transductive Clustering algorithm successfully to binary classi cation problems and demonstrated improvements over the current state-of-the-art Spectral Graph Transducers (Section 3.4). We use the algorithm as described in (Duh and Kirchhoff, 2005b), which adapts the algorithm to structured output problems. In particular, the modi cation involves a different estimate of the priors p(h), which was assumed to be uniform in (El-Yaniv and Gerzon, 2005). Since there are many possible h, adopting a uniform prior will lead to small values of p(h) and thus a loose bound for all h. Probability mass should only be spent on POS-sets that are possible, and as such, we calculate p(h) based on frequencies of compound-labels in the training data (i.e. an empirical prior).</Paragraph>
    </Section>
    <Section position="3" start_page="401" end_page="402" type="sub_section">
      <SectionTitle>
3.3 Transductive SVM
</SectionTitle>
      <Paragraph position="0"> Transductive SVM (TSVM) (Joachims, 1999) is an algorithm that implicitly implements the cluster  the learning algorithm seeks to maximize the margin subject to misclassi cation constraints on the training samples. In TSVM, this optimization is generalized to include additional constraints on the unlabeled samples. The resulting optimization algorithm seeks to maximize the margin on both labeled and unlabeled samples and creates a hyperplane that avoids high-density regions (e.g. clusters).</Paragraph>
    </Section>
    <Section position="4" start_page="402" end_page="402" type="sub_section">
      <SectionTitle>
3.4 Spectral Graph Transducer
Spectral Graph Transducer (SGT) (Joachims,
</SectionTitle>
      <Paragraph position="0"> 2003) achieves transduction via an extension of the normalized mincut clustering criterion. First, a data graph is constructed where the vertices are labeled or unlabeled samples and the edge weights represent similarities between samples. The min-cut criteria seeks to partition the graph such that the sum of cut edges is minimized. SGT extends this idea to transductive learning by incorporating constraints that require samples of the same label to be in the same cluster. The resulting partitions decide the label of unlabeled samples.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="402" end_page="402" type="metho">
    <SectionTitle>
4 Data
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="402" end_page="402" type="sub_section">
      <SectionTitle>
4.1 Corpus
</SectionTitle>
      <Paragraph position="0"> The dialect addressed in this work is Levantine Colloquial Arabic (LCA), primarily spoken in Jordan, Lebanon, Palestine, and Syria. Our development/test data comes from the Levantine Arabic CTS Treebank provided by LDC. The training data comes from the Levantine CTS Audio Transcripts. Both are from the Fisher collection of conversational telephone speech between Levantine speakers previously unknown to each other.</Paragraph>
      <Paragraph position="1"> The LCA data was transcribed in standard MSA script and transliterated into ASCII characters using the Buckwalter transliteration scheme1. No diacritics are used in either the training or development/test data. Speech effects such as dis uencies and noises were removed prior to our experiments.</Paragraph>
      <Paragraph position="2">  The training set consists of 476k tokens and 16.6k types. It is not annotated with POS tags this is the raw text we use to train the unsupervised HMM tagger. The test set consists of 15k tokens and 2.4k types, and is manually annotated with POS tags. The development set is also POSannotated, and contains 16k tokens and 2.4k types. We used the reduced tagset known as the Bies tagset (Maamouri et al., 2004), which focuses on major part-of-speech and excludes detailed morphological information.</Paragraph>
      <Paragraph position="3"> Using the compound-label framework, we observe 220 and 67 distinct compound-labels (i.e. POS-sets) in the training and test sets, respectively. As mentioned in Section 3.1, a classi er in the compound-label framework can never hypothesize POS-sets that do not exist in the training data: 43% of the test vocabulary (and 8.5% by token frequency) fall under this category.</Paragraph>
    </Section>
    <Section position="2" start_page="402" end_page="402" type="sub_section">
      <SectionTitle>
4.2 Morphological Analyzer
</SectionTitle>
      <Paragraph position="0"> We employ the LDC-distributed Buckwalter analyzer for morphological analyses of Arabic words.</Paragraph>
      <Paragraph position="1"> For a given word, the analyzer outputs all possible morphological analyses, including stems, POS tags, and diacritizations. The information regarding possible POS tags for a given word is crucial for constraining the unsupervised learning process in HMM taggers.</Paragraph>
      <Paragraph position="2"> The Buckwalter analyzer is based on an internal stem lexicon combined with rules for af xation. It was originally developed for the MSA, so only a certain percentage of Levantine words can be correctly analyzed. Table 1 shows the percentages of words in the LCA training text that received N possible POS tags from the Buckwalter analyzer.</Paragraph>
      <Paragraph position="3"> Roughly 23% of types and 28% of tokens received no tags (N=0) and are considered un-analyzable.</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="402" end_page="403" type="metho">
    <SectionTitle>
5 System
</SectionTitle>
    <Paragraph position="0"> Our overall system looks as follows (see Figure 3): In Step 1, the MSA (Buckwalter) analyzer is applied to the word list derived from the raw training text. The result is a partial POS lexicon,  possible tags, as determined by the Buckwalter analyzer. Words with 0 tags are un-analyzable. which lists the set of possible POS tags for those words for which the analyzer provided some output. All possibilities suggested by the analyzer are included.</Paragraph>
    <Paragraph position="1"> The focus of Step 2 is to infer the POS-sets of the remaining, unannotated words using one of the automatic learning procedures described in Section 3. Finally, Step 3 involves training an HMM tagger using the learned lexicon. This is the standard unsupervised learning component of the system. We use a trigram HMM, although modi cations such as the addition of af xes and variables modeling speech effects may improve tagging accuracy. Our concern here is the evaluation of the lexicon learning component in Step 2.</Paragraph>
    <Paragraph position="2"> An important problem in this system setup is the possibility of error propagation. In Step 1, the MSA analyzer may give incorrect POS-sets to analyzable words. It may not posit the correct tag (low recall), or it may give too many tags (low precision). Both have a negative effect on lexicon learning and EM training. For lexicon learning, Step 1 errors represent corrupt training data; For EM training, Step 1 error may cause the HMM tagger to never hypothesize the correct tag (low recall) or have too much confusibility during training (low precision). We attempted to measure the extent of this error by calculating the tag precision/recall on words that occur in the test set: Among the 12k words analyzed by the analyzer, 1483 words occur in the test data. We used the annotations in the test data and collected all the oracle POS-sets for each of these 1483 words.2 The average precision of the analyzer-generated POS-sets against the oracle is 56.46%. The average recall is 81.25%. Note that precision is low this implies that the partial lexicon is not very constrained. The recall of 81.25% means that 18.75% of the words may never receive the correct tag in tagging. In the experiments, we will investigate to what extent this kind of error affects lexicon learning and EM training.</Paragraph>
  </Section>
  <Section position="8" start_page="403" end_page="406" type="metho">
    <SectionTitle>
6 Experiments
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="403" end_page="405" type="sub_section">
      <SectionTitle>
6.1 Lexicon learning experiments
</SectionTitle>
      <Paragraph position="0"> We seek to answer the following three questions in our experiments: * How useful is the lexicon learning step in an end-to-end POS tagging system? Do the machine learning algorithms produce lexicons that result in higher tagging accuracies, when compared to a baseline lexicon that simply hypothesizes all POS tags for un-analyzable words? The answer is a de nitive yes.</Paragraph>
      <Paragraph position="1"> * What machine learning algorithms perform the best on this task? Do transductive learning outperform inductive learning? The empirical answer is that TSVM performs best, SGT performs worst, and TC and ISVM are in the middle.</Paragraph>
      <Paragraph position="2"> 2Since the test set is small, these oracle POS-sets may be missing some tags. Thus the true precision may be higher (and recall may be lower) than measured.</Paragraph>
      <Paragraph position="3">  * What is the relative impact of errors from the MSA analyzer on lexicon learning and EM training? The answer is that Step 1 errors affect EM training more, and lexicon learning is comparably robust to these errors.</Paragraph>
      <Paragraph position="4"> In our problem, we have 12k labeled samples and 3970 unlabeled samples. We de ne the feature of each sample as listed in Table 2. The contextual features are generated by co-occurrence statistics gleaned from the training data. For instance, for a word foo, we collect all bigrams consisting of foo from the raw text; all features [wt[?]1 = voc] that correspond to the bigrams (voc, foo) are set to 1. The idea is that words with similar orthographic and/or contextual features should receive similar POS-sets.</Paragraph>
      <Paragraph position="5"> All results, unless otherwise noted, are tagging accuracies on the test set given by training a HMM tagger on a speci c lexicon. Table 3 gives tagging accuracies of the four machine learning methods (TSVM, TC, ISVM, SGT) as well as two base-line approaches for generating a lexicon: (all tags) gives all 20 possible tags to the un-analyzable words, whereas (open class) gives only the sub-set of open-class POS tags.3 The results are given in descending order of overall tagging accuracy.4 With the exception of TSVM (63.54%) vs. TC (62.89%), all differences are statistically signi cant. As seen in the table, applying a machine learning step for lexicon learning is a worthwhile effort since it always leads to better tagging accuracies than the baseline methods.</Paragraph>
      <Paragraph position="6">  the same trend and are generally quite low. This might be due to the fact that POS tags of unknown words are usually best predicted by the HMM's transition probabilities, which may not be as robust due to the noisy lexicon.</Paragraph>
      <Paragraph position="7">  by machine learning (TSVM, TC, ISVM, SGT) and baseline methods. Accuracy=Overall accuracy; UnkAcc=Accuracy of unknown words.</Paragraph>
      <Paragraph position="8"> The poor performance of SGT is somewhat surprising since it is contrary to results presented in other papers. We attributed this to the dif culty in constructing the data graph. For instance, we constructed k-nearest-neighbor graphs based on the cosine distance between feature vectors, but it is dif cult to decide the best distance metric or number of neighbors. Finally, we note that besides the performance of SGT, transductive learning methods (TSVM, TC) outperform the inductive ISVM. We also compute precision/recall statistics of the nal lexicon on the test set words (similar to Section 5) and measure the average size of the POS-sets (bardblPOSsetbardbl). As seen in Table 4, POS-set sizes of machine-learned lexicon is a factor of 2 or 3 smaller than that of the baseline lexicons. On the other hand, recall is better for the baseline lexicons. These observations, combined with the fact that machine-learned lexicons gave better tagging accuracy, suggests that we have a constrained lexicon effect here: i.e. for EM training, it is better to constrain the lexicon with small POS-sets than to achieve high recall.</Paragraph>
      <Paragraph position="9">  Next, we examined the effects of error propagation from the MSA analyzer in Step 1. We attempted to correct these errors by using POS-sets of words derived from the development data. In  particular, of the 1562 partial lexicon words that also occur in the development set, we found 1044 words without entirely matching POS-sets. These POS-sets are replaced with the oracle POS-sets derived from the development data, and the result is treated as the (corrected) partial lexicon of Step 1. In this procedure, the average POS-set size of the partial lexicon decreased from 2.13 to 1.10, recall increased from 82.44% to 100%, and precision increased from 57.15% to 64.31%. We apply lexicon learning to this corrected partial lexicon and evaluate tagging results, shown in Table 5. The fact that all numbers in Table 5 represent signi cant improvements over Table 3 implies that error propagation is not a trivial problem, and automatic error correction methods may be desired.</Paragraph>
      <Paragraph position="10">  the partial lexicon prior to lexicon learning. Interestingly, we note ISVM outperforms TC here, which differs from Table 3.</Paragraph>
      <Paragraph position="11"> Finally, we determine whether error propagation impacts lexicon learning (Step 2) or EM training (Step 3) more. Table 6 shows the results of TSVM for four scenarios: correcting analyzer errors in the the lexicon: (A) prior to lexicon learning, (B) prior to EM training, (C) both, or (D) none. As seen in Table 6, correcting the lexicon at Step 3 (EM training) gives the most improvements, indicating that analyzer errors affects EM training more than lexicon learning. This implies that lexicon learning is relatively robust to training data corruption, and that one can mainly focus on improved estimation techniques for EM training (Wang and Schuurmans, 2005) if the goal is to alleviate the impact of analyzer errors. The same evaluation on the other machine learning methods (TC, ISVM, SGT) show similar results.</Paragraph>
    </Section>
    <Section position="2" start_page="405" end_page="406" type="sub_section">
      <SectionTitle>
6.2 Comparison experiments: Expert lexicon
</SectionTitle>
      <Paragraph position="0"> and supervised learning Our approach to building a resource-poor POS tagger involves (a) lexicon learning, and (b) un- null ent steps. Y=yes, lexicon corrected; N=no, POS-set remains the same as analyzer's output. supervised training. In this section we examine cases where (a) an expert lexicon is available, so that lexicon learning is not required, and (b) sentences are annotated with POS information, so that supervised training is possible. The goal of these experiments is to determine when alternative approaches involving additional human annotations become worthwhile in this task.</Paragraph>
      <Paragraph position="1"> (a) Expert lexicon: First, we build an expert lexicon by collecting all tags per word in the development set (i.e. oracle POS-sets). Then, the tagger is trained using EM by treating the development set as raw text (i.e. ignoring the POS annotations). This achieves an accuracy of 74.45% on the test set. Note that this accuracy is signi cantly higher than the ones in Table 3, which represent unsupervised training on more raw text (the training set), but with non-expert lexicons derived from the MSA analyzer and a machine learner. This result further demonstrates the importance of obtaining an accurate lexicon in unsupervised training. If one were to build this expert lexicon by hand, one would need an annotator to label the POS-sets of 2450 distinct lexicon items.</Paragraph>
      <Paragraph position="2"> (b) Supervised training: We build a supervised tagger by training on the POS annotations of the development set, which achieves 82.93% accuracy. This improved accuracy comes at the cost of annotating 2.2k sentences (16k tokens) with complete POS information.</Paragraph>
      <Paragraph position="3"> Finally, we present the same results with reduced data, taking rst 50, 100, 200, etc. sentences in the development set for lexicon or POS annotation. The learning curve is shown in Table 7. One may be tempted to draw conclusions regarding supervised vs. unsupervised approaches by directly comparing this table with the results in Section 6.1; we avoid doing so since taggers in Sections 6.1 and 6.2 are trained on different data sets (training vs. development set) and the accuracy differences are compounded by issues such  varying numbers of sentences. (2) Accuracies of unsupervised training using a expert lexicon of different vocabulary sizes.</Paragraph>
      <Paragraph position="4"> as ngram coverage, data-set selection, and the way annotations are done.</Paragraph>
    </Section>
  </Section>
  <Section position="9" start_page="406" end_page="406" type="metho">
    <SectionTitle>
7 Related Work
</SectionTitle>
    <Paragraph position="0"> There is an increasing amount of work in NLP tools for Arabic. In supervised POS tagging, (Diab et al., 2004) achieves high accuracy on MSA with the direct application of SVM classi ers. (Habash and Rambow, 2005) argue that the rich morphology of Arabic necessitates the use of a morphological analyzer in combination with POS tagging. This can be considered similar in spirit to the learning of lexicons for unsupervised tagging.</Paragraph>
    <Paragraph position="1"> The work done at a recent JHU Workshop (Rambow and others, 2005) is very relevant in that it investigates a method for improving LCA tagging that is orthogonal to our approach. They do not use the raw LCA text as we have done. Instead, they train a MSA supervised tagger and adapt it to LCA by a combination of methods, such using a MSA-LCA translation lexicon and redistributing the probabibility mass of MSA words to LCA.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML