File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/06/w06-1647_intro.xml

Size: 5,780 bytes

Last Modified: 2025-10-06 14:03:58

<?xml version="1.0" standalone="yes"?>
<Paper uid="W06-1647">
  <Title>Lexicon Acquisition for Dialectal Arabic Using Transductive Learning</Title>
  <Section position="4" start_page="399" end_page="400" type="intro">
    <SectionTitle>
2 The Importance of Lexicons in
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="399" end_page="399" type="sub_section">
      <SectionTitle>
Resource-poor POS Tagging
2.1 Unsupervised Tagging
</SectionTitle>
      <Paragraph position="0"> The lack of annotated training data in resource-poor languages necessitates the use of unsupervised taggers. One commonly-used unsupervised tagger is the Hidden Markov model (HMM), which models the joint distribution of a word sequence w0:M and tag sequence t0:M as:</Paragraph>
      <Paragraph position="2"> This is a trigram HMM. Unsupervised learning is performed by running the Expectation-Maximization (EM) algorithm on raw text. In this procedure, the tag sequences are unknown, and the probability tables p(wi|ti) and p(ti|ti[?]1,ti[?]2) are iteratively updated to maximize the likelihood of the observed word sequences.</Paragraph>
      <Paragraph position="3"> Although previous research in unsupervised tagging have achieved high accuracies rivaling supervised methods (Kupiec, 1992; Brill, 1995), much of the success is due to the use of arti cially constrained lexicons. Speci cally, the lexicon is a wordlist where each word is annotated with the set of all its possible tags. (We will call the set of possible tags of a given word the POS-set of that word; an example: POS-set of the English word bank may be {NN,VB}.) Banko and Moore (2004) showed that unsupervised tagger accuracies on English degrade from 96% to 77% if the lexicon is not constrained such that only high frequency tags exist in the POS-set for each word.</Paragraph>
      <Paragraph position="4"> Why is the lexicon so critical in unsupervised tagging? The answer is that it provides additional knowledge about word-tag distributions that may otherwise be dif cult to glean from raw text alone. In the case of unsupervised HMM taggers, the lexicon provides constraints on the probability tables p(wi|ti) and p(ti|ti[?]1,ti[?]2). Speci cally, the lexical probability table is initialized such that p(wi|ti) = 0 if and only if tag ti is not included in the POS-set of word wi. The transition probability table is initialized such that p(ti|ti[?]1,ti[?]2) = 0 if and only if the tag sequence (ti,ti[?]1,ti[?]2) never occurs in the tag lattice induced by the lexicon on the raw text. The effect of these zero-probability initialization is that they will always stay zero throughout the EM procedure (modulo the effects of smoothing). This therefore acts as hard constraints and biases the EM algorithm to avoid certain solutions when maximizing likelihood. If the lexicon is accurate, then the EM algorithm can learn very good predictive distributions from raw text only; conversely, if the lexicon is poor, EM will be faced with more confusability during training and may not produce a good tagger. In general, the addition of rare tags, even if they are correct, creates a harder learning problem for EM.</Paragraph>
      <Paragraph position="5"> Thus, a critical aspect of resource-poor POS tagging is the acquisition of a high-quality lexicon. This task is challenging because the lexicon learning algorithm must not be resource-intensive.</Paragraph>
      <Paragraph position="6"> In practice, one may be able to nd analysis tools or incomplete annotations such that only a partial lexicon is available. The focus is therefore on effective machine learning algorithms for inferring a full high-quality lexicon from a partial, possibly noisy initial lexicon. We shall now discuss this situation in the context of dialectal Arabic.</Paragraph>
    </Section>
    <Section position="2" start_page="399" end_page="400" type="sub_section">
      <SectionTitle>
2.2 Dialectal Arabic
</SectionTitle>
      <Paragraph position="0"> The Arabic language consist of a collection of spoken dialects and a standard written language (Modern Standard Arabic, or MSA). The dialects of Arabic are of considerable importance since they are used extensively in almost all everyday conversations. NLP technology for dialectal Arabic is still in its infancy, however, due to the lack of data and resources. Apart from small amounts of written dialectal material in e.g. plays, novels, chat rooms, etc., data can only be obtained by recording and manually transcribing actual conversations. Annotated corpora are scarce because annotation requires another stage of manual effort beyond transcription work. In addition, basic resources such as lexicons, morphological analyzers, tokenizers, etc. have been developed for MSA, but are virtually non-existent for dialectal Arabic.</Paragraph>
      <Paragraph position="1"> In this study, we address lexicon learning for Levantine Colloquial Arabic. We assume that only two resources are available during training: (1) raw text transcriptions of Levantine speech and (2) a morphological analyzer developed for MSA.</Paragraph>
      <Paragraph position="2"> The lexicon learning task begins with a partial lexicon generated by applying the MSA analyzer to the Levantine wordlist. Since MSA differs from Levantine considerably in terms of syntax, morphology, and lexical choice, not all Levantine words receive an analysis. In our data, 23% of the words are un-analyzable. Thus, the  goal of lexicon learning is to infer the POS-sets of the un-analyzable words, given the partiallyannotated lexicon and raw text.</Paragraph>
      <Paragraph position="3"> Details on the Levantine data and overall system are provided in Sections 4 and 5. We discuss the learning algorithms in the next section.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML