File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/p06-2005_metho.xml

Size: 22,835 bytes

Last Modified: 2025-10-06 14:10:22

<?xml version="1.0" standalone="yes"?>
<Paper uid="P06-2005">
  <Title>phrasing Based on Parallel Corpus for Normaliza-</Title>
  <Section position="5" start_page="33" end_page="34" type="metho">
    <SectionTitle>
, SMS
</SectionTitle>
    <Paragraph position="0"> lingo (i.e., SMS short form) dictionary is provided to replace SMS short-forms with normal English words. Most of the systems do not handle OOV (out-of-vocabulary) items and ambiguous inputs. Following compares SMS text normalization with other similar or related applications. null</Paragraph>
    <Section position="1" start_page="33" end_page="33" type="sub_section">
      <SectionTitle>
2.1 SMS Normalization versus General
Text Normalization
</SectionTitle>
      <Paragraph position="0"> General text normalization deals with Non-Standard Words (NSWs) and has been well-studied in text-to-speech (Sproat et al., 2001) while SMS normalization deals with Non-Words (NSs) or lingoes and has seldom been studied before. NSWs, such as digit sequences, acronyms, mixed case words (WinNT, SunOS), abbreviations and so on, are grammatically correct in linguistics. However lingoes, such as &amp;quot;b4&amp;quot; (before) and &amp;quot;bf&amp;quot; (boyfriend), which are usually selfcreated and only accepted by young SMS users, are not yet formalized in linguistics. Therefore, the special phenomena in SMS texts impose a big challenge to SMS normalization.</Paragraph>
    </Section>
    <Section position="2" start_page="33" end_page="33" type="sub_section">
      <SectionTitle>
2.2 SMS Normalization versus Spelling
Correction Problem
</SectionTitle>
      <Paragraph position="0"> Intuitively, many would regard SMS normalization as a spelling correction problem where the lingoes are erroneous words or non-words to be replaced by English words. Researches on spelling correction centralize on typographic and cognitive/orthographic errors (Kukich, 1992) and use approaches (M.D. Kernighan, Church and  http://www.etranslator.ro and http://www.transl8bit.com Gale, 1991) that mostly model the edit operations using distance measures (Damerau 1964; Levenshtein 1966), specific word set confusions (Golding and Roth, 1999) and pronunciation modeling (Brill and Moore, 2000; Toutanova and Moore, 2002). These models are mostly character-based or string-based without considering the context.</Paragraph>
      <Paragraph position="1"> In addition, the author might not be aware of the errors in the word introduced during the edit operations, as most errors are due to mistype of characters near to each other on the keyboard or homophones, such as &amp;quot;poor&amp;quot; or &amp;quot;pour&amp;quot;. In SMS, errors are not isolated within word and are usually not surrounded by clean context.</Paragraph>
      <Paragraph position="2"> Words are altered deliberately to reflect sender's distinct creation and idiosyncrasies. A character can be deleted on purpose, such as &amp;quot;wat&amp;quot; (what) and &amp;quot;hv&amp;quot; (have). It also consists of short-forms such as &amp;quot;b4&amp;quot; (before), &amp;quot;bf&amp;quot; (boyfriend). In addition, normalizing SMS text might require the context to be spanned over more than one lexical unit such as &amp;quot;lemme&amp;quot; (let me), &amp;quot;ur&amp;quot; (you are) etc. Therefore, the models used in spelling correction are inadequate for providing a complete solution for SMS normalization.</Paragraph>
    </Section>
    <Section position="3" start_page="33" end_page="34" type="sub_section">
      <SectionTitle>
2.3 SMS Normalization versus Text Para-
phrasing Problem
</SectionTitle>
      <Paragraph position="0"> Others may regard SMS normalization as a paraphrasing problem. Broadly speaking, paraphrases capture core aspects of variability in language, by representing equivalencies between different expressions that correspond to the same meaning.</Paragraph>
      <Paragraph position="1"> In most of the recent works (Barzilay and McKeown, 2001; Shimohata, 2002), they are acquired (semi-) automatically from large comparable or parallel corpora using lexical and morpho-syntactic information.</Paragraph>
      <Paragraph position="2"> Text paraphrasing works on clean texts in which contextual and lexical-syntactic features can be extracted and used to find &amp;quot;approximate conceptual equivalence&amp;quot;. In SMS normalization, we are dealing with non-words and &amp;quot;ungrammatically&amp;quot; sentences with the purpose to normalize or standardize these words and form better sentences. The SMS normalization problem is thus different from text paraphrasing. On the other hand, it bears some similarities with MT as we are trying to &amp;quot;convert&amp;quot; text from one language to another. However, it is a simpler problem as most of the time; we can find the same word in both the source and target text, making alignment easier.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="34" end_page="35" type="metho">
    <SectionTitle>
3 Characteristics of English SMS
</SectionTitle>
    <Paragraph position="0"> Our corpus consists of 55,000 messages collected from two sources, a SMS chat room and correspondences between university students. The content is mostly related to football matches, making friends and casual conversations on &amp;quot;how, what and where about&amp;quot;. We summarize the text behaviors into two categories as below.</Paragraph>
    <Section position="1" start_page="34" end_page="34" type="sub_section">
      <SectionTitle>
3.1 Orthographic Variation
</SectionTitle>
      <Paragraph position="0"> The most significant orthographic variant in SMS texts is in the use of non-standard, selfcreated short-forms. Usually, sender takes advantage of phonetic spellings, initial letters or number homophones to mimic spoken conversation or shorten words or phrases (hw vs. homework or how, b4 vs. before, cu vs. see you, 2u vs. to you, oic vs. oh I see, etc.) in the attempt to minimize key strokes. In addition, senders create a new form of written representation to express their oral utterances. Emotions, such as &amp;quot;:(&amp;quot; symbolizing sad, &amp;quot;:)&amp;quot; symbolizing smiling, &amp;quot;:()&amp;quot; symbolizing shocked, are representations of body language. Verbal effects such as &amp;quot;hehe&amp;quot; for laughter and emphatic discourse particles such as &amp;quot;lor&amp;quot;, &amp;quot;lah&amp;quot;, &amp;quot;meh&amp;quot; for colloquial English are prevalent in the text collection.</Paragraph>
      <Paragraph position="1"> The loss of &amp;quot;alpha-case&amp;quot; information posts another challenge in lexical disambiguation and introduces difficulty in identifying sentence boundaries, proper nouns, and acronyms. With the flexible use of punctuation or not using punctuation at all, translation of SMS messages without prior processing is even more difficult.</Paragraph>
    </Section>
    <Section position="2" start_page="34" end_page="34" type="sub_section">
      <SectionTitle>
3.2 Grammar Variation
</SectionTitle>
      <Paragraph position="0"> SMS messages are short, concise and convey much information within the limited space quota (160 letters for English), thus they tend to be implicit and influenced by pragmatic and situation reasons. These inadequacies of language expression such as deletion of articles and subject pronoun, as well as problems in number agreements or tenses make SMS normalization more challenging. Table 1 illustrates some orthographic and grammar variations of SMS texts.</Paragraph>
    </Section>
    <Section position="3" start_page="34" end_page="35" type="sub_section">
      <SectionTitle>
3.3 Corpus Statistics
</SectionTitle>
      <Paragraph position="0"> We investigate the corpus to assess the feasibility of replacing the lingoes with normal English words and performing limited adjustment to the text structure. Similarly to Aw et al. (2005), we focus on the three major cases of transformation as shown in the corpus: (1) replacement of OOV words and non-standard SMS lingoes; (2) removal of slang and (3) insertion of auxiliary or copula verb and subject pronoun.</Paragraph>
      <Paragraph position="1">  i hv cm to c my luv.</Paragraph>
      <Paragraph position="2"> (I have come to see my love.) 6. Introducing local flavor yar lor where u go juz now (yes, where did you go just now?) 7. Dropping verb  I hv 2 go. Dinner w parents.</Paragraph>
      <Paragraph position="3"> (I have to go. Have dinner with parents.)  u -&gt; you m are 2 - to lah am n - and t is r - are ah you ur -your leh to dun - don't 1 do man - manchester null huh a no - number one in intro - introduce lor yourself wat - what ahh will  tion, Deletion and Insertion Table 2 shows the statistics of these transformations based on 700 messages randomly selected, where 621 (88.71%) messages required  If we include the word &amp;quot;null&amp;quot; in the English vocabulary, the above model can fully address the deletion and substitution transformations, but inadequate to address the insertion transformation. For example, the lingoes &amp;quot;duno&amp;quot;, &amp;quot;ysnite&amp;quot; have to be normalized using an insertion transformation to become &amp;quot;don't know&amp;quot; and &amp;quot;yesterday night&amp;quot;. Moreover, we also want the normalization to have better lexical affinity and linguistic equivalent, thus we extend the model to allow many words to many words alignment, allowing a sequence of SMS words to be normalized to a sequence of contiguous English words. We call this updated model a phrase-based normalization model.</Paragraph>
      <Paragraph position="4"> normalization with a total of 2300 transformations. Substitution accounts for almost 86% of all transformations. Deletion and substitution make up the rest. Table 3 shows the top 10 most common transformations.</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="35" end_page="35" type="metho">
    <SectionTitle>
4 SMS Normalization
</SectionTitle>
    <Paragraph position="0"> We view the SMS language as a variant of English language with some derivations in vocabulary and grammar. Therefore, we can treat SMS normalization as a MT problem where the SMS language is to be translated to normal English.</Paragraph>
    <Paragraph position="1"> We thus propose to adapt the statistical machine translation model (Brown et al., 1993; Zens and Ney, 2004) for SMS text normalization. In this section, we discuss the three components of our method: modeling, training and decoding for SMS text normalization.</Paragraph>
  </Section>
  <Section position="8" start_page="35" end_page="111" type="metho">
    <SectionTitle>
4.2 Phrase-based Model
</SectionTitle>
    <Paragraph position="0"> Given an English sentence e and SMS sentence s , if we assume that e can be decomposed into phrases with a segmentation T , such that each phrase e in can be corresponded with one phrase s in</Paragraph>
    <Paragraph position="2"> s , we have ee and</Paragraph>
    <Paragraph position="4"/>
    <Paragraph position="6"> s ssnullnullsnull= ...... . The channel model can be rewritten in equation (3).</Paragraph>
    <Section position="1" start_page="35" end_page="111" type="sub_section">
      <SectionTitle>
4.1 Basic Word-based Model
</SectionTitle>
      <Paragraph position="0"> The SMS normalization model is based on the source channel model (Shannon, 1948). Assuming that an English sentence e, of length N is &amp;quot;corrupted&amp;quot; by a noisy channel to produce a SMS message s, of length M, the English sentence e, could be recovered through a posteriori distribution for a channel target text given the source text Ps , and a prior distribution for the channel source text .</Paragraph>
      <Paragraph position="2"/>
      <Paragraph position="4"> This is the basic function of the channel model for the phrase-based SMS normalization model, where we used the maximum approximation for the sum over all segmentations. Then we further decompose the probability</Paragraph>
      <Paragraph position="6"> phrase alignment as done in the previous word-based model.</Paragraph>
      <Paragraph position="7"> A null Assuming that one SMS word is mapped exactly to one English word in the channel model under an alignment , we need to consider only two types of probabilities: the alignment probabilities denoted by Pm and the lexicon mapping probabilities denoted by (Brown et al. 1993). The channel model can be written as in the following equation where m is the position of a word in</Paragraph>
      <Paragraph position="9"> The statistics in our training corpus shows that by selecting appropriate phrase segmentation, the position re-ordering at the phrase level occurs rarely. It is not surprising since most of the English words or phrases in normal English text are replaced with lingoes in SMS messages without position change to make SMS text short and concise and to retain the meaning. Thus we need to consider only monotone alignment at phrase level, i.e., k , as in equation (4). In addition, the word-level reordering within phrase is learned during training. Now we can further de-</Paragraph>
      <Paragraph position="11"> The alignment process given in equation (8) is different from that of normalization given in equation (7) in that, here we have an aligned input sentence pair, s and . The alignment process is just to find the alignment segmentation  between the two sentences that maximizes the joint probability. Therefore, in step (2) of the EM algorithm given at Figure 1, only the joint probabilities are involved and updated.</Paragraph>
      <Paragraph position="12"> For the above equation, we assume the segmentation probability (|P Te to be constant. Finally, the SMS normalization model consists of two sub-models: a word-based language model (LM), characterized by</Paragraph>
    </Section>
    <Section position="2" start_page="111" end_page="111" type="sub_section">
      <SectionTitle>
4.3 Training Issues
</SectionTitle>
      <Paragraph position="0"> For the phrase-based model training, the sentence-aligned SMS corpus needs to be aligned first at the phrase level. The maximum likelihood approach, through EM algorithm and Viterbi search (Dempster et al., 1977) is employed to infer such an alignment. Here, we make a reasonable assumption on the alignment unit that a single SMS word can be mapped to a sequence of contiguous English words, but not vice verse.</Paragraph>
      <Paragraph position="1"> The EM algorithm for phrase alignment is illustrated in Figure 1 and is formulated by equation (8).</Paragraph>
    </Section>
    <Section position="3" start_page="111" end_page="111" type="sub_section">
      <SectionTitle>
The Expectation-Maximization Algorithm
</SectionTitle>
      <Paragraph position="0"> (1) Bootstrap initial alignment using orthographic similarities (2) Expectation: Update the joint probabilities (, k Ps (3) Maximization: Apply the joint probabilities to get new alignment using  order to speed up convergence and find a nearly global optimization, a string matching technique is exploited at the initialization step to identify the most probable normalization pairs. The or- null thographic similarities captured by edit distance and a SMS lingo dictionary  which contains the commonly used short-forms are first used to establish phrase mapping boundary candidates. Heuristics are then exploited to match tokens within the pairs of boundary candidates by trying to combine consecutive tokens within the boundary candidates if the numbers of tokens do not agree.</Paragraph>
      <Paragraph position="1"> Finally, a filtering process is carried out to manually remove the low-frequency noisy alignment pairs. Table 4 shows some of the extracted normalization pairs. As can be seen from the table, our algorithm discovers ambiguous mappings automatically that are otherwise missing from most of the lingo dictionary.  Given the phrase-aligned SMS corpus, the lexical mapping model, characterized by  English Gigaword provided by LDC using SRILM language modeling toolkit (Stolcke, 2002). Backoff smoothing (Jelinek, 1991) is used to adjust and assign a non-zero probability to the unseen words to address data sparseness.</Paragraph>
    </Section>
    <Section position="4" start_page="111" end_page="111" type="sub_section">
      <SectionTitle>
4.4 Monotone Search
</SectionTitle>
      <Paragraph position="0"> Given an input , the search, characterized in equation (7), is to find a sentence e that maxis null mizes using the normalization model. In this paper, the maximization problem in equation (7) is solved using a monotone search, implemented as a Viterbi search through dynamic programming.</Paragraph>
      <Paragraph position="1">  (|) ()Ps e Pei</Paragraph>
    </Section>
  </Section>
  <Section position="9" start_page="111" end_page="111" type="metho">
    <SectionTitle>
5 Experiments
</SectionTitle>
    <Paragraph position="0"> The aim of our experiment is to verify the effectiveness of the proposed statistical model for SMS normalization and the impact of SMS normalization on MT.</Paragraph>
    <Paragraph position="1"> A set of 5000 parallel SMS messages, which consists of raw (un-normalized) SMS messages and reference messages manually prepared by two project members with inter-normalization agreement checked, was prepared for training and testing. For evaluation, we use IBM's BLEU score (Papineni et al., 2002) to measure the performance of the SMS normalization. BLEU score measures the similarity between two sentences using n-gram statistics with a penalty for too short sentences, which is already widely-used in  The baseline experiment is to moderate the texts using a lingo dictionary comprises 142 normalization pairs, which is also used in bootstrapping the phrase alignment learning process.</Paragraph>
    <Paragraph position="2"> Table 5 compares the performance of the different setups of the baseline experiments. We first measure the complexity of the SMS normalization task by directly computing the similarity between the raw SMS text and the normalized English text. The 1 st row of Table 5 reports the similarity as 0.5784 in BLEU score, which implies that there are quite a number of English word 3-gram that are common in the raw and normalized messages. The 2 nd experiment is carried out using only simple dictionary look-up.  The entries are collected from various websites such as http://www.handphones.info/sms-dictionary/sms-lingo.php, and http://www.funsms.net/sms_dictionary.htm, etc.  Lexical ambiguity is addressed by selecting the highest-frequency normalization candidate, i.e., only unigram LM is used. The performance of the 2 nd experiment is 0.6958 in BLEU score. It suggests that the lingo dictionary plus the uni-gram LM is very useful for SMS normalization. Finally we carry out the 3 rd experiment using dictionary look-up plus bi-gram LM. Only a slight improvement of 0.0128 (0.7086-0.6958) is obtained. This is largely because the English words in the lingo dictionary are mostly high-frequency and commonly-used. Thus bi-gram does not show much more discriminative ability than unigram without the help of the phrase-based lexical mapping model.</Paragraph>
    <Paragraph position="3"> Experimental result analysis reveals that the strength of our model is in its ability to disambiguate mapping as in &amp;quot;2&amp;quot; to &amp;quot;two&amp;quot; or &amp;quot;to&amp;quot; and &amp;quot;w&amp;quot; to &amp;quot;with&amp;quot; or &amp;quot;who&amp;quot;. Error analysis shows that the challenge of the model lies in the proper insertion of subject pronoun and auxiliary or copula verb, which serves to give further semantic information about the main verb, however this requires significant context understanding. For example, a message such as &amp;quot;u smart&amp;quot; gives little clues on whether it should be normalized to &amp;quot;Are you smart?&amp;quot; or &amp;quot;You are smart.&amp;quot; unless the full conversation is studied.</Paragraph>
    <Paragraph position="4"> Takako w r u? Takako who are you? Im in ns, lik soccer, clubbin hangin w frenz! Wat bout u mee? I'm in ns, like soccer, clubbing hanging with friends! What about you? fancy getting excited w others' boredom Fancy getting excited with others' boredom If u ask me b4 he ask me then i'll go out w u all lor. N u still can act so real.</Paragraph>
    <Paragraph position="5"> If you ask me before he asked me then I'll go out with you all. And you still can act so real. Doing nothing, then u not having dinner w us? Doing nothing, then you do not having dinner with us? Aiyar sorry lor forgot 2 tell u... Mtg at 2 pm. Sorry forgot to tell you... Meeting at two pm. tat's y I said it's bad dat all e gals know u... Wat u doing now? That's why I said it's bad that all the girls know you... What you doing now?</Paragraph>
    <Section position="1" start_page="111" end_page="111" type="sub_section">
      <SectionTitle>
5.2 Using Phrase-based Model
</SectionTitle>
      <Paragraph position="0"> We then conducted the experiment using the proposed method (Bi-gram LM plus a phrase-based lexical mapping model) through a five-fold cross validation on the 5000 parallel SMS messages.</Paragraph>
      <Paragraph position="1"> Table 6 shows the results. An average score of 0.8070 is obtained. Compared with the baseline performance in Table 5, the improvement is very significant. It suggests that the phrase-based lexical mapping model is very useful and our method is effective for SMS text normalization.</Paragraph>
      <Paragraph position="2"> Figure 2 is the learning curve. It shows that our algorithm converges when training data is increased to 3000 SMS parallel messages. This suggests that our collected corpus is representative and enough for training our model. Table 7 illustrates some examples of the normalization results.</Paragraph>
    </Section>
    <Section position="2" start_page="111" end_page="111" type="sub_section">
      <SectionTitle>
5.3 Effect on English-Chinese MT
</SectionTitle>
      <Paragraph position="0"> An experiment was also conducted to study the effect of normalization on MT using 402 messages randomly selected from the text corpus.</Paragraph>
      <Paragraph position="1"> We compare three types of SMS message: raw SMS messages, normalized messages using simple dictionary look-up and normalized messages using our method. The messages are passed to two different English-to-Chinese translation systems provided by Systran  R) separately to produce three sets of translation output. The translation quality is measured using 3-gram cumulative BLEU score against two reference messages. 3-gram is  used as most of the messages are short with average length of seven words. Table 8 shows the details of the BLEU scores. We obtain an average of 0.3770 BLEU score for normalized messages against 0.1926 for raw messages. The significant performance improvement suggests that preprocessing of normalizing SMS text using our method before MT is an effective way to adapt a general MT system to SMS domain.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML