File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/00/c00-2113_metho.xml

Size: 13,822 bytes

Last Modified: 2025-10-06 14:07:15

<?xml version="1.0" standalone="yes"?>
<Paper uid="C00-2113">
  <Title>An empirical lnethod for identifying and translating technical terminology</Title>
  <Section position="3" start_page="0" end_page="782" type="metho">
    <SectionTitle>
2 Pattern-based MT system
</SectionTitle>
    <Paragraph position="0"> h pattern-ha.seal MT system uses a set of bilingua.1 pa.tterns(CFG rules) (Abeille et a.l., 1990) (Ta.keda., 1.996) (Shimohata. et a.l., 1.999). In the pa.rsing process, the engine performs a. CFGparsing for a.n input sentence and rewrites trees by a.pplying the source pa.tterns. 3'erminals and non-terminals are processed under the sa.me fra.lnework but lexicalized pa.tterns ha.re priority over symbolized pa.tterns 1 A plausible parse We define a symbolized pattern as a pattern without a. terminal and ~L lexicalizcd pattern as that with more than one terminal, we prepares 1000 symbolized patterns a.nd 130,000 lexicalizcd patterns as a system  tree will be selected among possible parse trees by the number of l)atterns applied. Then the pa.rse tree is tr~msferred into target language by using target patterns which correspond to the source patterns.</Paragraph>
    <Paragraph position="1"> Figure 1 shows an example of translation patterns between Fmglish and .lapanese. Each C1 G rule) has col English pattern(a left-half ' ,' ' responding aal)anese pattern(a right-half CFG rule). Non-terminals are bracketed with index numbers which represents correspondence of non-terminals between the source and target pattern.</Paragraph>
    <Paragraph position="3"> The pattern ibrmat is simple but highly descriptive. It can represent complicated linguistic phenomena and even correspondences between the languages with quite different structures, l)'urthermore, a.l\] the knowledge necessary fl)r the translation, whether syntactic or lexical, are compiled in the same pattern tbrmat. Owing to these fea.tures, we can easily apply the retrieved technical terms to a real MT system.</Paragraph>
  </Section>
  <Section position="4" start_page="782" end_page="784" type="metho">
    <SectionTitle>
3; Algorithm
</SectionTitle>
    <Paragraph position="0"> 1,'igure 2 shows an outline of the l)roposed nlethod. The inpu t is an untagged :~nonolingu al corpus, while the output is a dolnain dictionary for machine translation. The process is con&gt; prised of 3 phases: retrieving local patterns, assigning their syntactic categories with part-ofspeech(POS) templates, and making translation patterns. The dictionary is used when an MT system translates a text in the same domain as the corpus.</Paragraph>
    <Paragraph position="1"> We assume that the input is an English corpus and the dictionary is used for an English-Japanese MT system. In the remainder of this section, we will explain each phase in detail with English and Japanese examples.</Paragraph>
    <Paragraph position="2"> dictiona.ry.</Paragraph>
    <Section position="1" start_page="782" end_page="782" type="sub_section">
      <SectionTitle>
3.1 Retrieving local patterns
</SectionTitle>
      <Paragraph position="0"> We have ah'eady proposed a method for retrieving word sequences (Shimohata et al., 1997).</Paragraph>
      <Paragraph position="1"> This method generates all n-character (or nword) strings appearing in a text and tilters out ffagl-nenta.1 strings with the distribution of words adjacent to the strings. This is based on the idea. that adjacent words are widely distributed if the string is meaningful, m~d are localized if the string is a substring of a meaningful string.</Paragraph>
      <Paragraph position="2"> The method introduces entropy value to measure the word distribution. Let the string t)e 8tr, the adjacent words Wl...w,~, and the frequency of str frcq(.slr). The probability of each possible adjacent word p(wi) is then:</Paragraph>
      <Paragraph position="4"> Calculating the entropy of both sides of ,qtr, the lower one is used as ll(,tr). Then the strings whose entropy is larger than a given threshold are retrieved as local pattexns.</Paragraph>
    </Section>
    <Section position="2" start_page="782" end_page="784" type="sub_section">
      <SectionTitle>
3.2 Identifying syntactic categories
</SectionTitle>
      <Paragraph position="0"> Since the strings are just word sequences, the l)rocess gives tllem syntactic categories. For each str .str~  1. assign pa.rt-ofspeech tags tl, ... t~. to the coH\]ponent words Wl, ... /vr~ 2. match tag sequence tl, ... t,~ with part-of-speech templates 7~ 3. give sir corresponding syntactic category ,5'6'i, it' it matches Ti 3.2.1 Assigning part-of-speech tags  The process uses a simplified part-of speech set shown in table 1. l?unction words are assigned as they are, while content words except for adverb are fallen into only one part of speech word. Four kinds of words &amp;quot;be&amp;quot;, &amp;quot;do&amp;quot;, &amp;quot;'not&amp;quot;, and &amp;quot;to&amp;quot; are assigned to speciM tags be, do, not, and to respectively.</Paragraph>
      <Paragraph position="1"> There are several reasons to use the simplitied POS tags:  * it may sometimes be difl3cult to identify precise parts of speech in such a local pattern. null  * words are often used beyond parts of speech in technical terminology * it is eml)irically found that word sequences retrieved through n-gram statistics have distributional concentration on several syntactic categories.</Paragraph>
      <Paragraph position="2"> Theretbre, we think the simplified POS tags are sufficient to identify syntactic categories. The word sequence w~, ... w,~ is represented for a part-of-speech tag sequence tl, ... ti. Figure 3 shows examples of POS tagging. Italic the fuel tank art word word do this step * do det,prn word punc to oprn the to word art  lines are given word sequences and bold lines are POS tag sequences. If a word falls into two or more parts of speech, all possible POSs wi\]\] be assigned like &amp;quot;this&amp;quot; in the second example.  The process identifies a syntactic category(SC) of sir by checking if str's tag sequence tl, ... tn matches a given POS template 7}. If they  match, str is given a syntactic category ,5'Ci corresponding to 5/). Table 2 shows examt)les of I)OS teml)la.tes and corresl)onding SCs 2</Paragraph>
      <Paragraph position="4"> The templaPSes are described in the l'orm of regula.r expressions(Rl~;) a. The first templ~te in table 2, for exanrple, :m~tches a string whose tag sequence begins with an article, contains 0 or m ore rel)etitions of content word s or conj u n ctions, a.nd ends with a content word. &amp;quot;the fuel ta,nk&amp;quot; in tigure 3 is applied to this templa.tes aald given a SC &amp;quot;N&amp;quot;.</Paragraph>
    </Section>
    <Section position="3" start_page="784" end_page="784" type="sub_section">
      <SectionTitle>
3.3 Making translation patterns
</SectionTitle>
      <Paragraph position="0"> The process converts the strings into translation l)a.tterns. The l)roblem here is that we need to generate bilingual translation l)al;terns from monolingua\] strings. We use heuristic rules on borr0wing word s from foreign \]angu ages ..1</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="784" end_page="786" type="metho">
    <SectionTitle>
4 Evaluation
</SectionTitle>
    <Paragraph position="0"> VVe have tested our algorithln in building a doma.in dictionary and malting a. translation with it. A corpus used in the exl)eriment is a COml)uter nlanual comprising 167,023 words (in 22,0d i sentences).</Paragraph>
    <Paragraph position="1"> The corl)us contains 24,7137 n-grooms which appear more than twice. Among them, 7,6116 strings are extracted over the entropy threshold 1. Table 3 is a list of top 20 strings (except for single words and function word sequences) retrieved from the test c()rptlS.</Paragraph>
    <Paragraph position="2"> These strings a.re c~tego:rized into 1,239 POS patterns. Table 4 is a. list of to I) 10 POS l)at;terns aim the numl)ers of strings classitied into thenl, hi this experiment, the top 10 POS patterns a.ccount for dg.d % of a.ll 1'OS patterns. It substantiates the fa.ct that the retrieved strings tend to concentr~te in certa.in POS patterns.</Paragraph>
    <Paragraph position="4"> 2 Note that tile POS templates are strongly dependent on tile features of n-gram strings.</Paragraph>
    <Paragraph position="5"> a ,.,, causes tile resulting RP, to match 0 or more repetitions of the preceding I{E. &amp;quot;+&amp;quot; causes the resulting RE to match I or more rel)etitions of the preceding RI!'.</Paragraph>
    <Paragraph position="6"> &amp;quot;1:&amp;quot; creates a RE exl)ression that will match either right o,: left of &amp;quot;l&amp;quot;- &amp;quot;(...)&amp;quot; indicates the start and end of ~L group.</Paragraph>
    <Paragraph position="7"> 4 In Japanese, foreign words, especially in technical  terminology, are often used as they are in katakana (tiLe phonetic spelling for foreign words) followed by function words which indicate their parts of speech For example, English verbs are followed by &amp;quot;suru&amp;quot;, a verb wliich means &amp;quot;do&amp;quot; in English.</Paragraph>
    <Paragraph position="8"> frcq POS  In the matching process, we prepared 15 templates and 6 SCs. Table 5 is a result of SC identification. 2,462 strings(32.3 %) are not lnatched to any templates. The table indicates that most strings retrieved in this method are identified as N and NP. It is quite reasonable because the majority of the technical terms are supposed to be nouns and noun phrases.</Paragraph>
    <Paragraph position="9"> improved in parsing 104 improved in word selection 467 about the same 160  The retrieved translation patterns total 1,21.9. Figure 5 shows an example of translation patterns retrieved by our method. We, then, converted them to an MT dictionary and made a translation with and without it. Table 6 summarizes the evaluation results translating randomly selected 1.,000 sentences fi'om the test corpus. Compared with the translations without the dictionary, the translations with the dictionary improved 571 in parsing and word selection.</Paragraph>
    <Paragraph position="10"> Figure 6 illustrates changes in translations.</Paragraph>
    <Paragraph position="11"> Each column consists of an input sentence, a translation without the dictionary, and a translation with the dictionary. Bold English words  correspond to underlined a apanese.</Paragraph>
    <Paragraph position="12"> First two examples show improvement in word selection. The transl ations of&amp;quot; map(verb)&amp;quot; and &amp;quot;exec&amp;quot; are changed from word-for-word transla.tions to non-translation word sequences. Although &amp;quot;to make a map&amp;quot; and &amp;quot;exective&amp;quot; are not wrong translations, they are irrelevant in the computer manual context. On the contrary, the domain dictionary reduces confltsion caused by the wrong word selection.</Paragraph>
    <Paragraph position="13"> Wrong parsing and incomplete p~rsing are also reduced as shown in the next two examples. In the third example, &amp;quot;Next&amp;quot; should be a noun, while it is usually used as an adverb. The domain dictionary solved the syntactic ambiguity properly because it has exclusive priority over system dictionaries. In the forth example, &amp;quot;double-click&amp;quot; is an unknown word which could cause incomplete parsing. But the phrase was parsed as a verb correctly.</Paragraph>
    <Paragraph position="14"> The last one is an wrong example of Japanese verb selection. That was a main cause of errors and declines. The reason why the undesirable Japanese verbs were selected is that  to the retrieved nouns and noun phrases. We hope to overcome it by a. model tha.t cla.ssilies noun pllrases, for example using verb-noun or a,djective-n ou :n relation s.</Paragraph>
  </Section>
  <Section position="6" start_page="786" end_page="787" type="metho">
    <SectionTitle>
5 Related work
</SectionTitle>
    <Paragraph position="0"> As mentioned in section 1, there are two approaches in corpus-based technica.l term retrievah a rule-based approach and a statistical a~pproach. Major ditlhre:nces between the two 3,re: * the former uses a tagged corlItls while the latter uses an untagged one.</Paragraph>
    <Paragraph position="1"> * the former retrieves words and phrases with a designated syntactic category while the bttter :retrieves that with various syntactic categories at the same time.</Paragraph>
    <Paragraph position="2"> Our method uses the latter ~pproa, ch because we think it more practical both in resources and in applications.</Paragraph>
    <Paragraph position="3"> For colnparison~ we refer here to Smadja's method (1993) because this method and the proposed method have much in connnon. In both cases, technicaJ terms are retrieved from a.n untagged corpus with n-gram statistics and given syntactic categories for NI,P applica.tions. The methods are diflhrent in that Sma.dja uses a  parser for syntactic category identification while we use POS templates. A parser may add more precise syntactic category than I?OS templates. However, we consider it not to be critical under the specific condition that the variety of input patterns is very small. In terms of portability, the proposed method has an advantage. Actually, adding POS templates is not so time consuming as developing a parser.</Paragraph>
    <Paragraph position="4"> We have applied the translation patterns retrieved by this method to a real MT system. As a result, 57.1. % of translations were improved with 1,219 translation patterns. To our knowledge, little work has gone into quantifying its effectiveness to NLP applications. We recognize that the method leaves room for improvement in making translation patterns. We, therefore, plan to introduce techniques for finding translational equivalent from bilingual corpora (Me\]amed, 1998) to our method.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML