File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/i05-2012_metho.xml
Size: 10,053 bytes
Last Modified: 2025-10-06 14:09:36
<?xml version="1.0" standalone="yes"?> <Paper uid="I05-2012"> <Title>Automatic Extraction of English-Korean Translations for Constituents of Technical Terms</Title> <Section position="3" start_page="67" end_page="68" type="metho"> <SectionTitle> 2 Related Works </SectionTitle> <Paragraph position="0"> One of the well-known alignment techniques is the one based on statistical machine translation models. It was initially proposed by (Brown et al., 1993) and, more recently, have been intensively studied by several research groups (Germann et al., 2001; Och et al., 2003). It is used for finding sentence, phrase, and word-level correspondences from parallel texts. It can be formulated as equation (1). For the give source text, S, it finds the most probable alignment set, A, and target text, T.</Paragraph> <Paragraph position="2"> Brown (Brown et al., 1993) proposed five alignment models, called IBM Model, for an English-French alignment task based on equa- null tion (1). Equation (2) describes the IBM Model 1. It is modeled by two assumptions - P(F|E) depends on word translation probability t(f</Paragraph> <Paragraph position="4"> (2) where, m represents the length of F, l represents the length of E, and C l,m is a constant value determined by l (the length of E) and m (the length of F).</Paragraph> <Paragraph position="5"> IBM Model 2 considers distortion (How likely is a source language word in position i to align to a target language word in position j). IBM Model 3 adopts fertility (How likely is a source language word to align to k target language words) as its parameter for 1:n alignment. IBM Model 4 and 5 make use of relative distortion, word classes and variables to avoid deficiency. null There is another stream of studies on alignment. (Chen et al., 1993; Gale et al., 1993) proposed sentence alignment techniques based on dynamic programming, using sentence length and lexical mapping information. (Haruno et al., 1996; Kay et al., 1993) applied iterative refinement algorithms to sentence level alignment tasks.</Paragraph> <Paragraph position="6"> In this paper, we propose an alignment algorithm between English and Korean conceptual units (or between English and Korean term constituents) in English-Korean technical term pairs based on IBM Model (Brown et al., 1993).</Paragraph> <Paragraph position="7"> Unlike IBM Model, our alignment model can deal with n:1 alignment. While the IBM Model aimed to word-level alignment of parallel texts, our method focuses on word- and morphologylevel alignment of English-Korean term pairs. Moreover, our algorithm reflects the translation properties of English-to-Korean technical term pairs in a bilingual dictionary.</Paragraph> </Section> <Section position="4" start_page="68" end_page="70" type="metho"> <SectionTitle> 3 Term Constituent Alignment </SectionTitle> <Paragraph position="0"> For term constituent alignment, we use biology, chemistry and physics dictionaries where term constituents are manually segmented and their part-of-speech is manually assigned. For example, the Korean counterpart of crop growth rate is 'jak-mul + seng-jang + yul' and its three term constituents are 'jak-mul', 'seng-jang', and 'yul' where the first two are a noun and the last one is a suffix.</Paragraph> <Paragraph position="1"> The problem can be defined as finding correspondence between English and Korean term constituents as described in equation (3). For a given English term E=e</Paragraph> <Section position="1" start_page="68" end_page="69" type="sub_section"> <SectionTitle> 3.1 Statistical Modeling </SectionTitle> <Paragraph position="0"> In this section, first, we describe two translation properties (or constraints), derived from analysis of the alignment tendency between English-Korean term constituents and then describe how to apply these properties to statistical modeling of term constituent alignment.</Paragraph> <Paragraph position="1"> We randomly sample 20% data of English-Korean term pairs in each technical dictionary and finds two properties &quot;Cross alignment appears in some conditions&quot; by analyzing the sampled data.</Paragraph> <Paragraph position="2"> Constraint 1: Cross alignment is partly allowed. null Let alignment units in a source language be s</Paragraph> <Paragraph position="4"> (i<j), where i and j are the index of the source language, and those in a target language be t</Paragraph> <Paragraph position="6"> (q<r), where q and r are the index in the target language. Then alignment a</Paragraph> <Paragraph position="8"> ) are called cross alignment. Because a sentence structure of Korean is different from Among analyzed data, 1.3% for biology, 0.1% for physics and 5.65% for chemistry show cross alignment. Among analyzed data, 0.8% for biology, 0.2% for physics and 0.1% for chemistry show null alignment. that of English, cross-alignment between English and Korean words frequently occurs in parallel sentences (Shin et al., 1995). For alignment between term constituents, however, most alignment relations are derived from sequential alignment because technical terms, which are usually noun phrases, share the similar structure, say modifier and modifee, in both languages. Sometimes there is cross-alignment because of the preposition in an English term such as of. In that case, we allow cross-alignment. For example, there is a cross-alignment relation such as a = 'eung-go') between the English term clotting of blood and its Korean translation 'hyeol-aek + eung-go'. Note that we do not consider the preposition of as an alignment unit in that case. English-Korean term pairs representing a name of chemical compounds usually show cross-alignment and 1:1 alignment. To deal with this case, we allow cross-alignment when the number of English term constituents and that of Korean term constituents are same. With the constraint 1, sequential alignment is performed except the above two cases.</Paragraph> <Paragraph position="9"> Constraint 2 means that all English and Korean term constituents should be aligned. Because, term pairs consist of an English term and its translated Korean term, we assume that all constituents should be aligned. Null alignment means that an alignment unit in one side is aligned to nothing in the other side. For example, for Dutch elm disease and 'ne-deol-lan-deu (Dutch) / neu-leup-na-mu (elm) / che-gwan (sieve tube) / byeong (disease)', there is no English term constituent to be aligned to the Korean term constituent 'che-gwan (sieve tube)'. Because, null alignment, however, does not frequently appear in term constituent alignment (only the 0.1%~0.8% data among analyzed data), we do not consider null alignment in our algorithm. null</Paragraph> <Paragraph position="11"> By the constraints, equation (3) can be represented as equation (4). In equation (4), n, m, and t represent the number of English term constituents, the number of Korean term constituents and the number of alignment relations between term constituents. In equation (4), a(i|j,n,t) represents position information, which is a binary-valued function and supports the constraint</Paragraph> <Paragraph position="13"> cross-alignment, which is not allowed by constraint 1, otherwise a(i|j,n,m,t) = 1.</Paragraph> <Paragraph position="14"> In equation (4), p(a</Paragraph> <Paragraph position="16"> are lexical information and part of speech information of the j th Korean term constituent, respectively.</Paragraph> </Section> <Section position="2" start_page="69" end_page="70" type="sub_section"> <SectionTitle> 3.2 Parameter Estimation with EM Algo- </SectionTitle> <Paragraph position="0"/> <Paragraph position="2"> equation (5) are estimated with EM (Expectation-Maximization) algorithm. EM algorithm is the technique for parameter estimation of generic statistical distributions in presence of incomplete data (Dempster et al., 1997). The main goal of EM is to obtain the estimated parameters that give maximum likelihood to the input (incomplete) data. The basic idea underlying the EM algorithm is to iterate through a series of expectation (E-step) and maximization (M-step) steps where the estimation of the parameters of the model is progressively refined until convergence (Lopez et al., 1999).</Paragraph> <Paragraph position="3"> In this paper, parameters are estimated through two steps, called &quot;initial parameter estimation&quot; and &quot;iterative parameter estimation&quot;. In the initial parameter estimation step, the initial parameters are determined by seed data.</Paragraph> <Paragraph position="4"> Seed data, which contains alignment relations , where n =1 or m = 1, was selected among data for term constituent alignment. In the condition of n = 1 or m = 1, English technical terms or Korean technical terms are a conceptual unit by itself. In other words, alignment relations can be directly extracted from the English-Korean term pairs if there is only one English term constituent or only one Korean term constituent. With the seed data we can get the initial alignment relation set A(0) and then the initial parameter th (0) is estimated with A(0), where A(k) represents the alignment relation set and th (k) represents the estimated parameter set derived from the k In the iterative parameter estimation step, A(k) is determined by th (k-1) in E-step and th (k) is estimated by A(k) in M-step using the whole data until th (k) converges. E-step and M-step can be represented as equation (6)</Paragraph> <Paragraph position="6"> iteration as equation (7) and (8), respectively.</Paragraph> <Paragraph position="7"> In order to prevent zero probability, the Laplace smoothing method (Manning et al., 1999) is applied to equation (7) and (8).</Paragraph> <Paragraph position="8"> where C(x) represents frequency of x, |E |represents the number of unique English term constituents in A(k), |T |represents the number of unique POS tags of Korean term constituents in A(k).</Paragraph> </Section> </Section> class="xml-element"></Paper>