File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/02/w02-1801_metho.xml
Size: 6,669 bytes
Last Modified: 2025-10-06 14:08:10
<?xml version="1.0" standalone="yes"?> <Paper uid="W02-1801"> <Title>Extraction of Translation Unit from Chinese-English Parallel Corpora</Title> <Section position="4" start_page="0" end_page="2" type="metho"> <SectionTitle> 3 Statistical measurement used </SectionTitle> <Paragraph position="0"> Four statistical measurements were used in identification of unilingual multi-word units and the correspondences of the bilingual translation units. All four statistical formulas measures the degree of association of two random events. Given two random events, X and Y, they might be two Chinese words appears in the Chinese texts and two translation units appears in an aligned region of the corpus. The distribution of the two events could be depicted by a 2 by 2 contingency table.</Paragraph> <Paragraph position="1"> Based on the above-mentioned contingency table, different kinds of measurements could be used. We have tried four of them, namely, point-wise mutual information, DICE coefficient, kh score and Log-likelihood score. One other</Paragraph> <Paragraph position="3"> which is equivalent to the kh score. All the four measurements could be easily calculated using the following formula.</Paragraph> <Paragraph position="5"> 4. Identification of multi-word units What might constitute multi-word units is probably a question critical to identification of them. It seems rational to assume Multi-word units are something between phrases and words, which might have the following properties: 1) The component words of a multi-word unit should tend to co-occur frequently.</Paragraph> <Paragraph position="6"> In the significance of statistics, multi-word unit should be word group that co-occur more frequently than expectation.</Paragraph> <Paragraph position="7"> 2) Multi-words units are not arbitrary combinations of arbitrary words; they shall form valid syntactic structure in the meaning of linguistics.</Paragraph> <Paragraph position="8"> Based on the above-mentioned observations, we used an iterative algorithm using both statistical and linguistics means. The algorithm runs as follows: firstly the algorithm tries to find all word pairs that show strong coherence. This could be done using the measurements listed in section 3. After this step, all the word pairs in both of Chinese texts and English Texts whose association value is greater than a predefined threshold are marked. But this can only list of word groups of length of 2. Word groups of length more than 3 words could not be found by only one run of the algorithm. But apparently they could be found by a series of runs until there are no word groups having greater association value than the threshold anymore. The algorithm is designed as recursive structure, it marks longer word groups by viewing the shorter word group marked in the previous run as one word.</Paragraph> <Paragraph position="9"> It is no doubt that pure statistics cannot perform very reliable. Some word groups found by the algorithm are awkward to be accepted as multi-word unit. The result of the algorithm shall be viewed as a candidate list of multi-words units. Some kind of refinement of the results might be required. For thinking that multi-word unit shall form valid syntactic pattern, we use a filter module which check all the word groups found and see if they fall into a set of predefined syntactic patterns.</Paragraph> <Paragraph position="10"> filter. Patterns in the left side are for Chinese while the right side for English.</Paragraph> <Paragraph position="11"> 5. Extracting of the bilingual translation units We adopt the same hypothesis-testing approach to set the correspondence between the Chinese-English translation units. It follows the observations that words are translation of each other are more likely to appear in aligned regions(Gale,W. (1991), Tufis,D. (2001)). But we also take the multi-word units into consideration.</Paragraph> <Paragraph position="12"> The whole procedure could be divided logically into two phases. The first phase could be called a generative phase, which lists all possible translation equivalent pairs from the aligned corpus. And the second phase can be viewed as a testing operation, which selects the Translation Equivalent Correspondences show an association measure higher than expected under the independence assumption as translation equivalence pairs. Again we use DICE coefficient, point-wise mutual information, LL score and kh score to measure the degree of association.</Paragraph> <Paragraph position="13"> One of problems of above-mentioned approach is its inefficiency in processing large corpus. Because in the generative phase, the above-mentioned approach will list all translation equivalent pairs and can lead to huge search space. To make the approach more efficient, we adopted the following assumption: Source translation units tend to be translated into translation units of the same syntactic categories. For example, English nouns tend to be translated into Chinese nouns, and English pattern &quot;JJ+NN&quot; tend to be translated into Chinese pattern &quot;a+n&quot; or &quot;b+n&quot;. Apparently, this assumption is not always true for translation of Chinese into English and vice versa. But it really makes the algorithm much more efficient while the precision does not fall severely.</Paragraph> </Section> <Section position="5" start_page="2" end_page="2" type="metho"> <SectionTitle> 6. Experiments and Results </SectionTitle> <Paragraph position="0"> We have performed some preliminary experiments to test the performance of different statistic measurements, performance change when the categorial hypothesis is used.</Paragraph> <Paragraph position="1"> For the experiments, we used a very small portion of the corpus of 500 sentence pairs.</Paragraph> <Paragraph position="2"> Figure 5. show the performance of Chinese MultiWord Unit Identification, we count how many correct MWUs are there in the first hundred of candidate MWUs produced by the program.</Paragraph> </Section> <Section position="6" start_page="2" end_page="2" type="metho"> <SectionTitle> MI DICE LL </SectionTitle> <Paragraph position="0"> statistical measurements for identification of</Paragraph> </Section> <Section position="7" start_page="2" end_page="2" type="metho"> <SectionTitle> MWU </SectionTitle> <Paragraph position="0"> Figure 6 shows the performance of the TEP extraction using different statitical means. we count how many correct and partially correct correspondences there are in the first hundred of translation equivalent pairs produced by the algorithm.</Paragraph> </Section> class="xml-element"></Paper>