File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/91/h91-1026_concl.xml

Size: 3,649 bytes

Last Modified: 2025-10-06 13:56:39

<?xml version="1.0" standalone="yes"?>
<Paper uid="H91-1026">
  <Title>Identifying Word Correspondences in Parallel Texts</Title>
  <Section position="5" start_page="156" end_page="156" type="concl">
    <SectionTitle>
10. Conclusions
</SectionTitle>
    <Paragraph position="0"> We have been studying how to find corresponding words in parallel texts given aligned regions. We have introduced several novel techniques that make substantial progress toward this goal. The philosophy underlying all our techniques is to keep errors of commission low. Whatever words are matched by these robust techniques should almost always be correct. Then, at any stage, the words that are matched can be used eortfidently for further research.</Paragraph>
    <Paragraph position="1"> The first technique we have introduced is the measurement of association of pairs of words by d~ 2, based on a two by two contingency table. This measure does better than mutual information at showing which pairs of words are translations, because it accounts for the cases in which one of the words occurs and the other does not. We apply this measure iteratively. Our caution is expressed by selecting at most one pair of words containing a given word on each iteration. The C/~2 measure for a selected pair must be significantly greater than the C/2 measures for each of the words of the pair and any other suggested translation.</Paragraph>
    <Paragraph position="2"> The iteration is accompanied by a progressive enlargement of possibly interesting pairs. We could not study all paks of words, or even all occurring pairs of words. Rather we take all the oceuring pairs in a progressively enlarged sample of regions. This does propose the most frequently cooccurring pairs first. On each iteration we delete the pairs of words that have already been selected, thereby reducing the confusion among collocates. Our eantion was expressed by hand checking the accuracy of selected pairs after each iteration. We chose techniques which could give 98 percent accuracy on the selected pairs. This has not been a blind automatic procedure, but one controlled at each step by human expertise.</Paragraph>
    <Paragraph position="3"> When we observed that many of the pairs considered contained morphological variants of a pair selected, we allowed such pairs to be accepted if they also had a d~ 2 significantly greater than chance.</Paragraph>
    <Paragraph position="4"> Several of our tests acknowledge that any function, such as ~2, of noisy data, such as frequencies, is itself a noisy measure. Therefore our caution is to require not just that one measure be greater than another, but that it be significantly greater. This calculation is made using an estimate of the variance of ~ 2.</Paragraph>
    <Paragraph position="5"> We then used the selected word pairs to suggest word correspondences within a given aligned region. The alignment was done by a dynamic programming technique with a parameter that controlled how certain we should be before accepting a specific pair of words as corresponding. We set the parameter to give results that are quite likely to be correct. Currently we suggest correspondences for about 60 percent of the words, and when we do suggest a correspondence we are correct in about 95 percent of cases.</Paragraph>
    <Paragraph position="6"> This is work in progress. We expect that in the future the coverage can be increased substantially above 60% while errors can be deoreased somewhat from 5%. We believe that errors of omission are much less important than errors of commission and expect to continue choosing techniques accordingly.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML