File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/03/w03-1803_evalu.xml
Size: 7,748 bytes
Last Modified: 2025-10-06 13:59:02
<?xml version="1.0" standalone="yes"?> <Paper uid="W03-1803"> <Title>Noun-Noun Compound Machine Translation: A Feasibility Study on Shallow Processing</Title> <Section position="6" start_page="4" end_page="4" type="evalu"> <SectionTitle> 5 Analysis and extensions </SectionTitle> <Paragraph position="0"> In this section, we offer qualitative analysis of the unaligned translation pairs (i.e. members of classes B, C and D in Table 3) with an eye to improving the coverage of DMTCOMP. We make a tentative step in this direction by suggesting one extension to the basic DMTCOMP paradigm based on synonym substition.</Paragraph> <Section position="1" start_page="4" end_page="4" type="sub_section"> <SectionTitle> 5.1 Analysis of unaligned translation pairs </SectionTitle> <Paragraph position="0"> We consider there to be 6 basic types of misalignment in the translation pairs, each of which we illustrate with examples (in which underlined words are aligned and boldface words are the focus of discussion). In listing each misalignment type, we indicate the corresponding alignment classes in a8 4.2.</Paragraph> <Paragraph position="1"> (a) Missing template (B) An example of misaligment due to a missing template (but where all component words align) is: (a1) a1a3a2 a8a4a6a5 kesshoua8shiNshutsu &quot;advancement to finals&quot; Simply extending the coverage of translation templates would allow DMTCOMP to capture examples such as this.</Paragraph> <Paragraph position="2"> In (b1), the misalignment is caused by the English disclosure default-encoding information; a similar case can be made for (b2), although here summit does not align with a10a19a17 kaidaN. DMTCOMP could potentially cope with these given a lexical inference module interfacing with a semantically-rich lexicon (particularly in the case of (b1) where translation selection at least partially succeeds), but DMTINTERP seems the more natural model for coping with this type of translation. (b3) is slightly different again, in that a7a2a8 riritsu can be analysed as a two-character abbreviation derived from a7a6a20 risoku &quot;interest&quot; and a8 ritsu &quot;rate&quot;, which aligns fully with interest rate. Explicit abbreviation expansion could unearth the full wordform and facilitate alignment.</Paragraph> <Paragraph position="3"> (c) Synonym and association pairs (C1) This class contains translation pairs where one or more pairs of component nouns does not align under exact translation, but are conceptually similar: In (c1), although a5a7a6 zaisei &quot;finance&quot; is not an exact translation of budget, they are both general financial terms. It may be possible to align such words using word similarity, which would enable DMTCOMP to translate some component of the C1 data. In (c2), on the other hand, a8a7a9 kamei &quot;affiliation&quot; is lexicallyassociated with the English membership, although here the link becomes more tenuous.</Paragraph> <Paragraph position="4"> (d) Mismatch in semantic explicitness (C1) This translation class is essentially the same as class (b) above, in that semantic content explicitly described in the source NN compound is made implicit in the translation. The only difference is that the translation is not a single word so there is at least the potential for word-to-word compositionality to hold: NN compound and translation express the same concept differently due to a shift in semantic focus: (e1) a17a12a18 a8a20a19a22a21 shuushokua8katsudou &quot;(lit.) activity for getting new employment&quot; a58 job hunting.</Paragraph> <Paragraph position="5"> Here, the mismatch is between the level of directed participation in the process of finding a job. In Japanese, a23a22a24 katsudou &quot;activity&quot; describes simple involvement, whereas hunting signifies a more goal-oriented process.</Paragraph> <Paragraph position="6"> (f) Lexical gaps (C3,D2) Members of this class cannot be translated compositionally as they are either non-compositional expressions or, more commonly, there is no conventionalised way of expressing the denoted concept in the target language: (f1) a25 a8 a12a22a26 zokua8giiN &quot;legistors championing the causes of selected industries&quot; These translation pairs pose an insurmountable obstacle for DMTCOMP.</Paragraph> <Paragraph position="7"> Of these types, (a), (b) and (c) are the most realistically achievable for DMTCOMP, which combined account for about 20% of coverage, suggesting that it would be worthwhile investing effort into resolving them.</Paragraph> </Section> <Section position="2" start_page="4" end_page="4" type="sub_section"> <SectionTitle> 5.2 Performance vs. translation fan-out </SectionTitle> <Paragraph position="0"> As mentioned in a8 5.1, there are a number of avenues for enhancing the performance of DMTCOMP. Here, we propose synonym-based substitution as a means of dealing with synonym pairs from class (c).</Paragraph> <Paragraph position="1"> The basic model of word substitution can be extending simply by inserting synonym translations as well as direct word translations into the translation templates. We test-run this extended method for the JE translation task, using the Nihongo Goi-taikei thesaurus (Ikehara et al., 1997) as the source of source language synonyms, and ALTDICa62 as our translation dictionary. The Nihongo Goi-taikei thesaurus classifies the contents of ALTDIC into 2,700 semantic classes. We consider words occurring in the same class to be synonyms, and add in the translations for each. Note that we test this configuration over only C1-type compounds due to the huge fan-out in translation candidates generated by the extended method (although performance is evaluated over the full dataset, with results for non-C1 compounds remaining constant throughout).</Paragraph> <Paragraph position="2"> One significant disadvantage of synonym-based substitution is that it leads to an exponential increase in the number of translation candidates. If we analyse the complexity of simple word-based substitution to be a28a30a29a32a31 a16a34a33 where a31 is the average number of translations per word, the complexity of synonym based Table 5 shows the translation performance and also translation fan-out (average number of translation candidates) for DMTCOMP with and without synonym-based substitution (a39 sim) over the top 6 and 13 translation templates (TTs). As baselines, we also present the results for MBMTDICT (MBMTDICT (orig)) and DMTCOMP (DMTCOMP (orig)) in their original configurations (over the full 23 templates and without synonym-substitution for DMTCOMP). From this, the exponential translation fan-out for synonym-based substitution is immediately evident, but accuracy can also be seen to increase by over 4 percentage points through the advent of synonym substitution. Indeed, the accuracy when using synonym-substitution over only the top 6 translation templates is greater than that for the basic DMTCOMP method, although the number of translation candidates is clearly greater. Note the marked difference in fan-out for MBMTDICT vs. the various incarnations of DMTCOMP, and that considerable faith is placed in the ability of translation selection with DMTCOMP.</Paragraph> <Paragraph position="3"> While the large number of translation candidates produced by synonym-substitution make translation selection appear intractable, most candidates are meaningless word sequences, which can easily be filtered out based on target language corpus evidence. Indeed, Tanaka (2002) successfully combines synonym-substitution with translation selection and achieves appreciable gains in accuracy.</Paragraph> </Section> </Section> class="xml-element"></Paper>