File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/03/w03-1803_intro.xml
Size: 5,282 bytes
Last Modified: 2025-10-06 14:02:04
<?xml version="1.0" standalone="yes"?> <Paper uid="W03-1803"> <Title>Noun-Noun Compound Machine Translation: A Feasibility Study on Shallow Processing</Title> <Section position="2" start_page="0" end_page="0" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> Multiword expressions are problematic in machine translation (MT) due to the idiomaticity and overgeneration problems (Sag et al., 2002). Idiomaticity is the problem of compositional semantic unpredictability and/or syntactic markedness, as seen in expressions such as kick the bucket (= diea0) and by and large, respectively. Overgeneration occurs as a result of a system failing to capture idiosyncratic lexical affinities between words, such as the blocking of seemingly equivalent word combinations (e.g. many thanks vs.</Paragraph> <Paragraph position="1"> *several thanks). In this paper, we target the particular task of the Japanesea1 English machine translation of noun-noun compounds to outline the various techniques that have been proposed to tackle idiomaticity and overgeneration, and carry out detailed analysis of their viability over naturally-occurring data.</Paragraph> <Paragraph position="2"> Noun-noun (NN) compounds (e.g. web server, car park) characteristically occur with high frequency and high lexical and semantic variability. A summary examination of the 90m-word written component of the British National Corpus (BNC, Burnard (2000)) unearthed over 400,000 NN compound types, with a combined token frequency of 1.3m;1 that is, over 1% of words in the BNC are NN compounds. Moreover, if we plot the relative token coverage of the most frequently-occurring NN compound types, we find that the low-frequency types account for a sig- null nificant proportion of the type count (see Figure 12).</Paragraph> <Paragraph position="3"> To achieve 50% token coverage, e.g., we require coverage of the top 5% most-frequent NN compounds, amounting to roughly 70,000 types with a minimum token frequency of 10. NN compounds are especially prevalent in technical domains, often with idiosyncratic semantics: Tanaka and Matsuo (1999) found that NN compounds accounted for almost 20% of entries in a Japanese-English financial terminological dictionary.</Paragraph> <Paragraph position="4"> Various claims have been made about the level of processing complexity required to translate NN compounds, and proposed translation methods range over a broad spectrum of processing complexity. There is a clear division between the proposed methods based on whether they attempt to interpret the semantics of the NN compound (i.e. use deep processing), or simply use the source language word forms to carry out the translation task (i.e. use shallow processing). It is not hard to find examples of semantic mismatch in NN compounds to motivate deep translation methods: the Japanese a3a5a4a7a6a9a8a11a10a5a12 idobataa8kaigi &quot;(lit.) well-side meeting&quot;,3 e.g., translates most naturally into English as &quot;idle gossip&quot;, which a shallow method would be hard put to predict. Our interest is in the relative occurrence of such NN compounds and their impact on the performance of shallow translation methods. In particular, we seek to determine what proportion of NN compounds shallow translation translation methods can reasonably translate and answer the question: do shallow methods perform well enough to preclude the need for deep processing? The answer to this question takes the form of an estimation of the upper bound on translation performance for shallow translation methods.</Paragraph> <Paragraph position="5"> In order to answer this question, we have selected the language pair of English and Japanese, due to the high linguistic disparity between the two languages. We consider the tasks of both English-to-Japanese (EJ) and Japanese-to-English (JE) NN compound translation over fixed datasets of NN compounds, and apply representative shallow MT methods to the data.</Paragraph> <Paragraph position="6"> segment the compound into its component nouns through the use of the &quot;a13&quot; symbol.</Paragraph> <Paragraph position="7"> While stating that English and Japanese are highly linguistically differentiated, we recognise that there are strong syntactic parallels between the two languages with respect to the compound noun construction. At the same time, there are large volumes of subtle lexical and expressional divergences between the two languages, as evidenced between a0a2a1a4a3 a8a6a5a4a7 jiteNshaa8seNshu &quot;(lit.) bicycle athelete&quot; and its translation competitive cyclist. In this sense, we claim that English and Japanese are representative of the inherent difficulty of NN compound translation.</Paragraph> <Paragraph position="8"> The remainder of this paper is structured as follows.</Paragraph> <Paragraph position="9"> In a8 2, we outline the basic MT strategies that exist for translating NN compounds, and in a8 3 we describe the method by which we evaluate each method. We then present the results in a8 4, and analyse the results and suggest an extension to the basic method in a8 5.</Paragraph> <Paragraph position="10"> Finally, we conclude in a8 6</Paragraph> </Section> class="xml-element"></Paper>