File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/01/h01-1041_metho.xml

Size: 16,800 bytes

Last Modified: 2025-10-06 14:07:32

<?xml version="1.0" standalone="yes"?>
<Paper uid="H01-1041">
  <Title>OTHER LANGUAGES SEMANTIC FRAMES (COMMON COALITION LANGUAGE) SEMANTIC FRAMES (COMMON COALITION LANGUAGE) UNDERSTANDING UNDERSTANDING UNDERSTANDING UNDERSTANDING GENERATION GENERATION GENERATION GENERATION</Title>
  <Section position="2" start_page="0" end_page="1" type="metho">
    <SectionTitle>
1. SYSTEM OVERVIEW
</SectionTitle>
    <Paragraph position="0"> The CCLINC The CCLINC Korean-to-English translation system is a component of the CCLINC Translingual Information System, the focus languages of which are English and Korean,  Given the input text or speech, the language understanding system parses the input, and transforms the parsing output into a language neutral meaning representation called a semantic frame, [16,17]. The semantic frame the key properties of which will be discussed in Section 2.3 becomes the input to the generation system. The generation system produces the target to the generation system, the semantic frame can be utilized for other applications such as translingual information extraction and language translation output after word order arrangement, vocabulary replacement, and the appropriate surface form realization in the target language, [6]. Besides serving as the input question-answering, [12].</Paragraph>
    <Paragraph position="1"> [?] In this paper, we focus on the Korean-to-English text translation component of CCLINC.</Paragraph>
  </Section>
  <Section position="3" start_page="1" end_page="1" type="metho">
    <SectionTitle>
2. ROBUST PARSING, MEANING
REPRESENTATION, AND AUTOMATED
GRAMMAR ACQUISITION
</SectionTitle>
    <Paragraph position="0"> [?] This work was sponsored by the Defense Advanced Research Project Agency under the contract number F19628-00-C-0002. Opinions, interpretations, conclusions, and recommendations are those of the authors and are not necessarily endorsed by the  For other approaches to Korean-to-English translation, the readers are referred to Korean-to-English translation by Egedi, Palmer, Park and Joshi 1994, a transfer-based approach using synchronous tree adjoining grammar, [5], and Dorr 1997, a small-scale interlingua-based approach, using Jackendoff's lexical conceptual structure as the interlingua, [4].</Paragraph>
  </Section>
  <Section position="4" start_page="1" end_page="6" type="metho">
    <SectionTitle>
OTHER LANGUAGES
SEMANTIC FRAMES
(COMMON
COALITION
LANGUAGE)
SEMANTIC FRAMES
(COMMON
COALITION
LANGUAGE)
UNDERSTANDING
UNDERSTANDING
UNDERSTANDING
UNDERSTANDING
GENERATION
GENERATION
GENERATION
GENERATION
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="1" end_page="4" type="sub_section">
      <SectionTitle>
1.1 Robust Parsing
</SectionTitle>
      <Paragraph position="0"> The CCLINC parsing module, TINA [16], implements the top-down chart parsing and the best-first search techniques, driven by context free grammars rules compiled into a recursive transition network augmented by features, [8]. The following properties of Korean induce a great degree of ambiguity in the grammar: (i) relatively free word order for arguments --- given a sentence with three arguments, subject, object, indirect object, all 6 logical word order permutations are possible in reality, (ii) frequent omissions of subjects and objects, and (iii) the strict verb finality, [10]. Due to the free word order and argument omissions, the first word of an input sentence can be many way ambiguous --- it can be a part of a subject, an object, and any other post-positional phrases.</Paragraph>
      <Paragraph position="1">  The ambiguity introduced by the first input word grows rapidly as the parser processes subsequent input words. Verbs, which usually play a crucial role in reducing the ambiguity in English by the subcategorization frame information, are not available until the end, [1,3,11].</Paragraph>
      <Paragraph position="2"> Our solution to the ambiguity problem lies in a novel grammar writing technique, which reduces the ambiguity of the first input word. We hypothesize that (i) the initial symbol in the grammar (i.e. Sentence) always starts with the single category generic_np, the grammatical function (subject, object) of which is undetermined. This ensures that the ambiguity of the first input word is reduced to the number of different ways the category generic_np can be rewritten. (ii) The grammatical function of the generic_np is determined after the parser processes the following case marker via a trace mechanism.</Paragraph>
      <Paragraph position="3">  Figure 2 illustrates a set of sample context free grammar rules, and Figure 3 (on the next page) is a sample parse tree for the input sentence &amp;quot;URi Ga EoRyeoUn MunJe Reul PulEox Da (We solved a difficult problem).&amp;quot;  (i) sentence - generic_np clause sentence_marker (ii) clause - subject generic_np object verbs (iii) subject - subj_marker np_trace  Post-positional phrases in Korean correspond to pre-positional phrases in English. We use the term post-positional phrase to indicate that the function words at issue are located after the head noun.</Paragraph>
      <Paragraph position="4">  The hypothesis that all sentences start with a single category generic_np is clearly over simplified. We can easily find a sentence starting with other elements such as coordination markers which do not fall under generic_np. For the sentences which do not start with the category generic_np, we discard these elements for parsing purposes. And this method has proven to be quite effective in the overall design of the translation system, especially due to the fact that most of non generic_np sentence initial elements (e.g. coordination markers, adverbs, etc.) do not contribute to the core meaning of the input sentence.</Paragraph>
      <Paragraph position="5">  Throughout this paper, &amp;quot;subj_marker&amp;quot; stands for &amp;quot;subject marker&amp;quot;, and &amp;quot;obj_marker&amp;quot;, &amp;quot;object marker&amp;quot;. The generic_np dominated by the initial symbol sentence in (i) of Figure 2 is parsed as an element moved from the position occupied by np_trace in (iii), and therefore corresponds to the category np_trace dominated by subject in Figure 3 (placed on the next page for space reasons). All of the subsequent generic_np's, which are a part of a direct object, an indirect object, a post-positional phrase, etc. are unitarily handled by the same trace mechanism. By hypothesizing that all sentences start with generic_np, the system can parse Korean robustly and efficiently. The trace mechanism determines the grammatical function of generic_np by repositioning it after the appropriate case marker.</Paragraph>
      <Paragraph position="6"> Utilization of overt case markers to improve the parsing efficiency precisely captures the commonly shared intuition for parsing relatively free word order languages with overt case markers such as Korean and Japanese, compared with parsing relatively strict word order languages with no overt case markers such as English: In languages like English, the verb of a sentence plays the crucial role in reducing the ambiguity via the verb subcategorization frame information on the co-occuring noun phrases, [1,3,11]. In languages like Korean, however, it is typically the case marker which identifies the grammatical function of the co-occuring noun phrase, assuming the role similar to that of verbs in English. The current proposal is the first explicit implementation of this intuition, instantiated by the novel idea that all noun phrases are moved out of the case marked phrases immediately following them.</Paragraph>
    </Section>
    <Section position="2" start_page="4" end_page="4" type="sub_section">
      <SectionTitle>
2.2 Meaning Representation and Generation
</SectionTitle>
      <Paragraph position="0"> The CCLINC Korean-to-English translation system achieves high quality translation by (i) robust mapping of the parsing output into the semantic frame, and (ii) word sense disambiguation on the basis of the selection preference between two grammatical relations (verb-object, subject-verb, head-modifier) easily identifiable from the semantic frame, [13]. The former facilitates the accurate word order generation of various target language sentences, and the latter, the accurate choice of the target language word given multiple translation candidates for the same source language word. Given the parsing output in Figure 3, the system produces the semantic frame in Figure 4:  Strictly speaking, the meaning representation in Figure 4 is not truly language neutral in that the terminal vocabularies are represented in Korean rather than in interlingua vocabulary. It is fairly straightforward to adapt our system to produce the meaning representation with the terminal vocabularies specified by an interlingua. However, we have made a deliberate decision to leave the Korean vocabularies in the representation largely (1) to retain the system efficiency for mapping parsing output into meaning representation, and (2) for unified execution of automation algorithms for both Korean-to-English and English-to-Korean translation. And we would like to point out that this minor compromise in meaning representation still ensures the major benefit of interlingua approach to machine translation, namely, 2 x N sets of grammar rules for N language pairs, as  EoRyeoUn MunJe Reul PulEox Da.&amp;quot; The semantic frame captures the core predicate-argument structure of the input sentence in a hierarchical manner, [9,10] (i.e. the internal argument, typically object, is embedded under the verb, and the external argument, typically subject, is at the same hierarchy as the main predicate, i.e. verb phrase in syntactic terms). The predicate and the arguments along with their representation categories are bold-faced in Figure 4. With the semantic frame as input, the generation system generates the English translation using the grammar rules in (1), and the Korean paraphrase using the grammar rules in (2).</Paragraph>
      <Paragraph position="1"> The semantic frame captures the core predicate-argument structure of the input sentence in a hierarchical manner, [9,10] (i.e. the internal argument, typically object, is embedded under the verb, and the external argument, typically subject, is at the same hierarchy as the main predicate, i.e. verb phrase in syntactic terms). The predicate and the arguments along with their representation categories are bold-faced in Figure 4. With the semantic frame as input, the generation system generates the English translation using the grammar rules in (1), and the Korean paraphrase using the grammar rules in (2).</Paragraph>
      <Paragraph position="2">  (1) a. statement :topic :predicate b. pul_v :predicate :topic (2) a. statement :topic :predicate b. pul_v :topic :predicate (1b) and (2b) state that the topic category for the object follows  the verb predicate in English, whereas it precedes the verb predicate in Korean.</Paragraph>
      <Paragraph position="3"> The predicate-argument structure also provides a means for word sense disambiguation, [13,15]. The verb pul_v is at least two-way ambiguous between solve and untie. Word sense disambiguation is performed by applying the rules, as in (3).</Paragraph>
      <Paragraph position="4">  (3) a .pul_v b .pul_v problem pul+solve_v thread pul+untie_v (3a) states that if the verb pul_v occurs with an object of type  problem, it is disambiguated as pul+solve_v. (3b) states that the verb occurring with an object of type thread is disambiguated as pul+untie_v. The disambiguated verbs are translated into solve and untie, respectively, in the Korean-to-English translation lexicon.</Paragraph>
    </Section>
    <Section position="3" start_page="4" end_page="6" type="sub_section">
      <SectionTitle>
1.2 Knowledge-Based Automated Acquisition
of Grammars
</SectionTitle>
      <Paragraph position="0"> To overcome the knowledge bottleneck for robust translation and efficient system porting in an interlingua-based system [7], we have developed a technique for automated acquisition of grammar rules which leads to a simultaneous acquisition of rules for (i) the parser, (ii) the mapper between the parser and the semantic frame, and (iii) the generator.</Paragraph>
      <Paragraph position="1"> The technique utilizes a list of words and their corresponding parts-of-speech in the corpus as the knowledge source, presupposes a set of knowledge-based rules to be derived from a word and its part-of-speech pair, and gets executed according to the procedure given in Figure 5. The rationale behind the technique is that (i) given a word and its part-of-speech, most of the syntactic rules associated with the word can be automatically derived according to the projection principle (the syntactic  structures) in linguistic theories, [2], and (ii) the mapping from the syntactic structure to the semantic frame representation is algorithmic. The specific rules to be acquired for a language largely depend on the grammar of the language for parsing.</Paragraph>
      <Paragraph position="2"> Some example rules acquired for the verb BaiChiHa (arrange) in Korean consistent with the parsing technique discussed in Section 2.1 are given in (4) through (7).</Paragraph>
      <Paragraph position="3"> Initialization: Create the list of words and their parts-of-speech in the corpus.</Paragraph>
      <Paragraph position="4"> Grammar Update: For each word and its associated part-ofspeech, check to see whether or not the word and the rules associated with the corresponding part-of-speech occur in each lexicon and grammar.</Paragraph>
      <Paragraph position="5"> If they already occur, do nothing.</Paragraph>
      <Paragraph position="6"> If not: (i) Create the appropriate rules and vocabulary items for each entry.</Paragraph>
      <Paragraph position="7"> (ii) Insert the newly created rules and vocabulary items into the appropriate positions of the grammar/lexicon files for the parser, the grammar file for the mapper between the parser and the  The rules for the parser for the verb tell in English are given below, to illustrate the dependency of the rules acquired to the specific implementation of the grammar of the language for parsing: .vp_tell vtell [adverb_phrase] dir_object [v_pp] vtell [adverb_phrase] indir_object dir_object vtell [adverb_phrase] dir_object v_to_pp [v_pp] vtell [adverb_phrase] dir_object that_clause vtell [and_verb] [or_verb] [adverb_phrase] dir_object wh_clause The contrast in complexity of verb rules in (4) for Korean, and (i) for English, reflects the relative importance of the role played by verbs for parsing in each language. That is, verbs play the minimal role in Korean, and the major role in English for ambiguity reduction and efficiency improvement.</Paragraph>
      <Paragraph position="8">  (6) Lexicon for the generation vocabulary</Paragraph>
      <Paragraph position="10"> (7) Rules for the generation grammar baichiha_v :predicate :conj :topic :sub_clause np-baichiha_v :noun_phrase :predicate :conj :topic :sub_clause The system presupposes the flat phrase structure for a sentence in Korean, as shown in Figure 3, and therefore the rules for the verbs do not require the verb subcategorization information, as in (4). The optional elements such as [negation], [tense], etc. are possible prefixes and suffixes to be attached to the verb stem, illustrating a fairly complex verb morphology in this language. The rules for the generation grammar in (7) are the subcategorization frames for the verb arrange in English, which is the translation of the Korean verb baichiha_v, as given in (6).</Paragraph>
      <Paragraph position="11"> The current technique is quite effective in expanding the system's capability when there is no large syntatically annotated corpus available from which we can derive and train the grammar rules, [14], and applicable across languages in so far as the notion of part-of-speech, the projection principle and the X-bar schema is language independent. With this technique, manual acquisition of the knowledge database for the overall translation system is reduced to the acquisition of (i) the bilingual lexicon, and (ii) the corpus specific top-level grammar rules which constitute less than 20% of the total grammar rules in our system. And this has enabled us to produce a fairly large-scale interlingua-based translation system within a short period of time. One apparent limitation of the technique, however, is that it still requires the manual acquisition of corpus-specific rules (i.e. the patterns which do not fall under the linguistic generalization). And we are currently developing a technique for automatically deriving grammar rules and obtaining the rule production probabilities from a syntactically annotated corpus.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML