File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/96/c96-2108_metho.xml

Size: 21,848 bytes

Last Modified: 2025-10-06 14:14:11

<?xml version="1.0" standalone="yes"?>
<Paper uid="C96-2108">
  <Title>An Empirical Architecture for Verb Subcategorization Frame a Lexicon for a Real-world Scale Japanese-English Interlingual MT</Title>
  <Section position="3" start_page="640" end_page="642" type="metho">
    <SectionTitle>
2. Linguistic Requirements of Verb
</SectionTitle>
    <Paragraph position="0"> subcategorization Frames The most notable syntactic phenomenon of Japanese is so-called scrambling. Any verb-modifying NP in a simple sentence in Japanese can appear at any position or does not have to appear at all, regardless of its surface case and deep case. In other words, word ordering is almost free for the major syntactic elements in a Japanese simple sentence except tor the predicate itself, which is to be placed at the end of the sentence. Not the word ordering but case postpositions mark the case of these NPs in relation to the main verb. All the examples in e.g. 2-1 lead to the same event structure interpretation, which is shared by the English translation &amp;quot;X gives Y to Z. ''2 e.g. Z-1 a. ok &amp;quot;X-ga Y-wo Z-ni age-ru&amp;quot;</Paragraph>
    <Paragraph position="2"> Particles &amp;quot;-ga&amp;quot;, &amp;quot;-wo&amp;quot; and &amp;quot;-ni&amp;quot; are the most frequently used case postpositions that often mark nominative, accusative, and dative cases, respectively ~ * There are various other alternative postpositions that can replace or be added to these case postpositions.</Paragraph>
    <Paragraph position="3"> Discourse representing postpositions such as &amp;quot;-ha&amp;quot;, which had been erroneously treated as general subject  mark various other cases and semantic roles including &amp;quot;locative&amp;quot; and &amp;quot;time.&amp;quot; marker, could replace &amp;quot;-wo&amp;quot; and &amp;quot;-ga&amp;quot;, and could be added to &amp;quot;-ni&amp;quot;. These replacement or addition do not alter the basic event structure of the sentences (2-1 a.,b ..... f.), but sometimes just add ambiguities to the syntactic structure as is shown in e.g. 2-2.</Paragraph>
    <Paragraph position="4"> e.g. 2-2 a. ok &amp;quot;X-ha Y-ha Z-ni age-ru&amp;quot; b. ok &amp;quot;Y-ha X-ha Z-ni age-ru&amp;quot; Practical Japanese sentence analyzers would need some semantic inference and default inference to plausibly identify X and Y using the semantic restrictions on each case element and the standard word ordering. Most nominative cases in Japanese verbs including &amp;quot;agc-ru&amp;quot; ('to give') have strong preference for human/animate attribute so that a meaningful difference between semantic similarity of X to animate object and the similarity of Y to other kind of concrete object leads to allocate nominative case on X and accusative case on Y in either 2-2 a. or 2-2 b. If there is no such difference in the semantic restriction score, the standard word ordering &amp;quot;-ga,&amp;quot; &amp;quot;wo,&amp;quot; and &amp;quot;-hi&amp;quot; seems to let the listener to interpret 'e.g. 2-2 a' as 'e.g. 2-1a'.</Paragraph>
    <Paragraph position="5"> Japanese not only has typical voice conversions such as passivization but also appears to have similar phenomena that alter tile surface case markings such as the cases with causative construction. This variation roughly corresponds to the variety of English auxiliary verbs and higher predicate verbs such as &amp;quot;dekiru (can),&amp;quot; &amp;quot;rareru (be pp./ can),&amp;quot; &amp;quot;kotoga-dekiru (be able to),&amp;quot; &amp;quot;tai (want to),&amp;quot; &amp;quot;seru (make/let),&amp;quot; &amp;quot;garu (feel complement)&amp;quot; etc.</Paragraph>
    <Paragraph position="6">  a. X-ga Y-wo tabe-ru, b. X-hi Y-ga tabe-rareru.</Paragraph>
    <Paragraph position="7"> X-NOM Y-ACC eat. X-DAT Y-NOM eat-PASSIVE &amp;quot;X eats Y .... X can eat Y&amp;quot; xor &amp;quot;X is eaten by Y&amp;quot; As is observed in e.g. 2-3, the nominative case marker &amp;quot;-ga&amp;quot; turns into dative case marker &amp;quot;-ni&amp;quot; and the accu,~ative case marker &amp;quot;-wo&amp;quot; turns into nominative case marker &amp;quot;-ga&amp;quot; when the passive~potential auxiliary verb &amp;quot;rareru&amp;quot; is attached.</Paragraph>
    <Paragraph position="8"> Multiple voice conversions can often occur for a single verb phrase as is shown in e.g. 2-3c.</Paragraph>
    <Paragraph position="9"> e.g.2-3 c. X-ga Y-wo Z-ni tabe-sase-rare-taku-nai.</Paragraph>
    <Paragraph position="10"> X-NOM Y-ACC Z-DAT eat-CAUS-PASS-WANT-NOT.</Paragraph>
    <Paragraph position="11"> &amp;quot;X does not want to be lbrced to eat Y by Z&amp;quot;  Since three auxiliary verb forms CAUS, PASS and WANT appear by this ordering in e.g. 2-3 c, a simple, natural solution to correctly recognize the scope of complex modality features is to recursively apply the pernmtations of surface case set as is described in the following section.</Paragraph>
    <Paragraph position="12">  3. Combining Surface and Deep Frames  3-1. Basic Representation for Japanese Verb Subcategorization Frame Empirical studies as we observed in the previous section have suggested that combining syntactic and semantic frames could lead to an optimum efficiency of lexicon descriptions. Thus we created the basic description framework in the verb lexicon as is shown in fig. 1.</Paragraph>
    <Section position="1" start_page="641" end_page="642" type="sub_section">
      <SectionTitle>
Fig.l Subcategorization Frame for &amp;quot;ageru&amp;quot;
</SectionTitle>
      <Paragraph position="0"> The example content in Fig. 1 is that of the verb &amp;quot;ageru (give)&amp;quot; used in e.g. 2-1 and e.g. 2-2. Anatural language analyzer in an MT system is supposed to convert the case elements in e.g. 2-1 a,..,f, X, Y and Z to AGenT, PATient and GOAl, respectively, as shown in e.g. 3-1.</Paragraph>
      <Paragraph position="1"> e.g. 3-1 a. ok&amp;quot;X-ga Y-wo Z-ni age-m&amp;quot;</Paragraph>
      <Paragraph position="3"> The analyzer looks up the slots in Surface case frame and find the match of the case postposition; for &amp;quot;-ga&amp;quot;, GA in the NOM case slot matches and the deep case that is stored in the NOM slot is taken out from the subcategorization frame. The analyzer checks if the semantic restriction 'animate' matches the case element X (John). If it fails, the analyzer looks for other slots, the other subcategorization frame of the same verb, and then the frames of other verbs that appear in the different place of the sentence.</Paragraph>
      <Paragraph position="4"> Fig. 1 shows a fixed frame with seven case slots, and this is exactly what the record format of our Japanese lexicon is. Why is it not necessary to have more slots though we know there are definitely more than seven case postpositions in Japanese? One of the reasons 4 is that other postpositions that can be mapped into a thematic role are supposed to fall into either of the seven slot and take the position as the alternative case markers. For example, &amp;quot;deha&amp;quot; in Fig.l could only be used with animate plural nouns such as &amp;quot;kotira ('our side'),&amp;quot; but it certainly could mark the nominative case.</Paragraph>
      <Paragraph position="5"> 4 The other reason is that there are case postpositions that are not mapped into thematic role. They constitutes not argument structure but just adjuncts (free elements) as is explained in modern Linguistics.</Paragraph>
      <Paragraph position="6"> The alternative case postposition &amp;quot;deha&amp;quot; also complies with the Unique Case Principle 5 that prohibits other case elements from filling the same slot as NOM that is already filled by &amp;quot;X-deha&amp;quot;. This is why this use of case postposition &amp;quot;deha,&amp;quot; with a different semantic restriction, is supposed to occupy the same slot NOM with the major case postposition &amp;quot;-ga&amp;quot;. Another slot DAT in Fig.1 shows that &amp;quot;-he&amp;quot; could replace the major case postposition &amp;quot;-ni&amp;quot; and be assigned the thematic role GOAl. Again, the following ungrammatical example e.g. 3-2 that violates the Unique Case Principle shows that &amp;quot;-ni&amp;quot; and &amp;quot;-he&amp;quot; for the verb &amp;quot;ageru&amp;quot; have to share the same slot in the subcategorization frame.</Paragraph>
      <Paragraph position="8"> &amp;quot;X gave Y to Z '. &amp;quot; For a simple Japanese analyzer that tries to fill as many slots as possible for a verb, the Unique Case Principle is virtually embedded in the subcategorization frame of our architecture for the computational lexicon.</Paragraph>
      <Paragraph position="9"> The slot name NOM2 represent the typical case with two-term adjectival predicateC/ that require two nominative cases.</Paragraph>
      <Paragraph position="11"> 3-2. Generating Permutational Subcategorization Frame Triggered by AUX Verbs We have generalized the notion of voice conversion for Japanese auxiliary verbs and equivalents by abstracting 14 codes of case frame permutation.</Paragraph>
      <Paragraph position="12"> These codes, the contents of which are to be elaborated in the fig. 2 &amp; 3, are assigned on the extended category of auxiliary verbs &amp;quot;dekiru (can),&amp;quot; &amp;quot;rareru (be pp./can),&amp;quot; &amp;quot;kotoga-dekiru (be able to),&amp;quot; &amp;quot;tai (want to),&amp;quot; &amp;quot;seru (make/let),&amp;quot; &amp;quot;garu (feel complement)&amp;quot;. Below is the description of the procedure by which the Japanese analyzer performs the permutation of the verb subcategorization frame.</Paragraph>
      <Paragraph position="13"> When the morphological analyzer detects an auxiliary verb or an equivalent while checking the information contained in the predicate phrase, the analyzer develops the verb subcategorization frame from the code in the verb's lexicon and read from the 5 - &amp;quot; The Unique Case Principle in Case Grammar and empirical studies is formulated and explained by the Lexicalist Hypothesis about thematic roles and the X-bar Theory in the school of Universal Grammar (Chomsky88).</Paragraph>
      <Paragraph position="14">  6 There is only one verb &amp;quot;komaru (be in trouble)&amp;quot;, the active voice of which shows two nominative cases CGA&amp;quot;s).  auxiliary verbs' lexicon what we call the Surface Case Permutation Frame code (SCPF code). The analyzer generates the subcategorization frame for the entire predicate by applying the permutation commands developed from the SCPF code for one auxiliary verb at a time. The first permutation is performed for the first auxiliary verb next to the main verb, and the locus moves on from the main verb to the first auxiliary verb. The second permutation is performed for the second auxiliary verb next to the first auxiliary verb, and the focus moves on to the second attxiliary verb. And so on: the N-th permutation is performed for the N-th auxiliary verb next to the (N-l)-th auxiliary verb, and the locus moves on from the (Nl)-thauxiliary verb to the N-th auxiliary verb. The maximum number for N is actually set to three in our MT system, reflecting the numbers of auxiliary verbs in real utterances and written sentences.</Paragraph>
      <Paragraph position="15"> e.g.3-3 a. X-gaY-wo taberu.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="642" end_page="643" type="metho">
    <SectionTitle>
X-NON Y-ACC cat.
</SectionTitle>
    <Paragraph position="0"> &amp;quot;X was made to eat Y by Z&amp;quot; A correct process would generate the subcategorization frames represented in the example sentences from e.g. 3-3a via e.g. 3-3b to e.g. 3-3c, where all case elements X, Y and Z are consistent in these three sentences. The SCPF code in the causative auxiliary verb &amp;quot;saseru&amp;quot; has two ambiguities of the set of permutation commands as is shown in Fig.2.</Paragraph>
    <Paragraph position="1">  The set of original postpositions is described in the left term (in the source direction of an arc) in the permutation commands. 'NULL' term unconditionally matches and adds an extra case slot with a new deep case described within the bracket on the right term (\[CAUser\]). If all the left term condition for matching the case markers meet, the permutation frame is valid and the number of subcategorization frames for a predicated is sometimes increases. In the above example, however, the permutation commands of Causative B results in generating two identical case markers 'WO&amp;quot; violating the Unique Surface Case Principle as is shown in e.g.3-4, so the pelxnutation is blocked.</Paragraph>
    <Paragraph position="2">  The analyzer follows the permutation procedure described above for the second auxiliary verb. All the permutation commands in fig.3 actually can match the original surface case flame of e,g. 3-3b. It is a set of independent semantic heuristic rules that drops the 'Autonomous' reading of &amp;quot;rareru&amp;quot; and ahnost drops tile 'Honorific' reading of &amp;quot;rareru&amp;quot;. All the other subcategorization frames for 'Direct Passive,' 'Indirect Passive,' 'Dative Passive,' and 'Possibility' are generated with slightly different variations of alternative surface case markers. The sentence in e.g. 3-3c can mean any of these but 'Indirect Passive,' the whole sentence of which is shown in e.g. 3-3d.</Paragraph>
    <Paragraph position="4"> &amp;quot;E experienced that X was made to eat Y by Z&amp;quot; It is still grammatical, but is much more difficult to get the meaning of it because it has four arguments for the single verb. This factor alone can be used by the analyzer to restrain the application of the generated subcategorization frame for the 'Indirect Passive' interpretation.</Paragraph>
    <Paragraph position="5"> 3-3. Two Cases of Multiple Deep Cases in a Single Slot There are two kinds of description by which multiple deep cases are described in a deep case slot of a subcategorization frame (Fig.l). One is selectional, and the other is overlapping. The selectional one is the use of alternative deep cases and meets the needs of economical description of the lexicon and also the manageability of it.</Paragraph>
    <Paragraph position="6">  a. &amp;quot;The typhoon\[REAson\] has broken a part the city block.&amp;quot; b. &amp;quot;A monster \[AGenT\] has broken a part the city block.&amp;quot;  In these examples, not the verb &amp;quot;break&amp;quot;, but the semantics of the subject decides what deep case the subject should be allocated. So, instead of assigning only one deep case onto the deep case slot and create bunch of whole subcategorization frames, we introduced an ambiguity marker such as AIRM (AGenT/INStrument/REAson/MEAns) to be assigned on the case slot. In this case, the analyzer does not have to decide the deep case until when necessary at whatever point in the phases of MT 7.</Paragraph>
    <Paragraph position="7"> The other kind of description is 'overlapping' of deep cases.</Paragraph>
    <Paragraph position="8"> Slot nome: NOM ACC DAT WITH Deep cose AGenT PAT- GOAl n/a frame: &amp;\[SOuRce\] ient Fig. 4 Deep Case Overlop Fig.4 shows ahnost the same deep case frame as in Fig.1 that shows the subcategorization frame of verb &amp;quot;ageru (give)&amp;quot;. The only difference is the deep case \[SOuRce\] added to the AGenT. Other kind of transitive verbs such as &amp;quot;taberu (eat)&amp;quot; may not let the \[SOuRce\] be added because the AGenT here is not the SOuRce position of the PATient in the event/action. &amp;quot;Taberu (eat)&amp;quot; may let \[GOAl\] be \] added to the AGenT. This distinction may work in the later knowledge-based inference phases of the MT system.</Paragraph>
    <Paragraph position="9"> 3-4. Lexicon Structure in Relation to</Paragraph>
    <Section position="1" start_page="643" end_page="643" type="sub_section">
      <SectionTitle>
Word Senses
</SectionTitle>
      <Paragraph position="0"> There are cases in which one word sense corresponds to multiple subcategorization frames, other cases in which one word sense corresponds to one subcategorization frame each, and the other cases in which multiple word senses correspond to less number of subcategorization frames. Since our approach here is rather empirical so any guidelines that help the lexicon to be uniform in quality, we take advantage of other literature that aimed at some exhaustive listing of interesting cases. The example sentences in e.g. 3-6 are cited from Fillmore68, and e.g. 3-7, from Levin93.</Paragraph>
      <Paragraph position="1">  e.g.3-6 a. John opened the door with the key.</Paragraph>
      <Paragraph position="2"> b. The key opened the door.</Paragraph>
      <Paragraph position="3"> c. The door opened.</Paragraph>
      <Paragraph position="4"> d. John ate the meal. e. John ate.</Paragraph>
      <Paragraph position="5"> e.g.3-7 a. John pounded the metal \[flat\].</Paragraph>
      <Paragraph position="6"> b. Metal pounded flat.</Paragraph>
      <Paragraph position="7"> c. * Metal pounded.</Paragraph>
      <Paragraph position="8"> 7 The decision point could be even delayed into the  generation phase of MT.</Paragraph>
      <Paragraph position="9"> As is briefly mentioned in the previous sections, the entry in our lexicon is composed of three blocks: M (Morphology)-Block, S (Syntax)-Block and C (Concept)-Block. M-Block contains the very surface information and can in general be linked to multiple S-Blocks. A whole subcategorization frame is described and stored in a S-Block coupled with corresponding other syntactic features such as aspect features. A C-Block linked from an S-Block or more represents an independent word sense, and, ideally, is linked to by other S-blocks that are linked to by other M-Blocks, in effect, other words of the same or the different language.</Paragraph>
      <Paragraph position="10"> The basic principle of the lexicon requires one to one correspondences between a subcategorization frame and an S-Block. So, any sentence in e.g. 3-6 or e.g. 3-7 ~ corresponds to a different S-Block from the others (except for e.g. 3-7c that does not exist). Any two S-Blocks can share the same word sense (C-Block) as long as the deep case (thematic role) frame is consistent. That is, all the case roles that appear in e.g.3-6 a, b and c are assigned different deep cases from one another: {John = AGenT, door = PATient, key = INStrument}. So are all the case roles in e.g. 3-6 a and b, and all the case roles in e.g. 3.7. Thus, our system of the lexicon could guarantee that these subcategorization frame s of intuitively the same word sense share the same C-Block.</Paragraph>
      <Paragraph position="11"> There are other cases in which the above criteria require to separate the intuitively single word sense as is shown in e.g. 3-8 e.g.3-8 a. John smeared the window\[PATient\] with the paint. b. John smeared thepaint\[PATient\] on the window. If a lexicographer is asked to fill in the deep cases as usual in the S-Blocks of e.g.3-8 a and b, he or she will assign PATient on window in e.g.3-8 a, and on paint in e.g.3-8 b. This inconsistency in the case assignment does not allow the lexicon to allocate the same C-Block to both e.g.3-8 a and b. NJ&amp;B94 gives a solution to this kind of case by introducing some deeper conceptual primitives than our deep cases.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="643" end_page="644" type="metho">
    <SectionTitle>
4. The Development Results of a
</SectionTitle>
    <Paragraph position="0"> Computational Japanese Lexicon for MT We have developed a computational Japanese lexicon with more than 80 thousand words, 30 thousand of which are verbs and their derivations. A key part of the development was to establish word  senses by means of comparing synonymous vocabulary sets of Japanese and English \[Nomura89\]. Each verb subcategorization frame is coded in what we call S-Block that is placed between M-Block that contains the very surface level information and C-Block that is supposed to contain language independent, purely conceptual information.</Paragraph>
    <Paragraph position="1"> Among 34 deep cases we defined for the Interlingua, which is fewer than those in previous (e.g. NagaoS0), 16 are currently used in the deep frame of our Japanese verb frame system.(Fig.5) PATient, EXPeriencer, AGenT, INStrument, MEAns, REAson,I SOuRce, GOAl, LOCotion, BENeficiory, TARget,~ PaRTicipant, CAPacity, FoCuS, MATeriot, ELMenT l lqg.5 The 16 Deep Cases Described in the Subcategorization Frames for Japanese Verbs and Prediwttive Adjectives The result of coding subcategorization fi'ames for 30 thousand verbs have listed up 18 case postpositions and standard word order used for verbs, and six, used for predicative adjectives. These figures are much smaller than the number of simple combinations: 7! = 823,543. The numbers of voice conversion types that affects the surface case pattern was 14% The non-weighted mean number of the case slots tbr each lexicon is counted to 1.6 ; 30% of verbs are listed up to take multiple case patterns. This figure seems appropriate, considering the fact that most of Japanese transitive verbs and intransitive verbs take separate word lorms.</Paragraph>
    <Paragraph position="2"> The numbers of the variations of subeategorization frames in the lexicon was about 250 for ordinary verbs and adjectives, and we have 150 more for idiomatic ones. These are the figures alter disregarding, of course, the variation of selectional restrictions. The sum of these figures are also much smaller than that of simple combinations: (16C7) * (7!) = 57657600.</Paragraph>
    <Paragraph position="3"> Exhaustive listing of 400 combinatorial subcategorization frames has contributed much to improve the accuracy the contents in the lexicon. The lexicon specification by the proposed verb subcategorization codes and SCPF codes tins improved uniformity in quality and the speed of lexicon development as well.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML