File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/86/c86-1105_metho.xml

Size: 14,019 bytes

Last Modified: 2025-10-06 14:11:56

<?xml version="1.0" standalone="yes"?>
<Paper uid="C86-1105">
  <Title>AN ATTEMPT TO AUTOMATIC THESAURUS CONSTRUCTION FROM AN ORDINARY JAPANESE LAN8UAeE DICTIONARY</Title>
  <Section position="1" start_page="0" end_page="446" type="metho">
    <SectionTitle>
AN ATTEMPT TO AUTOMATIC THESAURUS CONSTRUCTION FROM AN ORDINARY
JAPANESE LAN8UAeE DICTIONARY
</SectionTitle>
    <Paragraph position="0"> How to obtain hierarchical relations(e.g, superordinate -hyponym relation, synonym relation) is one of the most important problems for thesaurus construction, A pilot system for extracting these relations automatically from an ordinary Japanese language dictionary (Shinmeikai Kokugojiten, published hy Sansei-do, in machine readable form) is given. The features of the definition sentences in the dictionary, the mechanical extraction of the hierarchical relations and the estimation of the results are discussed.</Paragraph>
    <Paragraph position="1"> i. INTRODUCTION A practical sized semantic dictionary (thesaurus as wide sence) is necessary for advanced natural language processing. We have been studying how to obtain semantic information for such semantic dictionary from a Japanese language dictionary(Shinmeikai Kokugojiten,published by Sanseido,in machine readable form)(1) containing about 60,000 entries.</Paragraph>
    <Paragraph position="2"> A dictionary contains meanings and usages of practical size of general words. Especially,definition sentences{DS: a brief notation} are important sources of information for meanings of general words. Generally, DS of an entr~word {EW:a brief notation} is defined by qualifying its super ~.</Paragraph>
    <Paragraph position="3"> ordinate word or synonyms or hyponyms. We call these words definition M~rds{DW:a brief notation}.</Paragraph>
    <Paragraph position="4"> We have been developing a system for extracting automatically DW related to EW from its DS, and for deciding the DW-EW relation (e) . By this system,(hierarchical) relations among entry words in the dictionary are to be established.</Paragraph>
    <Paragraph position="5"> We constructed a sub-system for extracting DSs corresponding to parts of speech, infrected form and meaning (definition) number of each entry word(v) .</Paragraph>
    <Paragraph position="6"> In this paper, the features of DSs in the Japanese dictionary, an outline of the pilot system and the results of experiment will be discussed.</Paragraph>
    <Paragraph position="7">  Where the brackets( \[...\] ), underline, and parentheses ((...)) denote EW, DW, and an English translation for the preceding Japanese phrase respectively.</Paragraph>
    <Paragraph position="8"> In (1), the final word is DW and superordinate-hyponym relation(DW&gt;EW) holds between the DW and the EW.</Paragraph>
    <Paragraph position="9"> In (2), DW is the final word in hook brackets(r...j) and DW&gt;EW holds. Tile expression &amp;quot;r...jO)~6)~$1~{~&amp;quot; is called a functional e~ rpr~e,'ssion{FE:a brief notation}. The (compound) word &amp;quot;~b'J~&amp;quot; in the FE is called a funck~onA1 word{~W:a brief notation}. In this case, the FW denotes a usage of the EW.</Paragraph>
    <Paragraph position="10"> In (3), DW is just before the FE &amp;quot;0)--~&amp;quot; and DW&gt;EW holds. In this case, the I~E prescribes the DW&gt;EW explicitly. The word&amp;quot;--N&amp;quot; is the FW.</Paragraph>
    <Paragraph position="11"> In (4), DW is just before the FE &amp;quot;t:~b~C/~J~# '' and the synonymous relation(DW~EW) holds between the DW and the EW. The FW &amp;quot;~&amp;quot; denotes a usage of the EW.</Paragraph>
    <Paragraph position="12"> In (5), two DW&lt;EWs hal(t, that is, the DW &amp;quot;NZ'~/~/&amp;quot; &lt; the EW &amp;quot;$~&amp;quot; and the DW &amp;quot;~&amp;quot; &lt; the Ew &amp;quot;$~&amp;quot; . In this case, the number of DNs are more than one, DW isn't modified and the FE is the word &amp;quot;~*&amp;quot; . The FW is identical with the FE. (Notes: &amp;quot;~'&amp;quot; is a sub-postpositive signifying exemplification.) The features of DSs in the dictionary are as follows:  (a) Honorary, the final ~mrd in DS is DW.</Paragraph>
    <Paragraph position="13"> (b) In some cases, the final expression in DS is YE assigning semantic relation between DW and EW, and DW is just before the PE.</Paragraph>
    <Paragraph position="14"> (c) Genraly, DW is modified by another phrase(modifier).</Paragraph>
    <Paragraph position="15"> (d) In some cases, DS contains more than one DW.</Paragraph>
    <Paragraph position="16"> The following general structure is obtained according to these features.</Paragraph>
    <Paragraph position="17"> '&amp;quot; (\[MODIFIER\], DW)*. \[F~\]o Notes) \[...\] : optional constituent (...) : required constituent * : sequence of coordinate constituent(e.g..,~) * : concatination symbol which is diferent from coordinate constituent(.) ~or convenience of explanation, the general structure is divided into the following two types.</Paragraph>
    <Paragraph position="18"> (I) TYPEI: ... (\[MODIFIER\] .DN)*o (I1) TYPEII: .,. (\[MODI~\]:ER\] .DW)*. PEa  In TYPE I, the final word is DW. In TYPEII, the final expression is FE, and BW is just before the FE.</Paragraph>
    <Section position="1" start_page="445" end_page="446" type="sub_section">
      <SectionTitle>
2.2 DW-EW RELATION IN DS
</SectionTitle>
      <Paragraph position="0"> We will propose the following assumptions according to above-mentioned features in order to extruct the DW-EW relations from DOs of the general structure.</Paragraph>
      <Paragraph position="1">  (~) When DS is in TYPE I , DS~EW. Because DS is a phrase (or a compound word) as wide senoe.</Paragraph>
      <Paragraph position="2"> (~) When DS is in TYPEll, SS pFE EW.</Paragraph>
      <Paragraph position="3"> Where pFE is binary relation assigned by FE, and SS is the shortened DS corresponding to (\[MODIFIER\], DW)*. (r) \[MODIFIER\] * W ~ W (~) (\[MODIFIERi\] .We) * ~ \[MODIFIERa\] .Wa  Where i,j : l~n, W is arbitrary word.</Paragraph>
      <Paragraph position="4"> The following general algorithm for deciding the DN-EW  relations is obtained by means of these assumptions. (I) DS is in TYPE I (DS dosn't include FE), (A) DW is modified, (~) The number of DW is only one, then DW&gt;EW (B) The number of DW are more than one, then CD (B) DW isn't modified , (~) The number of DW is only one, then DW--EW (~) The number of DN are more than one, then DN&lt;EW (II) DS is in TYPEII(DS includes FE), (A) DW is modified , (a) The number of DW is only one, PFE is '&gt;' or '---' ,then BW&gt;EW otherwise CO. (B) The number of DW are more than one, then CD (B) DW isn't modified , (a) The number of DW is only one, then DW pFE EW (B) The number of DW are more than one, pFE is '&lt;' , then DW&lt;EW otherwise CD.</Paragraph>
      <Paragraph position="5"> CD denotes that ON-EW relation isn't extracted mechanically from DS. In this case, the extraction of DW-BW relation needs human support at this stage. 2,3 FEATURES OF FE FE prescribes hierarchical relations(e.g. DW&gt;EW, DW&lt;EN, DW=EW, or DWmEW) or whole-part relation(DW~ EW). (e.g. On &amp;quot; \[~(interbrain)\] :.... ~(brain)OD--~9 (a part Of)o &amp;quot;, the FE &amp;quot;6D--~&amp;quot; prescribes DW~ EW explicitly.)  Besides these relations, another relation between DW and BW are prescribed by special FEe(e.g. &amp;quot;~T(under)&amp;quot; ), which is called associative relation(R).</Paragraph>
      <Paragraph position="6"> There are so many FEs that they are mainly divided into four patterns called functional patterns{FP: a brief notation}. FP is expressed by means of regular expression. FP is necessary for extracting FE and DW-EW relation informat_j~_n_(i.e, information neccessary for deciding the DW-EW relations) assigned by the FE. FP also designates a place  of DW in DS, Main features of FP are as follows: (1) Type100 : \[,..DWj * ~ * FW (2) Type200 : ...DW * (~9 . FN)* (3) Type3OO : ...DW * P FW (4) Type400 : .,.DW * e~&amp;quot;</Paragraph>
      <Paragraph position="8"> is concatination symbol.</Paragraph>
      <Paragraph position="9"> We got about one hundred seventy FWs. These are classified into two groups. In one group(contained 64 FWs), the FNs contain explicitly DW-EN relation information. In the other group(contained 105 FWs), some of the FWs contain usages of the EWe, which are also important to thesaurus. We have constructed a FN dictionary which includes FP and DW-EW relation information corresponding to the FP.</Paragraph>
    </Section>
  </Section>
  <Section position="2" start_page="446" end_page="2318" type="metho">
    <SectionTitle>
3. SYSTEM FOR EXTRACTING DW-EW (HIERARCHICAL) RELATION
</SectionTitle>
    <Paragraph position="0"> The system consists of the following four steps.</Paragraph>
    <Paragraph position="1">  RE~tion of EW and DS (a) Extraction of EW, its DS, the part of speech of the EW, the definition number of the DS from the dictionary. (b) Transformation of the extracted DS to the ordinary  Japanese sentence's form(called the normalized DS). Because several contents(meanings) are thrown into one DS by means of parentheses or dot ' ' in the dictionary. ~D Extraction of FE and DW-EW rel~ information The FW Dictionary is used.</Paragraph>
    <Paragraph position="2">  (a) When DS dosen't include FW, DS is in TYPEI. (b) When DS includes FW and conforms FP, DS is in TYPEII. (c) When DS includes FW but doesn't conform FP or when DS includes more than one FW, the DS is picked out as check data.Because it is difficult to distinguish between DW and FW or to extract DW-EW relation information mechanically. (3) Extraction of DW and DW-HW relation informatio~  A general word dictionary (containing about 75,000 noun words)(S)is used, in which the character strings of entry words were arranged in inverse order (from right to left). DWs are basically extracted by means of longest matching method, because there is ordinarily no space between two adjacent words in the Japanese sentence. In addition to this. there are the following problems.</Paragraph>
    <Paragraph position="3">  (a) The 'hiragana' notation is often used(e.g. ~O)~b \[~b\] ).</Paragraph>
    <Paragraph position="4"> (b) The names of animals and plants are described by 'katakana' (e.g. ~)P \[~\] ).</Paragraph>
    <Paragraph position="5"> (c) The unknown(compound) words are often used. (d) In some cases, the DS containes more than one ON.  The oxtructing procedure has to be constructed with regard to these ploblems.</Paragraph>
    <Paragraph position="6"> The relation information are also extracted, that is, 'DW isn't modified' and 'The number of DN are more than one' . When DN isn't extracted (that is,DR is neither 'katakana' string nor 'kanji' string nor any entry word in the word  dictionary) from DS, the DS is picked out as check data. (4) Decision of DW-EW relation  According to the conditions above-mentioned, DW-EW relations are decided.</Paragraph>
    <Paragraph position="7"> When extracted relation information is ambiguous, DS is picked out as check data.</Paragraph>
    <Paragraph position="8"> PE T R SU T A pilot system has been implemented on FACOM M-360(Nagasaki University Computer Center) and FACON N-382(Kyushu University Computer Center) mostly by PL/I.</Paragraph>
    <Paragraph position="9"> The experimental input data(2,824 DSs) in the first step, arc the normalized DSs.</Paragraph>
    <Paragraph position="10"> Table 1 shows the number of input, output and check data in each step and the number of correct and incorrect data in output data.</Paragraph>
    <Paragraph position="11"> Table 2 shows the extracted DW-EW relations and the number of output data corresponding to the relations. The experimental results are as follows:  (1) The ratio of TYPEI (2,374) to output data(2,?ll) is about 87.6%.</Paragraph>
    <Paragraph position="12"> (2) The ratio of TYPEI1(337) to output data(2,711) is about 12.4%.</Paragraph>
    <Paragraph position="13"> (3) The ratio of output data(2,434) to input data(2,824) is about 85%.</Paragraph>
    <Paragraph position="14"> (a) The ratio(called system precision) of correct output data(2,311) to output data(2,434) is about 95 %. (b) The ratio(called error ratio) of incorrect output data(123) to output data(2,434) is about 5%.</Paragraph>
    <Paragraph position="15"> (4) The ratio of check data(390) to input data(2,824) is about 14%.</Paragraph>
    <Paragraph position="16">  Most of incorrect output data occur in the step of extraction of DWs which are described by 'hiragana' notation, because of limitaions of the longest matching method. The improvement of the results necessitates (a) analysis of the DSs, (b) reinforcement of the general word dictionary used for extracting the DWs.</Paragraph>
    <Paragraph position="17"> 5~CONCLUDINO~RKK~_ (1) The similer researches have been carried out about several English dictionarys(e.g. LONONAN)(2)(~), however there is scaresly any about Japanese dictionary. (2) We have extracted automaticlly,DW&lt;EW, DW~EW, DW~ EW in addition to DW&gt;EW as the DW-EW relations.</Paragraph>
    <Paragraph position="18">  (3) Input data not suitable for conditions are picked out as check data in each step.</Paragraph>
    <Paragraph position="19"> (4) There are a shortage of semantic information (e.g.  lack of the adequate DW) in the dictionary because of assuming the human usage of the dictionary.</Paragraph>
    <Paragraph position="20"> We have been investigating the followings.</Paragraph>
    <Paragraph position="21"> I .Development of a system for utilizing the dictionary (v) . II.Development of a system for hierarchically structuring among entry words in the dictionary(~=).</Paragraph>
    <Paragraph position="22"> Ill.Development of a man-assisted system for constructing a practical sized semantic dictionary (4).</Paragraph>
  </Section>
  <Section position="3" start_page="2318" end_page="2318" type="metho">
    <SectionTitle>
ACKNOWLEDGEMENT
</SectionTitle>
    <Paragraph position="0"> We will like to thank the member of Turumaru's laboratory in Nagasaki University, and in paticular, Mr. A.Uchida and Mr. K.Mizuno for their efforts of implementation.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML