XML Viewer - c80-1081

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/80/c80-1081_metho.xml
Size: 31,873 bytes
Last Modified: 2025-10-06 14:11:16
<?xml version="1.0" standalone="yes"?>
<Paper uid="C80-1081">
  <Title>AN ATTEMPT TO COMPUTERIZED DICTIONARY DATA BASES</Title>
  <Section position="1" start_page="0" end_page="10516" type="metho">
    <SectionTitle>
AN ATTEMPT TO COMPUTERIZED DICTIONARY DATA BASES
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
Summar X
</SectionTitle>
      <Paragraph position="0"> Two dictionary data base systems developed at Kyoto University are presented in this paper.</Paragraph>
      <Paragraph position="1"> One is the system for a Japanese dictionary ( Shinmeikai Kokugojiten, published by Sansei-do) and the other is for an English-Japanese dictionary (New Concise English-Japanese Dictionary, also published by Sansei-do). Both are medium size dictionaries which contain about 60,000 lexical items. The topics discussed in this paper are divided into two sub-topics. The first topic is about data translation problem of large, unformatted linguistic data. Up to now, no serious attempts have been made to this problem, though several systems have been proposed to translate data in a certain format into another. A universal data translator/verifier, called DTV, has been developed and used for data translation of the two dictionaries. The detailed construction of DTV will be given. The other sub-topic is about the problem of data organization which is appropriate for dictionaries. It is emphasized that the distinction between 'external structures' and 'internal structures' is important in a dictionary system. Though the external structures can be easily managed by general DBMS's, the internal (or linguistic) structures cannot be well manipulated. Some additional, linguistic oriented operations should be incorprated in dictionary data base systems with universal DBMS operations. Some examples of applications of the dictionary systems will also be given.</Paragraph>
      <Paragraph position="2"> i. Introduction To computerize large ordinary dictionaries is significant from various reasons: i) Dictionaries are rich sources of reference in linguistic processings of words, phrases and text. Algorithms for natural language p~ocessing should be verified by a large corpus of text data, and therefore, dictionaries to be prepared should be large enough to cover large vocabulary.</Paragraph>
      <Paragraph position="3"> 2) Dictionaries themselves are rich sources, as linguistic corpora. A data base system, when a dictionary data is stored in it, enables us to examine the data by making cross references from various view points. This will lead us to new discoveries of linguistic facts which are almost impossible by the printed version.</Paragraph>
      <Paragraph position="4"> 3) Computerized dictionaries have various applications in such areas as language teaching by computer, machine aided human translation, automatic key word extraction etc.(3) We have been engaged in the construction of dictionary data base systems these three years, and have almost completed two such systems. One is the system for a Japanese dictionary (Shinmeikai Kokugojiten, Published by Sansei-do) and the other for an English-Japanese dictionary (New Concise English-Japanese Dictionary, also Published by Sansei-do). Both are medium size dictionaries which contain about 60,000 items.</Paragraph>
      <Paragraph position="5"> In Addition to these two dictionary systems, we are now developing a system for an English dictionary (Longman dictionary of Contemporary English, Published by Longman Publishing Company, England). (4) Two topics will be discussed in this paper. The first is about the problem of data translation, that is, how to obtain formatted data which are more suitable for computer processings than their printed versions. The second is the problem of data organization, that is, how to organize the formatted data into data base systems. We will also give some examples of applications of these systems.</Paragraph>
      <Paragraph position="6">  2. Data Translation from Printed Imag~ to Formatted Data  We decided to input the dectionary contents almost as they are printed, and to translate them into certain formatted structures by computer programs rather than by hand.</Paragraph>
      <Paragraph position="7"> Ordinary dictionaries usually contain varieties of information. The description in English-Japanese dictionary, for example, consists of  I. parts of speech 2. inflection forms 3. pronunciations 4. derivatives 5. compounds 6. translation equivalents in Japanese (Usually several equivalents exist and they correspond tO different aspects of meaning of the entry word) 7. idioms and their translations 8. typical usages and their translations 9. antonyms and synonyms  etc.</Paragraph>
      <Paragraph position="8"> An entry may have several different parts of speech (homograms) and to each part of speech, the other information 2-9 is described (even pronunciation may change depending on the parts of speech). 7, 8 and 9 are usually attached to one of the translation equivalents (see Fig. i). In such a way, the description for a dictionary entry has a certain structure, and the several parts of the dictionary descriptions are related to each other. In the printed dictionaries, these relationships are expressed implicitly in linearized forms. Various ingenious conventions  are used to distinguish the relationships , including several kinds of parentheses, specially designed symbols ( ~ , ~ , ~ etc.) and character types (italic, gothic etc.). However, in order to utilize these relationships by programs, we should recognize them in the printed versions, and reorganize them appropriately so that the programs can manage them effectively. Instead of special symbols or character types, we should use formatted records, links or pointers to express such relationships explicitly. We call this sort of translation from the printed versions to computer oriented formats as data translation. null &amp;quot;~ The printed version of a dictionary highly relies on human ability of intuitive understanding, and consists of many uncertain conventions. Gothic characters, for example, indicate that the phrases printed by them are idioms, and italic characters show that the entry words have foreign origins. In the input texts for computer, these different types of characters are indicated by a set of shift codes. Shift codes, together with various special symbols such as ( , ), \[ , \], ~ etc, give us useful clues for the data translation.</Paragraph>
      <Paragraph position="9"> However, these codes should be interpreted differently, when they are used in different parts of descriptions. &amp;quot; (&amp;quot; shows the beginning of the pronunciation when it appears just after a head word, and, on the other hand, when it is used in the midst of an idiomatic expression, it shows the beginning of an alternative expression of just the preceeding one. Such conventions, however, have many exceptions.</Paragraph>
      <Paragraph position="10"> Moreover, the fact that there may be errors in the input texts makes the translation process more difficult.</Paragraph>
      <Paragraph position="11"> If we use an ordinary programming language like PL/I, the program for this purpose becomes a collection of tricky mechanisms and hard to debug. Data translation of this kind is inevitable whenever we want to process unformatted linguistic data by computer. It would be very useful if we could develop a universal system for data translation (in fact, our system described below has been used not only for dictionary data translations but also for the translations of bibliographic data in ethnology at the National Museum of Ethnology).</Paragraph>
      <Paragraph position="12"> 2-1. Data Translator/Verifier -- DTV The data translation can be seen as atranslation from linearized character strings to certain organized structures. The relationships implicitly expressed in the linearized strings should be recovered and explicitly represented in the organized forms. It is basically a process of parsing. It has many similarities with parsing of sentences in artificial or natural languages. It has more similarities with natural language parsings in the sense that both are defined by many uncertain rules. Therefore, it is reasonable to expect that we can apply the same techniques to this problem that have beenproven useful in natural language parsings. Several propos'als have been made to define data syntax by using ordinary BNF notations (or CFG)(~'~)How ever, we adopted here the framework of ATN (Augmented Transition Network) instead of CFG by the following reasons: (i) CFG is essentially a model of a recognizer. Although it is possible to check the syntactic correctness of input texts by CFG rules, we need another component that transduces the parsed trees into formatted records we want.</Paragraph>
      <Paragraph position="13"> ATN gives us an adequate model of a data transducer. It has provisions for setting up intermediate translation results in registers (registers in ATN are called 'buffers' in our system) and building them up into a single structure (called BUILDQ operation in ATN).</Paragraph>
      <Paragraph position="14"> (2) CFG provides an adequate framework for managing recursive structures such as embedded sentences in natural languages. Though recursive structures are also found in dictionary data, they are not so conspicuous. The structures in dictionaries are rather flat. In this sense, CFG is too powerful to define data syntax of dictionaries.</Paragraph>
      <Paragraph position="15"> (3) ATN provides a more procedural framework than CFG. Because a CFG based system assumes a general algorithm that applies the rules to the input text, the user who defines the rules cannot control the algorithm. This is a fatal disadvantage of CFG, when the input text contains many input errors. Whenever an input error is encountered during the translation process, a CFG system fails to produce a parsed tree. The system or the human user should trace back the whole process to find the input error which causes the failure. It would be a formidable task.</Paragraph>
      <Paragraph position="16"> 2-2. Definition of Rules ~ Codes~ Buffers and</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
Files
</SectionTitle>
      <Paragraph position="0"> Based on ATN model, we modified it for data translation. In this section, we will explain the detailed syntax for the DTV (the formal definition of the DVT syntax is given in (8).</Paragraph>
      <Paragraph position="1"> (A) Definition of Codes In the case of syntactic analyses of natural language sentences, the basic units are parts  of speech of individual words or individual words themselves. Special checking functions such as CAT and WORD are prepared in the original ATN model. On the other hand, in the case of data translation, the basic units are individual characters.</Paragraph>
      <Paragraph position="2"> A restricted set of characters such as the character set defined by ISO or ASCII are used and sufficient for ordinary computer applications. However, when we want to process real documents or linguistic data like dictionaries, we need much richer set of characters. Though, in principle, a single kind of parenthesis is sufficient for expressing tree-like structures, several different sorts of parentheses suqh as ( , \[ , ~ , { , ~ etc. are used to identify different parts of descriptions in the published dictionaries. We also found out that a certain set of characters, for example, phonetic symbols, appear only in a certain specific position (the pronunciation in the case of phonetic symbols) in the dictionary descriptions. If we could recognize the scope of the pronunciatioL~ parts, we would not need to have extra sets of character codes for phonetic symbols. We could interpret ordinary ASCII codes in the pronunciation part not as usual alpha-numeric characters but as phonetic symbols, according to certain pre-defined rules.</Paragraph>
      <Paragraph position="3"> However, these redundancies of descriptions are especially useful for detecting input errors.</Paragraph>
      <Paragraph position="4"> Whenever we find out the codes for phonetic symbols in the positions other than the pronunciation fields, or inversely, when we encounter, in the pronunciation fields, the codes for the characters other than phonetic symbols, something wron~ would be in the input texts.</Paragraph>
      <Paragraph position="5"> Because we have about i0,000 or more different Kanji-(Chinese) characters in Japanese, a standard code system such as ISO, ASCII etc. is no more adequate, and therefore, a special code system has been determined as JIS (Japanese Industrial Standard). The code system assigns a 2 byte code to each character. We have 752 extra codes which are not pre-defined by JIS and to which the user can assign arbitrary characters. Various types of parentheses, shift code, phonetic symbols etc. have been defined by using these extra codes. Because each character, including alpha-numeric, Kanji, specifically designed symbols, shift codes etc.,correspond to a 2 byte code, we can assign a decimal</Paragraph>
      <Paragraph position="7"> Note : The lower case alphabet characters are defined as the decimal numbers between 9057 and 9087. The alphabet characters are defined as the union of ALPHA-SMALL and ALPHA-LARGE.</Paragraph>
      <Paragraph position="8"> Fig. 2 Code Definition by Decimal Numbers number to each character by interpreting the 2 byte code as an integer representation. By using this decimal number notation, we can define arbitrary subsets of characters as shown in Fig.</Paragraph>
      <Paragraph position="9"> 2. These subsets of characters play the same role for the data translation as the syntactic categories for sentence analysis. Notice that a character is allowed to belong to more than one character set.</Paragraph>
      <Paragraph position="10"> (B) Definition of Rules A rule of DTV is defined by a triplet as (condition action next-state).</Paragraph>
      <Paragraph position="11"> The condition-part is specified by using the code sets defined in (A), Two forms of specifications are possible.</Paragraph>
      <Paragraph position="12"> i. &lt; subset-l, ..., subset-n &gt; 2. ( subset-l, .... subset-n ) The first notation means that the characters in the specified subsets should appear in this order. The second is the notation for specifying OR-conditions, that is, a character in one of the specified subsets should appear. Arbitrary combinations of these two bracketing notations are allowed such as &lt;( &lt; &gt; ) ( )&gt;.</Paragraph>
      <Paragraph position="13"> The action parts, which will be carried out when the condition parts are satisfied, are described by using a set of built-in functions. These are the functions for manipulating buffers and files. Some examples of such built-in functions are shown in Table i.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
Function Argument Result
</SectionTitle>
      <Paragraph position="0"> WRITE *\[-number\] the currently scanned char-BUF(Buf-name) acter(or the 'number' preceding chracter) iswitten in  the buffer.</Paragraph>
      <Paragraph position="1"> RECNO the ID number of the current BUF(Buf-name) input record is written in the buffer.</Paragraph>
      <Paragraph position="2"> PTR the position of the scanned BUF(Buf-name) character in the input record is written in the buffer.</Paragraph>
      <Paragraph position="3"> 'arbitrary char- the specified character acter string' string is written in the BUF(Buf-name) buffer.</Paragraph>
      <Paragraph position="4"> BUF(Buf-name) the content of the buffer is FILE(File-name) written out to the external file.</Paragraph>
      <Paragraph position="5"> MERGE BUF(Buf-nam~l,.. the contents of the n buffers ., Btlf-namen ) are merged into a single BUF(Buf-name) buffer specified by the second arguement.</Paragraph>
      <Paragraph position="6"> CLEAR CTR(Counter- the counter is cleared to 0 name) or BUF( or the buffer is cleared hy Buf-name) blank characters.</Paragraph>
      <Paragraph position="7"> ADD CTR(Counter- the counter is counted up by name) the number.</Paragraph>
      <Paragraph position="8">  Several actions can be specified and they will be executed in sequence.</Paragraph>
      <Paragraph position="9"> The next-state specifies the state to which the control is to be transferred after the current rule is applied. A typical state-diagram is shown in Fig. 3.</Paragraph>
      <Paragraph position="10">  One of the typical input errors is the omissions of delimiters which cause serious problems in data translation. Various characters play the roles of delimiters. They are shift codes, several sorts of parentheses, etc. and they are used in pairs (right vs. left parentheses, shiftin vs. shift-out etc.) to delimit scopes of specific data fields. When one of the pair is missing, two situations would occur: the buffers corresponding to the fields may overflow or illegal characters for the fields may be scanned. The latter case can be easily detected because no transition rules are defined for that character. DTV put a message to the error message file which tells at which position the illegal character is found. The former case is rather troublesome. Checking overflow conditions by rules makes the whole definition Very clumsy. We can specify in the definition of a buffer, to which state the control makes a transition if the buffer overflows. In that state, some error messages are printed out.</Paragraph>
      <Paragraph position="11"> 2-3. System Configuration Fig. 4 shows the overall construction of DTV. By the compiling component, the definitions of  codes, buffers, files, formats of input and Output, and translation rules are compiled into several internal tables. Based on these tables, the executer scans the input characters one by one from the input file and applies the rules. During the execution, the system will report various error messages such as 'buffer overflow', 'illegal characters' etc. into the error message files. Because the detailed information, such as the position of the error in the input text, is associated with these messages, human proofreaders can easily recognize the input errors. A flexible editor has been developed for correcting input errors. Because this editor has a special command to call DTV, the reviser can check the data syntax immediately after the correction (see Pig. 5).</Paragraph>
      <Paragraph position="13"> sponding error message. The human proofreader can easily recognize the input error and revise it. After the revision, he/she can check whether the input record contain no more errors, by calling the DTV.</Paragraph>
      <Paragraph position="14"> Fig. 5 Data Editor Accompanied by DTV 2-4. Experience with DTV We used DTV for data translation of the English-Japanese dictionary. About 500 rules and 150 states were necessary to manage exceptional description format of the dictionary. Because DTV should scan and check every input character  --537-and because the dictionary consists of 6,500,000 characters, the whole process was very time consuming (it took about 130 min. for translating the whole dictionary by FACOM M200 at Kyoto University Computing Center).</Paragraph>
      <Paragraph position="15"> In order to show the effectiveness of DTV, Table 2 is prepared, which shows the input errors detected in the initial input. Some of them can be corrected automatically only by augmenting DTV rules. Moreover, the data editor accompanied by DTV was so effective that all of the detected errors were completely removed by 3 man-month efforts. However, DTV can check mainly the consistency of delimiting characters. There still remain a lot of input errors in the text such as errors in spellings of words. The detection of such input errors requires certain semantic knowledge and is hardly done by DTV rules. Human proofreader should do it. Human proofreaders can easily recognize these errors, but tend to overlock the errors such as omissions of delimiting characters. Certain effective cooperations between man and machine seem to be inevitable in correcting errors in a large amount of linguistic corpora like dictionaries.</Paragraph>
      <Paragraph position="16"> Another point to discuss is the relations between DTV and data entry systems. Though our attempt here is highly batch-oriented, some considerations about intractive data entry systems will be necessary in future to augment the dictionary data in evolutional ways. An ordinary data entry system usually guides the</Paragraph>
    </Section>
    <Section position="4" start_page="0" end_page="10516" type="sub_section">
      <SectionTitle>
Explanations Frequencies
</SectionTitle>
      <Paragraph position="0"> Shift-out codes (the code for 792 normal characters) are often missin$.</Paragraph>
      <Paragraph position="1"> The phonetic symbol '~' is, 5434 for example, often confused with the number character 3.</Paragraph>
      <Paragraph position="2"> Certain functional character 1166 sequences can express a same thing. It is impossible to standardize them in the case that several key punchers work in parallel.</Paragraph>
      <Paragraph position="3"> The description formats for 550 acronyms, for example, are quite different from those of ordinary words.</Paragraph>
      <Paragraph position="4"> Though the key punchers consented to several standardization rules for input, some of them misunderstood them.</Paragraph>
      <Paragraph position="5">  Note i: ~ shows that the errors of that type can be automatically corrected only by augmenting DTV rules. * ~ shows that some of them can be corrected automatically by augmenting DTV rules. Note 2: The exceptional format errors are not input errors in a true sense.</Paragraph>
      <Paragraph position="6"> Table 2 Error Frequencies in the Initial Data user as to what he should input next, by printing prompting messages such as 'input the next word', 'input the part of speech of the word' etc. However, in the case that the input data have rich varieties in their description formats like the dictionary here, such a system becomes infeasible. Though some guidance by the entry system would be necessary, it is natural for the user to input data in arbitrary fashion. The data entry system should have the abilities of translating the texts into certain formatted structure, and of checking the existence of input errors. Our data editor accompanied by the DTV is the first step toward developing such data entry systems.</Paragraph>
    </Section>
  </Section>
  <Section position="2" start_page="10516" end_page="10516" type="metho">
    <SectionTitle>
3. Data Base Systems for Dictionaries
</SectionTitle>
    <Paragraph position="0"> A dictionary description has a certain hierarchical structure such as previously shown in Fig. 1. Such a structure can be well represented by a framework provided by ordinary DBMS's, because it is just a simple tree structure.</Paragraph>
    <Paragraph position="1"> However, the primitive data (or records) from which the whole structure is built have certain internal structures of their own. For example, idioms or typical usages in English-Japanese dictionary are the primitive records which are located at certain fixed positions in the whole structure and related to the other records such as translation equivalents, head words etc.</Paragraph>
    <Paragraph position="2"> They ean be accessed as basic units through usual DBMS operations. At the same time, they are composite expressions which consist of several component words. These component words are related to each other inside the idioms. We call such structures inside the primitive records ' internal structures'. (See Fig. 6) In other words, the primitive records in a dictionary data base system are not primitive in a  usual sense of DBMS. Though the external structures among primitive records can be managed by an ordinary DBMS, the internal linguistic structure,in some sense, cannot be well manipulated. Moreover, what we want to do on the dictionary data base systems is not only concerned with external structures, but also, in many cases, concerned with their internal, linguistic structures. Some additional operations should be incorporated with the usual DBMS operations for treating such intermixed structures.</Paragraph>
    <Paragraph position="3"> 3-1. Japanese Dicti0nary Data Base The first thing we have to do is to incorporate morpho/graphemic level of processings. Because Japanese has a very peculiar writing method, special techniques are required to utilize the dictionary. The main difficulty we encountered in developing the dictionary consultation system is from the fact that dictionary entries of Japanese usually have more than one spelling.</Paragraph>
    <Paragraph position="4"> They have basically two different forms of spellings, Kana-spellings(spellings by Japanese phonetic symbols) and Kanji-spellings (spellings by ideographs -- Chinese characters). Corresponding to these two spellings, we have two types of printed dictionaries, one for Kana and the other for Kanji spellings. However, in actual sentences, there often appear mixed forms of these two spellings. (See Fig. 7) Though these mixed Kanj i-Spellin~ Kana-Spc \] \] ing Mixed Spell ing Mean \[ng  forms are not entered in the ordinary, printed dictionaries of both types, human readers are intelligent enough for converting them into one of the two basic spellings. As for a computerized dictionary system, a certain graphemic level of processing is necessary for consulting the system from these mixed forms.</Paragraph>
    <Paragraph position="5"> In our system, the intermediate indexing structures are provided for both Kana and Kanji- null spellings (Fig. 8). The dotted line shows the access path for Kana-spellings and the bold line is for Kanji-spellings. The relationships among FCT, FFCT, SCT and IT are illustrated in Fig. 9, and the required memory spaces for these structures are given in Table 3.</Paragraph>
    <Paragraph position="7"> Note : Each record in IT(Item Table) contains a pointer to the meaning description of the word.</Paragraph>
    <Paragraph position="8"> IT, FFCT and SCT are blocked and stored in the secondary memory (disc file). Each block contains 50 records. A SCT record contains a set of Kanji-characters which follow the same(first)  Mixed spellings are normalized into one of these basic spellings. We can obtain Kana-spellings from mixed ones, by systematically changing the Kanji-characters in the mixed spellings into corresponding Kana-strings. However, because each Kanji-character corresponds to three or four (or more) different Kana-strings (each Kanji-character has several pronunciations ), the resultant Kana-strings are to be matched against the Kana-spellings in the dictionary. Some examples of retrieval results are shown in Fig. i0. Another problem is the incorporation of the morphological analysis component. Because the word inflection system for Japanese is much richer than English, the morphological analysSs component is indispensable for the Japanese dictionary system. The morphological analysis program developed for our another project, i.e., Machine Translation Project ( 7 ), has been incorporated into the system. The retrieval program has Japanese inflection rules in it and can convert inflectional variants to their  infinitive forms. The rules are almost perfect, and more than 98% inflectional variants can be converted correctly to their infinitive forms.</Paragraph>
    <Paragraph position="9"> 3-2. E_n.glish-Japanese Dictionary Data Base The morpho/graphemic processings which are required for utilizing this dictionary are much simpler than for the Japanese dictionary.</Paragraph>
    <Paragraph position="10"> Because most of derivatives are generally adopted in the dictionary as head words, the processings for derivational suffixes are not necessary. We can retrieve the corresponding lexical entries from their own spellings. When we want to see the original word from which the derivative is derived, it is required only to traverse the external structure (that is, a record for a derivative always contains a pointer to its original word). Therefore, the current system only recognizes the inflectional suffixes of English to convert inflected forms to their infinitive forms. As for the irregularyinflected words, all of their inflectional variants are extracted from the dictionary, and are stored in the inverted files (all of the head words are also stored in the inverted files). Some results of retrieval are shown in Fig. ii.</Paragraph>
    <Paragraph position="11"> As we described at the beginning of this section, some linguistic operations should be incorporated in a dictionary data base system besides usual operations provided by ordinary DBMS's.</Paragraph>
    <Paragraph position="12"> Morpho/graphemic processings are one of such operations. Another example is the retrieval of 'similar' expressions. English-Japanese  dictionary contains English idioms and their translations, and typical usages of each word and their translations. The effective utilization of these by computer is a very interesting topic, because this is one of the essential &amp;quot;reason-d'etres&amp;quot; of the dictionary. We have been developping some elementary programs to utilize the idioms in the dictionary. The system can retrieve idioms or usages which bear certain similarities to the input phrases. For example, when the user input a sentence such as 'He wore a long face', the system retrieves the idiom 'pull \[make, wear\] a long face' which have the highest similarity with the input. In this process, all of the words in the input are reduced to their infinitive forms (in this case, 'wore' is reduced to 'wear'), and all of the idioms and typical usages in the individual word entries are retrieved for the comparison with the input phrases. The comparison is currently performed as follows: i. Each word in the retrieved idioms and usages are reduced to their infinitive forms.</Paragraph>
    <Paragraph position="13"> 2. Literal string matching is performed. In the matching process, extra words in the input and retrieved idioms or usages are ignored. Only the oder of words is taken into consideration.</Paragraph>
    <Paragraph position="14"> 3. Similarity value is computed for each idioms and usages.</Paragraph>
    <Paragraph position="15"> The expressions with the highest value are printed out. In the current system, the similarity value is determined by a simple formula as \[the number of matched words\] / \[the number of words in idioms or typical usages\].</Paragraph>
    <Paragraph position="16"> Some results are shown in Fig. 12. We should develop more sophisticated method of computing the similarity value. Especially, information  about semantic relationships among words should be taken into consideration. Computerized thesauri will be useful. Certain words in idioms and usages play the role of variables. 'Oneself' in such a idiom as 'be a law unto oneself' is looked as a variable and should be able to be matched with 'myself', 'himself' etc. 'person' in 'take a person about a town' should be matched with any person such as John, he, and so on. In the latter case, 'a town' can also be replaced by many other words that have certain semantic features in common, for example, 'placehood' We are now designing such semantically guided pattern mat chings.</Paragraph>
    <Paragraph position="17"> file HEADWORD KEY HEADWORD IMPORT- pointer number POS-I -2 -3 -4 -5 spell TANCE to of</Paragraph>
  </Section>
  <Section position="3" start_page="10516" end_page="10516" type="metho">
    <SectionTitle>
COMPOUND POS' s
WORD
</SectionTitle>
    <Paragraph position="0"> 11T2 I 21 2 I 2 POS pointer pointer pointer to pointer pointer pointer code to to JAPANESE to to to</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML