File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/94/c94-1040_metho.xml

Size: 23,664 bytes

Last Modified: 2025-10-06 14:13:35

<?xml version="1.0" standalone="yes"?>
<Paper uid="C94-1040">
  <Title>Noun Phrasal Entries ill the 1,3)17, English Word Dictionary</Title>
  <Section position="3" start_page="0" end_page="257" type="metho">
    <SectionTitle>
2. The EDR Dictionaries
</SectionTitle>
    <Paragraph position="0"> The EI)R I-lectronic l)ictionary (El)R, 1993; Yokoi, 1990) is designed as the first true machine-readable dictionary that contains, it\] a readily qccessible lorm, tile information required for a colnputer to understand and generate natural hmguage. As such, the EDR Electronic l)ictionary is intended to be universally apl\]licahle and is not restricted to a p;trticuhtr application system. The part of the dictionary that handles surface information is kept separate from the section that handles semantic intbrlnation: surface inlormation that is heavily dependent upon a particular language is stored in the Word I)ictionary, attd sclnantic infornmtion is stored in the Concept Dictionary.</Paragraph>
    <Paragraph position="1"> There arc lbur different dictionaries that comprise the El)I?, l:!lectronic Dictionary: the Word Dictionary, the Concept I)ictionary, the Co-Occurrence Dictionary and the Bilingual Dictimmry. The different dictionaries that make np the EDR lilectronic Dictionary and the EDR Corpus are set in a structure of mutual interrelatedness. Four types of constituent data are contained in the Ifl)R electronic dictionaries: word entries, concept entries, co-occurrence entries and bi-lingual entries. Word entries consist of headwords, grammatical information that indicates tile grammatical characteristics of tile word, and concept identifiers that indicate tile concepts represented by a given word in different contexts.</Paragraph>
    <Paragraph position="2"> Concept entries represent the relationship between two dif- null ferent concepts. Co-occurrence entries use co-occurrence relation labels to describe the possible co-occurrence relations between headwords. Bilingual entries describe the word correspondences between headwords in different languages. Thus, each of the EDR electronic dictionaries is related to the others, and by using the different component dictionaries as a single entity, they can usefully be applied to many forms of natural language processing.</Paragraph>
  </Section>
  <Section position="4" start_page="257" end_page="259" type="metho">
    <SectionTitle>
3. The EDR Word Dictionary
</SectionTitle>
    <Paragraph position="0"> The role of the Word Dictionary is to provide morphological, syntactic and some semantic informatio,~: the Word Dictionary is divided into a General Vocabulary Dictionary and a Technical Terminology Dictionary and the former is further subdivided into a Japanese General Vocabulary Dictionary and ,an English General Vocabulary Dictionary, each of which contains 200,000 words. The vocabulary covers words, compounds, and idioms used in  ordinary documents. The Technical Terminology l)iction,'u'y covers words or terms that are specific to infornmtion processing and related fields, and is also split into a Japanese Technical Terminology and an English Technical Terminology Dictionary. Each contains 100,000 words.</Paragraph>
    <Paragraph position="1"> The main characteristics of the General Vocabulary Dictionary are: (1) surface level information and deep (semantic) level information are stored separately; (2) surface level information is described independent of any specific application system or algorithm; (3) a large-scale vocabul,'u'y contains lexical items used  in general writing.</Paragraph>
    <Paragraph position="2"> The Word Diction,'uy is a collection of word entries that contain entry information as shown in Fig. 1.</Paragraph>
    <Paragraph position="3"> Fig. 1 Structure and Content of Word Entries  tation, and pronunciation. A headword consists of notation (the orthographic spelling of a word - containing all the characters common to all inflected forms of tile word) and adjacency attributes. For phrasal entries, the headword is a list of the pairs of notation and adjacency attributes of each constituent of the phrasal entry. The adjacency attributes indicate the possibility of joining one inorpheme to another and ,are used to create adjacency rules for morphological analysis and generation. EDR employs a bidirectional connection grammar which divides the adjacency constraint attributes into possible connectivity to the left of tile word and possible connectivity to the right of the word. This information is not nor,nally described in this form for English, but EI)R employs the same method in both Word Dictionaries so that morphological analysis of Japanese and English can be made by the same algorithm. The extra notation information stores headwords in kana for entries in the Japanese Word Dictionary and in a character string form with syllable markers for hyphenation for entries in the English Word Dictionary.</Paragraph>
    <Paragraph position="4"> Grammatical information consists of part of speech, syntax tree, inflection information, grammatical attributes and function word information. The grammatical information can be used to find tile syntactic stn, cmre of a sentence in syntactic analysis. A syntax tree is provided for compound words or idioms consisting of multiple words. The fimction word code corresponding to the notation of the beadword is provided for fimction words.</Paragraph>
    <Paragraph position="5"> A concept (listed under semantic information) in addition to being a flmdamental component of the Concept Dictionmy, describes tile semantic content of any word entry in the Word l)ietionary. If the same headword has two or more different concepts, separate word entries are used in the Word Dictionaries. This information is used to distinguish between the various meanings a given word may have. The concept is the link between the Word Dictionary and the Concept l)ictionary.</Paragraph>
    <Paragraph position="6"> Supl)lementary inforlnation provides information on the usage as well as the frequency of the headword entry.</Paragraph>
    <Paragraph position="7"> 4. Noun Phrase Entries in the EDR English Word Dictionary null Portions of the English Word Dictionary have been subjected to rigorous verification through proiects at the Computing Research Laboratory (CRL) at New Mexico State University, and the University of Sheffiekt. Following is a report on one phase of the verification project at CRL which was aimed at describing the grammatical information as well as verifying the morphological information. The objects of this phase of tile verification project consisted of 37,039 entries initially coded by EI)R as noun phrase expressions. Among these entries, 2,389 were treated as single word entries and 34,650 were treated as  phrasal entries (see below for tile distinction between ,l.3 Aim of Verification single word and phrasal entries).</Paragraph>
    <Section position="1" start_page="258" end_page="258" type="sub_section">
      <SectionTitle>
4.1 Phrasal Entries vs. Single Word Entries in the lil)R IV.n -
glish Word Dictionary
</SectionTitle>
      <Paragraph position="0"> In the EDR English Word Dictionary, headwords are treated as either single lexical item, while 'phrasal' ,efers to a word that is coml)osed of more than one lexieal item.</Paragraph>
      <Paragraph position="1"> In addition to the difference made on lexical units, some words are 'treated as' single word entries even though they ,are composed of more tllan one lexical item. The type of information that is provided for headwords varies according to the type of headword. Phrasal entries are given the same information given to single word entries but they are also coded with additional information that indicates their internal syntactic structure. The adjacency attributes and the grammatical attributes are given to each of the constituents of the phrasal. Phrasal expressions t~'eated as single word entries ~:tre uot segmented into constituent words. Included under those words tbat are treated as single word entries are the following tylx~s of words: -foreign words -proper nouns -common nouns derived from prope,' nouns (&amp;quot;New Mexican&amp;quot;) -idiomatic expressions which do not fit into a generalized phrase structure l)atteru (&amp;quot;on the cheap,&amp;quot; &amp;quot;open sesame&amp;quot;) -function word equivalents</Paragraph>
    </Section>
    <Section position="2" start_page="258" end_page="259" type="sub_section">
      <SectionTitle>
4.2 Information Provided for Notre Phrase Entries
</SectionTitle>
      <Paragraph position="0"> For noun phrase entries in the English Word Dictionary information is provided for the phrase as a whole as well as far the individual constituents that comprise the F, hrase.</Paragraph>
      <Paragraph position="1"> Whole phrase information inch,des designation as either a common noun or proper noun (proper norms are treated as single word entries and constituents are not separately analyzed), countability, collectivity, gender, verb agreement and article usage. In addition, the head noun is designaled.</Paragraph>
      <Paragraph position="2"> The constituent information provided for phrasal entries includes left and right adjacency attributes, part of speech, inflection information, and grammatical attributes. The grammatical attributes that are provided for each of tile constituents varies according to the part of speech of the constituent. Information reg,'u'diug collectivity and countability is provided for nouns and information on possible comparative, superlative, or positive degree forms is given for both adjectives and adverbs that appear as constituents of the phrase. Constituent information is provided within the context of the whole phrase. Syntax trees are also described for noun phrase entries.</Paragraph>
      <Paragraph position="3">  The primary objective of the verification project was to code thc grammatical information for the noun phrase entries. The specific information given for the noun phrase entries included determining the intrn-phrasa\] structure of the phrasal, the grammatical attributes of the constituents .'rod also the grammatical attributes of the entire noun phrase.</Paragraph>
      <Paragraph position="4"> 4.?,. 1.1 Syntactic Relationship Between Constituents The basic principle used in coding imra-phrasal syntactic information is th:~t the information should clarify the syntactic structure of the phrasal entry. \[:or example, the lollowing phrases look simihu' on the surface, i.e. adjective + noun + noun, hut actually the internal syntactic structure of each phrasal is different.</Paragraph>
      <Paragraph position="5"> (iii) traveling post office (iv) dead letter box The adjective &amp;quot;traveling&amp;quot; modifies the noun phrase &amp;quot;post office&amp;quot; itl the phrase &amp;quot;traveling post office&amp;quot; while the noun phrase &amp;quot;dead letter&amp;quot;, con,posed of an adjective and a norm, modifies the nouu &amp;quot;box&amp;quot; in the phrase &amp;quot;dead letter box.&amp;quot; \[:or building a source of \[exieal iuformation to be used in language processing, inC/lication of the head noun of a F, hrase is useful as hyl~ernyrn information. Location of the head noun cannot be determined automatically from the phrase structt,rc. That is to say, it is often the case that lhc head noun is the noun occurring in the final position of a phrasal composed of two (or more) lexical items, but this rule does not always apply as there are also cases in which the head noun is the first norm of the phrasal e.g., cou,t martial.</Paragraph>
      <Paragraph position="6"> l)tuing the actual task, the distinction between the syntac~ tic relationship between constituents was car,ied out by indicatin\[, the inlra-phrasal synlax lay parentllesizing the immediate constituents with categorical labels. The categoricM labels used to mark the grouping of the phrasal are shown in the example below:</Paragraph>
      <Paragraph position="8"> In this syntactic notation a slash (/) divides constituents at the same level, ENI is an English common noun, EAJ au English adjective, etc.; and lhe bracketing structure is a linearized tree in a standard for,n, e.g., in (iii) above the tree expands to the right, while in (iv) it expands to tile left. The symbol &amp;quot;6.~&amp;quot; indicates the head noun.</Paragraph>
      <Paragraph position="9">  Once the intra-phrasal syntax structure of the phrasal has  been determined, the inflection information and grarmnatical attributes of the constituents are determined. The grmnmatical attributes of the constituents are determined by considering the constituent as part of the phrase. Given the information of the constituent words coded as separate dictionary entries in the EDR English Word Dictionary, the coding is given based on the behavior of the constituent when it is nsed in the phrase.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="259" end_page="260" type="metho">
    <SectionTitle>
traveling:EAPOS ;EANOCMP;EANOS UP
-&gt; EAPOS ;EANOCMP;EANOSUP
</SectionTitle>
    <Paragraph position="0"> post: ENSG;ECNI;ENC -&gt; ENSG;ENU office: ENSG;ECN 1 ;ENC -&gt; ENSG;ECN 1 ;ENC The coding ECN1 (takes plural ending -s) and ENC (Countable) is changed to ENU (Uncountable) to indicate that the word &amp;quot;post&amp;quot; when used in the context of the phrase, does not inflect.</Paragraph>
    <Paragraph position="1"> 4.3.1.3 Grammatical Information for tile Noun Phrase Unit The final process in the coding of tile syntactic information for noun phrasals involves marking the grammatical attributes for the noun phrase as a whole. The grammatical attributes marked for the whole phrase include: part of speech, countability, collectivity, gender, verb agreement and article usage. The ex,-unple given in the previous section, &amp;quot;traveling post office&amp;quot;, was coded as a common countable noun that may be preceded by both the definite and indefinite articles and is referred to by the pronoun 'it'. Since the phrase &amp;quot;traveling post office&amp;quot; does not have any special requirements on verb agreement, that is, when tile noun is used in the singular form it is followed by a singu1,-u&amp;quot; verb and conversely, when it is used in the plural form it is followed by a plural form verb, the verb agreement marking is left blank for the entry.</Paragraph>
    <Paragraph position="2">  Although decisions for the descriptions of the intra-phrasal syntax structure were based on initial coding phases of the EDR Word Dictionary development, verification and correction of that morphological information dr, ring the verification project was essential. The coding of the syntactic information for the phras,-d may affect the morphological information for the entry thus requiring the verification of morphological information as well. The decisions regarding the morphological information including segmentation and p,'u't of speech of the constituents could be made with more precision if the syntactic structure of the phrase was taken into consideration.</Paragraph>
    <Paragraph position="3"> Tim basic principles of headword determination and segmentation are as follows: (1) a headword unit should be determined on the basis of whether the phrasal expression comprises a single unit of meaning; (2) phrasal headwords shovld be segmented into those constituents which are also found as single headwords in EDR's English Word Dictionary.</Paragraph>
    <Paragraph position="4"> In view of the second basic principle, tim part of speech of a phrasal constituent is decided according to the lexical part of speech consistent with the part of speech of the constituent as a single word entry in the dictionary. For example, nouns flmctioning as adjectives in phrases like &amp;quot;corn stalk&amp;quot; are coded as nouns and verbs modifying nouns as in the phrase &amp;quot;jam session&amp;quot; are coded as verbs. The treatment of hyphenated words as single words or words which should be broken down into separate constituents is a significmlt segmentation issue. Hyphenated words which are used on their own in Standard English should be treated as single constituents and hyphenated words which are not used on their own should be broken down into separate constituents. In the examples below, the constituents are separated by the slash (/) notation.</Paragraph>
    <Paragraph position="5"> &amp;quot;X-ray//spectroscopy&amp;quot; &amp;quot;deep-sea//angler&amp;quot; &amp;quot;directed/-/energy//weapon&amp;quot; &amp;quot;Bose/-/Einstein//statistics&amp;quot; A decision on segmentation for some hyphenated words in phrasal entries is difficult to judge purely by intuition from looking at the individual phrasal entries. These types of entries have to be looked at as a whole with attention being given to wider usage, and in particular to consistency with other headwords in the dictionary. For example, a decision to correct &amp;quot;yellow/--/green&amp;quot; to &amp;quot;yellow-green&amp;quot; cannot be made purely by intuition. The decision here is more an issue of the selection of headwords rather than one of hyphenation. The verification task of morphological information also included ,'aising possible additional head-words to be added to the EDR English Word Dictionary through the analysis of tim entries. After a decision has been made regarding entering tile hyl)henated word of tile phrasal as a headword in the dictionary, tbe phrasal is fed back to the segmentation process.</Paragraph>
    <Section position="1" start_page="259" end_page="260" type="sub_section">
      <SectionTitle>
4.4 Some Results of the Work
</SectionTitle>
      <Paragraph position="0"> Tim result of the coding shows that 98% of tile 34,650 phrasal entries could be covered in approximately 40 different patterns. Tim following seven patterns are the most frequent and cover over 80% of the total entries.</Paragraph>
      <Paragraph position="1">  *ENI denotes a common noun, EN2 denotes a proper noun, EEN:ENPOS a noun possessive ending's and ', EPR a preposition, and EPP a prepositional phrase. The location of the head of tile phrasal is indicated by the @ notation.</Paragraph>
      <Paragraph position="2">  As mentioned em'lier one of the tasks of the coding was to indicate the grammatical attributes of the constituents. The data show that of the adjective + noun pattern tEA J0/ EN 10), the adjective constituent of the noun phrase did not inflect to form the superlative or comparative degree forms, but rather most often occurred in the positive degree form.</Paragraph>
      <Paragraph position="3"> The grammatical attributes fcu' nouns other than the head noun also showed some interesting results. Nouns other than those designated as the head noun do not inflect in most of the cases. One of the exceptional cases is &amp;quot;the time of w#one's life,&amp;quot; where &amp;quot;w#one's&amp;quot; is a word class name for any noun in the possessive form. In this example, &amp;quot;life&amp;quot; inflects in accordance with the content of &amp;quot;w#one's&amp;quot; word class, though it is not the head noun of tile phrase. Since phrases like this are very rare, it is also possible to treat &amp;quot;the time of w#one's life&amp;quot; and &amp;quot;the time of w#one's lives&amp;quot; as individual headwords and not its tile inflected forms of the same headword. Another exceptional case in which more than one constituent coukl inflect would be phrases containing the conjunction 'and.' llowever, most of the phrasal entries in the form of 'A anti B' are uncount:d)le and the final noun inflects if the phrase is countable, such as &amp;quot;gin and tonics.&amp;quot; Therefore, we can assume that the grammatical behavior of constituents of noun phrase entries can be properly described by indicating the head noun and coding tile inflection information and grammatical attributes of tile head noun, 4.4.3 Grammatical Attributes for the Notul f'hrase Unit The coding of grammatical attributes for tile entire noun phrase unit also provided some interesting results on countability and the usage of articles with the noun phrase. As is expected, the most typic:d combimttion of cotmtability shows the following combinations: If the noun is countalfle it may be lneceded by tim definite article or the indefinite article; If the norm is tmcountable it may be preceded by the definite article or no article.</Paragraph>
      <Paragraph position="4"> Approximately 10% of the nouns coded as countable showed a wtriation on the foremeutioned pattern. These coutltable notms were coded its allowing the definite article, indefinite article as well as no article. Nouns with this type of coding included mass nouns, names of phmls anti animals, metals, food, titles etc. or other nouns which could refer to both the group or a member of the group.</Paragraph>
      <Paragraph position="5"> Examl)les of such nouns included &amp;quot;Leconte's sparrow&amp;quot;, &amp;quot;Madagascar jasmiue&amp;quot;, &amp;quot;assembler language&amp;quot;, and &amp;quot;atomic weight&amp;quot;. Though this held for the majority of these types of nouus, it was not unive,sally applicable; the use of no article with &amp;quot;Nubian goat&amp;quot;, &amp;quot;Oregon grape&amp;quot; and &amp;quot;arctic loon&amp;quot; is questionable.</Paragraph>
      <Paragraph position="6"> The significance of this data is that it implies perMps a new code is necessary to cover cases of countable (ENC) nouns becoming uncountable (I-';NU) nouns and vice versa.</Paragraph>
      <Paragraph position="7"> Instead of coding a single entry as both, or providing two entries which correspond to the ENC and ENU usage we ,night better express the grauunatical behaviors which are commonly slmred by particular types of not, ns by using a new code,</Paragraph>
    </Section>
    <Section position="2" start_page="260" end_page="260" type="sub_section">
      <SectionTitle>
4.4.4 Verification of Morphological hfformation
</SectionTitle>
      <Paragraph position="0"> In tile morphological data some entries of the original data were segmented into constituents anti some were not. This was particularly the case with '-ing' and '-ed' forms of words. The segmentation was not always consistent. But through syntactic analysis, verificatiou of tim segmentation and part of speech assignment could be carried out.</Paragraph>
      <Paragraph position="1"> The El)P, English Word Dictionary does not contain gernuds or p.'uticiple forms of a verb its separate headword entries (except for irregular inflected lbrms). If a word in tire '-ing' form is regarded as a gerund or a present participle, it is to be segmented into a verb and a verb ending.</Paragraph>
      <Paragraph position="2"> There are some cases where gerund forms or participle forms have been accepted as lexical items and not as inflected forms of a verb. In such cases, they are klentified not as verbs, but as notms or adjectives.</Paragraph>
      <Paragraph position="3"> Noun phrases consisting of a word in the '-ing' form anti another noun are treated by using one of the following four patterns, where EVE denotes an English verb and EEV a  If a phrasal in the form of '-ing + noun' could be reworded as 'a noun that is v-ing' or 'a noun that v-s' the entry was coded using either pattern 2 or pattern 3.</Paragraph>
      <Paragraph position="4"> Through the verification of the morphological information we were able to gain more consistency in the segmentation of the constituents of phrasal headwords. Also we were able to indicate possible additional headword entries through the verification of the constituents that comprise the phrasal.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML