File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/69/c69-3601_metho.xml

Size: 29,082 bytes

Last Modified: 2025-10-06 14:11:04

<?xml version="1.0" standalone="yes"?>
<Paper uid="C69-3601">
  <Title>THE LEXICON: A SYSTEM OF MATRICES OF LEXICAL UNITS AND THEIR PROPERTIES ~</Title>
  <Section position="1" start_page="0" end_page="0" type="metho">
    <SectionTitle>
THE LEXICON: A SYSTEM OF MATRICES OF LEXICAL UNITS AND
THEIR PROPERTIES ~
</SectionTitle>
    <Paragraph position="0"> Harry H. Josselson Uriel Weinreich /I/~ in discussing the fact that at one time many American scholars relied on either the discipline of psychology or sociology for the resolution of semantic problems~ comments: In Soviet lexicology, it seems, neither the traditionalists~ who have been content to work with the categories of classical rhetoric and 19thcentury historical semantics~ nor the critical lexicologists in search of better conceptual tools, have ever found reason to doubt that linguistics alone is centrally responsible for the investigation of the vocabulary of languages. /2/ This paper deals with a certain conceptual tool, the matrix, which linguists can use for organizing a lexicon to insure that words will be described (coded) with consistency, that is~ to insure that questions which have been asked about certain words will be asked for all words in the same class, regardless of the fact that they may be more difficult to answer for some than for others. The paper will also discuss certain new categories~ beyond those of classical rhetoric~ which have been introduced into lexicology.</Paragraph>
    <Paragraph position="1"> i. INTRODUCTION The research in automatic translation brought about by the introduction of computers into the technology has ~The research described herein has b&amp;en supported by the Information Systems Branch of the Office of Naval Research. The present work is an amplification of a paper~ &amp;quot;The Lexicon: A Matri~ of Le~emes and Their Properties&amp;quot;~ contributed to the Conference on Mathematical Linguistics at Budapest-Balatonszabadi, September 6-i0~ 1968.</Paragraph>
    <Paragraph position="2">  -iengendered a change in linguistic thinking, techniques, and &amp;quot; output. The essence of this change is that vague generalizations cast into such phrases as 'words which have this general meaning are often encountered in these and similar structures' have been replaced by the precise definition of rules and the enumeration of complete sets of words defined by a given property. Whereas once it was acceptable to say (e.g., about Russian) that 'certain short forms which are modals tend to govern a UTO6~ clause', now it is required that: (a) the term 'modal' be defined, either bY criteria so precise that any modal could be easily identified, or if that is not possible, by a list containing all of the modals of the language, and (b) the 'certain short forms which are modals' which actually do govern a UTOOM clause be likewise identified, either by precise criteria, or by a list.</Paragraph>
    <Paragraph position="3"> Linguistic research into Russian has led to and will continue to yield many discoveries about the language, and the problem of recording and recalling the content of these discoveries is not trivial. A system is required to organize the information which has been ascertained, so that this information can be conveniently retrieved when it is required; such a system is realized as a lexicon.</Paragraph>
    <Paragraph position="4"> Fillmore /3/ has defined a lexicon as follows: I conceive of a lexicon as a list of minimally redundant descriptions of the syntactic, semantic, and phonological properties of lexical items, accompanied by a system ot redundancy rules, the latter conceivable as a set Of instructions on how to interpret the lexical entries.</Paragraph>
  </Section>
  <Section position="2" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2. DESCRIPTION OF THE LEXICON
</SectionTitle>
    <Paragraph position="0"> The steps in the construction of a lexicon may be detailed as follows: a) deciding which words to enter, i.e., the lexical stock b) deciding what are the subsets of the lexical stock c) deciding what information to code about each subset d) compiling the information e) structuring the storage of the information where the steps outlined have interdependencies. We shall discuss each of the steps, especially in relation to the Russian language.</Paragraph>
    <Paragraph position="1"> a) The Lexical Stock Ideally a Russian lexicon should contain all of the words in the Russian language, but 'all Russian words' is a set whose contents are not universally agreed upon, since some words are gradually dropped from usage, while others are continually being formed and added to the lexical stock. The words to be entered in the lexicon could be obtained from existing sources,i.e., lexicons and technical dictionaries, and be supplemented by neologisms found in written works. The lexicographer must also be alert for new meanings and contexts in which 'old' words may appear.</Paragraph>
    <Paragraph position="2"> b) Subsets of the Lexical Stock The lexical stock of Russian may be subdivided into word classes, i.e., words having certain properties in common. These properties may be morphological and/or functional. In Russian~ nouns are not marked for tense and predicatives  -3are not marked for the property of animateness; hence they are in different word classes. The subsets may coincide with those of traditional gram_mar, or they may be different if the grammar to which the lexicographer refers is not the traditional one.</Paragraph>
    <Paragraph position="3"> c) Information to be Coded The choice of information to be coded in a particular lexicon is a function of its intended use--in other words, one should code the information that will be necessary for a particular purpose or set of purposes, or information that has a forseeable application. For example, one of the tasks for the Wayne State University Machine Translation group was to program a routine to group each nominal in a Russian sentence with its preceding (dependent) modifiers. This procedure, called blocking, requires that the computer-stored lexicon contain I) word class information for identifying nominals and modifiers, as well as conjunctions, punctuation, and adverbs intervening between the modifiers, and other word classes, tokens of which mark the boundaries of a block; and 2) case, number, and gender information for establishing an agreement relation between the nominal and the preceding modifiers.</Paragraph>
    <Paragraph position="4"> Most existing Russian lexicons contain the usual morphological information for members of inflected word classes: person, gender, number, case, animation, paradigm, aspect, etc.. Certain syntactic information such as impersonality  -4and government of cases, prepositions, the infinitive, and clauses, is indicated for verbal word classes. This indication may be explicit or sometimes only implicit in an example; it is not consistent. It is not unusual to chance upon one of the complements of a certain predicative under the entry head of another predicative for which the example is given. In the Academy of Sciences dictionary /4/, the entry head ~eaenm~ contains the example CTaao BApyF 06~AHO M AocaAHO, qTO HpMxoAMTCH MFpaT~ TaKy~ He~eny~ pooh. In the same lexicon~ under OOMAH~, the form OOMBHO is shown to govern a qTO clause in an example; however, under AocaAa there is neither coding nor example to indicate that ~ocanHo takes a qTO clause.</Paragraph>
    <Paragraph position="5"> Each lexical entry should include all of the existing phonological, morphological, and syntactic information about the head word; the discussion and presentation of this information will entail the introduction of concepts from semantics and stylistics. When using a Russian lexicon, one should be able to discover whether ~OmHO is a modal (if the grammar of Russian uses the concept 'modal' for the word class of which MO~HO is a member) by looking under the entry head MOXHO and finding the position where the property 'modal' is coded for that word. Furthermore, one should be able to determine whether MOXHO takes an infinitive complement, i whether if takes a subject, or whether if has a corresponding long form, etc..</Paragraph>
    <Paragraph position="6"> Since the predicate is the sentence fulcrum, i.e., since  -5it contains the most information necessary for analysis of the structure of the sentence, the coding of the complements of predicative words is one of the main tasks for the data input to automatic parsing of Russian sentences. Machine translation oriented lexicographers have done a great deal of work in coding the complements of many lexemes, especially the predicatives, in an explicit and thorough way.</Paragraph>
    <Paragraph position="7"> lordanskaja /5/ suggested 126 different complementation patterns to account for the &amp;quot;strong government&amp;quot; of 7000 Russian stems. She recognized that the meanings of the stems could be associated with different patterns; e.g., exe~oBaTb has the following meanings with the following complements: i) 'to go after' with 3a + instr.</Paragraph>
    <Paragraph position="8"> 2) 'to ensue' with H3 + gen.</Paragraph>
    <Paragraph position="9"> 3) 'to be guided by something' with dative without prep.</Paragraph>
    <Paragraph position="10"> She recommended that the stems with different meanings be treated as different, and that a model be composed separately for each item.</Paragraph>
    <Paragraph position="11"> Rakhmankulova /6/ has written 12 models of complements for sentences containing any of ten different German verbs denoting position in space, and she illustrates, in a matrix, which verbs can appear in which models.</Paragraph>
    <Paragraph position="12"> Machine translation groups have examined Russian texts and from them compiled lists of nominals and predicatives which take an infinitive complement or a ~TO or qTOdH clause complement~ and lists, of governing modifiers with their  -6complementary structures~ many of which are not shown in any lexicon. The Wayne State University group has done extensive coding of the complementation of predicatives (verbs and short form modifiers), modifiers (participles and adjectives which govern complementary structures), and nouns. The group has created an auxiliary dictionary which is structured so that every complementation pattern (where the pattern includes an indication of the optional presence or obligatory absence of a subject) associated with each predicative in the dictionary is written out explicitly. For example, the entry for noTpe0OBaTB in this auxiliary dictionary reads as follows:  For translation purposes, it will be necessary to indicate the translation(s) of the predicativ~ corresponding to each pattern, as well as those of the prepositions and case endings in each pattern. A language example of pattern 3  -7is Yxe oKo~o HO~y~HS eFo HoTpe6oBa~M K OKO~OTOqHO~o -'AIready around midday they summoned him %0 the police.', while pattern 4 is illustrated by OT ~McaTeaH M~ noTpe~yeM Y.y~oxecTBeHHOR npaB~. -'Of a writer we shall artistic truth.' The above entry may not be complete. For instance, i% does not show the entry y+gen which reflects a phrase in Smirnitsky's dictionary /7/, Tpe6oBaT5 O6%SCHeHMS y KOF0 -- 'to demand an explanation from somebody', which is not shown in three Russian lexicons /8/. Furthermore, i% can be seen that the patterns with K+da% can be extended so that that phrase is replaced by B+aCC or even by an adverbial ~OMOR -'home'. New information will always be added.</Paragraph>
    <Paragraph position="13"> One transformationalist technique is to specify a syntactic construction along with a list of (all of the) lexemes which can occur in a certain position of that construction. The set of lexemes which Can be tokens in a certain position of a construction is the domain of that construction with respect to the position. Once a. lexeme is in that domainsthe pair consisting of the position and the construction becomes par% of the definition of that lexeme. The lexeme is completely defined by all of the pairs in which it is a token, and ideally the contents of a lexical entry would include all such pairs.</Paragraph>
    <Paragraph position="14"> Fillmore /9/ has compiled a list of English verbs which take a to-phrase complement, i.e. le~emes which occur in the  -8position VERB of the construction SUBJ + VERB + to-phrase. He is careful to point out that the to-phrase must function as the complement of the verbs on the list (e.g., agree, endeavor, hope, want) and not as a purposive adverbial phrase (as with.'wait' in 'He waited to see her.' where 'to' can be replaced by 'in order to'), since as he states,&amp;quot;The appearance of purpose adverbial to-phrases.., does not appear to be statable in terms of contextual verb type.&amp;quot;/10/ This indicates that the formal construction is not always sufficient to define a property, and that the deep structure function of the construction may have to be specified as well.</Paragraph>
    <Paragraph position="15"> The fact that statements which are formally identical can have distinct deep structures is illustrated by the following Russian language examples, which are not only formally identical, but identical in content except for one  Five persons were elected by referendum.</Paragraph>
    <Paragraph position="16"> In the first example, the (pro)noun in the instrumental case is the subject of the active transform</Paragraph>
    <Paragraph position="18"> We elected five persons.</Paragraph>
    <Paragraph position="19"> while in the second example, first interpretation, the in-9- null strumental noun remains in the instrumental case in the active transform, where (X) stands for some subject: (X) B~paaM n~T~ qea0BeK ~eaeFaTa~M.</Paragraph>
    <Paragraph position="20"> (X) elected five people as delegates.</Paragraph>
    <Paragraph position="21"> The instrumental noun in the third sentence also remains in the instrumental case in the active transform, but is shown to have a different function from the noun in the second sentence by the fact that it is possible, albeit not elegant, to say NHT5 ueaoBeE ONao BNOpaHo ~eaeraTaMM peCepeH~yMoM. Five people were elected as delegates by referendum. and correspondingly (X) BH~paaM nSTb qeaoBeK ~eaeFaTaMM pe~epeH~yMOM.&amp;quot; (X) elected five people as delegates by referendum., i.e. both words can coexist in a sentence.</Paragraph>
    <Paragraph position="22"> Kiefer /i\]/ has shown for Hungarian that the meaning of the verb can change within a given construction when the definition of the construction is formal and does not consider semantic properties of the components.</Paragraph>
    <Paragraph position="23"> Penz van nala.</Paragraph>
    <Paragraph position="24"> He has money on him.</Paragraph>
    <Paragraph position="25"> is contrasted with I Peter van n~la.</Paragraph>
    <Paragraph position="26"> Peter is with him.</Paragraph>
    <Paragraph position="27"> where the animate status of the subject distinguishes the possessive and locational meanings.</Paragraph>
    <Paragraph position="28"> Lehiste /12/ has shown that the distinction between 'being' and 'having' in Estonian is one of different complements taken, under special conditions, by the same verb.  -i0Although null Isa~ on raamat.</Paragraph>
    <Paragraph position="29"> Father has (a) book.</Paragraph>
    <Paragraph position="30"> and Laual on raamat.</Paragraph>
    <Paragraph position="31"> On the table is (a) book.</Paragraph>
    <Paragraph position="32"> are structurally identical, since morphologically isal and laual are both in the adessive case and raamat is in the nominative case, when functional (semantic) case names are used, isal is dative and laual is locative, while raama___.___~t is in both sentences in the objective case. As researchers work in the area of discovering and codifying syntactic properties, they find out that semantic considerations are impossible to avoid. Much of the new work in lexieology involves the analysis of predicates and their arguments (i.e., subjects, and complements such as clauses, noun/adjective phrases, and prepositional phrases). The transition from purely syntactic coding (i.e., specifying the complements and their morphological cases if applicable) to semantic coding has been made by Fillmore /13/ with his grammatical cases (e.g., agent, instrument, object). d) Compiling the Information Compiling a dictionary entails discovering facts about a language and arranging these facts in such a way that they may be conveniently retrieved. The ,key factor is that once a statement is made about a certain member of a word Class, all the other menbers of that class must be coded for the way that statement applies to them. If the statement is</Paragraph>
    <Paragraph position="34"> irrelevant for certain members, it may be desirable to create a new word class for the latter.</Paragraph>
    <Paragraph position="35"> A lexicon without lacunae can be compiled by the following procedure: For each word class construct a matrix such that each column head is a bit of information pertinent to this class, and the row heads are all of the words in this class. Each intersection must be filled with some code indicating whether or not the word has the property, and the codes of the properties must be such that they allow the entire spectrma of possible answers. For example, since the Russian CHpoxa - 'orphan' - can be feminine or masculine, the gender code must include also combinations of the basic components (masculine, feminine, and neuter); since the Russian ~Hdx~epe~HposaTb --' to differentiate' - is both perfective and imperfective, the code for aspect must comprise entries for 'perfective', 'imperfective', and 'both'. When a verb is marked 'both', it may be desirable to specify the distribution of the aspects over meaning and/or tenses.</Paragraph>
    <Paragraph position="36"> This matrix format forces the lexicographer to commit himself about the way each property applies to each member of the word class. It precludes the old-fashioned quasicoding, where the lexicographer coded what he knew and omitted what he did not know or had never thought to consider. In some Russian lexicons, certain nouns were coded for having no plural, but the absence of this coding in other entries did not necessarily imply that they did have a  -12plural; the only inference that could be drawn was that most nouns not coded for having no plural did indeed have one, but this information is not meaningful when definite information about a particular entry is required.</Paragraph>
    <Paragraph position="37"> When using the matrix format, with its demands for consistency, one faces the problem of how to get the information to fill its intersections. Naturally, if the information is already in a dictionary, or if the lexicographer has an example, from some text, of the phenomenon to be coded, there is no problem in filling the intersection. However, if the example is lacking, this is not always sufficient ground for coding the non-existence of the property. Sometimes~ despite the absence of an example, the lexicographer feels that the property holds, and he may consult with a native informant, using the caution offered by Zellig Harris /14/: If the linguist has in his corpus ax, bx, but not cx (where a, b, c are elements with general distributional similarity)9 he may wish to check with the informant as to whether cx occurs at all.</Paragraph>
    <Paragraph position="38"> The eliciting of forms from an informant has to be planned with care because of suggestibility in certain interpersonal and intercultural relations and because it may not always be possible for the informant to say whether a form which is proposed by the linguist occurs in his language.</Paragraph>
    <Paragraph position="39"> Rather than constructing a form cx and asking the informant 'Do you say cx?' or the like, the linguist can in most cases ask questions which should lead the informant to use cx if the form occurs in the informant's speech. At its most innocent, eliciting consists of devising situations in which the form in question is likely to occur in the informant's speech.</Paragraph>
    <Paragraph position="40"> Work at Wayne State University on the complementation of certain Russian -o forms by qTO/qTOO~ clauses /15/ sup-13- null ports Harris' observation. The difficulties of working with a native informant became evident when, on different occasions the native accepted and then rejected certain constructions. Sometimes the acceptance depended on the construction of contexts which eluded the native on the second perusal.</Paragraph>
    <Paragraph position="41"> The matrix approach is currently being used in Russian lexicon research at Wayne State University, Where the information in the Academy dictionary /16/ and in Ushakov /17/ is being coded. ~ The omissions and inconsistencies of presenting lexical information in the lexicons are discussed in a paper by Alexander Vitek /18/. Grammatical profiles have been produced for all Russian substantives, adjectives, and verbs, including their derivative participles and gerunds. The profiles contain primarily morphological properties, but some syntactic coding, mainly of complementation patterns, has also been started.</Paragraph>
    <Paragraph position="42"> A sample of the coding format developed for Russian verbs (in Ushakov) in this research appears in Figures 1 and 2.</Paragraph>
    <Paragraph position="43"> In Figure I, the coding form for Russian verb morphology, separate fields are denoted by a single slash mark. Each field has codes for certain morphological properties of the Russian verb. The following chart explains the codes for the verb ~OOHT~ - 'to obtain', which appears on the first line of Figure i.</Paragraph>
    <Paragraph position="44"> This work is supported by a grant from the National Science Foundat ion.</Paragraph>
    <Paragraph position="46"> aspect); subaspect (i.e., iterative/noniterative) does not apply.</Paragraph>
    <Paragraph position="47"> 2 Ii First conjugation verb: Ist person singular ends in -~; 3rd person plural ends in -VT.</Paragraph>
    <Paragraph position="48"> 3 200 Stress is fixed on the stem throughout the conjugation; there are no alternate stress patterns.</Paragraph>
    <Paragraph position="49"> 4 0000 No changes occur in the stem in the present tense conjugation.</Paragraph>
    <Paragraph position="50"> 5 99 LIST TYPe: -T___B_B is dropped, and 6_~- is replaced by 6yA-.</Paragraph>
    <Paragraph position="51"> 6 OOOO There are no consonantal mutations. 7 O0 There are no restrictions in usage of present and future tense.</Paragraph>
    <Paragraph position="52"> 8 00 Regular past tense marker: drop -T5 and add -JIto stem.</Paragraph>
    <Paragraph position="53">  g 7000 Stress is on stem in all past tense forms except the feminine where it is on the ending. i0 OO There are no restrictions in usage of past tense.</Paragraph>
    <Paragraph position="54"> In Figure 2, the coding form for Russian verb government, separate fields are denoted by double slash marks, with single slash marks used for separation within a given field. The codes are explained once again with the verb ~O6~TB - 'to obtain', which appears on the first coding l~ne.</Paragraph>
    <Paragraph position="56"> ,~- .,: -~. I I .,: ~ o..~ o~ ~ '., ! c', ': ; i~ ~ ., ..... !_~'t &amp;quot;\[ i i : u. ~ I w . &lt; .,,o I,- l.al w ~ :l... ~ .I-</Paragraph>
    <Paragraph position="58"> in the lexicon; a general government marker precedes the first Arabic numeral; there are language examples given in the lexicon; the entry is not a -cA verb.</Paragraph>
  </Section>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 AA The general government marker indicated
</SectionTitle>
    <Paragraph position="0"> above is the accusative case - KOFO/qTO.</Paragraph>
    <Paragraph position="1"> 3 O0 There is no government indicated under Arabic numeral #i.</Paragraph>
    <Paragraph position="2"> 4 O0 There is no government indicated under Arabic numeral #2.</Paragraph>
    <Paragraph position="3"> 5 R1 The entry contains a cross-reference to some other verb.</Paragraph>
    <Paragraph position="4"> e) Structuring the Storage of the Information There are many ways of storing the words of a language. With respect to sequencing, alphabetical order is the most popular method, although reverse dictionaries exist, and dictionaries where words are sequenced by their length and only alphabetized within a given word length, or where word class is the primary division, are conceivable. With respect to the entry heads, they can be stems, canonical forms, or all of the forms that exist in the language. Note that a canonical form could be a particular form of a paradigm such as the masculine singular form of an adjective or the infinitive form of a verb, or it could be a certain verb from which other verbs are derived by certain rules. Binnick /19/ has illustrated the latter by suggesting that be could be an entry head having, as part of its contents, ~ and  -18make which are the causative forms of the locational and existential meanings, respectively, of be. Fillmore /20/ has mentioned that strike and touch differ primarily only in relative intensity of impact. It is interesting to note that Hebrew has for some verbs a basic form which is Conjugated through seven 'constructions', two of which are labeled 'causative' and 'intensive'.</Paragraph>
    <Paragraph position="5"> A lexicon whose entry heads are stems or canonical forms has the advantage of compactness and the advantage that the whole paradigm associated with these forms is inI null dicated. It has the disadvantage that the user must know the rules of derivation in order to look up words which are not in canonical form. If every form in the language is an entry head, then the lexicon is much longer, but the homographic properties of the word are conveniently recorded; one might never realize, using a canonical for lexicon, that cea is both the past tense of OeCT~ - 'to sit down' - and the genitive plural of ce~o - 'village' -~ but this property would be immediately evident if tea were an entry head. In the Wayne State University machine translation research, Russian text to be translated or analyzed is 'read in' one sentence at a time; starting from left to right, segments of the sentence are 'looked up' in order to obtain whatever information about them has been stdred in the machine translation lexicon. The minimum segment is one word; the maximum segment is an entire sentence~ no segment is fermi-19- null nated inside a ~ord. The entry heads of the lexicon were designed to correspond to the segments, and therefore are ~rds or sequences of words (idioms). The entry heads could be canonical forms or stems, but this would require automatic procedures for transforming any inflected form into its canonical form, and for finding the stem of any form in text. Space can be saved in a full form lexicon by entering only once~ perhaps under the canonical form, the information which all members of a paradigm share 9 and cross referencing this information under the related entry heads. In the Wayne State University machine translation research~ sets of complementation patterns are stored in an auxiliary dictionary and any set can be referenced by any verbal form. The sequence of entry heads in the lexicon is alphabetical, since the shape of the text word to be looked up is its only identification. Naturally, if the set of Russian words could be put into a one-to-one correspondence with some subset of the positive integers by a function whose value on any word in its domain could be determined only by information deducible from the graphemic structure of that word~ then the entry heads of the lexicon would not have to be in alphabetic i order; in this case, the lookup would be simpler and faster, since the entries could be randomly accessed.</Paragraph>
    <Paragraph position="6"> The number of columns in the matrix of any word class should be without limit so that new information can be entered. Similarly, the number of rows should be without  -20limit to allow additions as the lexical stock of the language grows.</Paragraph>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3. coNCLusION
</SectionTitle>
    <Paragraph position="0"> Lexical information is the consummation and thereby also the obviation of research through grammars and articles which discuss certain questions and present a few examples of lexical items which have certain properties. A lexicon must reflect the grammatical system used to describe the language~ and i% should carry the system through to every lexical item in the language. It is clear that the matrix format is the only one which will insure consistency and completeness.</Paragraph>
    <Paragraph position="1"> This format is eminently machinable and thereby convenient for the retrieval of lists of all words in the language which have a certain property.</Paragraph>
    <Paragraph position="2"> -21-</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML