File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/abstr/82/c82-1048_abstr.xml

Size: 16,325 bytes

Last Modified: 2025-10-06 13:45:55

<?xml version="1.0" standalone="yes"?>
<Paper uid="C82-1048">
  <Title>EMPIRICAL DATA AND AUTOMATIC ANALYSIS</Title>
  <Section position="2" start_page="301" end_page="301" type="abstr">
    <SectionTitle>
302 F. PAFP
</SectionTitle>
    <Paragraph position="0"> dictionary we are currently preparing. This is not a trivial task if one considers that in the highly inflective, agglutinative Hungarian language an inflectable word has not only one or two (cf. English: table -tables; read - reading- reads, etc. ) or even about ten (e. g. Latin, Russian) paradigmatic forms, but hundreds of them. In theory it has long been known and now it is also shown by the reverse alphabetized concordances that will be spoken of later that this unbelievably large number of forms contrast with prepositions in our neighbouring Indo-European languages.</Paragraph>
    <Paragraph position="1"> Thus, whereas in English, Russian, German, etc. certain grammatical elements are &amp;quot;comfortably&amp;quot; separated by spaces, in the agglutinative Hungarian language they became, so to say, merged with the word stem.</Paragraph>
    <Paragraph position="2"> It is not the lack of a space to denote them which causes the principle difficulty. Being within a single word form they can undergo different changes themselves and can also cause different changes in the word stem itself depending on the grammatical nature of the lexeme and on vowel harmony, etc. To solve this purely practical task a mass of empirical data is needed. It is precisely the reverse alphabetized concordances that furnish them, but the data gained from dictionaries are also indispensable.</Paragraph>
    <Paragraph position="3"> 1. 2. It was at a conference in Prague some years ago that I gave an account of the general results gained from the study of the material rendered by the Hungarian dictionary mentioned above. The matter in question is that a large group of Hungarian nouns derives one of their important forms, namely the third person possessi~ form, in a way that seems to be rhapsodic, sometimes inserting a &amp;quot;j&amp;quot; element and sometimes not: ablaka 'his/her window' - ~ 'his/her apricot' ; ttirelme &amp;quot;his/~er patience' - filmje 'his/her film', etc. This possess~ve f-orm was registered with each entry word, so we had the opportunity to-group the tens of thousands of nouns of our dictionary from the greatly varying TI -~T points of view, keeping in mind the main question of &amp;quot;with 3 - without 3 Thus, the two examples quoted above were used to illustrate the point that the appearance of the &amp;quot;j&amp;quot; element could not depend directly on the final consonant: in the first pair of examples both stems ended in &amp;quot;k&amp;quot; and in the second in &amp;quot;m&amp;quot;; it could not even depend on the final consonant group, because in the second pair of examples '&amp;quot;'j&amp;quot; - not &amp;quot;j&amp;quot;&amp;quot; was preceded by &amp;quot;lm&amp;quot; in both cases; bu~, of course, we nevertheless had such a sorting made. The investigation of the full noun stock of the dictionary in different combinations led to the following conclusion: the seemingly rhapsodic appearance or omission of &amp;quot;j&amp;quot; can easily be explained if we suppose that the Hungarian language as a natural code is structured in such a way that automatic analysis can be carried out using a minimal vocabulary.</Paragraph>
    <Paragraph position="4"> Concretely: if the non-possessive stem is typical and &amp;quot;well-formed&amp;quot; from the viewpoint of the whole vocabulary-, i~.e. if it has a frequently occurring end, the &amp;quot;j&amp;quot; element does not appear after-i~nthe possessive form; and the other way round: where the stem would not b~-e r.~ognized automatically because it has a rare ending and therefore the bare 'ra/e~' ending would be linked automatically with the stem, &amp;quot;j&amp;quot; emerges, so to say, ~..order to stop this and to emphasize the end of the bare stem. Thus in the ~t pair of examples quoted above &amp;quot;j&amp;quot; did not appear after the bare final &amp;quot;k&amp;quot; bemuse this is frequent for the end of stems in Hungarian; whereas &amp;quot;j&amp;quot; appeared after the &amp;quot;bad&amp;quot; &amp;quot;ck&amp;quot; group, which is not a typical stem end in Hungarian. Our second pair of examples is also subjected to the same rule, although in a slightly different way: the word tfirelme did not require &amp;quot;j&amp;quot; because a EMPIRICAL DATA AND AUTOMATIC ANALYSIS 303 typical change of the stem of a productive Hungarian suffix is hidden in it (the bare nominative is tfirelem, here the stem was automatically emphasized by the tlirele'm~lm- opposition), whereas filmj _e required a special denotation of the stem by &amp;quot;j&amp;quot;, there being no Xfilem/film opposition, It must be added that the behaviour of suffixes was diagnostic from our point of view. If the suffix is productive, there is no &amp;quot;j&amp;quot; after it; if it is not productive or it is not a nominal suffix but e.g. an adjectival one and is used with a noun only occasionally, the appearance of &amp;quot;j&amp;quot; is more or less necessary. It is &amp;quot;more or less&amp;quot; so because it represents the linguistic manifestation of a regularity that is practically a statistical one; what the several examples of instability and parallel forms are explained by is that the linguistic instinct is not a computer, it is not always possible for a whole community to decide unequivocally whether a stern end is frequent or rare, whether a suffix is productive or not. From this point of view the behaviour of the different historico-etymological strata is characterisitic. The nearer we move towards the younger loanwords the TI N more frequently &amp;quot;j&amp;quot; appears: a loanword often has a wrong , &amp;quot;atypical&amp;quot; end. But, of course, we can only say &amp;quot;more frequently&amp;quot;, &amp;quot;in many cases&amp;quot;, etc. : e.g. after a great number of words ending in the easily perceptible -urn, -(t)-or had been borrowed from Latin, they gradually became &amp;quot;good&amp;quot;, i.e. recognizable and so they did not necessarily have to take the special sign ,\] . A similar phenomenon can also be observed with compound words.</Paragraph>
    <Paragraph position="5"> A root word having a &amp;quot;wrong&amp;quot; end requires &amp;quot;j&amp;quot; - but if it is often used as the second part of a compound, it becomes something like a suffix, *'we bare got used to it&amp;quot; at the ends of cornpounds and this is why the &amp;quot;j&amp;quot; will soo'ner disappear from there. This is easily noticeable even on the basis of a simple reverse alphabetized list, as the root word often having a morp!~.ological code with H j,, stands in the first place there and it is irrlmediately followed by the compounds in which this root word is the second part, and in roany cases its code has already no &amp;quot;j&amp;quot;. (NB. when the coding was going on this regularity was not even guessed at, so this theoretical consideration could not have influenced the coders, in the case of root words they were compelled to take over the corresponding code of the source dictionary. Of course, no one could see these compounds in one group before the publication of the reverse alphabetized dictionary!) All this, however, was nothing more than a plausible hypothesis supported by evidence from the dictionary. Its real confirmation could be achieved by the study of texts. Reverse alphabetized concordances based on Hungarian linguistic material afford a very good opportunity to do this.</Paragraph>
    <Paragraph position="6"> 2. For the last couple of years a number of normal and reverse alphabetized concordances have been made at the L. Kossuth University on the basis of English, French, Swahili and mainly Hungarian and Russian texts. Relying upon the n~aterial rendered by the last two languages, we are going to show what kind of empirical data can be provided for the analysis. Properly made concordances from texts in different natural languages have features of their own. Thus, it is clear that a normal Swahili concordance works the other way round in the sense that the material is divided into different groups according to the grammatical indices; that in a French concordance \[~ is not expedient to print running words consisting of three or fewer letters. These technical details will not be discussed here, we or:ly note thai in Hungarian concordances ~he article &amp;quot;a&amp;quot; which mah=es up a co~nparatively high percent of running words in the  different stylistic strata has been left out of consideration.</Paragraph>
    <Paragraph position="7"> 2.1. Having made a reverse alphabetized concordance from Hungarian newspaper texts consisting of approximately 26 thousand running words we can arrive at the following conclusions. Of the 64 phonemes in contemporary Hungarian only 49 actually occur at the ends of words, half of all the word ends being occupied by the first five ot these (the percentage number of these phonemes and their occurrence at the ends of words in our material: /t/ 13, /k/ 12, /n/ 9, /s/ 9, /a/ 7). As it will be proved in the next section - a comparison of this with Russian data - this division shows a situation very similar to that in Russian. One should say the agglutinative character of Hungarian becomes clear within this from the quantitative point of view: within the different final phonemes large blocks with the same long agglutinative ending group can be seen. Thus, 18 % of all words ending in /t/ are made up by those ending in /et/ (non-possessive acc. sing. ; verbs 3 rd pets. sing.), 10 % by those ending in /~t/ (possessive acc. sing., non-possessive acc. sing., verb. verbal prefix), etc. ; the end /k/ has final groups, sometimes containing as many as three phonemes: /nak/ and /nek/, each taking up 11% (dat. /gen, verb 3 rd pers. plural); the same can be observed with the end /n/: (ban) - (ben) (32 % altogether inessiv) and so on. All this suggests that morphological analysis in Hungarian should be started at the ends of words: much useful grammatical information is concentrated there. These final clusters of two, three or four phonemes, of course, are not completely homogeneous, but the number of words to be analysed in another way is insignificant. Thus, e.g. the acc.</Paragraph>
    <Paragraph position="8"> sing. forms ending in /ot/ make up 3 % of those ending in /t/ and there are only two running words among them in which this is in the form of nora.</Paragraph>
    <Paragraph position="9"> sing. (the two occurrences of the lexeme ~11~); or to mention another example: among the dozens of occurrences of the final quadruplet /~nek/ (3 rd pets. poss. dat/gen) to be analysed on the basis of the same principle only two running words can be found: b~k~nek and versik~nek, used as the non-possessive dat/gen of the lexemes b~ke 'peace' and versike 'little verse' We have already mentioned above that the final empirical evidence for our hypothesis on the possessive /j/ was provided by these reverse alphabetized concordances. It was found that of all the words ending in /a/ 11% ended in /ja/ mainly owing to this possessive form, whereas the words ending in /je/ did not make up 3 % of those ending in /e/: both the contemporary Hungarian vocabulary and the contemporary texts with their frequent /a/ end and less frequent /e/ end will sooner require the special denotation of the end of the stem with /j/ immediately before the /a/. Other Hungarian texts presented a very similar picture of the division of final phonemes, especially concerning /a/-/e/ at the ends of words. Thus, e.g. there was not a single noun with the ending /je/ in this possessive forn~ among the thousands of nouns of the approximately 20 000 running words of &amp;quot;Toldi&amp;quot; (an epic poem written by J~nos Arany in the middle of the last century); it goes without saying that at the same time a number of them took the endings /a/, /ja/ and /e/ in this grammatical form: the various possessive forms in their sum total proved to be even more productive than the plural ones. (By the way, all this testifies that Hungarian texts can be considered to have been &amp;quot;contemporary&amp;quot; from this point of view since at least the middle of the last century. ) EMPIRICAL DATA AND AUTOMATIC ANALYSIS 305 Concerning analysis let us make one more essential remark in connection with the /ja/ ending. Forms like barackja 'his/her apricot' can already be well differentiated in the' nominal declension, but the same /ja/ ending has created a new homonymy at the ends of words: the 3 rd person singular forms of velar verbs take the same ending in their objective conjugation: e.g. this form of the verb v~g 'to cut' is v~gja 'he/she cuts (it)'. The pro~inence of the /ja/ ending can be explained by this fact as well, which at the same time makes our evidence weaker: in the case of palatal harmony there is another ending (cf. n~z &amp;quot;to look at': n~zi 'he/she looks at (it)' and not the expectable Xn~zje or~mething like this-~. The whole morphology of Hungarian, however, is dominated by a particular feature: namely that no difference is made between the parts of speech: the /m/ at the ends of words refers to the first person of verbs, nouns, pronouns, etc. ; the /k/ refers to some kind of plural. This, of course, makes morphological , analysis based on the word end more difficult: how practical it is to know that in Russian endings containing the element /y/ (ye, ~:.x, Z_~_~, etc. ) belong to an adjective; that the overwhelming majority of verbal word ends (e._~', et, em, ete, etc.) is charateristic only of verbs, etc. (It is interesting to note that English, a language with an extremely poor system of endings and hardly comparable with Hungarian from this point of view, shows a similar indifference towards parts of speech and even grammatical meaning: it is only the simple /s/ that forms the pl.ural of nouns, the 3 rd person singular of verbs and even the genitive of nouns; such a polysemy, of ,course, could hardly be imagined in Hungarian. ) 2. 2. Here are the five most frequent final phonemes of &amp;quot;Onegin&amp;quot; containing about 22 000 running words, the percentage number is indicated in brackets: /j/ (10), /i/ (10\], /a/ (9), /o/ (8), /e/ (8); 45 % of the running words end in one of t-hese phonemes. Within the most frequent word ends, however, one can find fewer final pairs (not to speak of triplets or quadruplets), and what is important is that if an ending can still be brought into prominence, it can bear many various and incoherent functions.</Paragraph>
    <Paragraph position="10"> Thu~, e.g. in this material the word end /ej/ makes up more than one fifth of all the 480 running words ending in /j/. Within the limits of this material the following proportions have been stated (100 = 480): 1. pronouns like ej, sej, vsej (48 %), 2. gostej-type genitive plural (14 %), 3. poslednej -type adjectival forms (10 T0), 4. no~ej-type genitive plural (10 %), 5.</Paragraph>
    <Paragraph position="11"> lenivej-type comparative forms (8 %), and so on. The remaining 10 T0 are spread over a dozen functions (parts of speech, grammatical cases, moods of verbs, etc, ). It should be noted that none of the most frequent five types enumerated here is homogeneous from the point of view of grammatical analysis, cf. especially the 1. and the 3. with their vl..fv grammatical polysemy. Some of the more &amp;quot;fortunate rr endings as 1j , may have only half as many functions, but even in this case the mass of empirical data yielded by a reverse alphabetized concordance may be indispensable to make the analysing algorythm as exact and elegant as necessary.</Paragraph>
    <Paragraph position="12"> 2. 3. It was quite clear, even in the early stages of mechanical translation, that the separable Hungarian verbal prefixes would present a special problem for the analysis. (Thus, the very first step of the very first Hungarian-Russian MT-algorythm in word-finding was the search for verbal prefixes that might have been separated, cf. Mel'~uk 1958, 231. ) According to the</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML