XML Viewer - e89-1020

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/89/e89-1020_metho.xml
Size: 27,638 bytes
Last Modified: 2025-10-06 14:12:17
<?xml version="1.0" standalone="yes"?>
<Paper uid="E89-1020">
  <Title>It Would Be Much Easier If WENT Were GOED</Title>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2.2. THEMATIC FAMILIES
</SectionTitle>
    <Paragraph position="0"> We call a thematic family ('rF) the set of all word-forms of a given (lemma) word, obtained by grammatical Inflecting: TF= {W1 ..... Wm}. Let us consider a TF to be always lexicograpHIcally sorted. Let &lt;X&gt; denote an arbitrary string of characters and &lt; X &gt; &lt; Y &gt; the string obtained by concatenating the substrings &lt;X &gt; and &lt;Y &gt;.</Paragraph>
    <Paragraph position="1"> We say that a TF Is regular iff there Is a q-letter substring &lt; Rq &gt; called root, common to all the m words of TF so that:</Paragraph>
    <Paragraph position="3"> and {&lt;Wl+l&gt; ..... &lt;Win&gt;} give the roots &lt;Rql&gt; and &lt;Rq2&gt; with &lt;Rql &gt; being a sub-string of &lt; Rqa &gt; or at most equal to &lt; Rq2 &gt;. The remainlng part of a word in TF alter removing the root is called an ending (we use the term 'ending' to Include both deslnences and suffixes).</Paragraph>
    <Paragraph position="4"> The list of all endlngs obtained from a TF Is called a paradigmatic endings family (PEF).</Paragraph>
    <Paragraph position="5"> A thematic family is called partial regular if there Is a partition of TF = {TF1 ,TF2 ..... TFk} so that:</Paragraph>
    <Paragraph position="7"> According to the above definition, a partial-regular TF will be characterized by k roots. A thematic family which is neither regular nor partial-regular Is called Irregular.</Paragraph>
    <Paragraph position="8"> In the following, in order to simplify notations, when referring to strings of characters, we use angular brackets only if we need to outline a composition/decomposition of a word-form.</Paragraph>
    <Paragraph position="9"> A central notion of our approach is that of flextoning paradigm. Its meaning is similar to that used by most of the morphologists.</Paragraph>
    <Paragraph position="10"> We define a flexlonlng paradigm Q as a list of pairs: Q = {(el pl)(e2 p2)...(ek ~)} where 'e' are endings extracted from a thematic family (irrespective of their regularity) and 'pl' are appropriate points in P (the appropriateness will be revealed in the fourth chapter).</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.3 UNINTERPRETED LEXICON
</SectionTitle>
      <Paragraph position="0"> Let LS be a set of words obtained fromthe union of K thematic families, called a lexical stock: LS=TF1uTF2LJ...uTFk. We call an uninterprated lexicon of the word stock LS a set UL = {R1,R2 ..... Rp} so that for any i~ \[1,p\] Ri is a root of a certain TFi in LS. The mapping h LS --- &gt; PS(UL x P) Is called an Interpretation of an UL within a morphological model MM (recall that P Is a paradigmatic flexloning space of a certain MM). Let us observe that I mapping allows a word to be ambiguously interpreted, which is quite natural at the level of isolated word-form analysis. Such a common ambiguity, for Instance, is figured out by the Romanlan word &amp;quot;modul&amp;quot;, which may stand either for the unarttculated nominative/accusative form of &amp;quot;modul&amp;quot; (module) or for the articulated nominative/accusative form of &amp;quot;mod&amp;quot; (mode, manner). The I mapping abstracts the process of word-forms analysis. The abstraction of the reverse process, the generation of wordforms, Is represented by the mapping G defined as follows: G: Ul.xP --- &gt; LS. As opposed to I, G 18 a univoque function, that Is for a given root and a specific point in the paradigmatic flexioning point P, a unique word-form will result.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3. BUILDING A MORPHOLOGICAL MODEL
</SectionTitle>
    <Paragraph position="0"> To build a morphological model the designer starts by speclfiying the categories of interest In his/her application. The traditional categories are NOUN, ADJECTIVE, VERB, PRONOUN and so forth, but by no means this categorlal system Is  obligatory (for instance one might think of using semantically flavoured categories such as OB-JECT, PROPERTY, ACTION, STATE, ANAPHOR etc).</Paragraph>
    <Paragraph position="1"> For each defined category in C, the designer will be asked to provide the desired sub-categories (for instance COMMON-NOUN and PROPER-NOUN for NOUN). This activity Is equivalent In the formal model to defining the SC subset and the F1 function. Further, for each sub-category In SC the system asks the designer to enter the specific features along which the Inflexlonal behaviour of the words gets relevant. With Romanlan language for instance, while number, case and enclitic articulation are relevant for COMMON-NOUN, for feminine PROPER-NOUN only the case Is significant (but this is not always true: the feminine proper-nouns ending In a consonant, whatever their etimology, do not flexate at all). By entering all sub-category-features associations, the system is implicitly provided with the M set and F2 function. Finally, for each feature in M, the designer will be asked to define the possible values the current feature may take (e.g. 'singular' and 'plural' for the 'number' feature). When the list of features Is exhausted, the system has already learnt the V set and F3 mapping. At this point, the activity of the designer Is theoretically finished and it is the system itself which wlU generate, based on these definitions, the paradigmatic flexloning space (P), thus accomplishing the MM internal representation. From this Internal representation, the system generates for each defined sub-category a graphic tabular menu (we call it an Acquisition Scenario AS) partlally filled In. The only blank column in an AS is called WORD-FORM column and is accessible for writing in by the trainer (tutor) of the system. Each line In an AS Is filled (except the last field corresponding to the WORD-FORM column) by the Information uniquely Identifying e point in P.</Paragraph>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4. KNOWLEDGE ACQUISITION
</SectionTitle>
    <Paragraph position="0"> When the tutor chooses a defined sub-category of a category in C to be exemplified, the system answers by displaying the associated acquisition scenario. What the tutor Is asked to do is to fill In the blanks the WORD-FORM column with the In.</Paragraph>
    <Paragraph position="1"> flected forms of the thematic word. Each word form must obey the restrictions Imposed by the combination of the feature values displayed on the line which the tutor is writing In.</Paragraph>
    <Paragraph position="2"> Once the WORD-FORM column of the current AS completely filled In, the root detection phase is activated. The word-forms are ordered lexlcographically with the provision that the Initial association: ((ci sci Mi Vi) W0 will be remembered. In the general case of a partially regular TF contain-Ing n word-forms the result of the root detection phase is represented as follows: LA = (((ei...ek)rootO...((em...en) rooU)).</Paragraph>
    <Paragraph position="3"> The n endings In the above Ilst Inherit the morphological features which are associated with the word forms which they were extracted from. That ts if the word Wi --- rootJ + ei was associated with pj = (cj scl Mj Vj) then ei would also be associated with pj. As a result, a possible new flexloning paradigm appears: Q = ((el pl)(e2 p2)...(en pn)). While pi are all distinct, this is not obligatory the case for the endings. The Q paradigm is looked for In a list of already known paradigms and if not found there, is marked for Interning. To interne a paradigm means to integrate it into an associative structure (a discrimination tree) appropriate for word-form morphological analysis (see further).</Paragraph>
    <Paragraph position="4"> With the generation of word forms, the above representation is very suitable (Tufts,t988). A paradigm is interned immediately after Its marking only if it was learnt from a regular TF. Otherwise, this process is delayed until the roots of the TF are processed. The discrimination tree internally represents all the known endings and their morphological feature values. Its nodes are labelled with letters appearing in different endings.</Paragraph>
    <Paragraph position="5"> A proper ending is represented by the concatenation of the letters labelling the nodes along a certain path, starting from a terminal node towards the root of the tree. Due to the retrograde strategy used in our system, possible endings which are searched for In a word from right to the left, are checked in the discrimination tree from top to bottom. This explains the ordering of the label letters in the tree. A terminal node (T-node) Is not obligatory a leaf node because of the possibility of Inclusion of one ending Into a longer one (the reverse is always true). All T-nodes are associated with the paradigmatic information specific to the ending which they stand for. This hdormatlon is represented by a list of pairs: ((Q1 pQ(Q2 p2)...(Qm pro)) where Qi are paradigm Identifiers and p~ are (identifiers for) points in the paradigmatic flexloning space P. If a T-node (hence an ending) has associated more than a single pair (Q p) it Is called extrinsically ambiguous. Another type of ambiguity Is induced by the endings containing shorter embedded endings. We call such endings intrinsically ambiguous. Let us suppose two endings &lt; X &gt; and &lt; Y * so that &lt; X * may be written as &lt;Z&gt;&lt;Y&gt;. In case of a word-form  &lt; A * &lt; Z * &lt; Y * without additional information one cannot definitely decide if the word should be - 147segmented as &lt;A&gt; &lt;Z&gt; + &lt;Y&gt;or as &lt;A&gt; +&lt;Z&gt;&lt;Y&gt;. For both types of ambi- null guities there are sound methods of resolution if the decision procedure has access to the root lexicon or to some syntactic rules.</Paragraph>
    <Paragraph position="6"> Anyway, for Intrinsically ambiguous cases, our system has found out that for Romanlan, in almost all cases the segmentation with the longest ending is the correct one. For extrinsically ambiguous endings, the system uses some statistics, extracted from the training data, which proved to be valuable. For Instance, the system updates for each paradigm, a so-called local counter with each new thematic family behaving according to that paradigm. The value of this counter, specific for each paradigm is considered in sorting the Interpretations of an ending :((Q1 pl)...(Qr pr)). According to this sorting, an Interpretation (QI pl) is consldered more likely than another one (Qi pJ), If In the lexicon there are more roots &amp;quot;belonging&amp;quot; to Qi than to Qj. This preference heuristics does not take into account the frequency of the words in running texts but only their paradigmatic classification. We plan to introduce the &amp;quot;dynamic counters&amp;quot; which are supposed to provide qualitative estimation based on word-forms frequences. It is clear that In order to provide valuable preferences, the values of the static/dynamic counters must result from large sets of examples and running texts. This requirement may be fulfilled by using in parallel many PARADIGM incarnations and finally by merging their outputs. It is Important to note that the.preference ~eurlstics we talk about are intended only for getting a plausability ordering criterion for the possible interpretations of an ending or segmentations of an word-form. It means that no interpretation variant is rejected at this level, so that if a preferred (according to the preference heuristics) interpretation or segmentation was wrong, the correct one may still be found. Roots processing and eventually paradigms modification or absorbtlon (see further) are based on some similarity criteria. If no similarity is detected between the roots of a TF, the correspond-Ing paradigm, if marked as new, is Interned as it was initially constructed. But if the roots are similar, the system tries to reduce differences between them, either by modifying the inflexlonal paradigm or by inferring rules for root modification. The first approach is generally taken If the differences between roots appear at their boundary with the endings. The second method Is usually tried In case of differences appearing inside the roots. The similarity criteria are declaratlvely specified, so that It Is easy to modify, augment or adapt them to specific needs. The notion of similarity, as used in our approach, Is very simple. We have developed a similarity description language In which one may describe the conditlons under which two strings are to be considered similar. With the current version of the system, we use only three simple similarity rules:</Paragraph>
    <Paragraph position="8"> In the above rules the metasymbol &lt; X &gt; stands for an arbitrary string, the question mark for zero or one letter, the exclamation mark for exactly one letter and == for the similarity relation. Their readings are: rst) two strings are similar If they differ by at most one embedded letter (calculatoAr =, calculator); null rs2) two strings are similar ff they are the same * except the last letter of one or both of them (coplL =, cop,); rs3) two strings are similar if they differ by at most one embedded letter and by the last letter of one or both of them (fereAstrA == ferestrE).</Paragraph>
    <Paragraph position="9"> Actually, the similarity description language is more powerful than it is suggeste~. For instance, one may impose restrictions on an &lt;X &gt; constructlon such as minimal or exact number of characters in X, prosodic restrictions such as presence or absence of accent, a.s.o. If two roots are similar, the system attempts to generalize their similarity beyond the particular TF currently processed. The simllarlty between two roots is necessary but not sufficient for making a generalization. What is needed, Is an explanation, In terms of morphological features, accounting for root modification. This explanation, if found, will be used as a precondition for the root modification rule to be synthesized. The explanation Justifies the difference between the two roots (of the same TF), and consists of dlscrlminant descriptions (in terms of morphological features) of the endings associated with them. In the current version of the system, it looks for the morphological features which have the same value for all the word-forms obtainable from the first root and another different value for all word-forms derived from the second root. For instance with the similar roots 'copil' and 'copll' (child), the system d~covered that all the forms in singular are produced by the first root while the second generates all the plural forms.</Paragraph>
    <Paragraph position="10"> - 148 Using this fact, the system built the following rule, entering only one root (cop,) In the lexicon: &amp;quot;If a root X behaves according to the paradigm Q39 and its last letter Is T then In all plural forms T must be replaced by the letter&amp;quot;T.</Paragraph>
    <Paragraph position="11"> The &amp;quot;generative&amp;quot; flavour of this rule should not be misleading: that is, one must not infer that it is good only for generation. The same rule applies to analysis: &amp;quot;If a root was discovered according to the paradigm Q39 and its last letter was T, the root may be recorded in the lexicon with its T replaced by the letter T&amp;quot;.</Paragraph>
    <Paragraph position="12"> As more data sets are provided the rules may be generalized further in order to cover the new cases.</Paragraph>
    <Paragraph position="13"> We said before that the internalization of a marked paradigm was delayed until the roots of a partial TF were processed. As we shall see in the example below, the delay is justified by the possibility to alter the initial endings (hence the paradigm) In order to minimize the differences between the considered roots. A paradigm modification may appear if the last letter from each of the roots taken into account is transferred in front of all their corresponding endings (recall the LA list in the beginning of this chapter). If the system finds no feature-based Justification for root modification and ff the difference between the roots is given by their last letters, it decides to transfer these 'laulty&amp;quot; letters into the appropriate endings, thus &amp;quot;regularizing&amp;quot; the TF. As a side-effect the In-Itial paradigm is modified and in case the new one is already known the decision is considered sound and the older paradigm is forgotten. If the new paradigm is not known to the system then both paradigms (the initial and the modified ones) are kept until further evidence will allow the system to choose among them. If no such evidence is obtained in favour of one or another paradigm, it will be the task of the knowledge base designer to decide on the matter.</Paragraph>
    <Paragraph position="14"> Let us follow on an example the process of learning a root modification rule. Consider that the trainer provided the thematic family for the thematic word &amp;quot;fereastre&amp;quot; (window). The root detection process will generate the following segmentations:  There are identified two roots: 'fereastra' and 'ferestre'. According to the rule s3) they are similar, with &lt; X &gt; and &lt; Y &gt; bound to 'fere' and '$tr' respectively. The fault letters are associated with their appearance context: &gt; e I a I s &lt;, &gt; r I a I and</Paragraph>
    <Paragraph position="16"> The first decision made in order to minimize the differences between the two roots is to transfer the last character of them into endings, thus resulting the segmentatior,s:  A second step towards difference ellimination Is to consider the deletion of the 'a' letter between &lt; fare &gt; and &lt; sir &gt;. But because this operation does not contribute to paradigm modification it must be generalized (if possible) as a rule for root modification. By Inspecting the morphological features of the word-forms, the system finds out that the root 'fereastr' is characterized by the feature values: feminine, singular and nom-acc, while the root 'ferestr' Is characterized in all its appearances only by the 'feminine' feature. Because 'feminine' value Is common to all word-forms of the thematic family, it is considered irrelevant with respect to roof modification. Moreover, no word- 149 form derivable from the 'ferestr' root has attached the &amp;quot;singular + nom-acc&amp;quot; feature values combination. Therefore, this is taken as a possible condition for the faulty letter deletion and the synthesized rule is the following:</Paragraph>
    <Paragraph position="18"> The reading of this rule is: &amp;quot;If a root of a word-form which flexions according to the POD007 paradigm, in singular and nom-acc, contains the embedded string &amp;quot;eas&amp;quot;, then for all combinations of morphological features not containing both singular and nom-acc values, the 'eas' string is replaced by 'es'&amp;quot;.</Paragraph>
    <Paragraph position="19"> Let us notice that the rule is more specific than it should be, Imposing that all eligible words behave according to the P00007 paradigm and requiring the letter's' to follow the dlphtong 'ea'. But the system cannot infer more from this single example. If provided with another example, let's say 'ceapa' (onion), with a similar behaviour the system synthesizes a rule very alike to RMRI:</Paragraph>
    <Paragraph position="21"> The only difference between RMR1 and RMR2 is the condition that the diphtong 'ea' must be followed by 's' and 'p' respectively. By considering this condition a particular one, the system drops it and obtains a more general rule subsuming both previous ones:</Paragraph>
    <Paragraph position="23"> The rule RMR3 is still too specific. The process-Ing of the thematic family for the word 'sears' (evening) produces a further generalization of RMR3.</Paragraph>
    <Paragraph position="24"> Firstly, the system generates the following rule:</Paragraph>
    <Paragraph position="26"> The difference between RMR3 and RMR4 is made by the restriction that the flexioning paradigms are required to be P00007 Instead of P00008. To generalize these rules, the system investigates the feature values of the two Involved paradigms. Their common properties are SC =COMMON-NOUN, GENDER = FEMININE, so the system is able to propose a new rule subsuming the RMR3 and RMR4 rules:</Paragraph>
    <Paragraph position="28"> Because generalization correctness over incomplete data cannot be guaranteed, each synthesized rule has two associated lists, one of them containing positive examples (Initially only the prototype root which generated the rule) and the other one containing exceptions (initially empty).</Paragraph>
    <Paragraph position="29"> A similar point of view, that is attaching exception lists to general rules, may be found in (Bear,1988).</Paragraph>
    <Paragraph position="30"> The roots are entered into the root lexicon. For partial regular thematic families, the two or more roots are linked together bidirectionally. The first of them, in lexicographic order, is attached to the relevant common morpho-lexical information: paradigm name end the label for the semantic description. This information is Inherited by all linked roots. There is also root specific morphological Information such as selectional restrictions and phonemic patterns. The selectional restrictions are contributed by the system and they refer to the constraints to be satisfied In order that a root be selected in a word-form generation. For the regular modifying roots, links to the rules they obey and the position(s) In the root where letter Insertion or deletion Is to be performed are also recorded In this field.</Paragraph>
    <Paragraph position="31"> The lexicon building side-effect of the tutorial sessions Is not the main Interest of the research reported here (for this purpose we developed the MORPHO lexicon management system (Tufts, 1987a)).This feature was Implemented only for testing the PARADIGM system in learning and using learnt knowledge. Also, we were Interested in experimenting some generetlon strategies at the level of morphology (for instance choosing the least ambiguous or the more common used root from a synonimy set - see (Tufis,1988)). it was possible, in this way, to test the functionality of PARADIGM without coupling it to MORPHO, operation which would have required a greater programming effort. The embedding of PARADIGM into MORPHO is planned for the Immediate future.</Paragraph>
    <Paragraph position="32"> - 150 At the end of the system's apprenticeship Is activated a processing phase which we call the paradlgmatic absorbtlon. A paradigm Q1 may be absorbed Into another paradigm Q2 iff: abl) they describe the same subcategory, ab2) for each ending 'eli' from Q1 and the corresponding ending 'ea' from Q2 the following are true: 'eli' is a suffix of 'e21': &lt; e2i &gt; : &lt; x &gt; &lt; eti &gt; and the &lt; x &gt; preffix in 'e2i' exists as a suffix in all the roots In the lexicon which, from the flexlonlng point of view, behave according to QI.</Paragraph>
    <Paragraph position="33"> The implementation of paradigms absorbtlon Is computatlonally motivated: firstly by decreasing the number of paradigms, the search space Is narrowed and consequently word-form processing time Improved; secondly, by lengthening the endings, they become more discriminating and therefore the ambiguity Is reduced. In Romanlan the case Is that the longer an ending, the less ambiguous Its Interpretation. For instance the 'i' ending has 19 possible Interpretations (in our model), while the ending 'ulul' has only one. We think that this is a general property with inflexlonal languages and therefore we consider paradigmatic absorbtion not to be specific for Romanlan. The paradigmatic absorbtion limits both types of ambiguity discussed eadler: Intrinsically (due to different possibilities of a word segmentation) and extrinsically (due to different Interpretations an ending may have).</Paragraph>
    <Paragraph position="34"> In order to obtain a complete morphological knowledge in a relatively short time, PARADIGM is accompanied by a merging utility program, (partially) able to unify two or more knowledge bases developed with different copies of the system. null</Paragraph>
  </Section>
  <Section position="7" start_page="0" end_page="0" type="metho">
    <SectionTitle>
5. FINAL REMARKS
</SectionTitle>
    <Paragraph position="0"> One of our eadler goals, some years ago, was to establish, by manual procedures, a reasonable set of flexionlng paradigms for Romanlan, In order to implement a reliable morphological processor, general enough to cover the requirements of technical texts. The task was taken by seven colleagues with different backgrounds (linguists, logicians, engineers and mathematicians ) and lasted for almost half an year (Crlstea,1982). I used the examples from the then written material, in order to test the PARADIGM system. While differently organized, the equivalent (in linguistic coverage) knowledge base was obtained in a fourhour session. Moreover, the number of paradigms discovered by PARADIGM was smaller (97 paradigms versus 123). The rest were absorbed without any loss. By rul~ning test data on the manually discovered knowledge base and on the PARADIGM acquired knowledge base we noted up to 10% improvement in analysis time. In hypothesis-Ing the lexical status and morphological features of the unknown words, based only on endings analysis, the PARADIGM generated knowledge base was also batter.</Paragraph>
    <Paragraph position="1"> A morphological knowledge base for Russian and another one for Spanish are under development. Experiments have also been made with French, Slovak and Hungarian. In the near future, we plan to develop the system in two Important directions: null - learning compound word-forms rules (procUtic articulation of nouns and adjectives, verb compound tenses, degrees of comparison for adjectives); null - learning lexical affixes (that is meaning modifying preffixes and suffixes (Tufts,1988)).</Paragraph>
    <Paragraph position="2"> Related work is reported in (Wohtke, t986), (Trost,1986) but they are either concerned with English (not a very exciting language from the morphological point of view) or address generation or analysis only. The very popular two-level morphology model of Koskennlemi (1983) intended primarily to derivatlonal morphology is, from our point of view, too expensive for a grammatlcal oriented processing.</Paragraph>
    <Paragraph position="3"> Recent work reported in (Goertz, 1988), (Wothks', 1986), (Zock, 1988) share some points with our approach.</Paragraph>
    <Paragraph position="4"> We consider that the main contributions of our work stem from the following features:  - freedom In defining the categorlal system for the model; - Independence of a specific natural language, provided it is within our &amp;quot;root + ending&amp;quot; approach; - applicability of the synthesized rules both In analysing and generating word forms; - possibility of rapid development of morphological knowledge bases, by merging the results of many parallel acquisition sessions; ~__~ - 151 - duality of system behaviour (apprentice - expert) which allows Immediate check of the acquired knowledge; - low level of linguistic competence required to the trainers.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML