File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/86/c86-1064_metho.xml

Size: 19,924 bytes

Last Modified: 2025-10-06 14:11:48

<?xml version="1.0" standalone="yes"?>
<Paper uid="C86-1064">
  <Title>Synergy of ~ and _.mor_phology in automatic parsing of French language with a minim .urn of data Feasibility study of the method</Title>
  <Section position="2" start_page="0" end_page="270" type="metho">
    <SectionTitle>
III The mean_~:
</SectionTitle>
    <Paragraph position="0"> A, General method: The main method is pattern matching. The principle is the following: the to recognize is compared to ~ set of pattorr,c, until a match is found. But what exactly is a .~ in a natural language? The classical terms are &amp;quot;mg.E~&amp;\[9.gJL&amp;quot; for the word, and &amp;quot;&amp;VJ3.t6~&amp;quot; for the sentence. We could say that morphology is the shape of the word and syntax the shape of the sentence; and more, we propose here to fill conceptually the gap between the level morphology-word and the level syntax-sentence: 4- ~L.QLa2\[ = shape of the text.</Paragraph>
    <Paragraph position="1"> We can also remember that our habit of the ~Lrj_U,9..D. word properly delimited by spaces er punctuation rnakes us forget that the ~ string is a continuum that we cut while understanding: we use simultaneously morphological, syntactical, semantic and pragmatical information with which we make deductions, inferences, deadlocks and use intuition (about the word see Tesni6re page 27, SS 11 to 15).</Paragraph>
    <Paragraph position="2"> It is possible to classify the pattern matching methods in two categories: the statistic and the structural methods (see Miclet and Fu). More precisely, we use here a string pattern matching method: every word of the sentence is replaced by its category, coded by a character, and question marks for unknown words: r#lectricit6 c6r#brale --&gt; d?? (d=determiner) &lt;the cerebral electricity&gt; maladies mentales et l#sions c#r#brales --&gt; ??c?? (c=coordination) &lt;mental illnesses and cerebral lesions&gt; Let us call this string the pattern by word that will be used in the grammar and in the parsing, The information used in the parsing is composed of three types of data: a ~,LELe_~J.PS~ of the words in finite number (about ~_0JLoJLn~), morphological deduction rules for each word, a set of eatterns of the noun phrase for pattern matching, and of course thetexHo ana!vse itself.</Paragraph>
    <Paragraph position="3"> The 1irst steo of this st.u._d.PS is :lhe noun or ereoositional #hre~g. The following steps are the recognition of these phrases in the sentence, and the whole parsing of the sentence.</Paragraph>
    <Paragraph position="4"> This work is implemented on Apple Macintosh, and the programming language is Pascal UCSD which is suitable to develop such a parser whose algorithms are rarely recursive.</Paragraph>
    <Paragraph position="5"> B. The small lexicon: It contains about 80 forms (not lemmas): determiners (articles, possessive, demontrative and indefinite adjectives), prepositions, coordinations, some punctuation signs (considered as words). These words are the .a_EqbPE~g..D._qJ.O~ for pattern matching.</Paragraph>
    <Paragraph position="6">  But we realize that it is impossible to keep the general position to have only the words in finite number in this lexicon, and that the problem becomes pragmatic: what is the minimum of data necessary to recognize correctly the other words ? We have added the first numeral adjectives, indefinite adjectives, some very current adjectives often placed before the noun (petit, autre, m~me ), adverb not derived from adjective (bien, real, tr~s ). Every form has its possible categories, eventually gender and number; the list of the possible categories can be open: a form can have another category in a particular sentence: le blen et le real, le la de ma clarinette . For example, et can be: - conjunction which coordinates adjectives (b): valour Iocalisatrice etpronostique --&gt; ??b? &lt;localizing and prognosal value&gt; - conjunction which coordinates noun phrases (c): maladies mentales et lSsions c~rSbralas --&gt; ??o?? &lt;mental illnesses and cerebral lesions&gt; - conjunction which coordinates nouns (e): crSation et renouvellement lexicaux --&gt;?a?? &lt;lexical creation and renewing&gt; - conjunction which coordinates prepositional phrases (C): I'influence de I'inductance at de la capacit6 --&gt; d?pd?Opd? &lt;the influence of inductance and capacity&gt; We have distinguished two categories of preposition according to the &amp;quot;attraction&amp;quot; between the two np: high altraction: le syst~me d'unit~s &lt;the unit system&gt; (sort of compound nouns), or low attraction: un chat sur un toit &lt;a cat on a roof&gt; (facultative &amp;quot;circumstant&amp;quot;), whence two kinds of prepositional phrases: \[E\[~ to the np : dl de en (o) or #.~\[fim~ to the np: ,~ de en sur clans chez vers (p), So de can be:  - internal preposition (o) in: une th#orie de la morphog#n#se --&gt; d?od? &lt;a morphogenasis theory&gt; le syst#me d' unit#s internationnal --&gt;d?o?? &lt;the international unit system&gt; - external preposition (p) in: de ranimal &amp; rhomme --&gt; pd?pd? &lt;fromanimaltohuman&gt; - preposition (q) in:  Ins diff#rents moyens de faire lea mesures --&gt; d??qid? &lt;the different ways to make measures&gt; C. The morphological approach: Our attitude is to explore all possibilities to extract, to deduce Information from the mere morphology of the word, without dictionaries, information which can be used ~t ~,NV time of the s~ For example, lot us observe the words ending with -Ire) : -icltd -ivit6 -abi/It6 -ibillt6 -ubi/It(~ -arlt6 -a/lt~ ; we have a regular alternation adjective/noun: ~lectr&amp;quot; ~lg.~ / 6/ectrjPSjJ~, combatjl / combatjyj.~, portb#L~_ /portb#~_t~, particu/j.~ /particu/arit~ ; from these endings, we can deduce that the word means a quality (semantic aspect) and is a singular feminine noun (category).</Paragraph>
    <Paragraph position="7"> On the semantic opposite, endings as -ification, -isation suggest an action, for example: class'Ej_q_fitj_~ comes from the noun class(e) + suffixe -ification , national~ comes from the adjective national + suffixe -isation , climatisflrLOj2 &lt;air-conditioning&gt; comes from the noun climat + -isation ; these words have been derived on the same way, with the same semantic aspect: the suffixe -is- + -er (verbal ending) = .tsar or -is- + -ation (noun ending) = -isation has the property to make a verb or a noun which expresses an action, from adjectives (national) or nouns (cfimat); words ending with -ification or -isation are always feminine nouns. In some cases, at first sight, the morphology does not give reliable information: a word ending by -ement can be an adverb (derived from adjective) or a masculine noun: for example lachement &lt;slackly&gt; is adverb and rel&amp;chement &lt;slackening&gt; masculine noun, but a more precise study brings the following information: -Scent ==&gt; adverb except 3 roots: agr#ment complSment increment and except the word #lSnrent ; -Oment ==&gt; adverb: assidOment; -ublement -iblement -ivement ==&gt; adverb derived from an adjective:indissolublement visiblement h#tivement ; .oment .rment -gment ==&gt; noun: moment sarment fragment ; -issement -ionnement ==&gt; noun derived from a verb: vagissement positionnement, At last, as far as neological production uses these elements and these rules to create new words, ~Q.Q.LQg\[.~,ZLE# analysed exactly as the other werd~ (see Guilbert and Kokourek).</Paragraph>
    <Paragraph position="8"> These morphological properties of each word are the second kind of arlchorln~ nolnt~ for pattern matching, D. The grammar: 1. the grammar of the complex noun or prepositional phrase: The phase is considered as a three level hierarchical structure (finite number of levels): the grammar is not recursive (on that point joining Tesnlbre and leaving Chomsky): phrase = complex noun or prepositional phrase which is composed of simple noun phrases which are composed of words or &amp;quot;.&amp;ggJ_uJ\[DE\[#~&amp;quot; words.</Paragraph>
    <Paragraph position="9"> A oomolex noun D.br.~,&amp;e (cnp) is: * either a simple noun phrase alone (G=snp) - or a train of simple noun phrases separated by: an expernal preposition (p=de, clans, pour), or a conjunction co-ordinating sap (c=et , ou ), or a conjunction co-ordinating prepositional phrase (C=et , ou) and followed by a preposition (Cp=et de , ou avac), or a preposition preceding an infinitive (q=de, ~, pour ) or a present participle (r=-ant).</Paragraph>
    <Paragraph position="10"> These snp have between them relations of subordination or co-ordination. 2. the grammar of the simple noun phrase: A ~ is a train of words obeying two types of constraints: - ~Ey.QL&amp;EEg..~, when the phrase agrees with a dependency tree - EI.OE.QtLQJQgJ_0~, which is usually named gender.number agreement A .~#,ttern of a snD is a horizontal Dro!ection of a sub-etemma of_#, canonical etemma. Let us remember that a stomma (word introduced by Tesni~re) is a dependency tree. The canonical stemma represents an abstract of ~~.</Paragraph>
    <Paragraph position="11"> a sub-stemma is: - either the unchanged canonical stemma, - or the canonical stemma without a leaf, - or a sub-stemma without a leaf.</Paragraph>
    <Paragraph position="12"> A stemma is a ~.9_~\[Lri\]gj\].~ram: the vertical dimension of the hierachical levels and the horizontal dimension of the written words; a stemma can be horizontally projected to obtain the one dimensional train of words as they are written. Here is an example of a possible canonical stemma of the snp, and its projection: I I dot adv ady adj noun adv adv adj adv adv adj dependency relations: adj ........... &gt; noun (depends on) projection relations: dot ..... &gt;det (is projected as) There is a snp pattern for every sub-stemma of the canonical stemma. The three canonical sternmas now used are equivalent to about 2000 rewriting rules.</Paragraph>
    <Paragraph position="13"> The &amp;quot;aggluUnatlon&amp;quot; rules are applied JZIE\[.~L~312 from right to left and are the bottora-up aspect of the algorithm: - every adjective can be an agglutinated adjective (A) as the result of the co-ordination of several adjectives: ?b?-&gt;A Bb?-&gt;A ==&gt; ?=adjectives - in the forms: noun de noun or: noun ,t noun, we have JAL~.ES\[ ~ositional_ phr~e..~, working like ~ (B), which are included in the snp, and processed as.~ in the parsing of the snp: o?-&gt;B od?-&gt;B ==&gt; ?=noun.</Paragraph>
    <Paragraph position="14"> We recognize here Tesni~re's &amp;quot;translation&amp;quot; concept: the &amp;quot;translation&amp;quot; e{ Jb_e..Eg3Lg_LEt~ (see Tasni0re pages 443 and seq.). The prepositional phrases which can be considered as adjectives are not only preceded by: de &amp; en, but potentially by every preposition, for example in: t\]g.~ par domalna ou lexicographique usual ou solon la thSorle \[Ag_b.EgJ2#. heuristique clans los graphes appliqu~e ~ la reconnaissance We can remark here that t~co-ordinated obie(:ts, have fundamentally Be same functlorl that is fDJ.tQ.Lt._b_g.t_A~~ We shall now deduce linguistic information from the form of the snp. For example, if we analyse the form: un \[unknown.word\] (d?), we can deduce that this unknown word is a masculine singular noun for two reasons: it matches the pattern: determiner - noun, and the whole snp inherits its gender and number from the determiner. Here is an ambiguous case: \[unknown word 1\] \[unknown word 2\] (??); we have here three solutions: either noun - adjective , or adjective - noun, or noun - apposed noun.. It is often possible to decide by a morphological study of each word: I'~lectrlclt6 c~r~Jbrale --&gt; d?? and Iclt6 ==&gt; singular feminine noun ==&gt; noun - ad!ective une nouvelle conception --&gt; d?? and tlon ==&gt; singular feminine noun ==&gt; .~jg.DJL~ lee andes alpha --&gt; d?? and no number agreement =&gt; noun - a~xx)sed noun In the texts now processed (content tables of scientific books), we have noun - adiectiy.#, in about 97 % eases, probably because French is a centrifugal  language: the governor first, then the dependants (see Wagner et Pinchon page 155, Tesnibre pages 33 and 147). For example, la linguistique infornratique and rinformatique linguistique which are morphologically ambiguous, are both understood by a native speaker as a form: no~_0_z_&amp;d~.</Paragraph>
    <Paragraph position="15"> If we choose to obtain only one analysis to get one deduction, we must have an order of trial, barring syntactical or morphological impossibility: this order is now: noun - adiective, then adjective - noun, then noun - apposed noun, with three stemmas tried in this order.</Paragraph>
    <Paragraph position="16"> 3. some p~rsing difficulties: Wrong deductions upon the category come in the cases we have several possible analysises aod whorl morphology does not implies the category: o what does et coordinate ? wdeur \[(Iocalisatrice) et (pronostique)\] noun \[(adjective) el (adjective)\] \[(valour Iocalisatrice) et (pronostique)\] \[(snp) et (noun)\] two possible analysises according to whether of co-ordinates two snp or two adjectives; then orenostique can be analysed as noun or adjective; - if the pattern is ?? without any possible morphological deduction, the form nourt - adjective will be choased, and that rnay be wrong in some rare cases. But at the end of the parsing of the text, the lexicon is extracted and it is possible to consull it to reparse the ambiguous phrases.</Paragraph>
    <Paragraph position="17"> A. How syntax and morphology work together: In such a pars~rr, the parsed language implies a parsing strategy: in French, syntax gives more information than morphology; for example, in English rnarphology is peer and syntax becomes more important, in German, morphology is richer because of declensions and file three genders. So, in French, etbA~j2~.. \[O.g~iUi~ e.~l b~'z~Z ~t ajPSa__n ~s o ~L0 i~\[i~h t ~by~tp h el ogy: . at the beginning, we look if it is possible to deduce its category, gender and number, and the deduction is marked sure or not sure, for example: qcit6 ==&gt; feminin singular noun, sure (61ectricitd) -ement ==&gt; masc. singular noun (enregistremont) or adverb (pumment) -ant ==&gt; present participle, not sure (concerrrant or passant) * in the study of each snp, every category and some geoders and numbers are known and the gender-number agreement is verified, for example: -at and adjective (deduced by syntax) ==&gt; masculine singular (principal) -ives and adjective ==&gt; feminine plural (qualitatives) If a snp does not agree in goader and number, the analysis fails and the next stomma is tested.</Paragraph>
    <Paragraph position="18"> B. General ease: First, some n~placoments are made in the phrase submitted to the analysis, for example: space inserted after the apostrophes to isolate /' or d' as one word, autour de --&gt; autour.de (one word), du o-&gt; de le , des --&gt; de los, au --&gt; &amp;quot;,) le , aux -.&gt; # los .</Paragraph>
    <Paragraph position="19"> Then for each word, the lexicon is consulted, and if not found, the first morphological study is made (see above), whence the set of the possible categories of each word; this set is classed in the order of trial. Then the ~h.PSL6Lh~Lg_J~_rl:Ls~ t)~ is made from tire combinations of the possible categories of each word, and from contextual constraints of each letter of the pattern; these constraints are as severe as possible to reduce the number of combinations as much and as soon as possible: for example, for the phrase: 6volution de 1'61octro..enc6phalodramn?e d'un malade attaint de paralysie g6n6rale salon los effete du traitement, the n umber of possible patterns is reduced from 1250 to 8.</Paragraph>
    <Paragraph position="20"> \]hen, each pattern is tested until the first successful analysis, except if there are possible adverbs, infinitives or present participle. In that case, a measure of the quality of the analysis is made to get only tile best analysis. The test of ol~a pattern is made in the following way: the pattern by snp is calculated: I'electricit&amp; c6r6brale --&gt; d?? --&gt; G (G=snp); maladies mentales et 16starts du cerveau --&gt; ??c?od? --&gt; GcG (o=prepesition internal to snp); /'activation par fermeture des yeux --&gt; d?p?od? .-&gt; GpG (the activation by closing eyes) (p=preposition external to snp).</Paragraph>
    <Paragraph position="21"> We verify that this pattern by snp carl constitute a cnp (top-down aspect). The patterns by snp may be for example: G (snp) GcG (co-ordinated snp) GpG (sub.ordinated snp) GrG (two snp separed by a present participle).</Paragraph>
    <Paragraph position="22"> We try to apply tile .&amp;gg\[g.t~3.tLg_qJM..L~. (bottom-up aspect: see above). \]'hen we study each snp: - we test if it is possible to find a match with one of the three sternmas tested in the order: ~r~.g.~._tL~, then d.~ .O.#J, tB, then ##g.B._-....~, whence a deduced or confirmed category (noun or adjective) for every question mark; - we test if we have a gender-number agreement between the governing noun and its eventual depending determinant and adjectives; this is done by a set intersection algorithm and by getting gender and number of the determinants from the lexicon, and by a morphological study (see above) of adjectives and nouns whose category h\[~s just been deduced.</Paragraph>
    <Paragraph position="23"> At any moment, if a constraint is not satisfied, the test of this pattern is stopped and the next one is tested.</Paragraph>
    <Paragraph position="24"> A bracketed phrase gives the history of the analysis.</Paragraph>
    <Paragraph position="25"> Co A parsing example: valour Iocallsatrlc~ et pr'onostlquo process by ward:</Paragraph>
  </Section>
  <Section position="3" start_page="270" end_page="270" type="metho">
    <SectionTitle>
? valour
? Iocalisatrice
</SectionTitle>
    <Paragraph position="0"> bcC et co-ordinates adjective (b), snp (c) or prep. phrases (C) ? pronostique C is impossible because et is not followed by a preposition possible pattsrns: ??b? ??c? test of ??b? calculation of the pattern by snp: ??b? -&gt; G possible cornplex phrase agglutination: appScabie rule: ?b? --&gt; A ( co=ordinated adjectives) tocalisatrice : ?= adjective pronostique :'?= adjective bracketed structure: valour +A(Ioealisatrice +at +pronostique ) new pattern: ?A and end of agglutination study of the single snp: syntactical constraint: ?A matches with the stamina 1 (noun-adjective) valour : ?==noun morphological constraints: valour : singular ( by morphological study ) Iocalisatrice : feminine singular (-trice by morphological study ) pronestique : singular ( by morphological study ) gender-number agreement: ferninine singular this snp is correct and of course the cnp is correct: wdeur &gt;neun/f/s can be adjective elsewhere Iocalisatnce &gt;odj/f/s carl be noun elsewhere, qualifies valour pronostiquc &gt;adj/f/s can be noun elsewhere, qualifies valeur bracketed structure: G( valour +A(Iocalisatrice +at +pronostique ) ) if we ask for all possible analysises, we get also: G( wdeur +loealisatrice ).~ et + G( pronostique ) V Conclusion: In the texts now processed, tables of contents arrd diagrams in scientific books and articles (about 10 000 words), the .Lg_COgg_~PS..p~f PS;~ePSLor~es As c.eqrzeccUo.99 degZ~, and tile ~n.o~f \]t3Aj~gx_i is PSo~eqfly_ex4r-q~PSted., but the deduction of the hierarchy of the snp and of relations between snp cannot be realisod only by using syntactical o1: morphological data because bAQI~(_&amp;quot; and prag~at~ information is lacking.</Paragraph>
    <Paragraph position="1"> 3he original assumptions are w\]rifiod: - it is passible to deduce categories of words by erring pattern nratchhrg, with rio dictionary and :wJt~3AL~..v_.~\[6, by simultaneous Use of ~C/J.tLc~L and \[t\]_~_p~\[)lo_~LL~.L Inforlnatlon. * . the concept of category is really a functional concept.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML