File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/96/c96-2121_metho.xml
Size: 21,648 bytes
Last Modified: 2025-10-06 14:14:11
<?xml version="1.0" standalone="yes"?> <Paper uid="C96-2121"> <Title>Saussurian analogy: a theoretical account and its application</Title> <Section position="3" start_page="0" end_page="717" type="metho"> <SectionTitle> 2 Saussurian analogy </SectionTitle> <Paragraph position="0"> In Chapter \[our, t)~rt 1\[1 o\[&quot; the Cou'rs de linguislique .qdn&'alc t, Saussur(; points out wha.t he calls analogy: given two forms of ~ given word, and ouly one form of a second word, it is possil>h'. I.o tall uxa.ml>les in this six:tlott +u'e front the (\]ours.</Paragraph> <Paragraph position="1"> coin the missing form ~.</Paragraph> <Paragraph position="3"> In this particular case, Saussure was interested in explaining the competition of honor with the older fornl honos, honor is not a phonetic transformation of hon.os by rhotacism, but simply the result of a.n alogy.</Paragraph> <Paragraph position="4"> Analogy is very general, and its ('.\[\[~('ts are seen in a number of other places. Ill may ex plain M1 flexional paradigms, from conjugation to declension :~.</Paragraph> <Paragraph position="5"> (~ern\]an: screen : .sclzle := lachen : x x : lachte Analogy a.lso explains what is called the produc l;ivity of bmguage, i.e., the fact that underst~md a.ble words cml be coined, which are not registered in dictiona.ries, nnd may have never been uttered before by the speaker nor heard before by the list;crier 4.</Paragraph> <Paragraph position="6"> I,'rench: rdaction : r'6actionnaUv = r@rcssion : x</Paragraph> <Paragraph position="8"> Fina.lly, analogy Mso cxphdns incorrect \[brms or barbarisms, ex;mq)le~s of which ,~re flmquent in child langua.g(; ~.</Paragraph> <Paragraph position="9"> French: dleindrai : /leindre = viendrai : x x = viendre Our goal is to give one possible account of this phenomenon in compul,ationM t, erms, ;rod to show that, given n tree brink, ~ possible ;~pl)lica.t;ion may t)e the mmtysis or genera.lion of sentences.</Paragraph> <Paragraph position="10"> o,..to,, (o,.~to,:, ~pe~,,kor) +,,d ho.o; (t,oHo,.') ~.,mi\[,;~tive singular, oralorem and honorcm a.ccu\]sMivc singula.r. null</Paragraph> <Paragraph position="12"> nouns, rdactionnaire (reactionary) adje(:tivc; rdprcsdeg sionnaire souuds perh~(:tly I!'rench, but will not be tound in +~ diction~u:y.</Paragraph> <Paragraph position="13"> ~'dtei,,hv (to extinguish; to turn off) infinitive, ('teindrai &quot;,rod vicndrai future tense; viendre is +L b~u'ba,rism in pbtce of venir (to come) (compare, in l';ugllsh, qocd for wc'nl).</Paragraph> </Section> <Section position="4" start_page="717" end_page="719" type="metho"> <SectionTitle> 3 A possible account </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="717" end_page="717" type="sub_section"> <SectionTitle> 3.1 Notations </SectionTitle> <Paragraph position="0"> Let 12 be a non-empty finite set, called the vocabulary. (12&quot;, .) is the monoid over 12 where . denotes concatenation. 1)* is also the infinite union of all 12n tbr n 6 IN. By convention, 120 = {el with g being the empty string.</Paragraph> <Paragraph position="1"> Using these notations, analogies are equations with one unknown on 12': u:v = w:x. Tobe able to solve analogies, it is necessary to give a meaning to such a notation.</Paragraph> </Section> <Section position="2" start_page="717" end_page="717" type="sub_section"> <SectionTitle> 3.2 A geometrical view </SectionTitle> <Paragraph position="0"> In our attempt to discover a mathematical explanation of analogy, we were long hindered by the notation itself. Of course, the idea behind it is that analogy could be considered a similar psychological process as the one intervening in proportions: null</Paragraph> <Paragraph position="2"> But ~, the set of rationals, is mathematically well equipped. Addition defines a commutative group, and multiplication makes it a field. Proportions in iI~ are thus well understood, and safely solved. What is true for II~ is not for l;*. The basic operation, concatenation, is not comnmtative and does not define a group, but a relaxed structure, that of a monoid. And no one knows what the meaning of u : v could possibly be.</Paragraph> <Paragraph position="3"> In fact, looking at analogy fi'om the previous point of view is misleading because, intentionally or not, we think of nmnbers, which enforces too many constraints. A better, more relaxed view of the problem is that era rectangle. In a rectangle, opposite sides and diagonals are equal (see</Paragraph> </Section> <Section position="3" start_page="717" end_page="718" type="sub_section"> <SectionTitle> 3.3 Formal|sat|on </SectionTitle> <Paragraph position="0"> 'Fhis view makes explicit that analogy sets a relation between an unknown on on(&quot; hand, and three terms on the oth& hand. Now, carrying on with the geometrical paralM, analogy may be interpreted in terms of distances as follows : the dis tanee of any term to the unknown is the same as the distance between the two remaining terms.</Paragraph> <Paragraph position="1"> We thus posit the following equivalence.</Paragraph> <Paragraph position="2"> Definition 1 (Analogy) A f dist(u,v) = dist(w,x) u: v -- w: x ~=> dist(u,w) = dist(v,x) dist(v,w) = dist(u,x) The rectangle view does not forbid commutativity for dist, a notable difference with division on numbers, where 2/4 is not the same as 4/2.</Paragraph> <Paragraph position="3"> Let us linguistically interpret the previous system of equations. Supl)ose we get the following analogy to solve: mathematics: mathematical = physics : x. Of course, x = physical.</Paragraph> <Paragraph position="4"> The first two equations show that the terms on the diagonal may be exchanged. A linguistic interpretation is thai; analogy involves two orthogonal dimensions reflecting the duality of the lexeme/morpheme (or root/affix, or meaning/limction, etc.) separation.</Paragraph> <Paragraph position="6"> On each side of the equal sign something is conserved (one dimension), and something changes (second dimension).</Paragraph> <Paragraph position="7"> * In the example, the first equation stands for a conservation in meaning (&quot;mathematics&quot; as opposed to &quot;physics&quot;) and a change in categories, . whereas the second equation stands for a conservation of grammatical categories (N as opposed to A), but a change in meanings.</Paragraph> <Paragraph position="8"> The third equation means that, somehow, analogy neutralises changes performed at the same time along the two previous dimensions.</Paragraph> <Paragraph position="9"> dist( mathematics, physical) = dis.t(physics, mathematical) On each side of the equality sign, both changes in meanings and categories, performed at the same time, leave the proportion unchanged.</Paragraph> <Paragraph position="10"> In order to complete the \['ormMisation, dist re-mains to be defined, l,;dition distances which have been proposed in many works (Levenshtein 65), (Wagner & Fischer 74), (Selkow 77), etc., are a good (;atl(lidate. They are mathematically sound as well as fat,it|rely relevant: they re\[tect a sensible notion, that of keystrokes, an(t turn out to be metrics under some hypotheses. They answer the correction problem: what is the minimal number of cdit operations needed to lransJbrm one word into anolhcr one?. In our example, how mmty characters need to be (:hanged to transR)rm malhemalical into physics'? Edit operations are insertion (for instance, e -~ p), (leletion (like l-+ C/) and replacement (like a ---+ s). A <listance can be defined by assigning weights to these three ()per: ations, 1 for each of them, for simplification. The edit <tistance is then a simple extension fi'om edit operations to strings.</Paragraph> <Paragraph position="11"> Definition 2 (Edition distanee) Let V be a vocab'ulary, dist is defined on V* as a commutative operation, in the following way: v(a, c v v(,,,,O (v*)</Paragraph> <Paragraph position="13"> With this delinition and a weight of 1 for each of the three edit operations, tile distance between mathematical and physics becomes 9.</Paragraph> <Paragraph position="14"> m a t h c m a I i c a / yszcs = 9 As a mathematical result, with more general weighl;s, it can be proved that, if the edit Ol)erations define a metric on P U {c}, then the ('.(lit distance on V* is also a metric. We recall tile tbrmal definition of a metric.</Paragraph> <Paragraph position="15"> Definition 3 (Metric) Let S bc a set, dist a function from ,_q x S to IR + , the ,so/of non-negative real n'umbcrs, dist is a metric on S if and only if</Paragraph> <Paragraph position="17"/> </Section> <Section position="4" start_page="718" end_page="719" type="sub_section"> <SectionTitle> 3.4 Coverage </SectionTitle> <Paragraph position="0"> tlaving defined what we un<lerstand by analogy in a formal way, we inspect, some o\[' its properties. We first; make a very strong but necessary assumption about the nature of the solution of an analogy. Following the linguistic feeling, we impose that tile solution of an analogy be built only with the elements of the vocabulary present in the three given terms. In other words, no material from outside should be used.</Paragraph> <Paragraph position="1"> This constraint does not prevent analogies from having multiple solutions. It suffices that the distances become too large relative to the lengths of the words, a:thc = of:x is such a case. The constraint eliminates, for instance, all words of the form txy, with x and y two letters outside of the set {a, e, f, h, o,t}, but does not bar Ill, hhh, eee, which are solutions of this analogy. But, as a matter of fact, this kind of example does not make much linguistic sense.</Paragraph> <Paragraph position="2"> A degenerated c~se of analogy is when two of tile three terms are equal. The only possihle solution is then the third term. IlL other words, nothing new <:an really be said. This meets common sellse.</Paragraph> <Paragraph position="3"> v) c (v'f, = = This property is always true. It is proved thanks to the equality property of a metric: u : &quot;u = v : x dist(u,,4=O=dist(v,x) ~ x=,:.</Paragraph> <Paragraph position="4"> Some. imt>ortmtt linguistic phenomena are covered hy onr proposal for linguistic examples. But the corresponding mathematical properties appear not to hold ill the general case. In fact, studying the necessary and suificient conditions under which they are true remains an open problem. It seems that, in all cases, it; has to do with some &quot;weakest links&quot; along the pair of strings considered (minimisation of a sum of distances).</Paragraph> <Paragraph position="5"> An important property which works in many cases, and at least on linguistic examples, but may not; be true in the general case 6, is transitivity: ?t l V -~ lff : V I A ~ttl l ?)l ~- lJO : X zT~ I/\[V :'W :X This accounts for the fact; that any representative ill a group of conjugation/declension/die, may he chosen as the model. In Ancient Greek, AoTo< is always taken as a model for the declension of tile 1st group of masculine nouns, although any other word from the same group would have been as good.</Paragraph> <Paragraph position="6"> Our definition of analogy fortunately captures linguistic cases where prefixes (or suffixes) are involved. null tt.t :'~t.~ ~--- w.t :X =2? X ~ W.~ This is not true in tile general case. At least;,</Paragraph> <Paragraph position="8"> thanks to a property of edit distances, which we give here without a proof: V (u,v,w) E (V*) a, 6Counter-example: the:ttt = a:of A a :of = th.c : hhh. ~ the : tit = the : hhh because</Paragraph> <Paragraph position="10"> dist(u.v, u.w) = dist(v, w). But the third equation may not always be verified. A suIficient condition for it to hold is that the joints between prefixes and suffixes minimise some sums of distances:</Paragraph> <Paragraph position="12"> This is the case when prefixes and sulfixes are dissimilar enough, as in our example with mathema# i-cs and phys-i-cal, but in the general case, only dist(u.v, w.t) _< dist(u, w) + dist(v, t) holds.</Paragraph> <Paragraph position="13"> Similarly to prefixing and suffixing, our formalisation accounts for linguistic examples of infixing, a phenomenon well illustrated by semitic languages 7 (here, the replacement of an a by an i). Arabic: arsala : mursilun = aslama : x x = muslimun It also accounts for some (not all) examples of sound changes, like umlaut in German s .</Paragraph> <Paragraph position="14"> German: Balg : B~lge = Ilals : x</Paragraph> <Paragraph position="16"> These linguistic cases work partly thanks to the previous property of distances with prefixes.</Paragraph> </Section> <Section position="5" start_page="719" end_page="719" type="sub_section"> <SectionTitle> 3.4.5 R.eduplieatlon </SectionTitle> <Paragraph position="0"> Unfortunately, our proposal does not render an account of reduplication. This would be necessary if we wanted to describe, for example, the formation of plurals in Malay/Indonesian: orang ---+ orang-orang 9. Here, a speculative remark would link the power of analogy with some class of languages; our proposal seems not to go beyond regular languages.</Paragraph> </Section> </Section> <Section position="5" start_page="719" end_page="721" type="metho"> <SectionTitle> 4 Application </SectionTitle> <Paragraph position="0"> In the sequel, we will apply the principle of analogy not on words anymore, but on sentences. In the same way ~ words are strings of characters, sentences are strings of words. So, the shift from words to sentences is just of matter of reformulation. null Wc also recall that edit distances and edit operations are not contined to strings; they extend in a natural way to forests, and hence to trees.</Paragraph> <Paragraph position="1"> In fact, it is possible to give a definition of an edit distance on forests which generalises the definitions on strings (Wagner & Fischer 74) and on 7arsala (he sent) and aslama (he became converted) verbs 3rd person singular past; mursilun (a sender) and muslirnun (a convert) agent nouns.</Paragraph> <Paragraph position="2"> 8Balg (pelt, skin) and Iials (neck) singular; Biilge and Hdlse plural.</Paragraph> <Paragraph position="3"> 9 orang (human being) singular, orang-orang plural.</Paragraph> <Paragraph position="4"> trees (Selkow 77). Hence the possibility of applying analogy to trees.</Paragraph> <Paragraph position="5"> The example-based approach in machine translation, inaugurated by (Nagao 84) and illustrated by (Sadler and Vendelmans 90) or (Sato 90), for instance, relies on the assumption that, if two sentences are &quot;ek)se&quot;, then, their analyses should be &quot;close&quot; too. By consequence, if the analysis of a first sentence is known, the analysis of the second one could be obtained by performing slight &quot;modifications&quot; (>it it. A problem arises: where are the slight &quot;modifications&quot; to be performed, and what; are they? In that matter, edit distances could help a lot: &quot;close&quot; means at a distance not too large, and ':modifications&quot; are edit operations.</Paragraph> <Section position="1" start_page="719" end_page="721" type="sub_section"> <SectionTitle> 4.1 Analysis by analogy </SectionTitle> <Paragraph position="0"> Suppose we have a collection of sentences (a data-base) already analysed (in fact, a tree-bank).</Paragraph> <Paragraph position="1"> For a new sentence, called the prototype, our goal is to build its analysis, i.e., a corresponding tree.</Paragraph> <Paragraph position="2"> Of course, the ideal case is when the prototype is already present in the tree-bank, which means that the analysis is tbnnd there too.</Paragraph> <Paragraph position="3"> In general, the prototype will not be found in the tree-bank. The search may thus be relaxed to similar sentences. Now, if at least two different sentences are retrieved by approximate matching, a fourth one can be built by analogy. Figure 2 illustrates this: the prototype is in the upper left corner; the two sentences on its right and under it have been obtained by approximate matching.</Paragraph> <Paragraph position="4"> Knowing the respective distances between these three sentences (on the arrows), sentence x can be computed by analogy.</Paragraph> <Paragraph position="5"> it' by chance sentence x belongs to the tree-bank, its analysis is also in the tree-bank. Now, a reverse process on trees delivers an analysis for the prototype, as illustrated in Figure 3. The three trees in the right and bottom corners are the corresponding analyses of the sentences of Figure 2. They were taken from the tree-bank. The distances are given on the arrows. Tree y is the solution of the analogy, and we claim that it is the analysis of the prototype sentence.</Paragraph> <Paragraph position="6"> Approximate matching is retrieval of all sentences at a distance less than a threshold Kom a given prototype. Efficient algorithms, using dynamic programming, have been proposed to pertbrm approximate matching (Ukkonen 83) and (Landau & Vishkin 88). Our method is somewhat different. We do as if we wanted to generate the entire set of sentences at a distance less than or equal to the threshold. In doing that, we introduce a don't care symbol representing any possible word. Pattern-matching with don't care symbols has already been studied (Pinter 85). Of the green lamp turns oil ~-- 3 - *</Paragraph> <Paragraph position="8"> the lamp turns on the green signal is on x = the signal is off Figure. 2: l)rototype (upper left corner), sentences obtained by approximate matching and x, sentence oi)tained by analogy, and retrieved from the tree-bank. I)istanees are given in words. Distau('es are given in nodes.</Paragraph> <Paragraph position="9"> course, this naive solution implies an exponential explosion, but, fortunately, it is not ne(:essary to consider the entire set of sentences, neith('.r to generate them. Only sentences which a.re substrings of other strings may be coded. This allows us to use a siml)le non-determinist|(- version of the Aho & (;orasick algorithm (Aho &. (k)rasick 75), which only cheeks the possible presenc.e of patterns on an array of integer triples. This algorithm C()tl.tl)CteS well with one of the n,ost elIichmt algorithm agrcp (Wu ~ Manl)er 92), as it is faster in average.</Paragraph> <Paragraph position="10"> In a first implementation, rather than really computing solutions of analogies on trees, we retrieve them from the tree-bank using approximate matching. I,;xeeution dines are l)eh)w one second for the analysis of short chunks of text (about 5 words). This technique helps a lot in the (-onstruclion of tree-banks. Firstly, building new linguistic structures for new sentences is delinitely made faster. Secondly, this technique enforces consis-tency, a sensible issue in tree-bank construction, especially if tree-banks are to be used to train probabilistic models.</Paragraph> <Paragraph position="11"> 'lb have a more precise idea about the power of the method, we carried out some experiments on an excerpt of the tree.-bank of the University of Pennsylwmin (787 sentences with their corre sponding analyse's). For all possible zl-tuples of Sellt,(\]iilees which verify the analogy definition, wc eomlmted the analysis of the first sentence by analogy. We. recall that there may be no solution, one solution, or several solutions..As a restrict, ion in this experiment, we (lid not consider distances between objects over half of the lengths of the objects. null Ret:all In document retrieval, recall is delined as the ratio of the number of relevant documents retrieved over the total mmlber of relewmt docu-inents ill the data base. Ih're, we (lefine the recall as the number of times the exact structure was computed by analogy, divided by the number of sentence t)airs having the same structure in the tree-bank. In one experiment, the recall is 0.69, a quite good figure, which shows thai; the technique is promising.</Paragraph> <Paragraph position="12"> Precision Again, in document retriewfl, precision is defined as the ratio of the nmnt)er of rele- null rant documents retrieved over the total number of documents retrieved. Here, we define the precisio~ as the number of times when the exact structure was computed by analogy divided by the number of solutions delivered.</Paragraph> <Paragraph position="13"> In the experiment, the precision is 0.43, which means thai in almost half of the cases, one of the structures delivered is the right one. Now, in average, the structures delivered are far from the exact structure by 1.61 node, with a standard deviation of 1.86. This means that in average less that two nodes have to be edited in order to get the exact structm:e, the size of a structure in the tree-bank being 9.8 :t: 5.4 nodes.</Paragraph> </Section> <Section position="2" start_page="721" end_page="721" type="sub_section"> <SectionTitle> 4.2 Generation by analogy </SectionTitle> <Paragraph position="0"> Generation may be performed in the same way as analysis, the difference being that the prototype is a tree and pattern-matching is performed on trees. The overall process is similar to the one for analysis, but in the opposite direction. The tool we have built for the edition of text with trees, allows approximate matching on trees, and generation is performed using the same functions as for analysis.</Paragraph> </Section> </Section> class="xml-element"></Paper>