XML Viewer - w05-0616

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/w05-0616_metho.xml
Size: 24,001 bytes
Last Modified: 2025-10-06 14:09:52
<?xml version="1.0" standalone="yes"?>
<Paper uid="W05-0616">
  <Title>An Analogical Learner for Morphological Analysis</Title>
  <Section position="4" start_page="120" end_page="124" type="metho">
    <SectionTitle>
2 Principles of analogical learning
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="120" end_page="120" type="sub_section">
      <SectionTitle>
2.1 Analogical reasoning
</SectionTitle>
      <Paragraph position="0"> The ability to identify analogical relationships between what looks like unrelated situations, and to use these relationships to solve complex problems, lies at the core of human cognition (Gentner et al., 2001). A number of models of this ability have been proposed, based on symbolic (e.g. (Falkenheimer and Gentner, 1986; Thagard et al., 1990; Hofstadter and the Fluid Analogies Research group, 1995)) or subsymbolic (e.g. (Plate, 2000; Holyoak and Hummel, 2001)) approaches. The main focus of these models is the dynamic process of analogy making, which involves the identification of a structural mappings between a memorized and a new situation. Structural mapping relates situations which, while being apparently very different, share a set of common high-level relationships. The building of a structural mapping between two situations utilizes several subparts of their descriptions and the relationships between them.</Paragraph>
      <Paragraph position="1"> Analogy-making seems to play a central role in our reasoning ability; it is also invoked to explain some human skills which do not involve any sort of conscious reasoning. This is the case for many tasks related to the perception and production of language: lexical access, morphological parsing, word pronunciation, etc. In this context, analogical models have been proposed as a viable alternative to rule-based models, and many implementation of these low-level analogical processes have been proposed such as decision trees, neural networks or instance-based learning methods (see e.g. (Skousen, 1989; Daelemans et al., 1999)). These models share an acceptation of analogy which mainly relies on surface similarities between instances.</Paragraph>
      <Paragraph position="2"> Our learner tries to bridge the gap between these approaches and attempts to remain faithful to the idea of structural analogies, which prevails in the AI literature, while also exploiting the intuitions of large-scale, instance-based learning models.</Paragraph>
    </Section>
    <Section position="2" start_page="120" end_page="121" type="sub_section">
      <SectionTitle>
2.2 Analogical learning
</SectionTitle>
      <Paragraph position="0"> We consider the following supervised learning task: a learner is given a set S of training instances {X1,...,Xn}independently drawn from some unknown distribution. Each instance Xi is a vector containing m features: &lt;Xi1,...,Xim&gt; . Given S, the task is to predict the missing features of partially informed new instances. Put in more standard terms, the set of known (resp. unknown) features for a new value X forms the input space (resp. output space): the projections of X onto the input (resp. output) space will be denoted I(X) (resp. O(X)). This setting is more general than the simpler classification task, in which only one feature (the class label) is unknown, and covers many other interesting tasks.</Paragraph>
      <Paragraph position="1"> The inference procedure can be sketched as follows: training examples are simply stored for future use; no generalization (abstraction) of the data is performed, which is characteristic of lazy learning (Aha, 1997). Given a new instance X, we identify formal analogical proportions involving X in the input space; known objects involved in these proportions are then used to infer the missing features. An analogical proportion is a relation involving four objects A, B, C and D, denoted by A : B :: C : D and which reads A is to B as C is to D. The definition and computation of these proportions are studied in Section 3. For the moment,  we contend that it is possible to construct analogical proportions between (possibly partially informed) objects inS. Let I(X) be a partially described object not seen during training. The analogical inference process is formalized as:  1. Construct the setT(X)[?]S3 defined as:</Paragraph>
      <Paragraph position="3"> This inference procedure shows lots of similarities with the k-nearest neighbors classifier (k-NN) which, given a new instance, (i) searches the training set for close neighbors, (ii) compute the unknown class label according to the neighbors' labels. Our model, however, does not use any metric between objects: we only rely on the definition of analogical proportions, which reveal systemic, rather than superficial, similarities. Moreover, inputs and outputs are regarded in a symmetrical way: outputs are not restricted to a set of labels, and can also be structured objects such as sequences. The implementation of the model still has to address two specific issues.</Paragraph>
      <Paragraph position="4"> * When exploringS3, an exhaustive search evaluates |S|3 triples, which can prove to be intractable. Moreover, objects in S may be unequally relevant, and we might expect the search procedure to treat them accordingly.</Paragraph>
      <Paragraph position="5"> * Whenever several competing hypotheses are proposed forhatwiderO(X), a ranking must be performed. In our current implementation, hypotheses are ranked based on frequency counts.</Paragraph>
      <Paragraph position="6"> These issues are well-known problems for k-NN classifiers. The second one does not appear to be critical and is usually solved based on a majority rule. In contrast, a considerable amount of effort has been devoted to reduce and optimize the search process, via editing and condensing methods, as studied e.g. in (Dasarathy, 1990; Wilson and Martinez, 2000). Proposals for solving this problem are discussed in Section 3.4.</Paragraph>
      <Paragraph position="7"> 3 An algebraic framework for analogical proportions Our inductive model requires the availability of a device for computing analogical proportions on feature vectors. We consider that an analogical proportion holds between four feature vectors when the proportion holds for all components. In this section, we propose a unified algebraic framework for defining analogical proportions between individual features.</Paragraph>
      <Paragraph position="8"> After giving the general definition, we present its instantiation for two types of features: words over a finite alphabet and sets of labelled trees.</Paragraph>
    </Section>
    <Section position="3" start_page="121" end_page="122" type="sub_section">
      <SectionTitle>
3.1 Analogical proportions
</SectionTitle>
      <Paragraph position="0"> Our starting point will be analogical proportions in a set U, which we define as follows: [?]x,y,z,t [?] U,x : y :: z : t if and only if either x = y and z = t or x = z and y = t. In the sequel, we assume that U is additionally provided with an associative internal composition law[?], which makes (U,[?]) a semigroup. The generalization of proportions to semigroups involves two key ideas: the decomposition of objects into smaller parts, subject to alternation constraints. To formalize the idea of decomposition, we define the factorization of an element u in U as: Definition 1 (Factorization) A factorization of u [?] U is a sequence u1 ...un, with [?]i,ui [?] U, such that: u1 [?] ... [?]un = u.</Paragraph>
      <Paragraph position="1"> Each term ui is a factor of u.</Paragraph>
      <Paragraph position="2"> The alternation constraint expresses the fact that analogically related objects should be made of alternating factors: for x : y :: z : t to hold, each factor in x should be found alternatively in y and in z. This yields a first definition of analogical proportions: Definition 2 (Analogical proportion)  (x,y,z,t) [?]U form an analogical proportion, denoted by x : y :: z : t if and only if there exists some factorizations x1[?]...[?]xd = x, y1[?]...[?]yd = y, z1 [?] ... [?]zd = z, t1 [?] ... [?]td = t such that [?]i,(yi,zi) [?]{(xi,ti),(ti,xi)}. The smallest d for  which such factorizations exist is termed the degree of the analogical proportion.</Paragraph>
      <Paragraph position="3"> This definition is valid for any semigroup, and a fortiori for any richer algebraic structure. Thus, it readily applies to the case of groups, vector spaces, free monoids, sets and attribute-value structures.</Paragraph>
    </Section>
    <Section position="4" start_page="122" end_page="123" type="sub_section">
      <SectionTitle>
3.2 Words over Finite Alphabets
3.2.1 Analogical Proportions between Words
</SectionTitle>
      <Paragraph position="0"> Let S be a finite alphabet. Sstar denotes the set of finite sequences of elements of S, called words over S. Sstar, provided with the concatenation operation .</Paragraph>
      <Paragraph position="1"> is a free monoid whose identity element is the empty word e. For w[?]Sstar, w(i) denotes the ith symbol in w. In this context, definition (2) can be re-stated as: Definition 3 (Analogical proportion in (Sstar,.)) (x,y,z,t)[?]Sstar form an analogical proportion, denoted by x : y :: z : t if and only if there exists some integer d and some factorizations x1 ...xd = x, y1 ...yd = y, z1 ...zd = z, t1 ...td = t such that [?]i,(yi,zi)[?]{(xi,ti),(ti,xi)}.</Paragraph>
      <Paragraph position="2"> An example of analogy between words is:</Paragraph>
      <Paragraph position="4"> the proposal of (Lepage, 1998). It does not ensure the existence of a solution to an analogical equation, nor its uniqueness when it exists. (Lepage, 1998) gives a set of necessary conditions for a solution to exist. These conditions also apply here. In particular, if t is a solution of x : y :: z :?, then t contains, in the same relative order, all the symbols in y and z that are not in x. As a consequence, all solutions of an equation have the same length.</Paragraph>
      <Paragraph position="5">  Definition (3) yields an efficient procedure for solving analogical equations, based on finite-state transducers. The main steps of the procedure are sketched here. A full description can be found in (Yvon, 2003). To start with, let us introduce the notions of complementary set and shuffle product.</Paragraph>
      <Paragraph position="6"> Complementary set If v is a subword of w, the complementary set of v with respect to w, denoted by w\v is the set of subwords of w obtained by removing from w, in a left-to-right fashion, the symbols in v. For example, eea is a complementary sub-word of xmplr with respect to exemplar. When v is not a subword of w, w\v is empty. This notion can be generalized to any regular language.</Paragraph>
      <Paragraph position="7"> The complementary set of v with respect to w is a regular set: it is the output language of the finite-state transducer Tw (see Figure 1) for the input v.</Paragraph>
      <Paragraph position="9"> Shuffle The shuffle u*v of two words u and v is introduced e.g. in (Sakarovitch, 2003) as follows:</Paragraph>
      <Paragraph position="11"> The shuffle of two words u and v contains all the words w which can be composed using all the symbols in u and v, subject to the condition that if a precedes b in u (or in v), then it precedes b in w.</Paragraph>
      <Paragraph position="12"> Taking, for instance, u = abc and v = def, the words abcdef, abdefc, adbecf are in u*v; this is not the case with abefcd. This operation generalizes straightforwardly to languages. The shuffle of two regular languages is regular (Sakarovitch, 2003); the automaton A, computing K*L, is derived from the automata AK = (S,QK,q0K,FK,dK) and</Paragraph>
      <Paragraph position="14"> The notions of complementary set and shuffle are related through the following property, which is a direct consequence of the definitions.</Paragraph>
      <Paragraph position="15"> w[?]u*v=u[?]w\v Solving analogical equations The notions of shuffle and complementary sets yield another characterization of analogical proportion between words, based on the following proposition:</Paragraph>
      <Paragraph position="17"> An analogical proportion is thus established if the symbols in x and t are also found in y and z, and appear in the same relative order. A corollary follows:  t is a solution of x : y :: z :?=t[?](y*z)\x The set of solutions of an analogical equation x : y :: z :? is a regular set, which can be computed with a finite-state transducer. It can also be shown that this analogical solver generalizes the approach based on edit distance proposed in (Lepage, 1998).</Paragraph>
    </Section>
    <Section position="5" start_page="123" end_page="123" type="sub_section">
      <SectionTitle>
3.3 Trees
</SectionTitle>
      <Paragraph position="0"> Labelled trees are very common structures in NLP tasks: they can represent syntactic structures, or terms in a logical representation of a sentence. To express the definition of analogical proportion between trees, we introduce the notion of substitution.</Paragraph>
      <Paragraph position="1"> Definition 4 (Substitution) A (single) substitution is a pair (variable-tree).</Paragraph>
      <Paragraph position="2"> The application of the substitution (v-tprime) to a tree t consists in replacing each leaf of t labelled by v by the tree tprime. The result of this operation is denoted: t(v-tprime). For each variable v, we define the binary operator triangleleftv as t triangleleftv tprime = t (v-tprime). Definition 2 can then be extended as: Definition 5 (Analogical proportion (trees)) (x,y,z,t) [?] U form an analogical proportion, denoted by x : y :: z : t iff there exists some variables (v1,...,vn[?]1) and some factorizations x1 triangleleftv1 ... triangleleftvn[?]1 xn = x, y1 triangleleftv1 ... triangleleftvn[?]1 yn = y, z1 triangleleftv1 ... triangleleftvn[?]1 zn = z, t1 triangleleftv1 ... triangleleftvn[?]1 tn = t such that[?]i,(yi,zi)[?]{(xi,ti),(ti,xi)}.</Paragraph>
      <Paragraph position="3"> An example of such a proportion is illustrated on Figure 2 with syntactic parse trees.</Paragraph>
      <Paragraph position="4"> This definition yields an effective algorithm computing analogical proportions between trees (Stroppa and Yvon, 2005). We consider here a simpler heuristic approach, consisting in (i) linearizing labelled trees into parenthesized sequences of symbols and (ii) using the analogical solver for words introduced above. This approach yields a faster, albeit approximative algorithm, which makes analogical inference tractable even for large tree databases.</Paragraph>
    </Section>
    <Section position="6" start_page="123" end_page="124" type="sub_section">
      <SectionTitle>
3.4 Algorithmic issues
</SectionTitle>
      <Paragraph position="0"> We have seen how to compute analogical relationships for features whose values are words and trees.  If we use, for trees, the solver based on tree linearizations, the resolution of an equation amounts, in both cases, to solving analogies on words.</Paragraph>
      <Paragraph position="1"> The learning algorithm introduced in Section 2.2 is a two-step procedure: a search step and a transfer step. The latter step only involves the resolution of (a restricted number of) analogical equations. When x, y and z are known, solving x : y :: z :? amounts to computing the output language of the transducer representing (y *z)\x: the automaton for this language has a number of states bounded by |x|x|y|x|z|. Given the typical length of words in our experiments, and given that the worst-case exponential bound for determinizing this automaton is hardly met, the solving procedure is quite efficient.</Paragraph>
      <Paragraph position="2"> The problem faced during the search procedure is more challenging: given x, we need to retrieve all possible triples (y,z,t) in a finite set L such that x : y :: z : t. An exhaustive search requires the computation of the intersection of the finite-state automaton representing the output language of (L*L)\x with the automaton for L. Given the size of L in our experiments (several hundreds of thousands of words), a complete search is intractable and we resort to the following heuristic approach.</Paragraph>
      <Paragraph position="3"> L is first split into K bins{L1,...,LK}, with|Li| small with respect to|L|. We then randomly select k bins and compute, for each bin Li, the output language of (Li*Li)\x, which is then intersected with L: we thus only consider triples containing at least  two words from the same bin. It has to be noted that the bins are not randomly constructed: training examples are grouped into inflectional or derivational families. To further speed up the search, we also impose an upper bound on the degree of proportions.</Paragraph>
      <Paragraph position="4"> All triples retrieved during these k partial searches are then merged and considered for the transfer step.</Paragraph>
      <Paragraph position="5"> The computation of analogical relationships has been implemented in a generic analogical solver; this solver is based on Vaucanson, an automata manipulation library using high performance generic programming (Lombardy et al., 2003).</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="124" end_page="125" type="metho">
    <SectionTitle>
4 Experiments
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="124" end_page="125" type="sub_section">
      <SectionTitle>
4.1 Methodology
</SectionTitle>
      <Paragraph position="0"> The main purpose of these experiments is to demonstrate the flexibility of the analogical learner. We considered two different supervised learning tasks, both aimed at performing the lexical analysis of isolated word forms. Each of these tasks represents a possible instantiation of the learning procedure introduced in Section 2.2.</Paragraph>
      <Paragraph position="1"> The first experiment consists in computing one or several vector(s) of morphosyntactic features to be associated with a form. Each vector comprises the lemma, the part-of-speech, and, based on the part-of-speech, additional features such as number, gender, case, tense, mood, etc. An (English) input/output pair for this tasks thus looks like: input=replying; output={reply; V-pp--}, where the placeholder '-' denotes irrelevant features. Lexical analysis is useful for many applications: a POS tagger, for instance, needs to &amp;quot;guess&amp;quot; the possible part(s)-of-speech of unknown words (Mikheev, 1997). For this task, we use the definition of analogical proportions for &amp;quot;flat&amp;quot; feature vectors (see section 3.1) and for word strings (section 3.2). The training data is a list of fully informed lexical entries; the test data is a list of isolated word forms not represented in the lexicon. Bins are constructed based on inflectional families.</Paragraph>
      <Paragraph position="2"> The second experiment consists in computing a morphological parse of unknown lemmas: for each input lemma, the output of the system is one or several parse trees representing a possible hierarchical decomposition of the input into (morphologically categorized) morphemes (see Figure 3). This kind of analysis makes it possible to reconstruct the series of morphological operations deriving a lemma, to compute its root, its part-of-speech, and to identify morpheme boundaries. This information is required, for instance, to compute the pronunciation of an unknown word; or to infer the compositional meaning of a complex (derived or compound) lemma. Bins gather entries sharing a common root.</Paragraph>
      <Paragraph position="3">  phemes have a compositional type: B|A. denotes a suffix that turns adjectives into adverbs.</Paragraph>
      <Paragraph position="4"> These experiments use the English, German, and Dutch morphological tables of the CELEX database (Burnage, 1990). For task 1, these tables contain respectively 89 000, 342 000 and 324 000 different word forms, and the number of features to predict is respectively 6, 12, and 10. For task 2, which was only conducted with English lemma, the total number of different entries is 48 407.</Paragraph>
      <Paragraph position="5"> For each experiment, we perform 10 runs, using 1 000 randomly selected entries for testing1. Generalization performance is measured as follows: the system's output is compared with the reference values (due to lexical ambiguity, a form may be associated in the database with several feature vectors or parse trees). Per instance precision is computed as the relative number of correct hypotheses, i.e.</Paragraph>
      <Paragraph position="6"> hypotheses which exactly match the reference: for task 1, all features have to be correct; for task 2, the parse tree has to be identical to the reference tree.</Paragraph>
      <Paragraph position="7"> Per instance recall is the relative number of reference values that were actually hypothesized. Precision and recall are averaged over the test set; numbers reported below are averaged over the 10 runs.</Paragraph>
      <Paragraph position="8"> Various parameters affect the performance: k, the number of randomly selected bins considered during the search step (see Section 3.4) and d, the upper 1Due to lexical ambiguity, the number of tested instances is usually greater than 1 000.</Paragraph>
      <Paragraph position="9">  bound of the degree of extracted proportions.</Paragraph>
    </Section>
    <Section position="2" start_page="125" end_page="125" type="sub_section">
      <SectionTitle>
4.2 Experimental results
</SectionTitle>
      <Paragraph position="0"> Experimental results for task 1 are given in Tables 1, 2 and 3. For each main category, two recall and precision scores are computed: one for the sole lemma and POS attributes (left column); and one for the lemma and all the morpho-syntactic features (on the right). In these experiments, parameters are set as follows: k = 150 and d = 3. As k grows, both recall and precision increase (up to a limit); k = 150 appears to be a reasonable trade-off between efficiency and accuracy. A further increase of d does not significantly improve accuracy: taking d = 3 or d = 4 yields very comparable results.</Paragraph>
      <Paragraph position="1">  As a general comment, one can note that high generalization performance is achieved for languages and categories involving rich inflectional paradigms: this is exemplified by the performance on all German categories. English adjectives, at the other end of this spectrum, are very difficult to analyze. A simple and effective workaround for this problem consists in increasing the size the sub-lexicons (Li in Section 3.4) so as to incorporate in a given bin all the members of the same derivational (rather than inflectional) family. For Dutch, these results are comparable with the results reported in (van den Bosch and Daelemans, 1999), who report an accuracy of about 92% on the task of predicting the main syntactic category.</Paragraph>
      <Paragraph position="2"> Rec. Prec.</Paragraph>
      <Paragraph position="3">  The second task is more challenging since the exact parse tree of a lemma must be computed. For morphologically complex lemmas (involving affixation or compounding), it is nevertheless possible to obtain acceptable results (see Table 4, showing that some derivational phenomena have been captured.</Paragraph>
      <Paragraph position="4"> Further analysis is required to assess more precisely the potential of this method.</Paragraph>
      <Paragraph position="5"> From a theoretical perspective, it is important to realize that our model does not commit us to a morpheme-based approach of morphological processes. This is obvious in task 1; and even if task 2 aims at predicting a morphematic parse of input lemmas, this goal is achieved without segmenting the input lemma into smaller units. For instance, our learner parses the lemma enigmatically as: [[[.N enigma][.A|N ical]]B|A.ly], that is without trying to decide to which morph the orthographic t should belong. In this model, input and output spaces are treated symmetrically and correspond to distinct levels of representation.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML