XML Viewer - w02-0605

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/02/w02-0605_metho.xml
Size: 21,074 bytes
Last Modified: 2025-10-06 14:08:02
<?xml version="1.0" standalone="yes"?>
<Paper uid="W02-0605">
  <Title>Using eigenvectors of the bigram graph to infer morpheme identity</Title>
  <Section position="5" start_page="2" end_page="2" type="metho">
    <SectionTitle>
2 Method
</SectionTitle>
    <Paragraph position="0"> The construction of the nearest-neighbor graph is a process which allows for many linguistic and practical choices. Some of these we have experimented with, and others we have not, simply using parameter values that seemed to us to be reasonable. Our goal is to develop a graph in which vertices represent words, and edges represent pairs of words whose distribution in a corpus is similar. We then develop a representation of the graph by a symmetric matrix, and compute a small number of the eigenvectors of the normalized laplacian for  We are grateful to Yali Amit for drawing our attention to Shi and Malik 1997, to Partha Niyogi for helpful comments throughout the development of this material, and to Jessie Pinkham for suggestions on an earlier draft of this paper.</Paragraph>
    <Paragraph position="1"> July 2002, pp. 41-47. Association for Computational Linguistics.</Paragraph>
  </Section>
  <Section position="6" start_page="2" end_page="2" type="metho">
    <SectionTitle>
ACL Special Interest Group in Computational Phonology (SIGPHON), Philadelphia,
</SectionTitle>
    <Paragraph position="0"> Morphological and Phonological Learning: Proceedings of the 6th Workshop of the which the eigenvalues are smallest. These eigenvectors provide us with the coordinates necessary for our desired planar representation, as explained in section 2.2.</Paragraph>
    <Section position="1" start_page="2" end_page="2" type="sub_section">
      <SectionTitle>
2.1 Nearest-neighbor graph
</SectionTitle>
      <Paragraph position="0"> construction.</Paragraph>
      <Paragraph position="1"> We begin with the reasonable working assumption that to determine the syntactic category of a given word w, it is the set of words which appears immediate before w, and the set of words that appears immediately after w, that gives the best immediate evidence of a word's syntactic behavior. In a natural sense, under that assumption, an explicit description of the behavior of a word w in a corpus is a sparse</Paragraph>
      <Paragraph position="3"> ], of length V (where &amp;quot;V&amp;quot; is the number of words in the vocabulary of the corpus), indicating by l</Paragraph>
      <Paragraph position="5"> occurs immediately to the left of w, and also an similar vector R, also of length V, indicating how often each word occurs immediately to the right of w. Paraphrasing this, we may view the syntactic behavior of a word in a corpus as being expressed by its location in a space of 2V dimensions, or a vector from the origin to this location; this space has a natural decomposition into two spaces, called Left and Right, each of dimension V.</Paragraph>
      <Paragraph position="6"> Needless to say, such a representation is not directly illuminating -- nor does it provide a way to cogently present similarities or clusterings among words. We now construct a symmetrical graph (&amp;quot;LeftGraph&amp;quot;), whose vertices are the K most frequent words in the corpus. (We have experimented with K = 500 and K = 1000). For each word w, we compute the cosine of the angle between the vector w and the K-1 other words w</Paragraph>
      <Paragraph position="8"> , and use this figure to select the N words closest to w. We have experimented with N = 5,10,20 and 50. We insert an edge (v</Paragraph>
      <Paragraph position="10"> .. We follow the same construction for RightGraph in the parallel fashion. In much of the discussion that follows, the reader may take whatever we say about LeftGraph to hold equally true of RightGraph when not otherwise stated.</Paragraph>
    </Section>
    <Section position="2" start_page="2" end_page="2" type="sub_section">
      <SectionTitle>
2.2 Projection of nearest-neighbor
</SectionTitle>
      <Paragraph position="0"> graph by spectral decomposition In the canonical matrix representation of a (unweighted) graph, an entry M(i,j), with i distinct from j, is 1 if the graph includes an edge (i,j) and 0 otherwise. All diagonal elements are zero. The degree of a vertex of a graph is the number of edges adjacent to it; the degree of the</Paragraph>
      <Paragraph position="2"> )is thus the sum of the values in the m th row of M. If we define D as the diagonal matrix whose entry D(m,m) is d(v m ), the degree of v m , then the laplacian of the graph is defined as D - M. The normalized laplacian L is defined as D</Paragraph>
      <Paragraph position="4"> . The effect of normalization on the laplacian is to divide the weight of an entry M(i,j) that represents the edge between v</Paragraph>
      <Paragraph position="6"> , and to set the values of the diagonal elements to 1.</Paragraph>
      <Paragraph position="7">  The laplacian is a symmetric matrix which is known to be positive semi-definite (Chung 1997). Therefore all the eigenvalues of the laplacian are non-negative. We return to the space of our observations by premultiplying the eigenvectors by D  . We will refer to these eigenvectors derived from LeftGraph (pre null provide us with very useful information. They each consist of a vector with one coordinate for each word among the K most frequent words in the corpus, and thus can be conceived of as a 1-dimensional representation of the vocabulary. In particular,</Paragraph>
      <Paragraph position="9"> is the 1-dimensional representation that optimally preserves the notion of locality described by the graph we have just constructed, and the choice of the top N eigenvectors provides a representation which optimally preserves the graph-locality in N-space. By virtue of being eigenvectors in the same  is the 2-dimensional representation that best  Our attention was drawn to the relevance of the normalized laplacian by Shi and Malik 1997, who explore a problem in the domain of vision. We are indebted to Chung 1997 on spectral graph theory. preserves the locality described by the graph in question (Chung 1997, Belkin and Niyogi 2002).</Paragraph>
      <Paragraph position="10"> Thus, to the extent that the syntactic behavior of a word can be characterized by the set of its immediate right- and left-hand neighbors (which is, to be sure, a great simplification of syntactic reality), using the lowest-valued eigenvectors provides a good graphical representation of words, in the sense that words with similar left-hand neighbors will be close together in the representation derived from the LeftGraph (and similarly for RightGraph).</Paragraph>
    </Section>
    <Section position="3" start_page="2" end_page="2" type="sub_section">
      <SectionTitle>
2.3 Choice of graphs
</SectionTitle>
      <Paragraph position="0"> We explore below two types of projection to 2 dimensions: plotting the 1  eigenvectors of LeftGraph and RightGraph against each other. In all of these cases, we have built a graph using the 20 nearest neighbors. In future work, we would like to look at varying the number of nearest neighbors that are linked to a given word. From manual inspection, one can see that in all cases, the nearest two or three words are very similar; but the depth of the nearest neighbor list that reflects words of truly similar behavior is, roughly, inversely proportional to the frequency of the word. This is not surprising, in the sense that higher frequency words tend to be grammatical words, and for such words there are fewer members of the same category.</Paragraph>
    </Section>
    <Section position="4" start_page="2" end_page="2" type="sub_section">
      <SectionTitle>
2.4 English
</SectionTitle>
      <Paragraph position="0"> Figure 1 illustrates the results of plotting the 1 st and 2 nd eigenvectors of LeftGraph based on the first 1,000,000 words of the Brown corpus, and using the 1,000 most frequent words and constructing a graph based on the 20 nearest neighbors. Figure 2 illustrates the results derived from the first two eigenvectors of RightGraph. Figures 1 and 2 suggest natural clusterings, based both on density and on the extreme values of the coordinates. In Figure 1 (LeftGraph), the bottom corner consists primarily of non-finite verbs (be, do, make); the left corner of finite verbs (was, had, has); the right corner primarily of nouns (world, way, system); while the top shows little homogeneity, though it includes the prepositions. See Appendix 1 for details; the words given in the appendix are a complete list of the words in a neighborhood that includes the extreme tip of the representation. As we move away from the extremes, in some cases we find a less homogeneous distribution of categories, while in others we find local pockets of linguistic homogeneity: for example, regions containing names of cities, others containing names of countries or languages.</Paragraph>
      <Paragraph position="1">  In Figure 2, the bottom corner consists of adjectives (social, national, white), the left corner of words that often are followed by of (most, number, kind, secretary), the right corner primarily by prepositions (of, in for, on by) and the top corner of words that often are followed by to (going, wants, according), (See Appendix 2 for details).</Paragraph>
    </Section>
    <Section position="5" start_page="2" end_page="2" type="sub_section">
      <SectionTitle>
2.5 French
</SectionTitle>
      <Paragraph position="0"> Figure 3 illustrates the results of plotting the 1 st and 2 nd eigenvectors of LeftGraph based on the first 1,000,000 words of a French encyclopedia, using the 1,000 most frequent words and constructing a graph based on the 20 nearest neighbors.</Paragraph>
      <Paragraph position="1"> The bottom left tip of the figure consists entirely of feminine nouns (guerre, population, fin), the right tip of plural nouns (annees, etatsunis, regions), the top tip of finite verbs (est, fut, a, avait) plus se and y. A bit under the top tip one finds two sharp-tipped clusters; the one on the left consists of masculine nouns (pays, sud, monde). Other internal clusters, not surprisingly, are composed of words which, with high frequency, are preceded by a specific preposition (e.g., preceded by a: peu, l'est, Paris; by en: particulier, effet, and feminine names of geographical areas such as France).</Paragraph>
      <Paragraph position="2"> Figure 4 illustrates plotting the 1 st eigenvector of LeftGraph against the 1 st eigenvector of RightGraph. We find a striking &amp;quot;striped&amp;quot; effect which is due to the masculine/feminine gender system of French. There are three stripes that stand out at the top of the figure. The one on the extreme left consists of singular feminine nouns, the one to its right, but left of center, consists of singular masculine nouns, and the one on the extreme right consists of plural nouns of both genders. The lowest region of the graph, somewhat left of center, contains grammatical morphemes. At the very bottom are found relative and subordinating conjunctions (ou, car, lequel, laquel, lesquelles, lesquels, quand, si), and just above them are the prepositions: selon, durant, malgre, pendant, apres, entre, jusqu'a, contre, sur, etc.) We find it striking that the gender system of French has such a pervasive impact upon the global form of the 1</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="2" end_page="3" type="metho">
    <SectionTitle>
3 Identifying syntactic behavior of
</SectionTitle>
    <Paragraph position="0"> automatically identified suffixes Interesting as they are, the representations we have seen are not capable of specifying membership in grammatical categories in an absolute sense. In this section, we explore the application of this representation to a text which has been morphologically analyzed by a language-neutral morphological analyzer. For this purpose, we employ the algorithm described in Goldsmith (2001), which takes an unanalyzed corpus and provides an analysis of the words into stems and suffixes. What is useful about that algorithm for our purposes is that it shares the same commitment to analysis based only on a raw (untreated) natural text, and neither hand-coding nor prior linguistic knowledge.</Paragraph>
    <Paragraph position="1"> The algorithm in Goldsmith (2001) links each stem in the corpus to the set of suffixes (called its signature) with which it appears in the corpus. Thus the stem jump might appear with the three suffixes ed-ing-s in a given corpus.</Paragraph>
    <Paragraph position="2"> But a morphological analyzer alone is not capable of determining whether the -ed that appears in the signature ed-ing-s is the same -ed suffix that appears in the signature ed-ing (for example), or whether the suffix -s in ed-ing-s is the same suffix that appears in the signature NULL-s-'s (this last signature is the one associated with the stem boy in a corpus containing the words boy-boys-boy's). A moment's reflection shows that the suffix -ed is indeed the same verbal past tense suffix in both cases, but the suffix -s is different: in the first case, it is a verbal suffix, while in the second it is a noun suffix.</Paragraph>
    <Paragraph position="3"> In general, morphological information alone will not be able to settle these questions, and thus automatic morphology alone will not be able to determine which signatures should be &amp;quot;collapsed&amp;quot; (that is, ed-ing-s should be viewed as a special sub-case of the signature NULL-eding-s, but NULL-s is not to be treated as a special case of NULL-ed-ing-s).</Paragraph>
    <Paragraph position="4"> We therefore have asked whether the rudimentary syntactic analysis described in the present paper could provide the information needed for the automatic morphological analyzer.</Paragraph>
    <Paragraph position="5"> The answer appears to be that if a suffix has an unambiguous syntactic function, then that suffix's identity can be detected automatically even when it appears in several different signatures. As we will see momentarily, the clear example of this is English -ed, which is (almost entirely) a verbal suffix. When a suffix is not syntactically homogeneous, then the words in which that suffix appears are scattered over a much larger region, and this difference appears to be quite sharply measurable.</Paragraph>
    <Section position="1" start_page="2" end_page="2" type="sub_section">
      <SectionTitle>
3.1 The case of the verbal suffix -ed
</SectionTitle>
      <Paragraph position="0"> In the automatic morphological analysis of the first 1,000,000 words of the Brown corpus that we produced, there are 26 signatures that contain the suffix -ed: NULL.ed.s, e.ed.ing, NULL.ed.er.es.ing, and 23 others of similar sort.</Paragraph>
      <Paragraph position="1"> We calculated a nearest neighbor graph as described above, with a slight variation. We considered the 1000 most frequent words to be atomic and unanalyzed morphologically, and then of the remaining words, we automatically replaced each stem with its corresponding signature. Thus as jumped is analyzed as jump+ed, and jump is assigned the signature NULL.ed.er.s.ing (based on the actual forms of the stem found in the corpus), the word jumped is replaced in the bigram calculations by the pseudo-word NULL.ed.er.s.ing_ed: the stem jump is replaced by its signature, and the actual suffix -ed remains unchanged, but is separated from its stem by an underscore _. Thus all words ending in -ed whose stems show the same morphological variations are treated as a single element, from the point of view of our present syntactic analysis.</Paragraph>
      <Paragraph position="2"> We hoped, therefore, that these 26 signatures with -ed appended to them would appear very close to each other in our 2-dimensional representation, and this was exactly what we found.</Paragraph>
      <Paragraph position="3"> To quantify this result, we calculated the coordinates of these 26 signatures in the following way. We normalize coordinates so that the lowest x-coordinate value is 0.0 and the highest is 1.0; likewise for the y-coordinates. Using these natural units, then, on the LeftGraph data, the average distance from each of the signatures to the center of these 26 points is 0.050. While we do not have at present a criterion to evaluate the closeness of this clustering, this appears to us at this point to be well within the range that an eventual criterion will establish. (A distance of 0.05 by this measure is a distance equal to 5% along either one of the axes, a fairly small distance.) On the RightGraph data, the average distance is 0.054.</Paragraph>
    </Section>
    <Section position="2" start_page="2" end_page="2" type="sub_section">
      <SectionTitle>
3.2 The cases of -s and -ing
</SectionTitle>
      <Paragraph position="0"> By contrast, when we look at the range of the 19 signatures that contain the suffix -s, the average distance to mean in the LeftGraph is 0.265, and in the RightGraph, 0.145; these points are much more widely scattered. We interpret this as being due to the fact that -s serves at least two functions: it marks the 3rd person present form of the verb, as well as the nominal plural.</Paragraph>
      <Paragraph position="1"> Similarly, the suffix -ing marks both the verbal progressive form as well as the gerundive, used both as an adjective and as a noun, and we expect a scattering of these forms as a result. We find an average to mean of 0.096 in the LeftGraph, and of 0.143 in the RightGraph.</Paragraph>
      <Paragraph position="2"> By way of even greater contrast, we can calculate the scatter of the NULL suffix, which is identified in all stems that appear without a suffix (e.g., the verb play, the noun boy). This &amp;quot;suffix&amp;quot; has an average distance to mean of 0.312 in the LeftGraph, and 0.192 in the RightGraph. This is the scatter we would expect of a group of words that have no linguistic coherence.</Paragraph>
    </Section>
    <Section position="3" start_page="2" end_page="3" type="sub_section">
      <SectionTitle>
3.3 Additional suffixes tested
</SectionTitle>
      <Paragraph position="0"> Suffix -ly occurs with five signatures, and an average distance to mean of 0.032 in LeftGraph, and 0.100 in RightGraph.</Paragraph>
      <Paragraph position="1">  The suffix 's occurs in only two signatures, but their average distance to mean is 0.000 [sic] in LeftGraph, and 0.012 in RightGraph. Similarly, the suffix -al appears in two signatures (NULL.al.s and NULL.al), and their average distance to mean is 0.002 in LeftGraph, and also 0.002 in RightGraph. The suffix -ate appears in three signatures, with an average distance to mean of 0.069 in LeftGraph, and 0.080 in RightGraph. The suffix -ment appears in two signatures, with an average distance to mean of 0.012 in LeftGraph, and 0.009 in RightGraph.</Paragraph>
      <Paragraph position="2"> 3.4 French suffixes -ait, -er, -a, -ant, -e We performed the same calculation for the French suffix -ait as for the English suffixes discussed above. -ait is the highest frequency 3rd person singular imperfect verbal suffix, and as such is one of the most common verbal suffixes, and it has no other syntactic functions. It appears in seven signatures composed of verbal suffixes, and they cluster well in the spaces of both LeftGraph and RightGraph, with an average distance to mean of 0.068 in the LeftGraph, and 0.034 in the RightGraph.</Paragraph>
      <Paragraph position="3"> The French suffix -er is by far the most frequent infinitival marker, and it appears in 14 signatures, with an average distance to mean of 0.055 in LeftGraph, and 0.071 in RightGraph.</Paragraph>
      <Paragraph position="4"> The 3 rd singular simple past suffix -a appears in 11 signatures, and has an average distance to mean of 0.023 in LeftGraph, and 0.029 in RightGraph.</Paragraph>
      <Paragraph position="5"> The present participle verbal suffix -ant appears in 10 suffixes, and has an average  This latter figure deserves a bit more scrutiny; one of the five is an outlier: if we restricted our attention to four of them, the average distance to mean is 0.014.</Paragraph>
      <Paragraph position="6"> distance to mean of 0.063 in LeftGraph, and of 0.088 in RightGraph.</Paragraph>
      <Paragraph position="7"> On the other hand, the suffix -e appears as the last suffix in a syntactically heterogeneous set of words: nouns, verbs, and adjectives. It has an average distance to mean of 0.290 in LeftGraph and of 0.130 in RightGraph. This is as we expect: it is syntactically heterogeneous, and therefore shows a large average distance to mean.</Paragraph>
    </Section>
    <Section position="4" start_page="3" end_page="3" type="sub_section">
      <SectionTitle>
3.5 Summary
</SectionTitle>
      <Paragraph position="0"> Here are the average distances to mean for the cases where we expect syntactic coherence and the cases where we do not expect syntactic coherence. Our hypothesis is that the numbers will be small for the suffixes where we expect coherence, and large for those where we do not expect coherence, and this hypothesis is strongly borne out. We note empirically that we may take an average value of the two columns of .10</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML