File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/97/w97-1003_metho.xml

Size: 17,929 bytes

Last Modified: 2025-10-06 14:14:44

<?xml version="1.0" standalone="yes"?>
<Paper uid="W97-1003">
  <Title>A Preliminary Study of Word Clustering Based on Syntactic Behavior</Title>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
5 Similarity Measure and Algorithm
</SectionTitle>
    <Paragraph position="0"> The choice of the clustering algorithm is to some extent independent from the way data is collected, but as mentioned clustering is carried out on the basis of distributional similarity, and methods using Mutual Information are not applicable. The algorithm we present here is meant to demonstrate how syntactic behavior can be used for clustering. However, we feel the optimal choice for the clustering method depends on the application it will be used for.</Paragraph>
    <Paragraph position="1"> Studies in distribution based clustering often use the KuUback-Leibler (KL) distance, see for example (Pereira, Tishby, and Lee, 1993, Dagan, Pereira, and Lee, 1994). However, this distance is not symmetrical, and since we are (for the time being) interested in hard clustering it is desirable to have a symmetrical measure. We could possibly use Jeffery's Information, i.e. the sum of the KL-distances:</Paragraph>
    <Paragraph position="3"> We have tried this distance measure, but in many cases we have found it to have undesirable effects, primarily because the goal of our algorithm is joining words (and their statistics) together to make one cluster, and a distorted image results from this measure when words have different total frequencies. Furthermore, Jeffery's Information is undefined if either distribution has a value of 0 and the other not. For this reason they would have to be smoothed with, for example, a part of speech based distribution, such as</Paragraph>
    <Paragraph position="5"> but we wanted to avoid using an unlexical distribution since we believe lexical information is more valuable.</Paragraph>
    <Paragraph position="6"> Instead we suggest a different measure. Assume there are a number of patterns i = 1...n, and observed frequencies al...an for word wata, and bl...bn for word Wbt b. Also, let A = ~i ai and B = ~i bi. The Maximum-Likelihood estimates for Wa are thus calculated as pa(x) = a~/A and likewise for Wb.</Paragraph>
    <Paragraph position="7"> We now define the distance between words as</Paragraph>
    <Paragraph position="9"> which can be interpreted as the sum of KL-distances between a hypothetical word that would be created if the observations of the words Wata and Wbtb would be joined together, and Wata and Wbtb respectively.</Paragraph>
    <Paragraph position="10"> Like Jeffery's Information, this measure is symmetrical, although not a true distance since it does not obey the triangle inequality.</Paragraph>
    <Paragraph position="11"> This measure is more appropriate for two reasons.</Paragraph>
    <Paragraph position="12"> First, this distribution is better tailored toward making clusters where observations will be joined together. Second, we take this sum to be zero for values of i when ai = bi = 0 (no observations for either word), therefore pre-smoothing is not necessary. null The equation can easily be transformed into the</Paragraph>
    <Paragraph position="14"> which makes calculation significantly faster since patterns for which only one word has a non-zero frequency do not need to be calculated within the summation, as they always becomes zero.</Paragraph>
    <Paragraph position="15"> The Algorithm The algorithm initially regards every word as a 1-element cluster, and works bottom up towards a set of clusters. The strategy of a greedy algorithm is followed, every time finding the two clusters that have the least distance between them and merging them until the desired number of clusters is reached. However, only words with the Hogenhout ~ Matsumoto 19 Word Clustering from Syntactic Behavior same part of speech may be merged, so distances between words that have different parts of speech are never calculated. Words can therefore receive a 'combined tag' consisting of their part of speech tag, and a syntactic behavior tag. This is similar to what McMahon (1996) refers to as a structural tag.</Paragraph>
    <Paragraph position="16"> The algorithm is actually applied twice, once to clustering for dependent-context (1) and once to clustering for head-context (2).</Paragraph>
    <Paragraph position="17"> An obvious problem with this sort of clustering is low frequency words. For many words only a one or a few observations are available, which may give some information about what Sort of word it is, but which does not give a reliable estimate of the distributions. We will mention a solution to this problem later. In the example we present only words for which at least 25 observations are available.</Paragraph>
    <Paragraph position="18"> One problem with co-occurrence based clustering that has been pointed out in the past is that of almost-linear dendrograms, caused by the properties of Mutual Information. We have not encountered this problem with the described algorithm.</Paragraph>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
6 Result: the Case of Prepositions
</SectionTitle>
    <Paragraph position="0"> We present a binary word tree that was produced by the algorithm described in the previous section.</Paragraph>
    <Paragraph position="1"> The main goal of this is to show what sort of properties are revealed by this clustering, and what kind of words are problematic. Even in situations where words are clustered by syntactic behavior without making a binary tree, it can be useful to study the type of properties that decide syntactic behavior.</Paragraph>
    <Paragraph position="2"> Please refer to figure 2 for an example of the results obtained with clustering. This is a dendrogram that reflects the clustering process from loose words until the point were they are all merged into one cluster. The dendrogram shows the result for prepositions, although only those prepositions were considered for which at least 25 observations were available. In the division of words over the parts of speech we follow the tagging scheme of the Wall Street Journal Treebank, and for example subordinators such as while, if and because are included in the prepositions. Of course it is possible to use a more fine grained tag set, when available. On the other hand, as will be shown later, the algorithm does decide to classify most subordinators into one  Hogenhout ~ Matsumoto 20 Word Clustering from Syntactic Behavior We will discuss the major distinctions made by the algorithm. At first it may not be clear why words should be divided in this way, but inspection of the data from the corpus shows that many of these choices are very natural. We also discuss in which cases the dendrogram does not form natural categories. null The first partition, marked A, is a quite natural division. The upper branch (from off through About) are prepositions that usually cover some phrase themselves, whereas the prepositions in the lower branch usually do not cover any phrase.</Paragraph>
    <Paragraph position="3"> The preposition whether occurs, for example, in structures such as '' We have no useful information on (SBAR whether (S users are at risk)), ~ ~ said James A. Talcott</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
of Boston's Dana-Farber Cancer
</SectionTitle>
      <Paragraph position="0"> Institute.</Paragraph>
      <Paragraph position="1"> where in our headword-selection scheme whether depends on the headword are. (Even if this is changed, they still become one cluster because of the typical patterns with S and SBAR.) For comparison, the preposition below usually occurs in structures such as Magna recently cut its quarterly dividend in half and the company's Class A shares are (VP wallowing (PP-LOC far below their 52-week high of 16.125 Canadian dollars (US$ 13.73))).</Paragraph>
      <Paragraph position="2"> where it is the headword of a prepositional phrase before it modifies the verb.</Paragraph>
      <Paragraph position="3"> The partition marked with B is not a natural division; it rather separates a set of prepositions that do not fit in elsewhere. The prepositions from per through About are not similar to each other or to other prepositions in their behavior.</Paragraph>
      <Paragraph position="4"> Partition C again resembles to groups that can be characterized easily. The prepositions by through After, the lower branch of C, depended almost exclusively on verbs. The prepositions from off through about, the upper branch of C, depend on more varied headwords. Most of these frequently depend on both nouns and verbs. The following example shows around depending on a noun, although around also tends to depend on cardinal numbers.</Paragraph>
      <Paragraph position="5"> You now may drop by the Voice of America offices in Washington and read the text of what the Voice is broadcasting to those 130 million people (PP-LOC around the world) who tune in to it each week.</Paragraph>
      <Paragraph position="6"> An example for the lower branch of C is A plan to bring the stock to market before year end apparently (VP was upset (PP by the recent weakness of Frankfurt share prices) ).</Paragraph>
      <Paragraph position="7"> The prepositions at the upper branch of partition D tend to form a higher amount of PP-TMP type phrases, as in And in each case, he says, a sharp drop in stock prices (VP began (PP-TMP within a year)).</Paragraph>
      <Paragraph position="8"> although, while this is strongly the case for the prepositions within and throughout, it is not the case for behind.</Paragraph>
      <Paragraph position="9"> At partition E prepositions with a preference for verbs are at the upper branch. Prepositions that almost exclusively deal with verbs were separated at C, but here the distinction is less absolute. The prepositions at the upper branch of E have a chance of about two thirds to depend on a verb, whereas this is only one third at the lower branch.</Paragraph>
      <Paragraph position="10"> Partition F is once again a very clear, natural division. The prepositions in, on and at have a strong tendency to form phrases of the type PP-LOC as in Mr. Nixon is traveling (PP-LOC in China) as a private citizen, but he has made clear that he is an unofficial envoy for the Bush administrat ion.</Paragraph>
      <Paragraph position="11"> while the prepositions at the lower branch, of through about have much lower frequencies for these locative phrases.</Paragraph>
      <Paragraph position="12"> The division at G is also very clear when the data are inspected. The upper branch reflects prepositions for which the covering phrase (the middle part of the triple representing the grammatical relation) is mostly VP or NP. The prepositions For through After at the lower branch of G are mainly covered by phrases of type S. A preposition such as during is found in structures such as Fujitsu said it (VP bid the equivalent of less than a U.S.</Paragraph>
      <Paragraph position="13"> penny on three separate municipal contracts (PP-TMP during the past two years)).</Paragraph>
      <Paragraph position="14"> while a preposition such as without is usually found in the PP-S-VP pattern: Hogenhout ~ Matsumoto 21 Word Clustering from Syntactic Behavior (S In fact, (PP without Wall Street firms trading for their own accounts), the stock-index arbitrage trading opportunities for the big funds (VP may be all the more abundant). ) At H this is further divided in words that tend more to depend on loose words, PP type phrases (such as without in the last example) or S type phrases, at the lower branch, and those that usually depend on heads of a VP.</Paragraph>
      <Paragraph position="15"> As for the division at point I, the prepositions next through Although share the property that their covering phrase (the middle part of the triple representing the grammatical relation) is often of the type SBAR-ADV or SBAR-PRP. The prepositions at the upper branch, whether through down, mainly share not having this property.</Paragraph>
      <Paragraph position="16"> While the status of the upper branch of J is somewhat unclear, the lower branch of J is a perfectly clear and intuitive group. All of the words from though through Although appear almost exclusively in the patterns (-,SBAR-ADV,S), (-,S,S), (-,SBAR-PRP,S) and (-,SBAR-PRP,-). An example is The group says standardized achievement test scores are greatly inflated because teachers often 'Cteach the test'' as Mrs. Yeargin did, (SBAR-ADV although (S most are never caught)) .</Paragraph>
      <Paragraph position="17"> where in our headword scheme are becomes the headword of the SBAR-ADV type phrase.</Paragraph>
      <Paragraph position="18"> Concluding, many of the divisions made by the algorithm are quite natural. There are some parts of speech (such as nouns and verbs) were a much larger number of words is included in the hierarchy, while some other parts of speech, for example personal pronouns, produce very small hierarchies. In general the hierarchy is more interesting for parts of speech that are used in a varied way, and less interesting for, for example, symbols such as the percentage sign, that are used in a monotone way. It is interesting to see that capitalization turns out to be a meaningful predictor about the way a word will be used for some words, but not for others. The word pair so and So, and the pair because and Because are clustered next to each other, which indicates that they modify the same kind of structures, independent of whether they are at the beginning of the sentence. The word pair under and Under, and the pair after and After on the other hand are rather far apart, indicating that their usage changes substantially when they become the first word of the sentence.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
7 Applications
</SectionTitle>
    <Paragraph position="0"> A first application of this work, of which we carried out a first step in this article, is the lexicographical one of studying word behavior. Some properties of words, such as the peculiar behavior of the gerund including or the similarities between prepositions such as though and while only becomes clear once the corpus data is analyzed in the way we described. When inspecting manually, the binary word tree representation appears to be the most easy to understand.</Paragraph>
    <Paragraph position="1"> A second application of the binary word tree can be found in decision-tree based systems such as the SPATTER parser (Magerman, 1995) or the ATR Decision-Tree Part-Of-Speech Tagger, as described by Ushioda (Ushioda, 1996). In this case it is necessary to use a hard-clustering method, such that a binary word tree can be constructed by the clustering process, as we did in the example in the previous sections.</Paragraph>
    <Paragraph position="2"> A decision tree classifies data according to its properties by asking successive (often binary) questions. In the case of a part of speech tagger or a parsing system, it is particularly important for the system to ask lexicalizing questions. However, questions about individual words such as &amp;quot;Is this the word display?&amp;quot; are not efficient on a large scale since it would easily require thousands of questions. A binary tree allows one to separate the vocabulary into two parts at every question, which is efficient when these two parts are maximally different. In that case it is possible to obtain as much information as possible with a small number of questions. A condition for this application is that trees may not be very unbalanced, as the extreme case of a linear tree becomes equal to asking word-by-word. As mentioned, the method we suggest did not produce a very unbalanced tree for the parts of speech in the Wall Street Journal Treebank.</Paragraph>
    <Paragraph position="3"> A third application can be found in Information Retrieval. This can be seen from the example of including: words with such behavior have little content because they have a rather functional role in the sentence. This can be seen in the sentence &amp;quot;Safety advocates, including some members of Congress,...&amp;quot; where terms such as Safety advocates or members of Congress indicate much more about the topic of the sentence than the relatively empty word including. It is possible to cluster words and decide which clusters are likely to indicate the topic, and which are not likely to do so. For this application a wider Hogenhout 84 Matsumoto 22 Word Clustering from Syntactic Behavior variety of algorithms can be applied; words can for example be exchanged or shuffled between classes to improve the entire model.</Paragraph>
    <Paragraph position="4"> A fourth application is class-based smoothing of interpolated n-gram models. The co-occurrence based classes described in the literature are, of course, created with this as objective function, but on the other hand the classes we suggest clearly contain information that is inaccessible to co-occurrence based classes. It is possible that a combination of co-occurrence based classes and classes of syntactic behavior would give better results, but this would have to be demonstrated experimentally.</Paragraph>
    <Paragraph position="5"> In some of these applications words with a low frequency cannot be ignored because of their quantity, but at the same time the algorithm cannot rely too heavily on their observations. A possible solution is to carry out clustering without these words, and distribute the low-frequency words over the leaves of the tree afterwards. A solution along this line was chosen for co-occurrence based clustering in (McMahon and Smith, 1996), where a first algorithm handles more frequent words, and a second algorithm adds the low-frequency words afterwards.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML