File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/w05-1516_metho.xml

Size: 16,404 bytes

Last Modified: 2025-10-06 14:10:00

<?xml version="1.0" standalone="yes"?>
<Paper uid="W05-1516">
  <Title>Strictly Lexical Dependency Parsing</Title>
  <Section position="4" start_page="152" end_page="154" type="metho">
    <SectionTitle>
2 A Probabilistic Dependency Model
</SectionTitle>
    <Paragraph position="0"> Let S be a sentence. The dependency structure T of S is a directed tree connecting the words in S.</Paragraph>
    <Paragraph position="1"> Each link in the tree represents a dependency relationship between two words, known as the head and the modifier. The direction of the link is from the head to the modifier. We add an artificial root node ([?]) at the beginning of each sentence and a dependency link from [?] to the head of the sentence so that the head of the sentence can be treated in the same way as other words. Figure 1 shows an example dependency tree.</Paragraph>
    <Paragraph position="2"> We denote a dependency link l by a triple (u, v, d), where u and v are the indices (u &lt; v) of the words connected by l, and d specifies the direction of the link l. The value of d is either L or R. If d = L, v is the index of the head word; otherwise, u is the index of the head word.</Paragraph>
    <Paragraph position="3"> Dependency trees are typically assumed to be projective (without crossing arcs), which means that if there is an arc from h to m, h is an ancestor of all the words between h and m. Let F(S) be the set of possible directed, projective trees spanning on S. The parsing problem is to find  Generative parsing models are usually defined recursively from top down, even though the decoders (parsers) for such models almost always take a bottom-up approach. The model proposed here is a bottom-up one. Like previous approaches, we decompose the generation of a parse tree into a sequence of steps and define the probability of each step. The probability of the tree is simply the product of the probabilities of the steps involved in the generation process. This scheme requires that different sequences of steps must not lead to the same tree. We achieve this by defining a canonical ordering of the links in a dependency tree. Each generation step corresponds to the construction of a dependency link in the canonical order.</Paragraph>
    <Paragraph position="4"> Given two dependency links l and l' with the heads being h and h' and the modifiers being m and m', respectively, the order between l and l' are determined as follows:  * If h [?] h' and there is a directed path from one (say h) to the other (say h'), then l' precedes l. * If h [?] h' and there does not exist a directed path between h and h', the order between l and l' is determined by the order of h and h' in the sentence (h precedes h' = l precedes l').</Paragraph>
    <Paragraph position="5"> * If h = h' and the modifiers m and m' are on different sides of h, the link with modifier on the right precedes the other.</Paragraph>
    <Paragraph position="6"> * If h = h' and the modifiers m and m' are on the same side of the head h, the link with its modifier closer to h precedes the other one.</Paragraph>
    <Paragraph position="7">  For example, the canonical order of the links in the dependency tree in Figure 1 is: (1, 2, L), (5, 6, R), (8, 9, L), (7, 9, R), (5, 7, R), (4, 5, R), (3, 4, R), (2, 3, L), (0, 3, L).</Paragraph>
    <Paragraph position="8"> The generation process according to the canonical order is similar to the head outward generation process in (Collins, 1999), except that it is bottom-up whereas Collins' models are top-down.</Paragraph>
    <Paragraph position="9"> Suppose the dependency tree T is constructed in steps G</Paragraph>
    <Paragraph position="11"> in the canonical order of the dependency links, where N is the number of words in the sentence. We can compute the probability of T as follows:</Paragraph>
    <Paragraph position="13"/>
    <Paragraph position="15"> Following (Klein and Manning, 2004), we require that the creation of a dependency link from head h to modifier m be preceded by placing a left STOP and a right STOP around the modifier m and !STOP between h and m.</Paragraph>
    <Paragraph position="16"> Let</Paragraph>
    <Paragraph position="18"> E ) denote the event that there are no more modifiers on the left (and right) of a word w. Suppose the dependency link created in the step i is (u, v, d). If d = L, G i is the conjunction of the four events:</Paragraph>
    <Paragraph position="20"> which are the words in the sentence and a forest of trees constructed up to step i-1. Let</Paragraph>
    <Paragraph position="22"> be the number of modifiers of w on its left (and right). We make the following independence assumptions: null * Whether there is any more modifier of w on the d side depends only on the number of modifiers already found on the d side of w.</Paragraph>
    <Paragraph position="23">  * Whether there is a dependency link from a word h to another word m depends only on the words h and m and the number of modifiers of h between m and h. That is,</Paragraph>
    <Paragraph position="25"> corresponds to a dependency link (u, v, L). The probability ( )  Manning, 2004). They are crucial for modeling the number of dependents. Without them, the parse trees often contain some 'obvious' errors, such as determiners taking arguments, or prepositions having arguments on their left (instead of right).</Paragraph>
    <Paragraph position="26"> Our model requires three types of parameters:</Paragraph>
    <Paragraph position="28"> CwEP , |, where w is a word, d is a direction (left or right). This is the probability of a STOP after taking</Paragraph>
    <Paragraph position="30"> C )'th modifier of v on the left.</Paragraph>
    <Paragraph position="31"> The Maximum Likelihood estimations of these parameters can be obtained from the frequency counts in the training corpus: * C(w, c, d): the frequency count of w with c modifiers on the d side.</Paragraph>
    <Paragraph position="32"> * C(u, v, c, d): If d = L, this is the frequency count words u and v co-occurring in a sentence and v has c modifiers between itself and u. If d = R, this is the frequency count words u and v co-occurring in a sentence and u has c modifiers between itself and v.</Paragraph>
    <Paragraph position="33">  * K(u, v, c, d): similar to C(u, v, c, d) with an additional constraint that link d (u, v) is true.</Paragraph>
    <Paragraph position="34">  We compute the probability of the tree conditioned on the words. All parameters in our model are conditional probabilities where the left sides of the conditioning bar are binary variables. In contrast, most previous approaches compute joint probability of the tree and the words in the tree. Many of their model parameters consist of the probability of a word in a given context. We use a dynamic programming algorithm similar to chart parsing as the decoder for this model. The algorithm builds a packed parse forest from bottom up in the canonical order of the parser trees. It attaches all the right children before attaching the left ones to maintain the canonical order as required by our model.</Paragraph>
  </Section>
  <Section position="5" start_page="154" end_page="155" type="metho">
    <SectionTitle>
3 Similarity-based Smoothing
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="154" end_page="154" type="sub_section">
      <SectionTitle>
3.1 Distributional Word Similarity
</SectionTitle>
      <Paragraph position="0"> Words that tend to appear in the same contexts tend to have similar meanings. This is known as the Distributional Hypothesis in linguistics (Harris, 1968). For example, the words test and exam are similar because both of them follow verbs such as administer, cancel, cheat on, conduct, ... and both of them can be preceded by adjectives such as academic, comprehensive, diagnostic, difficult, ...</Paragraph>
      <Paragraph position="1"> Many methods have been proposed to compute distributional similarity between words (Hindle, 1990; Pereira et al., 1993; Grefenstette, 1994; Lin, 1998). Almost all of the methods represent a word by a feature vector where each feature corresponds to a type of context in which the word appeared. They differ in how the feature vectors are constructed and how the similarity between two feature vectors is computed.</Paragraph>
      <Paragraph position="2"> We define the features of a word w to be the set of words that occurred within a small context window of w in a large corpus. The context window of an instance of w consists of the closest non-stop-word on each side of w and the stop-words in between. In our experiments, the set of stop-words are defined as the top 100 most frequent words in the corpus. The value of a feature w' is defined as the point-wise mutual information between the w'</Paragraph>
      <Paragraph position="4"> where P(w, w') is the probability of w and w' co-occur in a context window.</Paragraph>
      <Paragraph position="5"> The similarity between two vectors is computed as the cosine of the angle between the vectors. The following are the top similar words for the</Paragraph>
    </Section>
    <Section position="2" start_page="154" end_page="155" type="sub_section">
      <SectionTitle>
3.2 Similarity-based Smoothing
</SectionTitle>
      <Paragraph position="0"> The parameters in our model consist of conditional probabilities P(E|C) where E is the binary  or two words in the input sentence. Due to the sparseness of natural language data, the contexts observed in the training data only covers a tiny fraction of the contexts whose probability distribution are needed during parsing. The standard approach is to back off the probability to word classes (such as part-of-speech tags). We have taken a different approach. We search in the train- null ing data to find a set of similar contexts to C and estimate the probability of E based on its probabilities in the similar contexts that are observed in the training corpus.</Paragraph>
      <Paragraph position="1"> Similarity-based smoothing was used in (Dagan et al., 1999) to estimate word co-occurrence probabilities. Their method performed almost 40% better than the more commonly used back-off method. Unfortunately, similarity-based smoothing has not been successfully applied to statistical parsing up to now.</Paragraph>
      <Paragraph position="2"> In (Dagan et al., 1999), the bigram probability</Paragraph>
      <Paragraph position="4"> ', wwsim denotes the similarity (or an increasing function of the similarity) between w</Paragraph>
      <Paragraph position="6"> The underlying assumption of this smoothing scheme is that a word is more likely to occur after</Paragraph>
      <Paragraph position="8"> if it tends to occur after similar words of w</Paragraph>
      <Paragraph position="10"> We make a similar assumption: the probability P(E|C) of event E given the context C is computed as the weight average of P(E|C') where C' is a similar context of C and is attested in the training  where S(C) is the set of top-K most similar contexts of C (in the experiments reported in this paper, K = 50); O is the set of contexts observed in the training corpus, sim(C,C') is the similarity between two contexts and norm(C) is the normalization factor.</Paragraph>
      <Paragraph position="11"> In our model, a context is either [ ]  where S(w) is the set of top-K similar words of w (K = 50).</Paragraph>
      <Paragraph position="12"> Since all contexts used in our model contain at least one word, we compute the similarity between two contexts, sim(C, C'), as the geometric average of the similarities between corresponding words:  sary when the frequency count of the context C in the training corpus is low. We therefore compute</Paragraph>
      <Paragraph position="14"> where the smoothing factor  A difference between similarity-based smoothing in (Dagan et al., 1999) and our approach is that our model only computes probability distributions of binary variables. Words only appear as parts of contexts on the right side of the conditioning bar. This has two important implications. Firstly, when a context contains two words, we are able to use the cross product of the similar words, whereas (Dagan et al., 1999) can only use the similar words of one of the words. This turns out to have significant impact on the performance (see Section 4).</Paragraph>
      <Paragraph position="15"> Secondly, in (Dagan et al., 1999), the distribu- null is 0, it is often due to data sparseness. Their smoothing scheme therefore tends to under-estimate the probability values. This problem is avoided in our approach. If a context did not occur in the training data, we do not include it in the average. If it did occur, the Maximum Likelihood estimation is reasonably accurate even if the context only occurred a few times, since the entropy of the probability distribution is upper-bounded by log 2.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="155" end_page="157" type="metho">
    <SectionTitle>
4 Experimental Results
</SectionTitle>
    <Paragraph position="0"> We experimented with our parser on the Chinese Treebank (CTB) 3.0. We used the same data split as (Bikel, 2004): Sections 1-270 and 400-931 as  the training set, Sections 271-300 as testing and Sections 301-325 as the development set. The CTB contains constituency trees. We converted them to dependency trees using the same method and the head table as (Bikel, 2004). Parsing Chinese generally involve segmentation as a pre-processing step. We used the gold standard segmentation in the CTB.</Paragraph>
    <Paragraph position="1"> The distributional similarities between the Chinese words are computed using the Chinese Gigaword corpus. We did not segment the Chinese corpus when computing the word similarity.</Paragraph>
    <Paragraph position="2"> We measure the quality of the parser by the undirected accuracy, which is defined as the number of correct undirected dependency links divided by the total number of dependency links in the corpus (the treebank parse and the parser output always have the same number of links). The results are summarized in Table 1. It can be seen that the performance of the parser is highly correlated with the length of the sentences.</Paragraph>
    <Paragraph position="3">  We also experimented with several alternative models for dependency parsing. Table 2 summerizes the results of these models on the test corpus with sentences up to 40 words long.</Paragraph>
    <Paragraph position="4"> One of the characteristics of our parser is that it uses the similar words of both the head and the modifier for smoothing. The similarity-based smoothing method in (Dagan et al., 1999) uses the similar words of one of the words in a bigram. We can change the definition of similar context as follows so that only one word in a similar context of C may be different from a word in C (see  where w is either v or u depending on whether d is L or R. This change led to a 2.2% drop in accuracy (compared with Model (a) in Table 2), which we attribute to the fact that many contexts do not have similar contexts in the training corpus.</Paragraph>
    <Paragraph position="5"> Since most previous parsing models maximize the joint probability of the parse tree and the sentence P(T, S) instead of P(T  |S), we also implemented a joint model (see Model (c) in Table 2):</Paragraph>
    <Paragraph position="7"> are the head and the modifier of the i'th dependency link. The probability  , as in (Dagan et al., 1999). The result was a dramatic decrease in accuracy from the conditional model's 79.9%. to 66.3%.</Paragraph>
    <Paragraph position="8"> Our use of distributional word similarity can be viewed as assigning soft clusters to words. In contrast, parts-of-speech can be viewed as hard clusters of words. We can modify both the conditional and joint models to use part-of-speech tags, instead of words. Since there are only a small number of tags, the modified models used MLE without any smoothing except using a small constant as the probability of unseen events. Without smoothing, maximizing the conditional model is equivalent to maximizing the joint model. The accuracy of the unlexicalized models (see Model (d) and Model (e) in Table 2) is 71.1% which is considerably lower than the strictly lexicalized conditional model, but higher than the strictly lexicalized joint model. This demonstrated that soft clusters obtained through distributional word similarity perform better than the part-of-speech tags when used appropriately.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML