File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/97/a97-1046_metho.xml
Size: 14,968 bytes
Last Modified: 2025-10-06 14:14:32
<?xml version="1.0" standalone="yes"?> <Paper uid="A97-1046"> <Title>Fast Statistical Parsing of Noun Phrases for Document Indexing</Title> <Section position="4" start_page="313" end_page="315" type="metho"> <SectionTitle> 3 Fast Noun Phrase Parsing </SectionTitle> <Paragraph position="0"> A fast and robust noun phrase parser is a key to the exploration of syntactic phrase indexing. Noun phrase parsing, or noun phrase structure analysis ( also known as compound noun analysisS), is itself an important research issue in computational linguistics and natural language processing. Long noun phrases, especially long compound nouns such as &quot;information retrieval technique&quot;, generally have ambiguous structures. For instance, &quot;information retrieval technique&quot; has two possible structures: &quot;\[\[information retrieval\] technique\]' and &quot;\[information \[retrieval technique\]\]'. A principal difficulty in noun phrase structure analysis is to resolve such structural ambiguity. When a large corpus is available, which is true for an IR task, statistical preference of word combination or word modification can be a good clue for such disambiguation. As summarized in (Lauer 95), there are two different models for corpus-based parsing of noun phrases: the adjacency model and the dependency model. The difference between the two models can be illustrated by the example compound noun &quot;informationsretrieval technique&quot;. In the adjacency model, the structure would be decided by looking at the adjacency association of &quot;information retrievaF and &quot;retrieval technique&quot;. &quot;information retrievat' will be grouped first, if &quot;information retrievaF has a stronger association than &quot;retrieval technique&quot;, otherwise, &quot;retrieval technique&quot; will be grouped first. In the dependency model, however, the structure would be decided by looking at the dependency between &quot;information&quot; and &quot;retrievaP (i.e., the tendency for &quot;information&quot; to modify &quot;retrievat') and the dependency between &quot;information&quot; and &quot;technique&quot;. If &quot;information&quot; has a stronger dependency association with &quot;retrievaP than with &quot;technique&quot;, &quot;information retrievat' will be grouped first, otherwise, &quot;retrieval technique&quot; will be grouped first. The adjacency model dates at least from (Marcus 80) and has been explored recently in (Liberman and Sproat 92; Pustejovsky et al. 93; Resnik and Hearst 93; Lauer 95; Strzalkowski et al. 95; Evans and Zhai 96); The dependency model has mainly been studied in (Lauer 94). Evans and Zhai (Evans and Zhai 96) use primarily the adjacency model, but the association score also takes into account some degree of dependency. Lauer (Lauer 95) compared the adjacency model and the dependency model for compound noun disambiguation, and concluded that the SStrictly speaking, however, compound noun analysis is a special case of noun phrase analysis, but the same technique can oRen be used for both.</Paragraph> <Paragraph position="1"> dependency model provides a substantial advantage over the adjacency model.</Paragraph> <Paragraph position="2"> We now propose a probabilistic model in which the dependency structure, or the modification structure, of a noun phrase is treated as &quot;hidden&quot;, similar to the tree structure in the probabilistic context-free grammar (Jelinek et al. 90). The basic idea is as follows.</Paragraph> <Paragraph position="3"> A noun phrase can be assumed to be generated from a word modification structure (i.e., a dependency structure). Since noun phrases with more than two words are structurally ambiguous, if we only observe the noun phrase, then the actual structure that generates the noun phrase is &quot;hidden&quot;. We treat the noun phrases with their possible structures as the complete data and the noun phrases occurring in the corpus (without the structures) as the observed incomplete data. In the training phase, an Expectation Maximization (EM) algorithm (Dempster et al. 77) can be used to estimate the parameters of word modification probabilities by iteratively maximizing the conditional expectation of the likelihood of the complete data given the observed incomplete data and a previous estimate of the parameters. In the parsing phase, a noun phrase is assigned the structure that has the maximum conditional probability given the noun phrase.</Paragraph> <Paragraph position="4"> Formally, assume that each noun phrase is generated using a word modification structure. For example, &quot;information retrieval technique&quot; may be generated using either the structure &quot;\[XI\[X2Xz\]\]&quot; or the structure &quot;\[\[X1X2\]X3\]&quot;. The log likelihood of generating a noun phrase, given the set of noun phrases observed in a corpus NP = {npi} can be written as:</Paragraph> <Paragraph position="6"> where, S is the set of all the possible modification structures; c(npi) is the count of the noun phrase npi in the corpus; and PC/(npi, sj) gives the probability of deriving the noun phrase npi using the modification structure sj.</Paragraph> <Paragraph position="7"> With the simplification that generating a noun phrase from a modification structure is the same as generating all the corresponding word modification pairs in the noun phrase and with the assumption that each word modification pair in the noun phrase is generated independently, PC/(npi, sj) can further be written as</Paragraph> <Paragraph position="9"> where, M(npi, sj) is the set of all word pairs (u, v) in npi such that u modifies (i.e., depends on) v according to sj. 6 c(u, v; npi, sj) is the count of the ~For example, if npl is &quot;information retrieval tech- nique&quot;, and sj is &quot;\[\[X1X~IX3\]&quot;, then, M(npi, sj) = {(information, retrieval), (retrieval, technique)}. modification pairs (u, v) being generated when npi is derived from sj. PC/(sj) is the probability of structure sj; while Pc(u, v) is the probability of generating the word pair (u, v) given any word modification relation. PC/(sj) and Pc(u, v) are subject to the constraint of summing up to 1 over all modification structures and over all possible word combinations respectively. 7 The model is clearly a special case of the class of the algebraic language models, in which the probabilities are expressed as polynomials in the parameters(Lafferty 95). For such models, the M-step in the EM algorithm can be carried out exactly, and the parameter update formulas are:</Paragraph> <Paragraph position="11"> where, A1 and A2 are the Lagrange multipliers corresponding to the two constraints mentioned above, and are given by the following formulas:</Paragraph> <Paragraph position="13"> ;'One problem with such simplification is that the model may generate a set of word modification pairs that do not form a noun phrase, although such &quot;illegal noun phrases&quot; are never observed. A better model would be to write the probability of each word modification pair as the conditional probability of the modifier (i.e., the modifying word) given the head (i.e., the word being modified). That is,</Paragraph> <Paragraph position="15"> where h(np,) is the head (i.e., the last word) of the noun phrase npi(Lafferty 96).</Paragraph> <Paragraph position="16"> The EM algorithm ensures that L(n+ 1) is greater than L(n). In other words, every step of parameter update increases the likelihood. Thus, at the time of training, the parser can first randomly initialize the parameters, and then, iteratively update the parameters according to the update formulas until the increase of the likelihood is smaller than some pre-set threshold, s In the implementation described here, the maximum length of any noun phrase is limited to six. In practice, this is not a very tight limit, since simple noun phrases with more than six words are quite rare. Summing over all the possible structures for any noun phrase is computed by enumerating all the possible structures with an equal length as the noun phrase. For example, in the case of a three-word noun phrase, only two structures need to be enumerated.</Paragraph> <Paragraph position="17"> At the time of parsing noun phrases, the structure of any noun phrase np (S(np)) is determined by</Paragraph> <Paragraph position="19"> We found that the parameters may easily be biased owing to data sparseness. For example, the modification structure parameters naturally prefer left association to right association in the case of three-word noun phrases, when the data is sparse.</Paragraph> <Paragraph position="20"> Such bias in the parameters of the modification structure probability will be propagated to the word modification parameters when the parameters are iteratively updated using EM algorithm. In the experiments reported in this paper, an over-simplified solution is adopted. We simply fixed the modification structure parameter and assumed every dependency structure is equally likely.</Paragraph> <Paragraph position="21"> Fast training is achieved by reading all the noun phrase instances into memory. 9 This forces us to split the whole noun phrase corpus into small chunks for training. In the experiments reported in this paper, we split the corpus into chunks of a size of around 4 megabytes. Each chunk has about 170,000 (or about 100,000 unique) raw multiple word noun phrases. The parameters estimated on each sub-corpus are then merged (averaged). We do not know how much the merging of parameters affects the parameter estimation, but it seems that a majority of phrases are correctly parsed with the merged parameter estimation, based on a rough check of the parsing results. With this approach, it takes a 133-MHz DEC Alpha workstation about 5 hours to train the parser over the noun phrases from a 250-megabyte SFor the experiments reported in this paper, the threshold is 2.</Paragraph> <Paragraph position="22"> 9An alternative way would be to keep the corpus in the disk. In this way, it is not necessary to split the corpus, unless it is extremely large.</Paragraph> <Paragraph position="23"> text corpus. Parsing is much faster, taking less than 1 hour to parse all noun phrases in the corpus of a 250-megabyte text. The parsing speed can be scaled up to gigabytes of text, even when the parser needs to be re-trained over the noun phrases in the whole corpus. However, the speed has not taken into account the time required for extracting the noun phrases for training. In the experiments described in the following section, the CLARIT noun phrase extractor is used to extract all the noun phrases from the 250-megabyte text corpus.</Paragraph> <Paragraph position="24"> After the training on each chunk, the estimation of the parameter of word modifications is smoothed to account for the unseen word modification pairs.</Paragraph> <Paragraph position="25"> Smoothing is made by &quot;dropping&quot; a certain number of parameters that have the least probabilities, taking out the probabilities of the dropped parameters, and evenly distributing these probabilities among all the unseen word pairs as well as those pairs of the dropped parameters. It is unnecessary to keep the dropped parameters after smoothing, thus this method of smoothing helps reduce the memory overload when merging parameters. In the experiments reported in the paper, nearly half of the total number of word pairs seen in the training chunk were dropped. Since, word pairs with the least probabilities generally occur quite rarely in the corpus and usually represent semantically illegal word combinations, dropping such word pairs does not affect the parsing output so significantly as it seems. In fact, it may not affect the parsing decisions for the majority of noun phrases in the corpus at all.</Paragraph> <Paragraph position="26"> The potential parameter space for the probabilistic model can be extremely large, when the size of the training corpus is getting larger. One solution to this problem is to use a class-based model similar to the one proposed in (Brown et al. 92) or use parameters of conceptual association rather than word association, as discussed in (Lauer 94)(Lauer 95).</Paragraph> </Section> <Section position="5" start_page="315" end_page="316" type="metho"> <SectionTitle> 4 Experiment Design </SectionTitle> <Paragraph position="0"> We used the CLARIT commercial retrieval system as a retrieval engine to test the effectiveness of different indexing sets. The CLARIT system uses the vector space retrieval model(Salton and McGill 83), in which documents and the query are all represented by a vector of weighted terms (either single words or phrases), and the relevancy judgment is based on the similarity (measured by the cosine measure) between the query vector and any document vector(Evans et al. 93; Evans and Lefferts 95; Evans et al. 96). The experiment procedure is described by Figure 1.</Paragraph> <Paragraph position="1"> First, the original database is parsed to form different sets of indexing terms (say, using different combination of phrases). Then, each indexing set is passed to the CLARIT retrieval engine as a source document set. The CLARIT system is configured to accept the indexing set we passed as is to ensure that the actual indexing terms used inside the CLARIT system are exactly those generated.</Paragraph> <Paragraph position="2"> It is possible to generate three different kinds/levels of indexing units from a noun phrase: (1) single words; (2) head modifier pairs (i.e., any word pair in the noun phrase that has a linguistic modification relation); and (3) the full noun phrase. For example, from the phrase structure &quot;\[\[~neavy=construction\]-industry\]\]-group\]&quot; (a real example from WS390), it is possible to generate the following candidate terms: SINGLE WORDs : heavy, construction, industry, group</Paragraph> </Section> <Section position="6" start_page="316" end_page="316" type="metho"> <SectionTitle> HEAD MODIFIERS : </SectionTitle> <Paragraph position="0"> construction industry, industry group, heavy construction</Paragraph> </Section> <Section position="7" start_page="316" end_page="316" type="metho"> <SectionTitle> FULL NP : </SectionTitle> <Paragraph position="0"> heavy construction industry group Different combinations of the three kinds of terms can be selected for indexing. In particular, the indexing set formed solely of single words is used as a baseline to test the effect of using phrases. In the experiments reported here, we generated four different combinations of phrases: The results from these different phrase sets are discussed in the next section.</Paragraph> </Section> class="xml-element"></Paper>