File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/97/a97-1046_intro.xml

Size: 5,230 bytes

Last Modified: 2025-10-06 14:06:14

<?xml version="1.0" standalone="yes"?>
<Paper uid="A97-1046">
  <Title>Fast Statistical Parsing of Noun Phrases for Document Indexing</Title>
  <Section position="3" start_page="312" end_page="313" type="intro">
    <SectionTitle>
2 Phrases for Document Indexing
</SectionTitle>
    <Paragraph position="0"> In most current IR systems, documents are primarily indexed by single words, sometimes supplemented by phrases obtained with statistical approaches, such as frequency counting of adjacent word pairs. However, single words are often ambiguous and not specific enough for accurate discrimination of documents.</Paragraph>
    <Paragraph position="1"> For example, only using the word &amp;quot;baalS' and &amp;quot;terminology&amp;quot; for indexing is not enough to distinguish &amp;quot;bank terminology&amp;quot; from &amp;quot;terminology baalS'. More specific indexing units are needed. Syntactic phrases (i.e., phrases with certain syntactic relations) are almost always more specific than single words and thus are intuitively attractive for indexing. For example, if &amp;quot;bank terminology&amp;quot; occurs in the document, then, we can use the phrase &amp;quot;bank terminology&amp;quot; as an additional unit to supplement the single words &amp;quot;banld' and &amp;quot;terminology&amp;quot; for indexing. In this way, a query with &amp;quot;terminology banlZ' will match better with the document than one with &amp;quot;bank terminology&amp;quot;, since the indexing phrase &amp;quot;bank terminology&amp;quot; provides extra discrimination.</Paragraph>
    <Paragraph position="2"> Despite the intuitive rationality of using phrases for indexing, syntactic phrases have been reported to show no significant improvement of retrieval performance (Lewis 91; Belkin and Croft 87; Fagan 87). Moreover Fagan (Fagan 87) found that syntactic phrases are not superior to simple statistical phrases. Lewis discussed why the syntactic phrase indexing has not worked and concluded that the problems with syntactic phrases are for the most part statistical (Lewis 91). Indeed, many (perhaps most) syntactic phrases have very low frequency and tend to be over-weighted by the normal weighting method. However, the size of the collection used in</Paragraph>
    <Paragraph position="4"> these early experiments is relatively small. We want to see if a much larger size of collection will make a difference. It is possible that a larger document collection might increase the frequency of most phrases, and thus alleviate the problem of low frequency.</Paragraph>
    <Paragraph position="5"> We only consider noun phrases and the sub-phrases derived from them. Specifically, we want to obtain the full modification structure of each noun phrase in the documents and query. From the view-point of NLP, the task is noun phrase parsing (i.e., the analysis of noun phrase structure). When the phrases are used only to supplement, not replace, the single words for indexing, some parsing errors may be tolerable. This means that the penalty for a parsing error may not be significant. The challenge, however, is to be able to parse gigabytes of text in practically feasible time and as accurately as possible. The previous work taking on this challenge includes (Evans et al. 91; Evans et al. 96; Evans and Zhal 96; Strzalkowski and Carballo 94; Strzalkowski et al. 95) among others. Evans et al. exploited the &amp;quot;attestedness&amp;quot; of subphrases to partially reveal the structure of long noun phrases (Evans et al. 91; Evans et al. 96). Strzalkowski et al. adopted a fast Tagged Text Parser (TTP) to extract head modifier pairs including those in a noun phrase (Strzalkowski 92; Strzalkowski and Vauthey 92; Strzalkowski and Carballo 94; Strzalkowski et al. 95). In (Strzalkowski et al. 95), the structure of a noun phrase is disambiguated based on certain statistical heuristics, but there seems to be no effort to assign a full structure to every noun phrase.</Paragraph>
    <Paragraph position="6"> Furthermore, manual effort is needed in constructing grammar rules. Thus, the approach in (Strzalkowski et M. 95) does not address the special need of scalability and robustness along with speed. Evans and Zhai explored a hybrid noun phrase analysis method and used a quite rich set of phrases for document indexing (Evans and Zhai 96). The indexing method was evaluated using the Associated Press newswire 89 (AP89) database in Tipster Diskl, and a general improvement of retrieval performance over the indexing with single words and full noun phrases was reported. However, the phrase extraction system as reported in (Evans and Zhal 96) is still not fast enough to deal with document collections measured by gigabytes. 3 We propose here a probabilistic model of noun phrase parsing. A fast statistical noun phrase parser has been developed based on the probabilistic model.</Paragraph>
    <Paragraph position="7"> The parser works fast and can be scaled up to parse gigabytes text within acceptable time. 4 Our goal is to generate different kinds of candidate syntactic  phrases from the structure of a noun phrase so that the effectiveness of different combinations of phrases and single words can be tested.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML