File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/94/a94-1028_metho.xml

Size: 14,658 bytes

Last Modified: 2025-10-06 14:13:36

<?xml version="1.0" standalone="yes"?>
<Paper uid="A94-1028">
  <Title>Robust Text Processing in Automated Information Retrieval</Title>
  <Section position="3" start_page="169" end_page="169" type="metho">
    <SectionTitle>
2 Overall Design
</SectionTitle>
    <Paragraph position="0"> We have established the general architecture of a NLP-IR system, depicted schematically below, in which an advanced NLP module is inserted between the textual input (new documents, user queries) and the database search engine (in our case, NIST's PRISE system). This design has already shown some promise in producing a better performance than the base statistical system (Strzalkowski, 1993b). We would like to point out at the outset that this system is completely automated, including the statistical core, and the natural language processing components, and no human intervention or manual encoding is required.</Paragraph>
    <Paragraph position="1"> NIP: TAGGER PARSER terms In our system the database text is first processed with a sequence of programs that include a part-of-speech tagger, a lexicon-based morphological stemmer and a fast syntactic parser. Subsequently certain types of phrases are extracted from the parse trees and used as compound indexing terms in addition to single-word terms. The extracted phrases are statistically analyzed as syntactic contexts in order to discover a variety of similarity links between smaller subphrases and words occurring in them. A further filtering process maps these similarity links onto semantic relations (generalization, specialization, synonymy, etc.) after which they are used to transform a user's request into a search query.</Paragraph>
    <Paragraph position="2"> The user's natural language request is also parsed, and all indexing terms occurring in it are identified. Certain highly ambiguous, usually single-word terms may be dropped, provided that they also occur as elements in some compound terms. For example, &amp;quot;natural&amp;quot; may be deleted from a query already containing &amp;quot;natural language&amp;quot; because &amp;quot;natural&amp;quot; occurs in many unrelated contexts: &amp;quot;natural number&amp;quot;, &amp;quot;natural logarithm&amp;quot;, &amp;quot;natural approach&amp;quot;, etc. At the same time, other terms may be added, namely those which are linked to some query term through admissible similarity relations. For example, &amp;quot;unlawful activity&amp;quot; is added to a query (TREC topic 055) containing the compound term &amp;quot;illegal activity&amp;quot; via a synonymy link between &amp;quot;illegal&amp;quot; and &amp;quot;unlawful&amp;quot;.</Paragraph>
    <Paragraph position="3"> One of the observations made during the course of TREC-2 was to note that removing low-quality terms from the queries is at least as important (and often more so) as adding synonyms and specializations.</Paragraph>
    <Paragraph position="4"> In some instances (e.g., routing runs) low-quality terms had to be removed (or inhibited) before similar terms could be added to the query or else the effect of query expansion was all but drowned out by the increased noise.</Paragraph>
    <Paragraph position="5"> After the final query is constructed, the database search follows, and a ranked list of documents is returned. It should be noted that all the processing steps, those performed by the backbone system, and those performed by the natural language processing components, are fully automated, and no human intervention or manual encoding is required.</Paragraph>
  </Section>
  <Section position="4" start_page="169" end_page="170" type="metho">
    <SectionTitle>
3 Fast Parsing with TTP Parser
</SectionTitle>
    <Paragraph position="0"> T/'P (Tagged Text Parser) is a full-grammar parser based on the Linguistic String Grammar developed by Sager (1981). It currently encompasses most of the grammar productions and many of the restrictions, but it is by no means complete. Unlike a conventional parser, TYP's output is a regularized representation of each sentence which reflects its logical predicate-argument structure, e.g., logical subject and logical objects are identified depending upon the main verb subcategorization frame. For example, the verb abide has, among others, a subcategorization frame in which the object is a prepositional phrase with by, as in he'll abide by the court's decision, i.e., ABIDE: subject NP object PREP by NP Subcategorization information is read from the on-line Oxford Advanced Learner's Dictionary (OALD) which TI~P uses.</Paragraph>
    <Paragraph position="1"> Also unlike a conventional parser, TTP is equipped with a powerful skip-and-fit recovery mechanism that allows it to operate effectively in the face of ill-formed input or under severe time pressure. A built-in timer regulates the amount of time allowed for parsing any one sentence: if a parse is not returned before the allotted time elapses, TTP enters the skipping mode in which it will try to &amp;quot;fit&amp;quot; the parse. While in the skip-and-fit mode, the parser attempts to forcibly reduce incomplete constituents, possibly skimming over portions of input in order to restart processing at a next unattempted constituent; in other words, it will favor reduction over backtracking. The result of this strategy is an approximate parse, partially fitted using top-down predictions. In runs with approximately 130 million words of TREC's Wall Street Journal and San Jose Mercury texts, the parser's speed averaged 30 minutes per Megabyte or about 80 words per second, on a Sun SparcStationl0. In addition, TIP has been shown to produce parse structures which are no worse  than those generated by full-scale linguistic parsers when compared to hand-coded parse trees. 5 Full details of TTP parser have been described in the TREC-1 report (Strzalkowski, 1993a), as well as in other works (Strzalkowski, 1992; Strzalkowski &amp; Scheyen, 1993).</Paragraph>
    <Paragraph position="2"> As may be expected, the skip-and-fit strategy will only be effective if the input skipping can be performed with a degree of determinism. This means that most of the lexical level ambiguity must be removed from the input text, prior to parsing. We achieve this using a stochastic parts of speech tagger to preprocess the text prior to parsing. In order to streamline the processing, we also perform morphological normalization of words on the tagged text, before parsing. This is possible because the part-of-speech tags retain the information about each word's original form. Thus the sentence The Soviets have been notified is transformed into the/dt soviet/nps have/vbp be/vbn notify/vbn before parsing commences. 6</Paragraph>
  </Section>
  <Section position="5" start_page="170" end_page="171" type="metho">
    <SectionTitle>
4 Head-Modifier Structures
</SectionTitle>
    <Paragraph position="0"> Syntactic phrases extracted from TIP parse structures are represented as head-modifier pairs. The head in such a pair is a central element of a phrase (main verb, main noun, etc.), while the modifier is one of the adjuncts or arguments of the head. In the TREC experiments reported here we extracted head-modifier word pairs only, i.e., nested pairs were not used even though this was warranted by the size of the database.</Paragraph>
    <Paragraph position="1"> Figure 1 shows all stages of the initial linguistic analysis of a sample sentence from the WSJ database.</Paragraph>
    <Paragraph position="2"> The reader may note that the parser's output is a predicate-argument structure centered around the main elements of various phrases. For example, BE is the main predicate (modified by HAVE) with 2 arguments (subject, object) and 2 adjuncts (adv, sub_ord).</Paragraph>
    <Paragraph position="3"> INVADE is the predicate in the subordinate clause with 2 arguments (subject, object). The subject of BE is a noun phrase with PRESIDENT as the head element, two modifiers (FORMER, SOVIET) and a determiner (THE). From this structure, we extract head-modifier pairs that become candidates for compound terms. In general, the following types of pairs are considered: (1) a head noun of a noun phrase and its left adjective or noun adjunct, (2) a head noun and the head of its right adjunct, (3) the main verb of a clause and the head of its object phrase, and (4) the head of the subject phrase s Hand-coded parse trees were obtained from the University of Pennsylvania Treebank Project database.</Paragraph>
    <Paragraph position="4"> s The tags are read as follows: dt is determiner, nps is a proper name, vbp is a tensed plural verb, vbn is a past participle. and the main verb. These types of pairs account for most of the syntactic variants for relating two words (or simple phrases) into pairs carrying compatible semantic content. For example, the pair retrieve+information will be extracted from any of the following fragments: information retrieval system; retrieval of information from databases; and information that can be retrieved by a user-controlled interactive search process. We also attempted to identify and remove any terms which were explicitly negated in order to prevent matches against their positive counterparts, either in the data-base or in the queries.</Paragraph>
    <Paragraph position="5"> One difficulty in obtaining head-modifier pairs of highest accuracy is the notorious ambiguity of nominal compounds. The pair extractor looks at the distribution statistics of the compound terms to decide whether the association between any two words (nouns and adjecfives) in a noun phrase is both syntactically valid and semantically significant. For example, we may accept language+natural and processing+language from  natural language processing as correct, however, case+trading would make a mediocre term when extracted from insider trading case. On the other hand, it is important to extract trading+insider to be able to match documents containing phrases insider trading sanctions act or insider trading activity.</Paragraph>
  </Section>
  <Section position="6" start_page="171" end_page="171" type="metho">
    <SectionTitle>
5 Term Weighting Issues
</SectionTitle>
    <Paragraph position="0"> Finding a proper term weighting scheme is critical in term-based retrieval since the rank of a document is determined by the weights of the terms it shares with the query. One popular term weighting scheme, known as ffidf, weights terms proportionately to their inverted document frequency scores and to their in-document frequencies (tO. The in-document frequency factor is usually normalized by the document length, that is, it is more significant for a term to occur in a short 100-word abstract, than in a 5000-word article. 7 A standard ff.idf weighting scheme (see Buckley, 1993 for details) may be inappropriate for mixed term sets, consisting of ordinary concepts, proper names, and phrases, because: (1) It favors terms that occur fairly frequently in a document, which supports only general-type queries (e.g., &amp;quot;all you know about X&amp;quot;). Such queries were not typical in TREC.</Paragraph>
    <Paragraph position="1"> (2) It attaches low weights to infrequent, highly specific terms, such as names and phrases, whose only occurrences in a document are often decisive for relevance. Note that such terms cannot be reliably distinguished using their distribution in the database as the sole factor, and therefore syntactic and lexical information is required.</Paragraph>
    <Paragraph position="2"> (3) It does not address the problem of inter-term dependencies arising when phrasal terms and their component single-word terms are all included in a document representation, i.e., launch+satellite and satellite are not independent, and it is unclear whether they should be counted as two terms.</Paragraph>
    <Paragraph position="3"> In our post-TREC-2 experiments we considered (1) and (2) only. We changed the weighting scheme so that the phrases (but not the names, which we did not distinguish in TREC-2) were more heavily weighted by their idf scores while the in-document frequency scores were replaced by logarithms multiplied by sufficiently large constants. In addition, the top N highest-idf matching terms (simple or compound) were counted 7 This is not always true, for example when all occurrences of a term are concentrated in a single section or a paragraph rather than spread around the article.</Paragraph>
    <Paragraph position="4"> more toward the document score than the remaining terms.</Paragraph>
    <Paragraph position="5"> Schematically, these new weights for phrasal and highly specific terms are obtained using the following formula, while weights for most of the single-word terms remain unchanged: weight (Ti)=( C1 *log (0c)+C 2 * ot(N,i) )*idf In the above, ~t(N,i) is 1 for i &lt;N and is 0 otherwise. The selection of a weighting formula was partly constrained by the fact that document-length-normalized tf weights were precomputed at the indexing stage and could not be altered without re-indexing of the entire database. The intuitive interpretation of the oL(N,i) factor is as follows. We restrict the maximum number of terms on which a query is permitted to match a document to N highest weight terms, where N can be the same for all queries or may vary from one query to another. Note that this is not the same as simply taking the N top terms from each query. Rather, for each document for which there are M matching terms with the query, only min(M,N) of them, namely those which have highest weights, will be considered when computing the document score. Moreover, only the global importance weights for terms are considered (such as idf), while local in-document frequency (eg., tO is suppressed by either taking a log or replacing it with a constant.</Paragraph>
    <Paragraph position="6"> Changing the weighting scheme for compound terms, along with other minor improvements (such as expanding the stopword list for topics, or correcting a few parsing bugs) has lead to an overall increase of precision of more than 20% over our official TREC-2 ad-hoc results. Table 1 includes statistics of these new runs for 50 queries (numbered 101-150) against the WSJ database. The gap between the precision levels in columns txt2 and con reflects the difference in the quality of the queries obtained from the narrative parts of the topics (txt2 = title + desc + narr), and those obtained primarily from expert's formulation (title + desc + con). The column txt2+nlp represents the improvement of txt2 queries thanks to NLP, with as much as 70% of the gap closed. Similar improvements have been obtained for other sets of queries.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML