File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/94/c94-1100_intro.xml

Size: 4,103 bytes

Last Modified: 2025-10-06 14:05:40

<?xml version="1.0" standalone="yes"?>
<Paper uid="C94-1100">
  <Title>BUILDING A LEXICAL DOMAIN MAP FROM TEXT CORPORA</Title>
  <Section position="3" start_page="0" end_page="604" type="intro">
    <SectionTitle>
INTRODUCTION
</SectionTitle>
    <Paragraph position="0"> Tile task of information retrieval is to extract relevant documents from large collection of documents ill response to a user's query. When the documents cont:dn primm'ily unrestricted text (e.g., newspaper `articles, legld documents, etc.) the relev,'mce of a document is established through 'full-text' retriewd. This has been usually accomplished by identifying key terms in the documents (the process known as 'indexing') which could then be matched against terms in queries (Salton, 1989). The effectiveness of ,any such term-b`ased approach is directly related to the accuracy with which a set of terms represents the content of a document, ,as well as how well it contrasts a given document with respect to other documents. In other words, we ,are looking for a represeutat ion R such that for any text items D1 and D2, R(DI) = R(D2) iff meaning(D1) = meaning(D2), at an appropriate level of abstraction (which may depend on types and character of anticipated queries).</Paragraph>
    <Paragraph position="1"> For all kinds of terms that can be assigned 1o the representation of a docmnent, e.g., words, operatorm'gument pairs, fixed phrases, ~md proper n,'unes, vltrious levels of &amp;quot;reguh'u'ization&amp;quot; ,are needed to ,assure that syntactic or lexie,'d v,'u'iations of input do not obscure underlying semantic uniformity. Without actually doing semantic analysis, tiffs kind of normalization can be achieved through the following processes: ~ (1) morpbological stemming: e.g., retrieving is reduced to retriev; An altematlve, but less efficient method is to generate all variants (lexical, syntactic, etc.) of words/phrases in the queries (Sparck-Jones &amp; &amp;quot;Fail, 1984).</Paragraph>
    <Paragraph position="2">  (2) lexicon-based word nonnldizntion: e.g., retrieval is reduced to retrieve; (3) operator-argument representation of phr'tses: e.g., information retrieval, retrievhlg of information, and retrieve relewmt information ,are ,all assigned the slune representation, retrieve+btformation; (4) conlext-blmed term clustering into synonymy classes and subsumption hierarchies: e.g., take null over is a kind of acquisition (in business), luld Fortran is a programming language.</Paragraph>
    <Paragraph position="3"> We have established the general architecture of a NLP-IR system that accommodates these considerations. In a general view of this design, depicted schematic~dly below, an advanced NLP module is inserted between the textuld input (new documeuts, user queries) and the datab~Jse search engine (in our c`ase, NIST's PRISE system).</Paragraph>
    <Paragraph position="4"> NLP: 'FA\[~ PARSER temls This design has already shown some promise in producing signific~mtly better performance than the base statisticld system (Strz~dkowski, 1993). Its practical significance stems in no slnall part from the use of a tkst and robust parser, TI'P, 2 which can process unrestricted text at speeds below 0.2 sec per sentence. TI'P's output is It regularized representation o1' each sentence which reflects logical prcdicalc-argumclll su'uclure, e.g., Iogic:d subject and logical objects are identilicd depending upon the main verb subcategorization frame. For example, Ihe verb abide has, among others, a subcategorization frame in which the object is a prepositional pbrase with by, i.e., ABIDE: subject NP object PREP by NP Subcategorization inlbrmution is rend from the on-line</Paragraph>
    <Section position="1" start_page="0" end_page="604" type="sub_section">
      <SectionTitle>
Oxford Advanced Le`arner's Diction,try (OALD) which
</SectionTitle>
      <Paragraph position="0"> TTP uses.</Paragraph>
      <Paragraph position="1"> TFP stands for Tagged Text Parser, and it has I:een described in detail in (Strzalkowski, 1992) and ev~duated in (Strzalkowski &amp; Scheyen, 1993).</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML