File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/00/c00-2104_intro.xml

Size: 5,697 bytes

Last Modified: 2025-10-06 14:00:46

<?xml version="1.0" standalone="yes"?>
<Paper uid="C00-2104">
  <Title>Experiments in Automated Lexicon Building for Text Searching</Title>
  <Section position="2" start_page="0" end_page="719" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> In develot)ing a syst'em to find answers in text to user questions, we mmovered a major obstacle: Doemnent sentences t'hat contained answers dkl not of_ ten use the same expressions as the question. While an:;wers in documents and questiolts llse terms that' are relat'e(l to each other, a system that sear(:hes for answers based on the quesl:ion wording will often fail. 3.b address t'his probleln, we develol)ed techniques to al,tomatically build a lexicon of associated terms t'hat can be used to hell) lind al)lIrol/riate bext' seglllent,s.</Paragraph>
    <Paragraph position="1"> The mismatch })et'ween (tuestion an(l doctlttlent wording was I)rought home to us in an analysis of a testbed of question/answer l/airs. \~Ze had a collection of newswire articles about the Clinton impeachment t'() use as a small-scale corl)uS fin' development of ;_t system. V~Ze asked several l)eol)le to 1)ose questions about this well-known t'opic, but we (lid not make the corpus availal)le to our cont'ril)utors. \~Ze wanted to avoid quest'ions that tracked t'he terminology in t'he corlms too (:losely to sinnllate quest'ions t'o a real-world syst'em. The result was a set of questions that used language that' rarely nmtched t'he phrasing in the. corl)us. \,Ve had expected t'hat' we would be able to make most of these lexical connections with the hel l) of V~rordnet (Miller, 1990).</Paragraph>
    <Paragraph position="2"> For example, consider a simple quest'ion al)out testimony: &amp;quot;Did Secret Service agents give testimony about' Bill Clinton?&amp;quot; There is no reason t'o expect that' the answer would appear 1)aldly st'ated as &amp;quot;Secret Service. agents dkl testi(y ...&amp;quot; What we need to know is what' testimony is about', where it: occurs, who gives it. The answer would lie likely to be found in a passage ment'ioning juries, or 1)roseeut'ors, like these tbund in our Clinton corl)uS: Starr immediately brought Secret Service employees before tim grand jury for questioning. null Prosecutors repeat'edly asked Secret Serviee 1)ersonnel to rel)eat' gossil) they may have heard.</Paragraph>
    <Paragraph position="3"> Yet, tile V~ordnet synsets fbr &amp;quot;testinlony&amp;quot; offer: &amp;quot;evidence, assertion, averment alia asseveration,&amp;quot; not a very hell)tiff selection here. -Wordnet hypernyms become general quickly: &amp;quot;declarat'ion,&amp;quot; &amp;quot;indicat'ion&amp;quot; and &amp;quot;inforlnation&amp;quot; are only one st, eli u 1) in t'lle hierarehy. Following these does not lead us into a courtroom.</Paragraph>
    <Paragraph position="4"> We asked our cont'ril)ut'ors for a second round of questions, but this time made the corpus available to them, exl)laining t'hat we wanted to be sure the answers were contained in t'he collection of articles.</Paragraph>
    <Paragraph position="5"> 'J'he result was a set of questions that' mueh more closely matched t'he wording in the corpus. This was~ in t'aet, what' the 1999 DARPA question-answering (:oml)et'ition did in order t'o ensure that their questions couhl be answered (Singhal, 1!199). The sectrod question-answering conference adopted a new approach to gathering questions and verifying separately that' they a.re answerable.</Paragraph>
    <Paragraph position="6"> Our intuition is t'hat if we can lind the tyl)ical lexical neighborhoods of concept's, we can efficiently locate a concept described in a query or a question without needing to know the precise way the answer is phrased and without relying on a cost'ly, hand-built concept' hierarchy.</Paragraph>
    <Paragraph position="7"> The example above illustrat'es the 1)oint. Testimony is given 1) 3, wit'nesses, defendant's, eyewitnesses. It is solicited by 1)rosecutors, counsels, lawyers. It is heard by judges, juries at trials, hearings, and recorded in depositions and transcripts.</Paragraph>
    <Paragraph position="8"> What' we wanted was a complete description of t'he world of testimony - the who, what, when and where of the word. Or, in other words, the &amp;quot;metaaboutness&amp;quot; of terms.</Paragraph>
    <Paragraph position="9"> To this end, we exl)erimented /tSitlg shallow linguist.k: techniques t'o gat'her and analyze word co-occurrence data in various configurat'ions. Unlike previous collocation research, we were int'erested in an expansive set' of relationships between words  rather than a specific relationship. More important, we felt that the information we needed could be derived from an analysis that crossed clause and sentence boundaries. We hyl)othesized that news articles would be coherent so that the sequences of sentences and clauses would be linked conceptually.</Paragraph>
    <Paragraph position="10"> We exanfined the nouns in a number of configurations - paragraphs, sentences, clauses and sequences of clauses - and obtained tile strongest results from configurations that count co-occurrences across the surface subjects of sequences of two to six clauses.</Paragraph>
    <Paragraph position="11"> Exl)eriments with multi-clause configurations were generally more accurate in a variety of experiments.</Paragraph>
    <Paragraph position="12"> In the next section, we briefly review related research. In section 3 we describe our experiments.</Paragraph>
    <Paragraph position="13"> In section 4, we discuss the problem of evaluation, and look ahead to future directions in the concluding sections.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML