File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/p05-3004_metho.xml

Size: 10,720 bytes

Last Modified: 2025-10-06 14:09:48

<?xml version="1.0" standalone="yes"?>
<Paper uid="P05-3004">
  <Title>CL Research's Knowledge Management System</Title>
  <Section position="5" start_page="13" end_page="13" type="metho">
    <SectionTitle>
2 Parsing and Creation of XML Tagging
</SectionTitle>
    <Paragraph position="0"> KMS and each of its application areas is based on parsing text and then transforming parse trees into an XML representation. CL Research uses the Proximity Parser, developed by an inventor of top-down syntax-directed parsing (Irons, 1961).3 The parser output consists of bracketed parse trees, with leaf nodes describing the part of speech and lexical entry for each sentence word. Annotations, such as number and tense information, may be included at any node.</Paragraph>
    <Paragraph position="1"> (Litkowski (2002) and references therein provide more details on the parser.) After each sentence is parsed, its parse tree is traversed in a depth-first recursive function. During this traversal, each non-terminal and terminal node is analyzed to identify discourse segments (sentences and clauses), noun phrases, verbs, adjectives, and prepositional phrases. These items are maintained in lists; the growing lists constitute a document's discourse structure and are used, e.g., in resolving anaphora and establishing coreferents (implementing techniques inspired by Marcu (2000) and Tetreault (2001)). As these items are identified, they are subjected to a considerable amount of analysis to characterize them syntactically and semantically. The analysis includes word-sense disambiguation of nouns, verbs (including subcategorization identification), and adjectives and semantic analysis of prepositions to establish their semantic roles (such as described in Gildea &amp; Jurafsky, 2002).</Paragraph>
    <Paragraph position="2"> When all sentences of a document have been parsed and components identified and analyzed, the various lists are used to generate the XML representation. Most of the properties of the components are used as the basis for establishing XML attributes and values in the final representation. (Litkowski 2003a provides further details on this process.) This representation then becomes the basis for question answering, summarization, information extraction, and document exploration.</Paragraph>
    <Paragraph position="3"> The utility of the XML representation does not stem from an ability to use XML manipulation technologies, such as XSLT and XQuery. In fact, these technologies seem to involve too much overhead. Instead, the utility arises within a Windows-based C++ development environment with a set of XML functions that facilitate working with node sets from a document's XML tree.</Paragraph>
  </Section>
  <Section position="6" start_page="13" end_page="14" type="metho">
    <SectionTitle>
3 Question Answering
</SectionTitle>
    <Paragraph position="0"> As indicated above, the initial implementation of the question-answering component of KMS was designed primarily to determine if suitable XPath expressions could be created for answering questions. CL Research's XML Analyzer was developed for this purpose.4 XML Analyzer is constructed in a C++ Windows development environment to which a capability for examining XML nodes has been added.</Paragraph>
    <Paragraph position="1"> With this capability, a document can be loaded with one instruction and an XPath expression can be applied against this document in one more instruction to obtain a set of nodes which can be examined in more detail. Crucially, this enables low-level control over subsequent analysis steps (e.g., examining the text of a node with Perl regular expressions).</Paragraph>
    <Paragraph position="2"> XML Analyzer first loads an XML file (which can include many documents, such as the &amp;quot;top 50&amp;quot; used in TREC). The user then presents an XPath expression and discourse components (typically, noun phrases) satisfying that expression are returned. XML Analyzer includes the document number, the sentence number, and the full sentence for each noun phrase.</Paragraph>
    <Paragraph position="3"> Several other features were added to XML Analyzer to examine characteristics of the documents and sentences (particularly to identify why an answer  wasn't retrieved by an XPath expression).</Paragraph>
    <Paragraph position="4"> XML Analyzer does not include the automatic creation of an XPath expression. KMS was created for TREC 2003 as the initial implementation of a complete question-answering system. In KMS, the question itself is parsed and transformed into an XML representation (using the same underlying functionality for processing documents) and then used to construct an XPath expression.</Paragraph>
    <Paragraph position="5"> An XPath expression consists of two parts. The first part is a &amp;quot;passage retrieval&amp;quot; component, designed to retrieve sentences likely to contain the answer. This basic XPath is then extended for each question type with additional specifications, e.g., to ask for noun phrases that have time, location, or other semantic attributes. Experiments have shown that there is a tradeoff involved in these specifications. If they are very exacting, few possible answers are returned.</Paragraph>
    <Paragraph position="6"> Backoff strategies are used to return a larger set of potential answers and to analyze the context of these potential answers in more detail. The development of routines for automatic creation of XPath expressions is an ongoing process, but has begun to yield more consistent results (Litkowski, 2005).</Paragraph>
    <Paragraph position="7"> In preparation for TREC 2004, KMS was further extended to incorporate a web-based component.</Paragraph>
    <Paragraph position="8"> With a check box to indicate whether the web or a document repository should be used, additional functionality was used to pose questions to Google. In web mode, an XML representation of a question is still developed, but then it is analyzed to present an optimal query to Google, typically, a pattern that will provide an answer. This involves the use of an integrated dictionary, particularly for creating appropriate inflected forms in the search query. KMS only uses the first page of Google results, without going into the source documents, extracting sentences from the Google results and using these as the documents. (A user can create a new &amp;quot;document repository&amp;quot; consisting of the documents from which answers have been obtained.) Many additional possibilities have emerged from initial explorations in using web-based question answering.</Paragraph>
  </Section>
  <Section position="7" start_page="14" end_page="14" type="metho">
    <SectionTitle>
4 Summarization
</SectionTitle>
    <Paragraph position="0"> Litkowski (2003a) indicated the possibility that the XML representation of documents could be used for summarization. To investigate this possibility, XML Analyzer was extended to include summarization capabilities for both general and topic-based summaries, including headline and keyword generation. Summarization techniques crucially take into account anaphora, coreferent, and definite noun phrase resolutions. As intimated in the analysis of the parse output, the XML representation for a referring expression is tagged with antecedent information, including both an identifying number and the full text of the antecedent. As a result, in examining a sentence, it is possible to consider the import of all its antecedents, instead of simply the surface form.</Paragraph>
    <Paragraph position="1"> At the present time, only extractive summarization is performed in KMS. The basis for identifying important sentences is simply a frequency count of its words, but using antecedents instead of referring expressions. Stopwords and some other items are eliminated from this count.</Paragraph>
    <Paragraph position="2"> In KMS, the user has the option for creating several kinds of summaries. The user specifies the type of summary (general, topic-based, headline, or keyword), which documents to summarize (one or many), and the length. Topic-based summaries require the user to enter search terms. The search terms can be as simple as a person's name or a few keywords or can be several sentences in length.</Paragraph>
    <Paragraph position="3"> Topic-based summaries use the search terms to give extra weight to sentences containing the search terms. Sentences are also evaluated for their novelty, with redundancy and overlap measures based on examining their noun phrases. KMS summarization procedures are described in more detail in Litkowski (2003b); novelty techniques are described in Litkowski (2005).</Paragraph>
    <Paragraph position="4"> In KMS, summaries are saved in XML files as sets of sentences, each characterized by its source and sentence number. Each summary uses XML attributes containing the user's specifications and the documents included in the search. generated quickly but in whole form.</Paragraph>
  </Section>
  <Section position="8" start_page="14" end_page="15" type="metho">
    <SectionTitle>
5 Document Exploration
</SectionTitle>
    <Paragraph position="0"> KMS includes two major components for exploring the contents of a document. The first is based on the semantic types attached to nouns and verbs. The second is based on analyzing noun phrases to construct a document hierarchy or ontology.</Paragraph>
    <Paragraph position="1"> As noted above, each noun phrase and each verb  is tagged with its semantic class, based on WordNet. A user can explore one or more documents in three stages. First, a semantic category is specified. Second, the user pushes a button to obtain all the instances in the documents in that category. The phraseology in the documents is examined so that similar words (e.g., plurals and singulars and/or synonyms) are grouped together and then presented in a drop-down box by frequency. Finally, the user can select any term set and obtain all the sentences in the documents containing any of the terms.</Paragraph>
    <Paragraph position="2"> KMS provides the capability for viewing a &amp;quot;dynamic&amp;quot; noun ontology of a document set. All noun phrases are analyzed into groups in a tree structure that portrays the ontology that is instantiated by these phrases. Noun phrases are reduced to their base forms (in cases of plurals) and grouped together first on the basis of their heads. Synonym sets are then generated and a further grouping is made. Algorithms from Navigli &amp; Velardi (2004) are being modified and implemented in KMS. The user can then select a node in the ontology hierarchy and create a summary based on sentences containing any of its terms or children.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML