File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/93/m93-1009_intro.xml

Size: 1,895 bytes

Last Modified: 2025-10-06 14:05:30

<?xml version="1.0" standalone="yes"?>
<Paper uid="M93-1009">
  <Title>The Generic Information Extraction Syste m</Title>
  <Section position="3" start_page="87" end_page="88" type="intro">
    <SectionTitle>
PREPROCESSOR
</SectionTitle>
    <Paragraph position="0"> This module takes the text as a character sequence, locates the sentence boundaries, and produces for eac h sentence a sequence of lexical items . The lexical items are generally the words together with the lexica l attributes for them that are contained in the lexicon . This module minimally determines the possible part s of speech for each word, and may choose a single part of speech . It makes the lexical attributes in the lexicon available to subsequent processing. It recognizes multiwords . It recognizes and normalizes certain basic types that occur in the genre, such as dates, times, personal and company names, locations, currenc y amounts, and so on . It handles unknown words, minimally by ignoring them, or more generally by tryin g to guess from their morphology or their immediate context as much information about them as possible .</Paragraph>
    <Paragraph position="1"> Spelling correction is done in this module as well .</Paragraph>
    <Paragraph position="2"> The methods used here are lexical lookup, perhaps in conjunction with morphological analysis ; perhaps statistical part-of-speech tagging; finite-state pattern-matching for recognizing and normalizing basic entities ; standard spelling correction techniques ; and a variety of heuristics for handling unknown words .</Paragraph>
    <Paragraph position="3"> The lexicon might have been developed manually or borrowed from another site, but more and more the y are adapted from already existing machine-readable dictionaries and augmented automatically by statistica l techniques operating on the key templates and/or the corpus .</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML