File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/00/a00-1044_intro.xml
Size: 3,132 bytes
Last Modified: 2025-10-06 14:00:42
<?xml version="1.0" standalone="yes"?> <Paper uid="A00-1044"> <Title>Named Entity Extraction from Noisy Input: Speech and OCR</Title> <Section position="3" start_page="316" end_page="316" type="intro"> <SectionTitle> 2 Algorithms and Data </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="316" end_page="316" type="sub_section"> <SectionTitle> 2.1 Task Definition and Data </SectionTitle> <Paragraph position="0"> The named entity (NE) task used for this evaluation requires the system to identify all named locations, named persons, named organizations, dates, times, monetary amounts, and percentages. The task definition is given in Chinchor, et al, (1998).</Paragraph> <Paragraph position="1"> For speech recognition, roughly 175 hours of news broadcasts (roughly 1.2m words of audio) were available from the National Institute for Science and Technology (NIST) for training.</Paragraph> <Paragraph position="2"> All of that data includes both the audio and a manual transcription. The test set consisted of 3 hours of news (roughly 25k words).</Paragraph> <Paragraph position="3"> For the combined OCR/NE system, the OCR component was trained on the University of Washington English Image Database, which is comprised primarily of technical journal articles. The NE system was trained separately on 690K words of 1993 Wall Street Journal (WSJ) data (roughly 1250 articles), including development data from the Sixth Message Understanding Conference (MUC-6) Named Entity evaluation. The test set was approximately 20K words of separate WSJ data (roughly 45 articles), also taken from the MUC-6 data set. Both test and training texts were original text (no OCR errors) in mixed case with normal punctuation. Printing the on-line text, rather than using the original newsprint, produced the images for OCR, which were all scanned at 600 DPI.</Paragraph> </Section> <Section position="2" start_page="316" end_page="316" type="sub_section"> <SectionTitle> 2.2 Algorithms </SectionTitle> <Paragraph position="0"> The information extraction system tested is IdentiFinder(TM), which has previously been detailed in Bikel et al. (1997, 1999). In that system, an HMM labels each word either with one of the desired classes (e.g., person, organization, etc.) or with the label NOT-A-NAME (to represent &quot;none of the desired classes&quot;). The states of the HMM fall into regions, one region for each desired class plus one for NOT-A-NAME. (See Figure 2-1.) The HMM thus has a model of each desired class and of the other text. Note that the implementation is not confined to the seven name classes used in the NE task; the particular classes to be recognized can be easily changed via a parameter.</Paragraph> <Paragraph position="1"> Within each of the regions, we use a statistical bigram language model, and emit exactly one word upon entering each state. Therefore, the number of states in each of the name-class regions is equal to the vocabulary size.</Paragraph> <Paragraph position="2"> Additionally, there are two special states, the</Paragraph> </Section> </Section> class="xml-element"></Paper>