File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/97/j97-2002_concl.xml

Size: 3,369 bytes

Last Modified: 2025-10-06 13:57:45

<?xml version="1.0" standalone="yes"?>
<Paper uid="J97-2002">
  <Title>The MITRE Corporation</Title>
  <Section position="9" start_page="262" end_page="264" type="concl">
    <SectionTitle>
7. Summary
</SectionTitle>
    <Paragraph position="0"> This paper has presented Satz, a sentence boundary disambiguation system that possesses the following desirable characteristics:  disambiguate, rather than making under-informed guesses. The Satz system offers a robust, rapidly trainable alternative to existing systems, which usually require extensive manual effort and are tailored to a specific text genre or a particular language. By using part-of-speech frequency data to represent the context in which the punctuation mark appears, the system offers significant savings in parameter estimation and training time over word-based methods, while at the same  Palmer and Hearst Multilingual Sentence Boundary time producing a very low error rate (see Table 10 for a summary of the best results for each language).</Paragraph>
    <Paragraph position="1"> The systems of Riley (1989) and Wasson report what seem to be slightly better error rates, but these results are not directly comparable since they were evaluated on other collections. Furthermore, the Wasson system required nine staff months of development, and the Riley system required 25 million word tokens for training and storage of probabilities for all corresponding word types. By comparison, the Satz approach has the advantages of flexibility for application to new text genres, small training sets (and thereby fast training times), relatively small storage requirements, and little manual effort. The training time on a workstation (in our case a DEC Alpha 3000) is less than one minute, and the system can perform the sentence boundary disambiguation at a rate exceeding 10,000 sentences/minute. Because the system is lightweight, it can be incorporated into the tokenization stage of many natural language processing systems without substantial penalty. For example, combining our system with a fast sentence alignment program such as that of Gale and Church (1993), which performs alignment at a rate of up to 1,000 sentences/minute, would make it possible to rapidly and accurately create a bilingual aligned corpus from raw parallel texts. Because the system is adaptive, it can be focused on especially difficult cases and combined with existing systems to achieve still better error rates, as shown in Section 6.</Paragraph>
    <Paragraph position="2"> The system was designed to be easily portable to new natural languages, assuming the accessibility of lexical part-of-speech information. The lexicon itself need not be exhaustive, as shown by the success of adapting Satz to German and French with limited lexica, and by the experiments in English lexicon size described in Section 4.4. The heuristics used within the system to classify unknown words can compensate for inadequacies in the lexicon, and these heuristics can be easily adjusted.</Paragraph>
    <Paragraph position="3"> It is interesting that the system performs so well using only estimates of the parts of speech of the tokens surrounding the punctuation mark and using very rough estimates at that. In the future it may be fruitful to apply a technique that uses such simple information to more complex problems.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML