File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/00/a00-2035_intro.xml

Size: 12,368 bytes

Last Modified: 2025-10-06 14:00:38

<?xml version="1.0" standalone="yes"?>
<Paper uid="A00-2035">
  <Title>Tagging Sentence Boundaries</Title>
  <Section position="3" start_page="0" end_page="265" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> Sentence boundary disambiguation (SBD) is an important aspect in developing virtually any practical text processing application - syntactic parsing, Information Extraction, Machine Translation, Text Alignment, Document Summarization, etc. Segmenting text into sentences in most cases is a simple matter- a period, an exclamation mark or a question mark usually signal a sentence boundary.</Paragraph>
    <Paragraph position="1"> However, there are cases when a period denotes a decimal point or is a part of an abbreviation and thus it does not signal a sentence break. Furthermore, an abbreviation itself can be the last token in a sentence, in which case its period acts at the same time as part of this abbreviation and as the end-of-sentence indicator (fullstop).</Paragraph>
    <Paragraph position="2"> The first large class of sentence boundary disambiguators uses manually built rules which are usually encoded in terms of regular expression grammars supplemented with lists of abbreviations, common words, proper names, etc. For instance, the Alembic workbench (Aberdeen et al., 1995) contains a sentence splitting module which employs over 100 regular-expression rules written in Flex. To put together a few rules which do a job is fast and easy, but to develop a good rule-based system is quite a labour consuming enterprise. Another potential shortcoming is that such systems are usually closely tailored to a particular corpus and are not easily portable across domains.</Paragraph>
    <Paragraph position="3"> Automatically trainable software is generally seen as a way of producing systems quickly re-trainable for a new corpus, domain or even for another language. Thus, the second class of SBD systems employs machine learning techniques such as decision tree classifiers (Riley, 1989), maximum entropy modeling (MAXTERMINATOR) (Reynar and Ratnaparkhi, 1997), neural networks (SATZ) (Palmer and Hearst, 1997), etc.. Machine learning systems treat the SBD task as a classification problem, using features such as word spelling, capitalization, suffix, word class, etc., found in the local context of potentim sentence breaking punctuation. There is, however, one catch - all machine learning approaches to the SBD task known to us require labeled examples for training. This implies an investment in the annotation phase.</Paragraph>
    <Paragraph position="4"> There are two corpora normally used for evaluation and development in a number of text processing tasks and in the SBD task in particular: the Brown Corpus and the Wall Street Journal (WSJ) corpus - both part of the Penn Treebank (Marcus, Marcinkiewicz, and Santorini, 1993). Words in both these corpora are annotated with part-of-speech (POS) information and the text is split into documents, paragraphs and sentences. This gives all necessary information for the development of an SBD system and its evaluation. State-of-the-art machine-learning and rule-based SBD systems achieve the error rate of about 0.8-1.5% measured on the Brown Corpus and the WSJ. The best performance on the WSJ was achieved by a combination of the SATZ system with the Alembic system - 0.5% error rate. The best performance on the Brown Corpus, 0.2% error rate, was reported by (Riley, 1989), who trained a decision tree classifier on a 25 million word corpus.</Paragraph>
    <Section position="1" start_page="0" end_page="264" type="sub_section">
      <SectionTitle>
1.1 Word-based vs. Syntactic Methods
</SectionTitle>
      <Paragraph position="0"> The first source of ambiguity in end-of-sentence marking is introduced by abbreviations: if we know that the word which precedes a period is not an abbreviation, then almost certainly this period denotes a sentence break. However, if this word is an abbreviation, then it is not that easy to make a clear decision. The second major source of information  for approaching the SBD task comes from the word which follows the period or other sentence splitting punctuation. In general, when the following word is punctuation, number or a lowercased word - the abbreviation is not sentence terminal. When the following word is capitalized the situation is less clear. If this word is a capitalized common word - this signals start of another sentence, but if this word is a proper name and the previous word is an abbreviation, then the situation is truly ambiguous.</Paragraph>
      <Paragraph position="1"> Most of the existing SBD systems are word-based.</Paragraph>
      <Paragraph position="2"> They employ only lexical information (word capitalization, spelling, suffix, etc.) to predict whether a capitalized word-token which follows a period is a proper name or is a common word. Usually this is implemented by applying the lexical lookup method where a word is assigned its category according to which word-list it belongs to. This, however, is clearly an oversimplification. For instance, the word &amp;quot;Black&amp;quot; is a frequent surname and at the same time it is a frequent common word, thus the lexical information is not very reliable in this case. But by employing local context one can more robustly predict that in the context &amp;quot;Black described..&amp;quot; this word acts as a proper name and in the context &amp;quot;Black umbrella..&amp;quot; this word acts as a common word.</Paragraph>
      <Paragraph position="3"> It is almost impossible to robustly estimate contexts larger than single focal word using word-based methods - even bigrams of words are too sparse. For instance, there are more than 50,000 distinct words in the Brown Corpus, thus there are 250`0o0 potential word bigrams, but only a tiny fraction of them can be observed in the corpus. This is why words are often grouped into semantic classes. This, however, requires large manual effort, is not scalable and still covers only a fraction of the lexica. Syntactic context is much easier to estimate because the number of syntactic categories is much smaller than the number of distinct words.</Paragraph>
      <Paragraph position="4"> A standard way to identify syntactic categories for word-tokens is part-of-speech (POS) tagging. There syntactic categories are represented as POS tags e.g.</Paragraph>
      <Paragraph position="5"> NNS - plural noun, VBD - verb past form, J JR - comparative adjective, etc. There exist several tag-sets which are currently in use - some of them reflect only the major syntactic information such as partof-speech, number, tense, etc., whereas others reflect more refined information such as verb subcategorization, distinction between mass and plural nouns, etc. Depending on the level of detail one tag-set can incorporate a few dozen tags where another can incorporate a few hundred, but still such tags will be considerably less sparse than individual words. For instance, there are only about 40 POS tags in the Penn Treebank tag-set, therefore there are only 240 potential POS bigrams. Of course, not every word combination and POS tag combination is possible, but these numbers give a rough estimation of the magnitude of required data for observing necessary contexts for words and POS tags. This is why the &amp;quot;lexical lookup&amp;quot; method is the major source of information for word-based methods.</Paragraph>
      <Paragraph position="6"> The &amp;quot;lexical lookup&amp;quot; method for deciding whether a capitalized word in a position where capitalization is expected (e.g. after a fullstop) is a proper name or a common word gives about an 87o error rate on the Brown Corpus. We developed and trained a POS tagger which reduced this error more than by halfachieving just above a 3% error rate. On the WSJ corpus the POS tagging advantage was even greater: our tagger reduced the error rate from 1570 of the lexical lookup approach to 5%. This suggests that the error rate of a sentence splitter can be reduced proportionally by using the POS tagging methodology to predict whether a capitalized word after a period is a proper name or a common word.</Paragraph>
    </Section>
    <Section position="2" start_page="264" end_page="265" type="sub_section">
      <SectionTitle>
1.2 The SATZ System
</SectionTitle>
      <Paragraph position="0"> (Palmer and Hearst, 1997) described an approach which recognized the potential of the local syntactic context for the SBD problem. Their, system, SATZ, used POS information for words in the local context of potential sentence splitting punctuation. However, what is interesting is that they found difficulty in applying a standard POS tagging framework for determining POS information for the words: &amp;quot;However, requiring a single part-of-speech assignment for each word introduces a processing circularity: because most part-of-speech taggers require predetermined sentence boundaries, the boundary disambiguation must be done before tagging. But if the disambiguations done before tagging, no part-of-speech assignments are available for the boundary determination system&amp;quot;.</Paragraph>
      <Paragraph position="1"> Instead, they applied a simplified method. The SATZ system mapped Penn Treebank POS tags into a set of 18 generic POS categories such as noun, article, verb, proper noun, preposition, etc. Each word was replaced with a set of these generic categories that it can take on. Such sets of generic syntactic categories for three tokens before and three tokens after the period constituted a context which was then fed into two kinds of classifiers (decision trees and neural networks) to make the predictions.</Paragraph>
      <Paragraph position="2"> This system demonstrated reasonable accm'acy (1.0% error rate on the WSJ corpus) and also exhibited robustness and portability when applied to other domains and languages. However, the N-grams of syntactic category sets have two important disadvantages in comparison to the traditional POS tagging which is usually largely based (directly or indirectly) on the N-grams of POS tags. First, syntactic category sets are much sparser than syntactic categories (POS tags) and, thus, require more data for training. Second, in the N-grams-only method  . . . &lt;W . . . &lt;W . . . &lt;W C='RB' A='N'&gt;soon&lt;/W&gt;&lt;W C='.'&gt;.&lt;/W&gt; &lt;W A='Y' C='NNP'&gt;Mr&lt;/W&gt;&lt;W C='A'&gt;.&lt;/W&gt;... C='VBD'&gt;said&lt;/W&gt; &lt;W C='NNP' A='Y'&gt;Mr&lt;/W&gt;&lt;W C='A'&gt;.&lt;/W&gt; &lt;W C='NNP'&gt;Brown&lt;/W&gt;.,, C=','&gt;,&lt;/W&gt; &lt;W C='NNP' A='Y'&gt;Tex&lt;/W&gt;&lt;W C='*'&gt;.&lt;/W&gt; &lt;W C='DT'&gt;The&lt;/W&gt;...  with attributes: A='Y' - abbreviation, A=' N'- not abbreviation, C - part-of-speech tag attribute, C='. ' fullstop, C='A' - part of abbreviation, C='*' - a fullstop and part of abbreviation at the same time. no influence from the words outside the N-grams can be traced, thus, one has to adopt N-grams of sufficient length which in its turn leads either to sparse contexts or otherwise to sub-optimal discrimination.</Paragraph>
      <Paragraph position="3"> The SATZ system adopted N-grams of length six.</Paragraph>
      <Paragraph position="4"> In contrast to this, POS taggers can capture influence of the words beyond an immediate N-gram and, thus, usually operate with N-grams of length two (bigrams) or three (three-grams). Furthermore, in the POS tagging field there exist standard methods to cope with N-gram sparseness and unknown words.</Paragraph>
      <Paragraph position="5"> Also there have been developed methods for unsupervised training for some classes of POS taggers.</Paragraph>
    </Section>
    <Section position="3" start_page="265" end_page="265" type="sub_section">
      <SectionTitle>
1.3 This Paper
</SectionTitle>
      <Paragraph position="0"> In this paper we report on the integration of the sentence boundary disambiguation functionality into the POS tagging framework. We show that Sentence splitting can be handled during POS tagging and the above mentioned &amp;quot;circularity&amp;quot; can be tackled by using a non-traditional tokenization and markup conventions for the periods. We also investigate reducing the importance of pre-existing abbreviation lists and describe guessing strategies for unknown abbreviations. null</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML