File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/w04-3210_intro.xml
Size: 5,961 bytes
Last Modified: 2025-10-06 14:02:50
<?xml version="1.0" standalone="yes"?> <Paper uid="W04-3210"> <Title>Automatic Paragraph Identification: A Study across Languages and Domains</Title> <Section position="2" start_page="0" end_page="0" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> Written texts are usually broken up into sentences and paragraphs. Sentence splitting is a necessary pre-processing step for a number of Natural Language Processing (NLP) tasks including part-of-speech tagging and parsing. Since sentence-final punctuation can be ambiguous (e.g., a period can also be used in an abbreviation as well as to mark the end of a sentence), the task is not trivial and has consequently attracted a lot of attention (e.g., Reynar and Ratnaparkhi (1997)). In contrast, there has been virtually no previous research on inferring paragraph boundaries automatically. One reason for this is that paragraph boundaries are usually marked unambiguously by a new line and extra white space.</Paragraph> <Paragraph position="1"> However, a number of applications could benefit from a paragraph detection mechanism. Text-to-text generation applications such as single- and multidocument summarisation as well as text simplification usually take naturally occurring texts as input and transform them into new texts satisfying specific constraints (e.g., length, style, language).</Paragraph> <Paragraph position="2"> The output texts do not always preserve the structure and editing conventions of the original text.</Paragraph> <Paragraph position="3"> In summarisation, for example, sentences are typically extracted verbatim and concatenated to form a summary. Insertion of paragraph breaks could improve the readability of the summaries by indicating topic shifts and providing visual targets to the reader (Stark, 1988).</Paragraph> <Paragraph position="4"> Machine translation is another application for which automatic paragraph detection is relevant.</Paragraph> <Paragraph position="5"> Current systems deal with paragraph boundary insertion in the target language simply by preserving the boundaries from the source language. However, there is evidence for cross-linguistic variation in paragraph formation and placement, particularly for languages that are not closely related such as English and Chinese (Zhu, 1999). So, a paragraph insertion mechanism that is specific to the target language, instead of one that relies solely on the source language, may yield more readable texts.</Paragraph> <Paragraph position="6"> Paragraph boundary detection is also relevant for speech-to-text applications. The output of automatic speech recognition systems is usually raw text without any punctuation or paragraph breaks.</Paragraph> <Paragraph position="7"> This naturally makes the text very hard to read, which can cause processing difficulties, especially if speech recognition is used to provide deaf students with real-time transcripts of lectures. Furthermore, sometimes the output of a speech recogniser needs to be processed automatically by applications such as information extraction or summarisation.</Paragraph> <Paragraph position="8"> Most of these applications (e.g., Christensen et al., (2004)) port techniques developed for written texts to spoken texts and therefore require input that is punctuated and broken into paragraphs. While there has been some research on finding sentence boundaries in spoken text (Stevenson and Gaizauskas, 2000), there has been little research on determining paragraph boundaries.1 If paragraph boundaries were mainly an aesthetic device for visually breaking up long texts into smaller chunks, as has previously been suggested (see Longacre (1979)), paragraph boundaries could be easily inserted by splitting a text into several equal-size segments. Psycho-linguistic research, however, indicates that paragraph boundaries are not purely aesthetic. For example, Stark (1988) 1There has been research on using phonetic cues to segment speech into &quot;acoustic paragraphs&quot; (Hauptmann and Smith, 1995). However, these do not necessarily correspond to written paragraphs. But even if they did, textual cues could complement phonetic information to identify paragraphs.</Paragraph> <Paragraph position="9"> asked her subjects to reinstate paragraph boundaries into fiction texts from which all boundaries had been removed and found that humans are able to do so with an accuracy that is higher than would be expected by chance. Crucially, she also found that (a) individual subjects did not make all their paragraphs the same length and (b) paragraphs in the original text whose length deviated significantly from the average paragraph length were still identified correctly by a large proportion of subjects.</Paragraph> <Paragraph position="10"> These results show that people are often able to identify paragraphs correctly even if they are exceptionally short or long without defaulting to a simple template of average paragraph length.</Paragraph> <Paragraph position="11"> Human agreement on the task suggests that the text itself provides cues for paragraph insertion, even though there is some disagreement over which specific cues are used by humans (see Stark (1988)).</Paragraph> <Paragraph position="12"> Possible cues include repeated content words, pronoun coreference, paragraph length, and local semantic connectedness.</Paragraph> <Paragraph position="13"> In this paper, we investigate whether it is possible to exploit some of these textual cues together with syntactic and discourse related information to determine paragraph boundaries automatically. We treat paragraph boundary identification as a classification task and examine whether the difficulty of the task and the utility of individual textual cues varies across languages and across domains. We also assess human performance on the same task and whether it differs across domains.</Paragraph> </Section> class="xml-element"></Paper>