File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/97/w97-0305_metho.xml

Size: 5,505 bytes

Last Modified: 2025-10-06 14:14:40

<?xml version="1.0" standalone="yes"?>
<Paper uid="W97-0305">
  <Title>Detecting Subject Boundaries Within Text: A Language Independent Statistical Approach</Title>
  <Section position="5" start_page="49" end_page="53" type="metho">
    <SectionTitle>
3 Results
</SectionTitle>
    <Paragraph position="0"> Figure 6 shows the result of processing the first 800 sentences from an edition of The Times newspaper. The sentence number (x-axis) is plotted against the correspondence (y-axis) between the two windows of text on either side of that sentence.</Paragraph>
    <Paragraph position="1">  A large negative value indicates a low degree of correspondence and a small negative value or a positive value indicates a high degree of correspondence. The vertical lines mark actual article boundaries. The advantage of using a text such as this is that there can be no doubt from any human judge as to where the boundaries occur, i.e. between articles.</Paragraph>
    <Paragraph position="2"> The local minima on the graph signify the boundaries as determined by the algorithm. The vertical bars signify the actual article boundaries. The results of the first 400 sentences are summarised in table 1.</Paragraph>
    <Paragraph position="3"> The algorithm located 53% of the article boundaries precisely and 95% of the boundaries to within an accuracy of a single sentence. Every article boundary was identified to within an accuracy of two sentences. The algorithm made no use of endof-paragraph markers. It also found some additional subject boundaries in the middle of articles. These are denoted by a '+' in the error column. Many extra subject boundaries were found in the long article (starting at sentence 430). It is worth noting that the minima occurring within this article are not as pronounced as the actual article boundaries themselves. This section of the graph reflects a long article which contains a number of different subtopics. A newspaper is an easy test for such an algorithm though. Figure 7 shows a graph for an expository text - a 200 sentence psychology paper written by a fellow student. Again the local minima indicate where the algorithm considers a subject boundary to occur and the vertical lines are the obvious breaks in the text (mainly before new headings) as judged by the author. The results are summarised in table 2.</Paragraph>
    <Paragraph position="4"> This time the algorithm precisely located 50% of the boundaries. It found 63% of the boundaries to within an accuracy of a single sentence and 88% to  within an accuracy of two sentences. This level of accuracy was obtained consistently for a variety of different texts. Again, it should be mentioned that the algorithm found more breaks than were immediately obvious to a human judge. However, it should be noted that these extra breaks were usually denoted by smaller minima, and on inspection the vast majority of them were in sensible places.</Paragraph>
    <Paragraph position="5"> The algorithm has a certain resolving power. As the subject matter becomes more and more homogeneous, the number of subject breaks the algorithm finds decreases. For some texts, this results in very few divisions being made. By taking a smaller window size (the number of sentences to look at either side of each possible sentence break), the resolving power 'of the algorithm can be increased making it more sensitive to changes in the vocabulary. However, the reliability of the algorithm decreases with the increased resolving power. The default window size is fifteen sentences and this works well for all but the most homogeneous of texts. In this case a window size of around six is more effective. A lower window size increases the resolving power, but decreases the accuracy of the algorithm. The window size was a parameter of our implementation.</Paragraph>
  </Section>
  <Section position="6" start_page="53" end_page="53" type="metho">
    <SectionTitle>
4 Summary
</SectionTitle>
    <Paragraph position="0"> Based on our investigation, we believe that Hearst's original intuition that lexical correspondences can be exploited to identify subject boundaries is a sound one. The addition of the significance measure represents an improvement on Hearst's algorithm implemented by the Berkeley Digital Library Project.</Paragraph>
    <Paragraph position="1"> Furthermore, this algorithm is language independent except for the preprocessing stage (which can be omitted with only a modest degradation in performance). In order to improve accuracy, language dependent methods could be considered. Such methods might include the insertion of conventional discourse markers in order to detect preferred breaking points (e.g. repetition of the same syntactic structure, and conventional paragraph openings such as: &amp;quot;On the other hand...&amp;quot;, &amp;quot;The above...&amp;quot;, etc.). Another method would be to make use of a thesaurus, since we have found that human judgement is often based on synonymous information such as real synonyms or anaphora. The above issues are discussed in various articles (Morris and Hirst, 1991); (Morris, 1988) and (Givon, 1983) which study discourse markers and synonymous information.</Paragraph>
    <Paragraph position="2"> Another interesting line of research would be to use the information from stage two of the algorithm to discover the significant words of a section, and thereby attach a label to it. This would be particularly useful for information retrieval applications.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML