File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/99/w99-0702_evalu.xml

Size: 1,543 bytes

Last Modified: 2025-10-06 14:00:43

<?xml version="1.0" standalone="yes"?>
<Paper uid="W99-0702">
  <Title>Experiments in Unsupervised Entropy-Based Corpus Segmentation</Title>
  <Section position="5" start_page="11" end_page="11" type="evalu">
    <SectionTitle>
4 Conclusion and Future In-
vestigations
</SectionTitle>
    <Paragraph position="0"> The paper attempted to show that entropy and information can be used to segment a corpus into words, when no additional knowledge about the corpus or the language, and no other resources such as a lexicon or grammar are available.</Paragraph>
    <Paragraph position="1"> To segment the corpus, the algorithm searches for separators, without knowing a priory by which symbols or sequences of symbols they are constituted.</Paragraph>
    <Paragraph position="2"> Good results were obtained with a German and an English corpus with &amp;quot;clearly perceptible&amp;quot; separators (blank and new-line). Precision and recall decrease if the original separators of these corpora are removed or changed into a set of different co-occurring separators.</Paragraph>
    <Paragraph position="3"> So far, only separators and their frequencies have been taken into account. Future investigations may include: * the use of frequencies of tokens and their different alternative contexts, to validate these tokens and the adjacent separators, and * a search for criteria (based on the corpus itself and on the obtained result) to evaluate the &amp;quot;quality&amp;quot; of segmentation, thus enabling a selfoptimizing approach.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML