File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/99/w99-0702_evalu.xml
Size: 1,543 bytes
Last Modified: 2025-10-06 14:00:43
<?xml version="1.0" standalone="yes"?> <Paper uid="W99-0702"> <Title>Experiments in Unsupervised Entropy-Based Corpus Segmentation</Title> <Section position="5" start_page="11" end_page="11" type="evalu"> <SectionTitle> 4 Conclusion and Future In- vestigations </SectionTitle> <Paragraph position="0"> The paper attempted to show that entropy and information can be used to segment a corpus into words, when no additional knowledge about the corpus or the language, and no other resources such as a lexicon or grammar are available.</Paragraph> <Paragraph position="1"> To segment the corpus, the algorithm searches for separators, without knowing a priory by which symbols or sequences of symbols they are constituted.</Paragraph> <Paragraph position="2"> Good results were obtained with a German and an English corpus with &quot;clearly perceptible&quot; separators (blank and new-line). Precision and recall decrease if the original separators of these corpora are removed or changed into a set of different co-occurring separators.</Paragraph> <Paragraph position="3"> So far, only separators and their frequencies have been taken into account. Future investigations may include: * the use of frequencies of tokens and their different alternative contexts, to validate these tokens and the adjacent separators, and * a search for criteria (based on the corpus itself and on the obtained result) to evaluate the &quot;quality&quot; of segmentation, thus enabling a selfoptimizing approach.</Paragraph> </Section> class="xml-element"></Paper>