File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/00/w00-0705_intro.xml

Size: 1,738 bytes

Last Modified: 2025-10-06 14:01:01

<?xml version="1.0" standalone="yes"?>
<Paper uid="W00-0705">
  <Title>Increasing our Ignorance of Language: Identifying Language Structure in an Unknown 'Signal'</Title>
  <Section position="2" start_page="0" end_page="25" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> A useful thought experiment is to imagine eavesdropping on a signal from outer space.</Paragraph>
    <Paragraph position="1"> How can you decide that it is a message between intelligent life forms? We need a 'language detector': or, to put it more accurately, something that separates language from non-language. But what is special about the language signal that separates it from nonlanguage? Is it, indeed, separable? The problem goal is to separate language from non-language without dialogue, and learn something about the structure of language in the passing. The language may not be human (animals, aliens, computers...), the perceptual space can be unknown, and we cannot assume human language structure but must begin somewhere. We need to approach the language signal from a naive viewpoint, in effect, increasing our ignorance and assuming as little as possible.</Paragraph>
    <Paragraph position="2"> Given this standpoint, an informal description of 'language' might include that it:  We assume that a language-like signal will be encoded symbolically, i.e. with some kind of character-stream. Our language-detection algorithm for symbolic input uses a number of  statistical clues such as entropy, &amp;quot;chunking&amp;quot; to find character bit-length and boundaries, and matching against a Zipfian type-token distribution for &amp;quot;letters&amp;quot; and &amp;quot;words&amp;quot;.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML