File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/00/w00-0705_concl.xml
Size: 1,561 bytes
Last Modified: 2025-10-06 13:52:51
<?xml version="1.0" standalone="yes"?> <Paper uid="W00-0705"> <Title>Increasing our Ignorance of Language: Identifying Language Structure in an Unknown 'Signal'</Title> <Section position="7" start_page="28" end_page="28" type="concl"> <SectionTitle> 6 Summary and future </SectionTitle> <Paragraph position="0"> developments To summarise, our achievements to date include - a method for splitting a binary digit-stream into characters, by using entropy to diagnose byte-length; - a method for tokenising unknown character-streams into words of language; an approach to chunking words into phraselike sub-sequences, by assuming high-frequency function words act as phrase-delimiters; - a visualisation tool for exploring word-combination patterns, where word-pairs need not be immediate neighbours but characteristically combine despite several intervening words.</Paragraph> <Paragraph position="1"> So far, our approaches have involved working with languages with which we are most familiar and, to a certain extent, making use of linguistic 'knowns' such as pre-tagged corpora. It is early days yet and we make no apology for this initial approach. However, we feel that by deliberately reducing our dependence on prior knowledge ('increasing our ignorance of language') and by treating language as a 'signal', we might be contributing a novel approach to natural language processing which might ultimately lead to a better, more fundamental understanding of what distinguishes language from the rest of the signal universe.</Paragraph> </Section> class="xml-element"></Paper>