File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/00/w00-0705_metho.xml

Size: 14,853 bytes

Last Modified: 2025-10-06 14:07:21

<?xml version="1.0" standalone="yes"?>
<Paper uid="W00-0705">
  <Title>Increasing our Ignorance of Language: Identifying Language Structure in an Unknown 'Signal'</Title>
  <Section position="3" start_page="25" end_page="25" type="metho">
    <SectionTitle>
2 Identifying Structure and the
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="25" end_page="25" type="sub_section">
      <SectionTitle>
'Character Set'
</SectionTitle>
      <Paragraph position="0"> The initial task, given an incoming bit-stream, is to identify if a language-like structure exists and if detected what are the unique patterns/symbols, which constitute its 'character set'. A visualisation of the alternative possible byte-lengths is gleaned by plotting the entropy calculated for a range of possible byte-lengths (fig 1).</Paragraph>
      <Paragraph position="1"> In 'real' decoding of unknown scripts it is accepted that identifying the correct set of discrete symbols is no mean feat (Chadwick, 1967). To make life simple for ourselves we assume a digital signal with a fixed number of bits per character. Very different techniques are required to deal with audio or analogue equivalent waveforms (Elliott and Atwell, 2000; Elliott and Atwell, 1999). We have reason to believe that the following method can be modified to relax this constraint, but this needs to be tested further. The task then reduces to trying to identify the number of bits per character. Given the probability of a bit is Pi; the message entropy of a string of length N will be given by the first</Paragraph>
      <Paragraph position="3"> If the signal contains merely a set of random digits, the expected value of this function will rise monotonically as N increases. However, if the string contains a set of symbols of fixed length representing a character set used for communication, it is likely to show some decrease in entropy when analysed in blocks of this length, because the signal is 'less random' when thus blocked. Of course, we need to analyse blocks that begin and end at character boundaries. We simply carry out the measurements in sliding windows along the data. In figure 1, we see what happens when we applied this to samples of 8-bit ASCII text. We notice a clear drop, as predicted, for a bit length of 8. Modest progress though it may be, it is not unreasonable to assume that the first piece of evidence for the presence of language-like structure, would be the identification of a low-entropy, character set within the signal. null The next task, still below the stages normally tackled by NLL researchers, is to chunk the incoming character-stream into words. Looking at a range of (admittedly human language) text, if the text includes a space-like word-separator character, this will be the most frequent character. So, a plausible hypothesis would be that the most frequent character is a word-separator1; then plot type-token frequency distributions for words, and for word-lengths. If the distributions are Zipfian, and there are no significant 'outliers' (very large gaps between 'spaces' signifying very long words) then we have evidence corroborating our space hypothesis; this also corroborates our byte-length hypothesis, since the two are interdependent.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="25" end_page="26" type="metho">
    <SectionTitle>
3 Identifying 'Words'
</SectionTitle>
    <Paragraph position="0"> Again, work by crytopaleologists suggests that, once the character set has been found, the separation into word-like units, is not trivial and again we cheat, slightly: we assume that the language possesses something akin to a 'space' character. Taking our entropy measurement described above as a way of separating characters, we now try to identify which character represents 'space'. It is not unreasonable to believe that, in a word-based language, it is likely to be one of the most frequently used characters.</Paragraph>
    <Paragraph position="1"> Using a number of texts in a variety of languages, we first identified the top three most used characters. For each of these we hypothesised in turn that it represented 'space'.</Paragraph>
    <Paragraph position="2"> This then allowed us to segment the signal into words-like units ('words' for simplicity). We could then compute the frequency distribution of words as a function of word length, for each of the three candidate 'space' characters (fig 2).</Paragraph>
    <Paragraph position="3"> It can be seen that one 'separator' candidate (unsurprisingly, in fact, the most frequent character of all) results in a very varied distribution of word lengths. This is an interesting distribution, which, on the right hand side of the peak, approximately follows the well-known 'law' according to Zipf (1949), which predicts this behaviour on the grounds of minimum ef- null fort in a communication act. Conversely, results obtained similar to the 'flatter' distributions above, when using the most frequent character, is likely to indicate the absence of word separators in the signal.</Paragraph>
    <Paragraph position="4"> To ascertain whether the word-length frequency distribution holds for language in general, multiple samples from 20 different languages from Indo-European, Bantu, Semitic, Finno-Ugrian and Malayo-Polynesian groups were analysed (fig 3). Using statistical measures of significance, it was found that most groups fell well within 5- only two individual languages were near exceeding these limits - of the proposed Human language word-length profile (E1liott et al., 2000).</Paragraph>
    <Paragraph position="5"> Zipf's law is a strong indication of language-like behaviour. It can be used to segment the signal provided a 'space' character exists. However, we should not assume Zipf to be an infallible language detector. Natural phenomena such as molecular distribution in yeast DNA possess characteristics of power laws (Jenson, 1998). Nevertheless, it is worth noting, that such non-language possessors of power law characteristics generally display distribution ranges far greater than language with long repeats far from each other (Baldi and Brunak, 1998); characteristics detectable at this level or at least higher order entropic evaluation.</Paragraph>
  </Section>
  <Section position="5" start_page="26" end_page="26" type="metho">
    <SectionTitle>
4 Identifying 'Phrase-like' chunks
</SectionTitle>
    <Paragraph position="0"> Having detected a signal which satisfies criteria indicating language-like structures at a physical level (Elliott and Atwell, 2000; Elliott and Atwell, 1999), second stage analysis is required to begin the process of identifying internal grammatical components, which constitute the basic building blocks of the symbol system.</Paragraph>
    <Paragraph position="1"> With the use of embedded clauses and phrases, humans are able to represent an expression or description, however complex, as a single component of another description. This allows us to build up complex structures far beyond our otherwise restrictive cognitive capabilities (Minsky, 1984). Without committing ourselves to a formal phrase structure approach, (in the Chomskian sense) or even to a less formal 'chunking' of language (Sparkle Project, 2000), it is this universal hierarchical structure, evident in all human languages and believed necessary for any advanced communicator, that constitutes the next phase in our signal analysis (Elliott and Atwell, 2000). It is from these 'discovered' basic syntactic units that analysis of behavioural trends and inter-relationships amongst terminals and non-terminals alike can begin to unlock the encoded internal grammatical structure and indicate candidate parts of speech. To do this, we make use of a particular feature common to many known languages, the 'function' words, which occur in corpora with approximately the same statistics. These tend to act as boundaries to fairly self-contained semantic/syntactic 'chunks.' They can be identified in corpora by their usually high frequency of occurrence and cross-corpora invariance, as opposed to 'content' words which are usually less frequent and much more context dependent.</Paragraph>
    <Paragraph position="2"> Now suppose the function words arrived in a text independent of the other words, then they would have a Poisson distribution, with some long tails (distance between successive function words.) But this is NOT what happens. Instead, there is empirical evidence that function word separation is constrained to within short limits, with very few more than nine words apart (see fig 4). We conjecture that this is highly suggestive of chunking.</Paragraph>
  </Section>
  <Section position="6" start_page="26" end_page="28" type="metho">
    <SectionTitle>
5 Clustering into
</SectionTitle>
    <Paragraph position="0"> syntactico-semantic classes Unlike traditional natural language processing, a solution cannot be assisted using vast amounts of training data with well-documented 'legal' syntax and semantic interpretation or known statistical behaviour of speech categories. Therefore, at this stage we are endeavouring to extract the syntactic elements without a 'Rossetta' stone and by making as few assumptions as possible. Given this, a generic system is required to facilitate the analysis of behavioural trends amongst selected pairs of terminals and non-terminals alike, regardless of the target language.</Paragraph>
    <Paragraph position="1"> Therefore, an intermediate research goal is to apply Natural Language Learning techniques to the identification of &amp;quot;higher-level&amp;quot; lexical and grammatical patterns and structure in a linguistic signal. We have begun the development of tools to visualise the correlation profiles be- null tween pairs of words or parts of speech, as a precursor to deducing general principles for 'typing' and clustering into syntactico-semantic lexical classes. Linguists have long known that collocation and combinational patterns are characteristic features of natural languages, which set them apart (Sinclair, 1991). Speech and language technology researchers have used word-bigram and n-gram models in speech recognition, and variants of PoS-bigram models for Part-of-Speech tagging. In general, these models focus on immediate neighbouring words, but pairs of words may have bonds despite separation by intervening words; this is more relevant in semantic analysis, eg Wilson and Rayson (1993), Demetriou (1997). We sought to investigate possible bonding between type tokens (i.e., pairs of words or between parts of speech tags) at a range of separations, by mapping the correlation profile between a pair of words or tags. This can be computed for given word-pair type (wl,w2) by recording each word-pair token (wl,w2,d) in a corpus, where d is the distance or number of intervening words. The distribution of these word-pair tokens can be visualised by plotting d (distance between wl and w2) against frequency (how many (wl,w2,d) tokens found at this distance). Distance can be negative, meaning that w2 occurred be/ore wl and for any size window (i.e., 2 to n). In other words, we postulate that it might be possible to deduce part-of-speech membership and, indeed, identify a set of part-of-speech classes, using the joint probability of words themselves. But is this possible? One test would be to take an already tagged corpus and see if the parts-of-speech did indeed fall into separable clusters.</Paragraph>
    <Paragraph position="2"> Using a five thousand-word extract from the LOB corpus (Johansson et al., 1986) to test this tool, a number of parts-of-speech pairings were analysed for their cohesive profiles. The arbitrary figure of five thousand was chosen, as it both represents a sample large enough to reflect trends seen in samples much larger (without loosing any valuable data) and a sample size, which we see as at least plausible when analysing ancient or extra-terrestrial languages where data is at a premium.</Paragraph>
    <Paragraph position="3"> Figure 5 shows the results for the relationship between a pair of content and function words, so identified by looking at their cross-corpus statistics. It can be seen that the function word has a high probability of preceding the content word but has no instance of directly following it. At least metaphorically, the graph can be considered to show the 'binding force' between the two words varying with their separation. We are looking at how this metaphor might be used in order to describe language as a molecular structure, whose 'inter-molecular forces' can be related to part-of-speech interaction and the development of potential semantic categories for the unknown language.</Paragraph>
    <Paragraph position="4"> Examining language in such a manner also lends itself to summarising ('compressing') the behaviour to its more notable features when forming profiles. Figure 6 depicts a 3D representation of results obtained from profiling VBtags with six other major syntactic categories; figure 7 shows the main syntactic behavioural features found for the co-occurrence of some of the major syntactic classes ranging over the chosen window of ten words.</Paragraph>
    <Paragraph position="5"> Such a tool may also be useful in other areas, such a lexico-grammatical analysis or tagging of corpora. Data-oriented approaches to corpus annotation use statistical n-grams and/or constraint-based models; n-grams or constraints with wider windows can improve error-rates, by examining the topology of the annotationcombination space. Such information could be used to guide development of Constraint Grammars. The English Constraint Grammar described in (1995) includes constraint rules up to 4 words either side of the current word (see Table 16, p352); the peaks and troughs in the visualisation tool might be used to find candidate patterns for such long-distance constraints. Our research topic NLL4SETI (Natural Language Learning for the Search for Extra-Terrestrial Intelligence) is distinctive in that it is potentially a VERY useful application of unsupervised NLL; - it starts from more basic assumptions than most NLL research: we do not assume tokenisation into characters and words, and have no tagged/parsed training corpus; - it focuses on utilising statistical distributional universals of language which are computable and diagnostic; - this focus has led us to develop distributional visualisation tools to explore type/token combination distributions; the goal is NOT learning algorithms which anal- null yse/annotate human language in a way which human experts would approve of (eg phrasechunking corresponding to a human linguist's parsing of English text); but algorithms which recognise language-like structuring in a potentially much wider range of digital data sets.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML