File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/86/c86-1098_intro.xml

Size: 11,468 bytes

Last Modified: 2025-10-06 14:04:31

<?xml version="1.0" standalone="yes"?>
<Paper uid="C86-1098">
  <Title>STORING TEXT USING INTEGER CODES</Title>
  <Section position="3" start_page="0" end_page="419" type="intro">
    <SectionTitle>
1.0 Introduction
</SectionTitle>
    <Paragraph position="0"> This research aims at storing text in a form that facilitates word manipulation whilst saving storage space. Although there are many text compression algorithnts currently in use, the word manipulation capability has yet to be incorporated. Harris \[2\], in his research, compiled a 40,000 word list with syntactic linear ordering and suggested that words in a text be given two-byte integer codes that point to their respective positions in this list. In this way the coded text has inherent syntactic information, thereby making it useful for many applications including statistical linguistics and question-answering systems. In this paper, we show how such a scheme can achieve optimal compression results and present its efficient implementation.</Paragraph>
    <Paragraph position="1"> Three text compression techniques that have been tested on English texts include (i) the Huffman variable-length encoding schemes \[3\] which achieve tight packing of data by giving variable-length bit code for each character, with more frequently-occurring words having shorter codes, (2) the Adaptive pattern substitution algorithm, also known as the Lempel-Ziv or LZW algorithm (see \[7\], \[8\], \[9\]) which converts variable-length strings of input symbols into fixed length codes by first looking for con~non patterns of two or more bytes occurring frequently, and then substituting an unused byte for the common long one, and (3) another technique, due to Hahn \[i\] encodes non-blank characters in groups of a fixed size as unique fixed point ntunbers.</Paragraph>
    <Paragraph position="2"> For measuring text compression two definitions are used in this paper. One is the compression ratio defined as the ratio of size of the original text to that of the coded text. ~he other is the compression percentage, defined as (Size of original text - Size of coded text)% (Size of original text) of its size, is treated a~ a unit and is represented in computers in the form of a computer word (usually 4 bytes) or part of a word. It is addressable, hence it does not require a delii~ter like a space to distinguish it from a neighbouring number.</Paragraph>
    <Paragraph position="3"> For this encoding scheme, we endeavour to store text as a stream of fixed length computer words which is distinguishable by the computer. This can be achieved by keeping an external list of all words in the dictionary includ\]mg derived ones, and assigning a unique integer code for each entry.</Paragraph>
    <Paragraph position="4"> Instead of words separated by delimiters the coded text represents words as numbers, thereby dispensing the need for representing delimiters.</Paragraph>
    <Paragraph position="5"> In using two bytes to represent an integer it is possible to have 216 - 1 = 65536 distinct codes. However, since it is impossible to have codes for all the words in the English Lan~lage, it is necessary to include a mechanism that allows for the representation of words without codes by their individual characters. Keeping one bit for that purpose \].eaves 215 - i = 32767 possible number of combinations. Two adjoining character codes (ASCII or EBDIC) always have zero as the first bit and is therefore read as a positive integer. The first bit, being a sicrn bit, can be used to indicate whether the two bytes represent a code (negative integer) or two characters, as follows: IXXXXXXX XXXXXXXX a code OXXXXXXX XXXXXXXX two characters It is also necessary to show that compression can indeed be achieved for this encoding scheme. In several studies it has been shown (see Kucera \[4\], for example)that the word frequency distribution in natural language analysis is highly-skewed.</Paragraph>
    <Paragraph position="6"> Assu~ing a skewed distribution it may be seen that the 32000 most-frequently occurring words from the Cobuild* corpus constituting 60% of the corpus, account for 99% of the total word usage. Including these 32000 words in the list will imply that words without codes (not included in the list) makes up 1% of the text. Assuming that the average size of an English word \[4\] consists of 4.7 characters we would expect thst an average occurring word would occupy, taking one more byte for a trailing space, 5.7 bytes. Thus the compression ratio is</Paragraph>
    <Paragraph position="8"> coded text is 35.7% of the original text.</Paragraph>
    <Section position="1" start_page="0" end_page="418" type="sub_section">
      <SectionTitle>
2.0 The Two-Byte-Word Encoding Scheme
</SectionTitle>
      <Paragraph position="0"> In most computers, text is stored as a stream of characters - made up of alphabtes, spaces, digits and punctuation ma~s. Fmch word is separated from neigh~\]uring words by delimiters such as spaces, ~peciaL characters or punctuation nmrks. On the other 3and, a number (integer or floating point) regardless *~he corpus used in the Cobuild Project, a project in the EnglishDepartment, University of Birmingham, is made of 6 million words of written text and 1.3 million words of transcribed speech.</Paragraph>
      <Paragraph position="1">  The encoding scheme was implemented on a Honeywell computer using MULTICS operating system. Since MULTICS represents characters using ASCII with the nine bits per character, two bits are not being utilized. For our implementation, one bit is used to indicate words beginning with upper-case characters (proper names, etc) and the other available bit is kept for future develo~nent.</Paragraph>
    </Section>
    <Section position="2" start_page="418" end_page="418" type="sub_section">
      <SectionTitle>
3.0 She Word List
</SectionTitle>
      <Paragraph position="0"> I~Lrris \[2\] has constructed a word list with linear ordering in a sense that all words in a group are derived forms of a baseword (the first word in the group), and its relative position ~nplies its syntactic information (see figure 1). Because the n~nber of derived forms is not regular, the size of a group varies. Even though the positional notation cannot be used in a strict sense as in numbers because of the length variability of the groups, the relative position of a member in a group can still provide its syntactic infomnation.</Paragraph>
      <Paragraph position="1"> For a comprehensive and consistent word list, some words which are not in the top 32000 have had to be included, resulting in a larger word list.</Paragraph>
      <Paragraph position="2"> As the size of the integer codes cannot exceed 32768, the whole word list r~%y have to be reduced by excluding s~ne words in the top 32000.</Paragraph>
      <Paragraph position="3"> In using the above word list, it is found that two problems may arise due to (1) the occurrence of homographs (words identical in spelling but different in pronunciation and meaning) and homologs (words identical in spelling and pronun~i=~tion but different in meaning~ and (2)the occurrence of words having the same meaning but different syntax. If different codes are given to the duplicate words the encoding process would need an intelligent parser to be able to differentiate between the two. Such a parser, though not impossible to implement, is not cost-justifiable for this study; hence the only alternative is to assign the code for both words and to allow for ambiguity when the coded text is used for analysis.</Paragraph>
      <Paragraph position="4">  For encoding, a code table c~nprising more than 32000 distinct words in the word list indicating their codes is stored in an indexed file using the words as keys. For faster encoding, the top 200 words from dle Cobuild corpus are stored in a hash table. During encoding this hash table is searched before searching the code table, therefore saving execution time when enccx\]ing con~non words.</Paragraph>
      <Paragraph position="5"> For word mani~:~lation of the encoded text the word list needs to k~ structured in order that the syntactic informaticm is captured. That is, the group and the set (set-~Type)to which the word belongs and the relative position of the word in the group needs robe ir~icated. For our ~llplementation a l~ed list is employed to store a word which has liJ{s to the ~mse word (the first word in the group) and the next word in the group and information containing its set-type. Each node in the linked list is stored as a record in an indexed file using the codes as keys. In this way each word in a group can be retrieved individually and the group can be synthesized from tracing the links to the next word. There is further gain in using the codes as the search key in that the same file can be used for ~le decoding process.</Paragraph>
    </Section>
    <Section position="3" start_page="418" end_page="419" type="sub_section">
      <SectionTitle>
5.0 Some Sample Statistics
</SectionTitle>
      <Paragraph position="0"> Table 1 gives the perfomnance statistics of the two-byte-word encoding scheme on four English texts. Comparison of compression ratios with other techniques is sheik) in Table 2.</Paragraph>
      <Paragraph position="1">  a. &amp;quot;S~nall is Beautiful&amp;quot; by E.F. Scht~nacher b. &amp;quot;Baby and Child Care&amp;quot; by Dr. B. Speck c. &amp;quot;She Third World War&amp;quot; by Sir John Hacket d. &amp;quot;The ArLlericans&amp;quot; by A. Cooke  From Table 1 it is observed that on the average the compression percentage is 82%. Although short of the 99% mentioned in section 2.0, this is expected,since not all of the top 32000 words have been included in the word list. The compression ratio as shown in the fourth row has a mean of 2.0. Comparing the last row with the sl~m of the 7th and 8th row it is seen that the speed for word frequency count for the coded texts is much faster than that of the coded texts.</Paragraph>
    </Section>
    <Section position="4" start_page="419" end_page="419" type="sub_section">
      <SectionTitle>
6.0 Some Practical Applications
</SectionTitle>
      <Paragraph position="0"> For purposes of storing compressed text, the two-byte-word encoding scheme can be used independent of the word list and some results are show\]\] in the previous section. It maybe seen that the performance of the scheme is comparable with other well-known techniques (see Table 2). With the word list the scheme is capable of word nmniDulation and therefore it can be used for more intelligent applications. For example, the scheme has been used for obtaining the le~natized word count of several large texts - an almost impossible task when done manually. No comparison results is shown s~nplybecause no similar system currently exists.</Paragraph>
      <Paragraph position="1"> Because the words in the coded text are represented by integers, word comparison - a common task in linguistic research involving comparison of character by character - becomes comparison of two numbers which is a c~lick and simple operation on digital computers. This is reflected quite dramatically on obtaining the word frequency count of the coded text.</Paragraph>
      <Paragraph position="2"> Agknowled~ements I wish to thank Professor John Sinclair of the English Department, University of Birmingham for the use of research facilities in the Cobuild project. I a~l also grateful to the University of Malaya for providing the funds for my stay in Britain.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML