File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/94/a94-1004_intro.xml

Size: 12,955 bytes

Last Modified: 2025-10-06 14:05:34

<?xml version="1.0" standalone="yes"?>
<Paper uid="A94-1004">
  <Title>Modeling Content Identification from Document Images</Title>
  <Section position="3" start_page="22" end_page="26" type="intro">
    <SectionTitle>
3 Content identification
</SectionTitle>
    <Paragraph position="0"> Text characterization is an important domain for natural language processing. Many published techniques utilize word frequencies of a text for information retrieval and text categorization (Jacobs, 1992; Cutting et al., 1993).</Paragraph>
    <Paragraph position="1"> We also characterize the content of the document image by finding words that seem to specify the topic of the document. Briefly, our strategy is to identify the frequently occurring word shape tokens.</Paragraph>
    <Paragraph position="2"> In this section, we first describe a process of cleaning the input sequence which precedes the main procedures. Then, we illustrate how to collect the important tokens, introducing a stop list of common word shape tokens which is used to remove the tokens that are insufficiently specific to represent the content of the documents.</Paragraph>
    <Section position="1" start_page="22" end_page="23" type="sub_section">
      <SectionTitle>
3.1 Cleaning input sequence
</SectionTitle>
      <Paragraph position="0"> Given a sequence of word shape tokens, the system removes the specific character shape codes '-', ',', '.', ':', and '!' that do not contribute important linguistic information to the words to which they adhere, but that change the shape of the tokens. Otherwise, word shape would vary according to position and punctuation, which would interfere with token distribution analysis downstream. We ignore possible sentence initial word shape alteration by capitalization simply because it is almost impossible to presume the original shape. In this  paper, capitalized words are counted differently from their uncapitalized counterparts.</Paragraph>
      <Paragraph position="1"> Our cleaning process concatenates word shape tokens before and after the hyphen at the end of line. The process also deletes intended hyphens (e.g., AxxxA-xixAxA \[broad-minded\] --&gt; AxxxAxixAxA).</Paragraph>
      <Paragraph position="2"> Eliminating hyphens reduces the variation of word shape tokens. We measured this effect using the aforementioned frequent 10,000 words. Forty-two words of 10,000 are hyphenated. In character shape code representation, 10,000 words map into 3,679 word shape tokens (figure 2). When hyphens are eliminated, the 10,000 words fall into 3,670 word shape tokens. This small reduction implies that eliminating hyphens does not practically affect the following process.</Paragraph>
    </Section>
    <Section position="2" start_page="23" end_page="24" type="sub_section">
      <SectionTitle>
3.2 Introducing a word shape token stop list
</SectionTitle>
      <Paragraph position="0"> After cleaning is done, the system analyzes word shape token distribution. Word shape tokens are counted on the hypothesis that frequent ones correspond to words that represent content; however, tokens that correspond to function words are also very frequent. One problem awaiting solution is that of developing a technique to separate these two classes.</Paragraph>
      <Paragraph position="1"> In tiffs paper, we define function words as the set of {prepositions, determiners, conjunctions, pronouns, modals, be and have surface forms }, and content words as the set of {nouns, verbs (excluding modals and be and have surface forms), adjectives }. Words that belong in both categories are defined as function words. We exclude adverbs from both, because they sometimes behave as function words and sometimes as content words. Words that can be adverbs but also can be either a function or a context word are not counted as adverbs.</Paragraph>
      <Paragraph position="2"> In English, function words tend to be short whereas content words tend to be long. For the purpose of investigating characteristics of function and content words in character shape code representation, we compiled a lexicon of 71,372 distinct word shape token entries from an ASCII-represented lexicon of 245,085 word entries which was provided by Xerox PARC and was modified in our laboratory. 254 word shape token entries of the lexicon correspond to 515 function words, 63,356 entries correspond to 226,648 content words, and 209 entries correspond to both function and content words.</Paragraph>
      <Paragraph position="3"> Finally, 8,921 word shape token entries correspond to 17,922 adverbs. Figure 3 shows the distribution of word shape token length. Frequency of occurrence of word shape tokens was not taken into account; that is, we simply counted the length of each entry and computed the population ratio. The distribution of content words is apparendy different from that of function words. In the figure, we also record the distribution of word shape tokens corresponding to the 100 most frequent words (75 function words, 16 content words, and 9 adverbs) from the source (Carroll et al., 1971). It illustrates that very common words are short.</Paragraph>
      <Paragraph position="4">  A stop list of the most common function word shape tokens was constructed so that they could be removed from sequences of word shape tokens. It is important to select the right word shape tokens for this list, which must selectively remove more function words than content words. In general, the larger the list, the more it removes both function and content words.</Paragraph>
      <Paragraph position="5"> Thinking back to our goal of finding frequent content words, we don't need to try to remove all function words. We need only to remove the function words that are nsually more frequent than content-representing frequent words in the text on the assumption that the frequency of individual function words is almost independent of topic. Infrequent function words that remain after using the word shape token stop list are distinguishable from frequent content words by comparing their frequencies.</Paragraph>
      <Paragraph position="6"> We generated a word shape token stop list using Carroll's list of frequent function words. We selected several sets of the most freq~ent function words, by limiting the minimum frequency of words in the set to 1%, 0.5%, 0.1%, 0.09%, 0.08% ..... 0%, then converted them into word shape tokens. We tested these word shape tokens on the aforementioned lexicon to count the number of matching entries. Table 2 gives part of the results, where Freq.FW stands for frequencies of the selected function words, # FW for the number of them, # stop-tokens for the number of word shape tokens derived from them, FW.Match for a ratio of the number of matching function words to the total number of function words in the lexicon (515), and CW.Match for a ratio of the number of matching content words to the total number of content words (226,648). A word shape token stop list, for instance, from function words whose frequencies are more than 0.5% removes 0.4% of content words and 18% of function words from the lexicon; a word shape token stop list from function words with frequencies more than 0.01% removes 4.2%  of content and 56% of function words; and a word shape token stop list from all function words in the lexicon removes 9.5% of content words.</Paragraph>
      <Paragraph position="7">  the of and a to in is you that it he for was on are as with his they at be this from I have or by one had but what all were when we there can an your which their ff will each about up out them she many some so these would other into has her like him could no than been its who now my over down only may after where most through before our me any same around another must because such off every between should under us along while might next below something both few those Word shape token stop list frem above words AAx xA xxA x Ax ix gxx AAxA iA Axx xxx xx xiAA Aix AAxg AAix Axxx A Ag AxA xAxA xAA xxxx xAxx AAxxx gxxx xAixA AAxix xxxA xAxxA xg AAxx xAx xxxg xxxAA xAAxx ixAx AiAx lAx xxAg xxg xAxxx AAxxxgA  We also tested these word shape token stop lists on ASCII encoded documents, and discovered that good results are obtained with the lists derived from function words with frequencies of more than 0.05%. This list identifies all words that occur more than 5 times per 10,000 in the document. Figure 4 shows the selected function words and the corresponding word shape token stop list. The number of stop tokens is 57 for 101 ftmelion words. Table 2 shows that the list removes 2.9% of content words and 44% of function words from the lexicon. null</Paragraph>
    </Section>
    <Section position="3" start_page="24" end_page="24" type="sub_section">
      <SectionTitle>
3.3 Augmentation of the word shape token stop list
</SectionTitle>
      <Paragraph position="0"> In our character classification, all numeric characters are represented by the character shape code A (Table 1).</Paragraph>
      <Paragraph position="1"> Therefore, after cleaning is done, all numerals in a text fall into word shape tokens A*, where * means zero or more repetitions of A. This sometimes makes the frequency of A* unreasonably high though numerals are often of little importance in content identification.</Paragraph>
      <Paragraph position="2"> A* matches all numerals, but since it matches few content words except for acronyms in capital letters, we decided to add A* to the word shape token stop list.</Paragraph>
    </Section>
    <Section position="4" start_page="24" end_page="26" type="sub_section">
      <SectionTitle>
3.4 Testing the word shape token stop list
</SectionTitle>
      <Paragraph position="0"> Our word shape token stop list was tested on 20 ASCII encoded sample documents, ranged in length from 571 to 13,756 words, from a variety of sources including business reports, travel guides, advertisements, techni- null cal reports, and private letters. First, we generated word shape tokens, and cleaned as described earlier. Next, we removed tokens that were on the word shape token stop list. Table 3 shows the number of content and function words which the documents consist of before and after using the list. In the table, CW.1 and CW.2 stand for the number of distinct content words in the original document and the number after using the word shape token stop list, respectively. CW.R stands for a ratio of (CW.1 - CW.2) to CW.1. Similarly, FW.1 and FW.2 stand for the number of distinct function words before and after using the list, and FW.R is a ratio of (FW.1 -FW.2) to FW.1. FW.R is much larger than CW.R in all sample documents, which shows the good performance of the word shape token stop list. We should note that the values of CW.R are larger than the 2.9% that we get from testing the list on the lexicon. This is because the lexicon includes many uncommon words and these tend to be longer than the function words selected to make the word shape token stop list. This implies that our list removes more ordinary content words than uncommon ones. We believe that removing such words affects content identification little since ordinary content words in many topics usually don't  word shape token ranking and corresponding words {the, The} { to, be, Fr, In, As, On, An, (e} { 1988, 1989, 2000, 1990, 1987, +7%), +5%), +28%, +27%, +18%, +14%} {of, at} {and, out, not, end, act} {in, is} { for, due, ten, low, For, Rnz } {some, over, were, more, same, rose, ease } {6%, 5%, At, 9%, 7%, 4%, 3%, 1%, 8%, 2%, 11, 10, +6} Ibuilding, Building} {4, 9, 8, 6, 3, 2, 1, 0, R, A, 5} { work, real, cost, such, much, most } {90%, 83%, 847, 80%, 5%), 49%, 29%, 27%, 26%, 23%, 21%, 19%, 175, 14%, 13%} la, s} { year, pace, grew } {by, By} ! was, are, new, can, saw, own, one } { construction } { expanded, expected, reported } {on, as, or} After using the word shape token stop list: word sha  specify the content of the document well (Salton et al., 1975). Likewise, the values of FW.R are larger Chart 44% for the same reason.</Paragraph>
      <Paragraph position="1"> After using the word shape token stop list, we counted the remaining tokens to obtain content-representing words in their frequency order. All samples successfully indicated their content by appropriate words. Data for a sample document reporting the growth of the building industry in Switzerland in 1988 and its outlook for 1989, consisting of 1013 words, are shown in figure 5. It shows top frequent word shape token ranking of the original document and the new ranking after using the word shape token stop list. The number of removed tokens was 544. Most of them represented common words and numerals. The top ranking after using the word shape token stop list consists of content words and represents the content of the document much better than the ranking before using it.</Paragraph>
      <Paragraph position="2"> Figure 5 also suggests that we can inexpensively locate key words by performing OCR on the few frequent word shape tokens.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML