File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/abstr/94/c94-1085_abstr.xml

Size: 13,071 bytes

Last Modified: 2025-10-06 13:48:04

<?xml version="1.0" standalone="yes"?>
<Paper uid="C94-1085">
  <Title>Fax: An Alternative to SGML</Title>
  <Section position="2" start_page="0" end_page="526" type="abstr">
    <SectionTitle>
2. Fax-a-Query: the Ultimate in WYSIWYG
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="526" type="sub_section">
      <SectionTitle>
Interfaces for Infurmation Retrieval (1R)
</SectionTitle>
      <Paragraph position="0"> Users will need to search bitmaps for sections of interest. Traditionally, most IR systems have been developed for collections of text files rather than bitmaps. The user types in a query and the system retrieves a set of matching documents. Some ot' these systems depend on manual indexing, e.g., subject terms or hypertext links. Others allow the user to type in an arbitrary piece of text as input.</Paragraph>
      <Paragraph position="1"> Documents are retrieved lay matching words against the qt, ery and weighting appropriately (Salton, 1989).</Paragraph>
      <Paragraph position="2"> These systems have been extended to retrieve bitnmps, by first pre-processing the bitmaps with an OCR program. Although the OCR results arc far from perfect, and users would complain about the OCR errors if they saw them, the OCR output has been strewn to be more than adequate for retrieval purposes (Smith (1990), Taghva et al (to appear)).</Paragraph>
      <Paragraph position="3"> But why should a user have to type in a query? Why not provide a complete round trip capability? If OCR were used on the queries as well as on the documents, then the query could be a page of a book, article, a fax, or whatever. As far as the user is concerned, the system is just laxes (or bitmaps), through and through.</Paragraph>
      <Paragraph position="4">  panel), and comes across a topic of interest. The user sweeps a box over an interesting section of the bitmap (inverse video at bottom of left panel), which causes the corresponding words (produced by OCR) to be sent to an information retrieval system. A relevant document pops up in another bitmap browser (right panel).</Paragraph>
      <Paragraph position="5"> We call this proposal Fax-a-Query, and illustrate it in Figure 1. A user is reading a document in a bitmap browser, and comes across a topic of interest. The user sweeps a box over an interesting section of the bitmap, which causes the corresponding words (produced by OCR) to be sent to an information retrieval system. A relevant document pops up in another bitmap browser.</Paragraph>
      <Paragraph position="6"> Fax-a-Query is also useful for retrieving pictures as well as text. Most picture retrieval system require manual indexing, which can be very expensive. However, since a picture is often surrounded by useful text such as a caption, one can find the picture by matching on the text.</Paragraph>
      <Paragraph position="7"> We have applied a prototype Fax-a-Query system to our database of 15,000 AT&amp;T internal documents. These documents were scanned iuto the computer by the AT&amp;T library for archival purposes. They are stored in TIFF, format at 400 dots per inch, using Group 4 fax compression. It took us about a minute per page or a year of real time to OCR the collection and 40 hours of real time to index the collcction with the SMART information retrieval system (Salton and McGill, 1983, chapter 4). 1 The bitmap browser was borrowed li'om tbe Ferret system (Katseff, personal communication).</Paragraph>
      <Paragraph position="8"> Fax-a-Query was also designed to be usable fi'om a standard fax machine, for users that may be on the road and don't have access to a terminal with a  window system. A user could fax a query to the system ,and the system would fax back some 1. The OCR errors slow the indexing process considerably  since they make tile vocabulary too large to fit ill main memory. Our data has a huge vocabulary (3 million words), most of which are OCR errors. By comparisou, the TREC text collection (Dumais, 1994) has a much smaller vocabulary (1 million words). The difference in vocabulary sizes is especially significant given that TREC is considerably larger (2 gigabytes) tbau our OCR output (1 gigabyte).</Paragraph>
      <Paragraph position="9">  relevant documents. In this way, a user could call the borne office from any public fax machine anywhere and access documents in a fax mailt)ox, a private file computer, or a public library. (This capability is currently limited by the fact that OCR doesn't work very well on low resolution faxes.)  3. Do We Need OCR?  Fax-a-Query makes heavy use of OCR, hut does so in such a way that users are often mtaware of what is actually happening behind the scenes.</Paragraph>
      <Paragraph position="10"> hnage EMACS works directly on the pixels, in order to avoid OCR errors. Even though users can be fairly well shielded from the limitations of the OCR program, the OCR errors are fiustrating nonetheless.</Paragraph>
      <Paragraph position="11"> Two examples of the word &amp;quot;pair&amp;quot; are shown in Figure 2. Both examples were extracted flom the same document, trot from different pages. One of them was recognized correctly and tile other wits misrecognized as &amp;quot;liair&amp;quot;. As can be seen in Figure 2, the two images are ahnost identical.</Paragraph>
      <Paragraph position="12"> Even a very simple-minded measure such as Hamming distance would have worked better than OCR, at least in tiffs case.</Paragraph>
      <Paragraph position="13"> The &amp;quot;liair&amp;quot; error wits probably caused by incorrectly segmenting the &amp;quot;p&amp;quot; into two letters, and then labeling the left half of the &amp;quot;p&amp;quot; its an 'T' and the second half as an 'T'. This error is particularly inexcusable since the spacing of the letters within a word is completely determined by the font. There is no way that &amp;quot;li&amp;quot; should he confusable with &amp;quot;p&amp;quot; since it would require shilling the &amp;quot;1&amp;quot; with respect to the &amp;quot;i&amp;quot; in both the horizontal and vertical dimensions in ways that are extremely unlikely. The Hamming distance approach would not make this kind of error because it works at the word-level rather than the character-level, and so it would not try to shift parts of words (or letters) around in crazy ways.</Paragraph>
      <Paragraph position="14"> in general, we have found that two instances of the same word in the same document are often very similar to one another, nmch more so than two instances from different doctnnents. Figure 3, for example, shows a number of examples of the word &amp;quot;using&amp;quot; selected from two different documents. If we sum all of tile instances of &amp;quot;using&amp;quot; across the two documents, as shown in the bottom-most panel, we get a mess, indicating that we can't use Hamming distance, or anything like it, for comparing across two documents. But if we sum within a single document, .'is shown in the two panels just above the bottom-most panel, then we find nmch better agreement, indicating that something like Hamming distance ought to work faMy well, as long as we restrict the search to a single doenment.</Paragraph>
      <Paragraph position="15"> Ttte strong document effect should not he surprising. Chances are that all of the instances of &amp;quot;using&amp;quot; have been distorted in more or less tile slnne way. They were p,obably all Xeroxed about eqttally often. The gain control on tile scanner wits probably fairly consistent throughout. The Ibm is likely to be the salne. The point size is likely to be the same, and so on. Some authors refer to these factors its defects (Baird, 1992), trot we prefer to think of thein its document-specific properties.</Paragraph>
      <Paragraph position="16"> We have used this Ilamming distance approach to build a predicate that compares two boxes and tests whether the pixcls in the two boxes correspond to the same word. In tile case of the two &amp;quot;pairs&amp;quot; in Figure 2, for example, tile predicate produces the desired result. This distance measure has been used to implement a search corn,hand. When the user clicks on an instance of a word, the systent highlights the next instance of the same word, by searching the bitmap for the next phtce that has ahnost the same pixels. 2 It is remarkable that this search command manages to accomplish nutch of what we had beett doing with OCR, but without the C (it is word-based rather than character-based) attd withont the R (it doesn't need to recognize the words in order to search for the next instance of tile same thing). This opens an interesting question: how much natural hmgtutge processing can be done without  the C and without the R? For example, could we count ngram statistics at the pixel-level without giving the OCR program a chance to screw tip the Cs and the Rs? 4. Conclusions: Bitnmps :tre The Way of The</Paragraph>
    </Section>
    <Section position="2" start_page="526" end_page="526" type="sub_section">
      <SectionTitle>
Future
</SectionTitle>
      <Paragraph position="0"> We have been working with a large corpus of faxes (15,000 docnments or 500,000 pages or 2. It is possible to implement this search nmch more efficiently by i)re-computing It few monmnts for each of the words in the bitmap and using these moments to quickly exclude words that are too big or too small, or too spread out or llOt spread oat enough.</Paragraph>
      <Paragraph position="1">  100,000,000 words). Faxes raise a number of interesting technical challenges: we need editors, search engines, and much more. Of course, we wouldn't have to work on these hard problems if only people would use SGML. But, people aren't using SGML. SGML may be more convenient for us, but the world is using fax because it is more convenient for them.</Paragraph>
      <Paragraph position="2"> Fax hardware and software are everywhere: hotels, airports, news stands, etc. Everyone knows how to use a fax machine. Word processors are more expensive, and require more training and skill. The markup issues, for example, are very demanding on the users. Part of the problem may be the fault of the markup languages, but the real problem is that the concepts are just plain hard. Most users don't want to know about tables, figures, floating displays, headers, footers, footnotes, columns, fonts, point sizes, character sets, and so on, Libraries are scanning large numbers of documents because scanning has become cheaper and more convenient than microfiche. Our library is scanning 105 pages per year. Our library has also been trying to archive &amp;quot;machine readable&amp;quot; text files in addition to the bitmaps, but with somewhat less success. Because it is too expense to re-key the text, they have been asking authors for text files, but most authors aren't very cooperative.</Paragraph>
      <Paragraph position="3"> Even when the text file is available, we should also archive the bitmap as well, because the bitmap is more likely to survive the test of time. We tend to think of the text file as the master copy, and the bitmap and the hardcopy as a byproduct, when in fact, it should probably be the other way around. When the first author was finishing his Ph.D., he had to generate a copy of the thesis for archival purposes. At the time, it seemed that the school library was stuck in the stone age, because they insisted on a hardcopy printed on good paper, and they were not interested in his wonderful &amp;quot;machine readable&amp;quot; electronic version. In retrospect, they made the fight decision. Even if the tapes had not rotted in his basement, he still couldn't read them because the tape reader is long gone, and the tape format is now obsolete. The markup language is also probably dead (does anyone remember R?), along with the computer (a PDP-10), the operating system (ITS), and most other aspects of the hardware and software that woulff be needed to read the electronic version.</Paragraph>
      <Paragraph position="4"> The debate between text files or bitmaps is analogous to the old debate between character-based terminals such as a VT100 and bitmap terminals. At the time, bitmap terminals seemed wasteful to some because they required what was then a lot of memory, but nowadays, it is hard to find a character-based terminal anywhere, and it is hard to remember why anyone would have wanted one. How could you run a window system on such a terminal? How could you do any interesting graphics? There were solutions, of course, but they weren't pretty.</Paragraph>
      <Paragraph position="5"> So too, there might soon be a day when people might find it hard to imagine why anyone would want a text file. How could you do any interesting graphics? Equations? There are solutions (markup and include files), but they aren't pretty. Of course, bitmaps require a little more space (a 400 dpi G4 fax takes about 20 times the space as the equivalent text file), but the bitmap is so much more powerful and so much easier to use that it is well worth the extra space.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML