File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/94/c94-2108_concl.xml
Size: 3,912 bytes
Last Modified: 2025-10-06 13:57:12
<?xml version="1.0" standalone="yes"?> <Paper uid="C94-2108"> <Title>CONTENT CHARACTERIZATION USING WORD SHAPE TOKENS</Title> <Section position="13" start_page="689" end_page="689" type="concl"> <SectionTitle> 10 I)ISCUSSION </SectionTitle> <Paragraph position="0"> Although the word shape tagger- tleals wilh greater ambiguity, it can still extract significant information from a text. The increase in ambiguity is not as high as might be expected: a large number of word shapes remain unambiguous after the lexicon has been shape converted.</Paragraph> <Paragraph position="1"> As noted above, the creation of the word shape lexicon from the standard lexicon reduces the number of distinct entries to approximately one-third. Vor example, distinct words such as &quot;cat&quot; and &quot;rat&quot; map onto the same word shape token xxA. Nevertheless, the complexity of English spelling still allows a large proportion of surface forms to be distinguished merely by their word shapes.</Paragraph> <Paragraph position="2"> Several inlprovements on our technique remain to be fully implemented. We do not yet have a principled way to determine the optimal tagset for a given corpus of texl.</Paragraph> <Paragraph position="3"> As noted alxwe, there is a tension between the size of the tagset and the amount of syntactic information that is available in the word shape tokens.</Paragraph> <Paragraph position="4"> We are also investigating computationally inexpensive ways of making finer distinctions between characte,s that map to the character shape codes x and A. Initially, parentheses and brackets were always classified as A anti distorted any word shape they were adjacent to: for example, &quot;(USA)&quot; woukl be shape converted to A A A A A. Recently we have made progress m recognizing these nora alphabetic characters as wnrd shape token delimiters, rather than parts of the word shape tokens Ihemselves. It may also be useful to distinguish more alphabetic character elasses by mapping scanned character bnages to a larger set of chmacter shape codes. We can ext,'act more useful inlknmation by distinguishing upper case letters from lower case letters, such as &quot;h&quot; and &quot;k&quot;, which malt to the character shape code A. A larger number of character shape codes gives us more information about the word shape tokens, and helps Io reduce ambiguity, l lowever, we must be careful to choose character shape features which can bc easily dctccted in the image and quickly classi fled by a character shape ctx.le.</Paragraph> <Paragraph position="5"> In keeping with Vnji Xerox's multiqingual document emphasis, we are also exploring ways in which this method may be applied to other Roman-alphabet languages, such as French, German, Dutch, and Spanish.</Paragraph> <Paragraph position="6"> The technique will need to be evaluated separately for each language, however, to better understand how each hmguage's typographic conventions may be reflected in its word shape.</Paragraph> <Paragraph position="7"> 1 1 CONCLI!SION We have presented a new technique for the understanding of English document images without optical character recognition. By scanning and categorizing character shapes, it is possible to extract word shapes from the document text; these word shapes tokens can then be used as input to a tagger which determines part-of-speech reformation. This part-of-speech inlormation can then be used to inform other document understanding techniques, including noun phrase recognition and topic identification. The lack of OCR means we cannot extract all of the information contained in the scanned dC/x:tnnent's image; nevertheless, the information from the word shape tokens allows us to characterize the document's content with significant accuracy, and more quickly than if we performed O(;\[,I.</Paragraph> </Section> class="xml-element"></Paper>