File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/03/n03-1018_concl.xml

Size: 2,830 bytes

Last Modified: 2025-10-06 13:53:29

<?xml version="1.0" standalone="yes"?>
<Paper uid="N03-1018">
  <Title>A Generative Probabilistic OCR Model for NLP Applications</Title>
  <Section position="8" start_page="0" end_page="0" type="concl">
    <SectionTitle>
6 Conclusions and Future Work
</SectionTitle>
    <Paragraph position="0"> We have presented a flexible, modular, probabilistic generative OCR model designed specifically for ease of integration with probabilistic models of the sort commonly found in recent NLP work, and for rapid retargeting of OCR and NLP technology to new languages.</Paragraph>
    <Paragraph position="1"> In a rigorous evaluation of post-OCR error correction on real data, illustrating a scenario where a black-box commercial English OCR system is retargeted to work with French data, we obtained a 70% reduction in word error rate over the English-on-French baseline, with a resulting word accuracy of 97%. It is worth noting that our post-OCR correction of the English OCR on French text led to better performance than a commercial French OCR system run on the same text.</Paragraph>
    <Paragraph position="2"> We also evaluated the impact of error correction in a resource-acquisition scenario involving translation lexicon acquisition from OCR output. The results show that our post-OCR correction framework significantly improves performance. We anticipate applying the technique in order to retarget cross-language IR technology -- the results of Resnik et al. (2001) demonstrate that even noisy extensions to dictionary-based translation lexicons, acquired from parallel text, can have a positive impact on cross language information retrieval performance. null We are currently working on improving the correction performance of the system, and extending our error model implementation to include character context and allow for character merge/split errors. We also intend to relax the requirement of having a word list, so that the model handles valid word errors.</Paragraph>
    <Paragraph position="3"> We are also exploring the possibility of tuning a statistical machine translation model to be used with our model to exploit parallel text. If a translation of the OCR'd text is available, a translation model can be used to provide us with a candidate-word list that contains most of the correct words, and very few irrelevant words.</Paragraph>
    <Paragraph position="4"> Finally, we plan to challenge our model with other languages, starting with Arabic, Turkish, and Chinese. Arabic and Turkish have phonetic alphabets, but also pose the problem of rich morphology. Chinese will require more work due to the size of its character set. We are optimistic that the power and flexibility of our modeling framework will allow us to develop the necessary techniques for these languages, as well as many others.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML