File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/02/c02-2006_intro.xml

Size: 3,704 bytes

Last Modified: 2025-10-06 14:01:24

<?xml version="1.0" standalone="yes"?>
<Paper uid="C02-2006">
  <Title>An Annotation System for Enhancing Quality of Natural Language Processing</Title>
  <Section position="2" start_page="0" end_page="0" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> Recently there has been increasing interest in applying natural language processing #28NLP#29 systems, suchaskeyword extraction, automatic text summarization, and machine translation, to Internet documents. However, there are various obstacles that make it di#0Ecult for them to produce good results. It is true that NLP technologies are not perfect, but some of the di#0Eculties result from problems in HTML. Further, in general, if linguistic information is added to source texts, it greatly helps NLP programs to produce better results. In what follows, wewould liketoshow some examples related to machine translation.</Paragraph>
    <Paragraph position="1"> In general, it is very helpful for machine translation programs to know boundaries on many levels #28suchassentence, phrases, and words#29 and to know word-to-word dependency relations. For instance, in the following example, since #5CSt.&amp;quot; has two possible meanings, #5Cstreet&amp;quot; and #5Csaint,&amp;quot; it is di#0Ecult to determine whether the following example consists of one or two sentences.</Paragraph>
    <Paragraph position="2"> I went to Newark St. Paul lived there twoyears ago.</Paragraph>
    <Paragraph position="3"> As another example, the following sentence has twointerpretations; one interpretation is that what he likes is people and the other interpretation is that what he likes is accommodating.</Paragraph>
    <Paragraph position="4"> He likes accommodating people.</Paragraph>
    <Paragraph position="5"> If therearetagsindicatingthe direct-objectmodi#0Cer of the word #5Clike,&amp;quot; then the correct interpretation is possible. NLP may be able to resolve these ambiguities eventually by using advanced context processing techniques, but current NLP technology generally needs a hint from the author for these sorts of ambiguities.</Paragraph>
    <Paragraph position="6"> Further, there are issues in HTML#2FXML. When MT systems are applied to Web pages, most of the errors are generated by the linguistic incompleteness of MT technology, but some are generated by problems in HTML and XML tag usage. For instance, writers often use #3Cbr#3E tag to sentence termination. Sometimes writers intend that a #3Cbr#3E tag should terminate the sentence #28even without terminating punctuation such as a period#29, and in other cases writers intend #3Cbr#3E only as a formatting device. In the HTML #3Ctable#3Eshownin Figure 1, the writer intends each line of a cell to express one linguistic unit. The MT program cannot tell whether each line is a unit for translation, or, instead, the two lines form one unit. In this example, some MT programs would try to produce a translation of a unit #5CNetVista Models ThinkPad News.&amp;quot; As shown in the above examples, NLP applications do not achieve their full potential, on account of problems unrelated to the essential NLP processes. If tags expressing linguistic information</Paragraph>
    <Paragraph position="8"> are inserted into source documents, they help NLP programs recognize document and linguistic structures properly, allowing the programs to produce much better results. At the same time, it is true that NLP technologiesareincomplete, but their de#0Cciencies can sometimes be circumvented through the use of such tags. Therefore, this paper proposes a set of tags for helping NLP programs, called Linguistic Annotation Language #28or LAL#29.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML