File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/02/c02-2006_metho.xml
Size: 14,600 bytes
Last Modified: 2025-10-06 14:07:51
<?xml version="1.0" standalone="yes"?> <Paper uid="C02-2006"> <Title>An Annotation System for Enhancing Quality of Natural Language Processing</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 Linguistic Annotation Language </SectionTitle> <Paragraph position="0"> LAL is an XML-compliant tag set and its XML namespace pre#0Cx is lal.</Paragraph> <Paragraph position="1"> The LAL tag set is designed to be as simple as possible for the following reasons: #281#29 A simple tag set is easier for developers to check manually. #282#29 An easy-to-use annotation tool is mandatory for this annotation scheme. Simplicity is important for making an easy-to-use annotation tool, since if we use a feature-rich tag set, the user must check many annotation items.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.1 Basic Tags </SectionTitle> <Paragraph position="0"> The sentence tag s is used to delimit a sentence.</Paragraph> <Paragraph position="1"> #3Clal:s#3EThis is the first sentence.#3C#2Flal:s#3E #3Clal:s#3EThis is the second sentence.#3C#2Flal:s#3E The attribute type=&quot;hdr&quot; means that the sentence is a title or header.</Paragraph> <Paragraph position="2"> The word tag w is used to delimit a word. It can have attributes for additional information such as base form #28lex#29, part-of-speech#28pos#29, features #28ftrs#29, and sense #28sense#29ofaword. The values of these attributes are language-dependent, and are not described in this paper because of space limitations. The following example illustrates some of these tags and attributes.</Paragraph> <Paragraph position="4"> The dependency #28or word-to-word modi#0Ccation#29 relationship can be expressed by using the id and mod attributes of a word tag; that is, a word can have the ID value of its modi#0Cee in a mod attribute.</Paragraph> <Paragraph position="5"> The ID value of a mod attribute must be an ID value ofaword tag or a segment tag. For instance, the following example contains attributes showing that the word #5Cwith&quot; modi#0Ces the word #5Csaw,&quot; meaning that #5Cshe&quot; has a telescope.</Paragraph> <Paragraph position="6"> She #3Clal:w id=&quot;w1&quot; lex=&quot;see&quot; pos=&quot;v&quot; sense=&quot;see1&quot;#3Esaw#3C#2Flal:w#3E a man #3Clal:w mod=&quot;w1&quot;#3Ewith#3C#2Flal:w#3E a telescope.</Paragraph> <Paragraph position="7"> The phrase #28or segment#29 tag seg is used to specify a phrase scope on any level. In addition, you can specify the syntactic category for a phrase by using an optional attribute cat. The following example speci#0Ces the scope of a noun phrase #5Ca man ... a telescope,&quot; and it is a noun phrase. This also implies that the prepositional phrase #5Cwith a telescope&quot; modi#0Ces the noun phrase #5Ca man.&quot; She saw #3Clal:seg cat=&quot;np&quot;#3Ea man with a telescope#3C#2Flal:seg#3E.</Paragraph> <Paragraph position="8"> The attribute para=&quot;yes&quot; means that the segment is a coordinated segment. The following example showsthat the word#5Csoftware&quot;andthe word #5Chardware&quot; are coordinated.</Paragraph> <Paragraph position="9"> This company deals with #3Clal:seg cat=&quot;np&quot; para=&quot;yes&quot;#3Esoftware and hardware#3C#2Flal:seg#3E for networking.</Paragraph> <Paragraph position="10"> The ref attribute has the ID value of the referent of the currentword. This can be used to specify a pronoun referent, for instance:</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.2 Expressing Multiple Parses </SectionTitle> <Paragraph position="0"> As mentionedearlier, sincenaturallanguagecontains ambiguities, it is useful for LAL annotation to have a mechanism for expressing syntactic ambiguities. null Wehaveintroduced a parse identi#0Cer #28or PID#29 in attribute values for distinguishing parses. An attribute value which may be changed according to parses can be allowed to be expressed as spaceseparated multiple values, each of which consists of a PID pre#0Cx followed by a colon and an attribute value.</Paragraph> <Paragraph position="2"> This example shows that there are twointerpretations whose PIDs are p1 and p2, and that the p1 interpretation is #5CHe likes people&quot; and p2 is #5CHe likes accommodating.&quot;</Paragraph> </Section> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 LAL-Aware NLP Programs </SectionTitle> <Paragraph position="0"> We have modi#0Ced certain NLP systems to be LAL-aware. ESG #5B5, 6#5D is an English parsing system developed by the IBM Watson Research Center, andupdated toacceptandgenerateLAL-annotated English. Wehave also developed a Japanese parsing system with LAL output functionality. These LAL-aware versions of parsers are used as a back-end process to show users the system's default interpretationforagivensentencein the LAL-annotation editor described below.</Paragraph> <Paragraph position="1"> Further, the English to German, French, Spanish, Italian and Portuguese translation engines #5B6,</Paragraph> </Section> <Section position="5" start_page="0" end_page="2" type="metho"> <SectionTitle> 4 The LAL-Annotation Editor </SectionTitle> <Paragraph position="0"> Since inserting tags into documents manually is not generally an easy task for end users, it is important to provide an easy-to-use GUI-based editing environment. In developing suchanenvironment, wetookinto consideration the following points: #281#29 Users should not have to see any tags. #282#29 Users should not have to see internal representations expressing linguistic information. #283#29 Users should be able to view and modify linguistic information such as feature values, but only if they want to.</Paragraph> <Paragraph position="1"> Considering these points, we have found that most of the errors made by NLP programs result from their failure to recognize the phrasal structures of sentences. Therefore, wehave decided to In addition, Watanabe #5B11#5D reported on an algorithm for accelerating CFG-parsing by using LAL tag information, and it is implemented in the above English-to-Japanese translation engine.</Paragraph> <Paragraph position="2"> show only a structural view of a sentence in the initial screen; other information is shown only if the user requests it.</Paragraph> <Paragraph position="3"> The important issue here is how to represent the syntactic structure of a sentence to the user. NLP programs normally deal with a linguistic structure by means of a syntactic tree, but such a structure is not necessarily easy for end users to understand.</Paragraph> <Paragraph position="4"> For instance, Figure 2 shows the dependency structure of the Englishsentence#5CIBM announcedanew computer system for children with voice function.&quot; This dependency structure is di#0Ecult for end users, partly because a dependency tree does not keep the surface word order, so that it is di#0Ecult to map it to the original sentence quickly.</Paragraph> <Paragraph position="5"> Therefore, an important property for the linguistic structural view is that users can easily reconstruct the original surface sentence string.</Paragraph> <Paragraph position="6"> The next important issue is how easily a user can understand the overall linguistic structure. If a user is, at #0Crst, presented with detailed linguistic structure at the word level, then it is di#0Ecult to grasp the important linguistic skeleton of a sentence. Therefore, another necessary property is to give users a view in which the overall sentence With these requirements in mind, wehave developed a GUI tool called the LAL Editor. To satisfy the last requirement, this editor has two presentation modes: the reduced presentation view and the expanded presentation view. In the reduced presentation view, a main verb and its modi#0Cers are basic units for presenting dependencies, and they are located on di#0Berent lines, keeping the surface order. Figure 3 shows an example of this reduced presentation view. In this view, since dependencies that are obvious for native speakers #28e.g. #5Ca&quot; and #5Ccomputer&quot; #29 are not displayed explicitly, the user can concentrate on dependencies between key</Paragraph> <Section position="1" start_page="2" end_page="2" type="sub_section"> <SectionTitle> View </SectionTitle> <Paragraph position="0"> units #28or phrases#29. If the user #0Cnds any dependency errors in the reduced view, he or she can enter the expanded view mode in which all words are basic units for presenting dependencies. Figure 4 #28a, b#29 shows examples of this expanded view. In these views, to satisfy the former requirement, dependencies between basic units are expressed by using indentation. Therefore you can easily reconstruct the surface sentence string by just looking at words from top to bottom and from left to right, and easily know dependencies of words by looking at words located in the same column. For details of the algorithm, see #5B12#5D.</Paragraph> <Paragraph position="1"> In Figure 3, you can easily grasp the overall structure. In this case, since the dependencies between #5Cfor&quot;and #5Cannounced,&quot; and #5Cwith&quot; and #5Cannounced&quot; are wrong, the user can change the mode to the expanded view #28as shown in Figure 4 #28a#29#29.</Paragraph> <Paragraph position="2"> In this view, the user can change dependencies by dragging a modi#0Cer to the correct modi#0Cee using a mouse. The corrected dependency structure is shown in Figure 4 #28b#29.</Paragraph> <Paragraph position="3"> In addition, the LAL Editor has the capabilityof testing translation by using LAL annotation. Figure 5 shows a window in which the top pane shows theinput sentence, thesecondpaneshowsthe LAL-annotation of the input, the third pane shows the translation result using the LAL annotation, and the fourth pane shows the default translation without using the LAL annotation. The user can easily check whether the current annotation can improve translations.</Paragraph> </Section> </Section> <Section position="6" start_page="2" end_page="2" type="metho"> <SectionTitle> 5 Experiment </SectionTitle> <Paragraph position="0"> Wehave conducted a small experiment for evaluating LAL annotation to our English-to-Japanese machine translation system#5B9#5D. We gathered about 60 sentences from Web pages in the computer domain, and added LAL annotation to these sen-</Paragraph> <Section position="1" start_page="2" end_page="2" type="sub_section"> <SectionTitle> View </SectionTitle> <Paragraph position="0"> tences with the LAL annotation editor. In this experiment, only word-to-word modi#0Ccations were corrected. Due to severe parsing errors and glitches of the annotation editor, 53 of the 60 sentences were used in this experiment. The averagesentence length for this test set was 21 words. Twoevaluators assigned a qualityevaluation ranging from 1 #28worst#29 to 5 #28best#29 for each translation, with and without use of annotation.</Paragraph> <Paragraph position="1"> Translation results for 18 sentences #28about 34#25#29 were better for the annotated case than the non-annotated case. These better sentences were 1.16 points better #2827#25 better in quality score#29. On the other hand, 26 sentences #28about 49#25#29 were not changed, and 9 sentences #28about 17#25#29 were worse. The main reason why these 9 sentences were worse was the structural mismatch between the output of the LAL Editor and the expected structure of EtoJ translation system, since the LAL Editor and the EtoJ MT system use di#0Berent parsing systems.</Paragraph> <Paragraph position="2"> Wehave developed a structure conversion routine from LAL editor output to EtoJ input, but it does not yet cover all situations. This is the reason why these 9 sentences become worse.</Paragraph> <Paragraph position="3"> Note that this experiment only uses word-to-word modi#0Ccation corrections, so there is room for producing better translations if we use other types ofannotationsuchaspart-of-speech, andwordsense.</Paragraph> </Section> </Section> <Section position="7" start_page="2" end_page="2" type="metho"> <SectionTitle> 6 Discussion </SectionTitle> <Paragraph position="0"> There have been several e#0Borts to de#0Cne tags for describing language resources, such as TEI #5B10#5D, OpenTag #5B8#5D, CES #5B1#5D, EAGLES #5B2#5D, GDA #5B3#5D. The main focus of these e#0Borts other than GDA has beento sharelinguisticresourcesbyexpressingthem in a standard tag set, and therefore they de#0Cne very detailed levels of tags for expressing linguistic details. GDA has almost the same purposes but it has also de#0Cned a very complex tag set. This complexity discourages people from using these tag sets when writing documents, and it also becomes dif#0Ccult to make an annotation tool for these tags.</Paragraph> <Paragraph position="1"> LAL is not opposed to these previous e#0Borts, but attempts to strike a useful balance between expressiveness and simplicity, so that annotation can be used widely.</Paragraph> <Paragraph position="2"> As mentioned in the discussion of the experiment, there is an issue when the parsing system of LAL editor and the parsing system of a NLP tool which accepts the output of LAL editor are di#0Berent. As mentioned before, we used the ESG parser for producing LAL-annotated English, and Japanese-to-EnglishMTsystemforacceptingLALannotated English. Since these systems have been independently developedbasedon di#0Berentapproaches by di#0Berent developers, we found there are some structural di#0Berences. For instance, given a prepositional phrase Prep N, ESG's head word of the prepositional phrase is Prep, but EtoJ MT engine's head is N. In most cases, we can make systematic conversion routines for di#0Berent structures. In fact, for most of sentences whose translation is worse when annotation is used, we can provide structural conversion routines for linguistic structures included in them. The basic idea of LAL-awareness for NLP tools is that an NLP tool uses LAL information as much as possible, but if LAL information produces a severe con#0Dict with the internal processing, then such information should not be used. Our EtoJ MT program was basically implemented this way based on the algorithm described in #5B11#5D, but we seem to need more research on this issue.</Paragraph> </Section> class="xml-element"></Paper>