File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/p05-3018_metho.xml

Size: 10,376 bytes

Last Modified: 2025-10-06 14:09:48

<?xml version="1.0" standalone="yes"?>
<Paper uid="P05-3018">
  <Title>Word Alignment and Cross-Lingual Resource Acquisition [?]</Title>
  <Section position="4" start_page="69" end_page="69" type="metho">
    <SectionTitle>
2 Annotation Interface
</SectionTitle>
    <Paragraph position="0"> One way to acquire annotations quickly is to appeal to users across the Internet. First, we are more likely to find annotators with the necessary qualifications.</Paragraph>
    <Paragraph position="1"> Second, many more users can work simultaneously than would be feasible to physically host in a lab.</Paragraph>
    <Paragraph position="2"> Third, having many users annotate the same data allows us to easily identify systematic problems as well as spurious mistakes. The OpenMind Initiative (Stork, 2001) has had success collecting information that could not be obtained from data mining tools or with a local small group of annotators.</Paragraph>
    <Paragraph position="3"> Collecting data from users over the Internet introduces complications. Since we cannot ascertain the computer skills of the annotators, the interface must be easy to use. Our interface is a JAVA applet on a webpage so that it is platform independent. An online tutorial is also provided (and required for first-time users). Another problem of soliciting unknown users for data is the possibility of receiving garbage data created by users who do not have sufficient knowledge or are maliciously entering random input. Our system minimizes this risk in several ways. First, new users are required to work through the tutorial, which also serves as a short guide to reduce stylistic differences between the annotators. Second, we require the same data to be labeled by multiple people to ensure reliability, and researchers can use the visualization tool (see Section 3) to compare the agreement rates between annotators. Finally, our program is designed with a filter for malicious users. After completing the tutorial, the user is given a randomly selected sample sentence (for which we already have verified alignments) to annotate. The user must obtain an F-measure agreement of 60% with the &amp;quot;correct&amp;quot; alignments in order to be allowed to annotate sentences.1 Because word alignment annotation is a useful resource for both training and testing, quite a few interfaces have already been developed. The earliest  trained annotators who had an average agreement rate of about 85%. We chose 60% to be the figure of merit because this level is nearly impossible to obtain through random guessing but is lenient enough to allow for the inexperience of first time users. Automatic computer alignments average around 50%.</Paragraph>
    <Paragraph position="4"> is the Blinker Project (Melamed, 1998); more recent systems have been released to support more languages and visualization features (Ahrenberg et al., 2003; Lambert and Castell, 2004). 2 Our interface does share some similarities with these systems, but it is designed with additional features to support our experimental goals of guideline development, active learning and resource projection. Following the experimental design proposed by Och and Ney (2000), we instruct the annotators to indicate their level of confidence by choosing sure or unsure for each alignment they made. This allows researchers to identify areas where the translation may be unclear or difficult. We provide a text area for comments on each sentence so that the annotator may explain any assumptions or problems. A hidden timer records how long each user spends on each sentence in order to gauge the difficulty of the sentence; this information will be a useful measurement of the effectiveness of different active learning algorithms. Finally, our interface supports cross projection annotation. As an initial study, we have focused on POS tagging, but the framework can be extended for other types of resources such as syntactic and semantic trees and can be configured for languages other than English and Chinese. When words are aligned, the known and displayed English POS tag of the last English word involved in the alignment group is automatically projected onto all Chinese words involved, but a drop-down menu allows the user to correct this if the projection is erroneous. A screenshot of the interface is provided in Figure 1a.</Paragraph>
  </Section>
  <Section position="5" start_page="69" end_page="71" type="metho">
    <SectionTitle>
3 Tools for Researchers
</SectionTitle>
    <Paragraph position="0"> Good training examples for NLP learning systems should have a high level of consistency and accuracy.</Paragraph>
    <Paragraph position="1"> We have developed a set of tools for researchers to visualize, compare, and analyze the work of the annotators. The main interface is a JAVA applet that provides a visual representation of all the alignments superimposed onto each other in a grid.</Paragraph>
    <Paragraph position="2"> For the purposes of error detection, our system provides statistics for researchers to determine the agreement rates between the annotators. The metric we use is Cohen's K (1960), which is computed for every sentence across all users' alignments. Cohen's K is a measure of agreement that takes the total probability of agreement, subtracts the probability the agreement is due to chance, and divides by the maximum agreement possible. We use a variant of the  contains other downloadable interface packages that do not have companion papers.</Paragraph>
    <Paragraph position="3">  for analyzing multiple annotators' alignments.</Paragraph>
    <Paragraph position="4"> equation that allows for having three or more judges (Davies and Fleiss, 1982). The measurement ranges from 0 (chance agreement) to 1 (perfect agreement). For any selected sentence, we also compute for each annotator an average pair-wise Cohen's K against all other users who aligned this sentence.3 This statistic may be useful in several ways. First, someone with a consistently low score may not have enough knowledge to perform the task (or is malicious). Second, if an annotator received an unusually low score for a particular sentence, it might indicate that the per-son made mistakes in that sentence. Third, if there is too much disagreement among all users, the sentence might be a poor example to be included.</Paragraph>
    <Paragraph position="5"> In addition to catching individual annotation errors, it is also important to minimize stylistic inconsistencies. These are differences in the ways different annotators (consistently) handle the same phenomena. A common scenario is that some function words in one language do not have an equivalent counterpart in the other language. Without a precise guideline ruling, some annotators always leave the function words unaligned while others always group the function words together with nearby content words.</Paragraph>
    <Paragraph position="6"> Our tool can be useful in developing and improving style guides. It highlights the potential areas that need further clarifications in the guidelines with an at-a-glance visual summary of where and how the annotators differed in their work. Each cell in the grid represents an alignment between one particular word in the English sentence and one particular word in the Chinese sentence. A white cell means no one proposed an alignment between the words. Each colored cell has two components: an upper green portion in3not shown in the screenshot here.</Paragraph>
    <Paragraph position="7"> dicating a sure alignment and a lower yellow portion indicating an unsure alignment. The proportion of these components indicates the ratio of the number of people who marked this alignment as sure to those who were unsure (thus, an all-green cell means that everyone who aligned these words together is sure).</Paragraph>
    <Paragraph position="8"> Moreover, we use different saturation in the cells to indicate the percentage of people who aligned the two words together. A cell with faint colors means that most people did not chose to align these words together. Furthermore, researchers can elect to view the annotation decisions of a particular user by clicking on the radio buttons below. Only the selected user's annotation decisions would be highlighted by red outlines (i.e., only around the green portions of those cells that the person chose sure and around the yellow portions of this person's unsure alignments).</Paragraph>
    <Paragraph position="9"> Figure 1b displays the result of three annotators' alignments of a sample sentence pair. This sentence seems reasonably easy to annotate. Most of the colored cells have a high saturation, showing that the annotators agree on the words to be aligned. Most of the cells are only green, showing that the annotators are sure of their decisions. Three out of the four unsure alignments coincide with the other annotators' sure alignments, and even in those cases, more annotators are sure than unsure (the green areas are 2/3 of the cells while the yellow areas are 1/3). The colored cells with low saturation indicate potential outliers. Comparing individual annotator's alignments against the composite, we find that one annotator, rh, may be a potential outlier annotator since this person generated the most number of lightly saturated cells. The person does not appear to be malicious since the three people's overall agreements are high. To determine whether the conflict  arises from stylistic differences or from careless mistakes, researchers can click on the disputed cell (a cross will appear) to see the corresponding English and Chinese words in the text boxes in the top and left margin.</Paragraph>
    <Paragraph position="10"> Different patterns in the visualization will indicate different problems. If the visualization patterns reveal a great deal of disagreement and unsure alignments overall, we might conclude that the sentence pair is a bad translation; if the disagreement is localized, this may indicate the presence of an idiom or a structure that does not translate word-for-word.</Paragraph>
    <Paragraph position="11"> Repeated occurrences of a pattern may suggest a stylistic inconsistency that should be addressed in the guidelines. Ultimately, each area of wide disagreement will require further analysis in order to determine which of these problems is occurring.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML