XML Viewer - c02-1070

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/02/c02-1070_metho.xml
Size: 20,326 bytes
Last Modified: 2025-10-06 14:07:49
<?xml version="1.0" standalone="yes"?>
<Paper uid="C02-1070">
  <Title>Inducing Information Extraction Systems for New Languages via Cross-Language Projection</Title>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Cross-Language Projection
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 Motivation and Previous Projection Work
</SectionTitle>
      <Paragraph position="0"> Not all languages have received equal investment in linguistic resources and tool development. For a select few, resource-rich languages such as English, annotated corpora and text analysis tools are readily available. However, for the large majority of the world's languages, resources such as treebanks, part-of-speech taggers, and parsers do not exist. And even for many of the better-supported languages, cutting edge analysis tools in areas such as information extraction are not readily available.</Paragraph>
      <Paragraph position="1"> One solution to this NLP-resource disparity is to transfer linguistic resources, tools, and domain knowledge from resource-rich languages to resource-impoverished ones. In recent years, there has been a burst of projects based on this paradigm.</Paragraph>
      <Paragraph position="2"> Yarowsky et al. (2001) developed cross-language projection models for part-of-speech tags, base noun phrases, named-entity tags, and morphological analysis (lemmatization) for four languages. Resnik et al. (2001) developed related models for projecting dependency parsers from English to Chinese. There has also been extensive work on the cross-language transfer and development of ontologies and WordNets (e.g., (Atserias et al., 1997)).</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.2 Mechanics of Projection
</SectionTitle>
      <Paragraph position="0"> The cross-language projection methodology employed in this paper is based on Yarowsky et al.</Paragraph>
      <Paragraph position="1"> (2001), with one important exception. Given the absence of available naturally occurring bilingual  corpora in our target domain, we employ commercial, off-the-shelf machine translation to generate an artificial parallel corpus. While machine translation errors present substantial problems, MT offers great opportunities because it frees cross-language projection research from the relatively few large existing bilingual corpora (such as the Canadian Hansards). MT allows projection to be performed on any corpus, such as the domain-specific planecrash news stories employed here. Section 5 gives the details of the MT system and corpora that we used.</Paragraph>
      <Paragraph position="2"> Once the artificial parallel corpus has been created, we apply an English IE system to the English texts and transfer the IE annotations to the target language as follows:  1. Sentence align the parallel corpus.</Paragraph>
      <Paragraph position="3"> 1 2. Word-align the parallel corpus using the Giza++ system (Och and Ney, 2000).</Paragraph>
      <Paragraph position="4"> 3. Transfer English IE annotations and noun null phrase boundaries to French via the mechanism described in Yarowsky et al. (2001), yielding annotated sentence pairs as illustrated in Figure 1.</Paragraph>
      <Paragraph position="5"> 4. Train a stand-alone IE tagger on these projected annotations (described in Section 4).</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="0" end_page="7" type="metho">
    <SectionTitle>
4 Transformation-Based Learning
</SectionTitle>
    <Paragraph position="0"> We used transformation-based learning (TBL) (Brill, 1995) to learn information extraction rules for French. TBL is well-suited for this task because it uses rule templates as the basis for learning, which can be easily modeled after English extraction patterns. However, information extraction systems typically rely on a shallow parser to identify syntactic elements (e.g., subjects and direct objects) and verb  This is trivial because each sentence has a numbered anchor preserved by the MT system.</Paragraph>
    <Paragraph position="1"> constructions (e.g., passive vs. active voice). Our hope was that the rules learned by TBL would be applicable to new French texts without the need for a French parser. One of our challenges was to design rule templates that could approximate the recognition of syntactic structures well enough to duplicate most of the functionality of a French shallow parser.</Paragraph>
    <Paragraph position="2"> When our TBL training begins, the initial state is that no words are annotated. We experimented with two sets of &amp;quot;truth&amp;quot; values: Sundance's annotations and human annotations. We defined 56 language-independent rule templates, which can be broken down into four sets designed to produce different types of behavior. Lexical N-gram rule templates change the annotation of a word if the word(s) immediately surrounding it exactly match the rule. We defined rule templates for 1, 2, and 3-grams. In Table 1, Rules 1-3 are examples of learned Lexical N-gram rules. Lexical+POS N-gram rule templates can match exact words or part-of-speech tags. Rules 4-5 are Lexical+POS N-gram rules. Rule 5 will match verb phrases such as &amp;quot;went down in&amp;quot;, &amp;quot;shot down in&amp;quot;, and &amp;quot;came down in&amp;quot;. One of the most important functions of a parser is to identify the subject of a sentence, which may be several words away from the main verb phrase. This is one of the trickest behaviors to duplicate without the benefit of syntactic parsing. We designed Sub-ject Capture rule templates to identify words that are likely to be a syntactic subject. As an example, Rule 6 looks for an article at the beginning of a sentence and the word &amp;quot;crashed&amp;quot; a few words ahead  , and infers that the article belongs to a vehicle noun phrase. (The NP Chaining rules described next will extend the annotation to include the rest of the noun phrase.) Rule 7 attempts relative pronoun disambiguation when it finds the three tokens &amp;quot;COMMA which crashed&amp;quot; and infers that the word preceding the comma is a vehicle.</Paragraph>
    <Paragraph position="3"> Without the benefit of a parser, another challenge is identifying noun phrase boundaries. We designed NP Chaining rule templates to look at words that have already been labelled and extend the boundaries of the annotation to cover a complete noun phrase. As examples, Rules 8 and 9 extend loca-tion and victim annotations to the right, and Rule 10 extends a vehicle annotation to the left.</Paragraph>
    <Paragraph position="4">  ph is a start-of-sentence token. w  means that the item occurs in the range of word  through word</Paragraph>
  </Section>
  <Section position="6" start_page="7" end_page="7" type="metho">
    <SectionTitle>
5 Resources
</SectionTitle>
    <Paragraph position="0"> The corpora used in these experiments were extracted from English and French AP news stories.</Paragraph>
    <Paragraph position="1"> We created the corpora automatically by searching for articles that contain plane crash keywords. The news streams for the two languages came from different years, so the specific plane crash events described in the two corpora are disjoint. The English corpus contains roughly 420,000 words, and the French corpus contains about 150,000 words.</Paragraph>
    <Paragraph position="2"> For each language, we hired 3 fluent university students to do annotation. We instructed the annotators to read each story and mark relevant entities with SGML-style tags. Possible labels were loca-tion of a plane crash, vehicle involved in a crash, and victim (any persons killed, injured, or surviving a crash). We asked the annotators to align their annotations with noun phrase boundaries. The annotators marked up 1/3 of the English corpus and about 1/2 of the French corpus.</Paragraph>
    <Paragraph position="3"> We used a high-quality commercial machine translation (MT) program (Systran Professional Edition) to generate a translated parallel corpus for each of our English and French corpora. These will henceforth be referred to as MT-French (the Systran translation of the English text) and MT-English (the Systran translation of our French text).</Paragraph>
  </Section>
  <Section position="7" start_page="7" end_page="7" type="metho">
    <SectionTitle>
6 Experiments and Evaluation
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="7" end_page="7" type="sub_section">
      <SectionTitle>
6.1 Scoring and Annotator Agreement
</SectionTitle>
      <Paragraph position="0"> We explored two ways of measuring annotator agreement and system performance. (1) The exact-word-match measure considers annotations to match if their start and end positions are exactly the same. (2) The exact-NP-match measure is more forgiving and considers annotations to match if they both include the head noun of the same noun phrase.</Paragraph>
      <Paragraph position="1"> The exact-word-match criterion is very conservative because annotators may disagree about equally acceptable alternatives (e.g., &amp;quot;Boeing 727&amp;quot; vs. &amp;quot;new Boeing 727&amp;quot;). Using the exact-NP-match measure, &amp;quot;Boeing 727&amp;quot; and &amp;quot;new Boeing 727&amp;quot; would constitute a match. We used different tools to identify noun phrases in English and French. For English, we applied the base noun phrase chunker supplied with the fnTBL toolkit (Ngai &amp; Florian, 2001). In French, we ran a part-of-speech tagger (Cucerzan &amp; Yarowsky, 2000) and applied regular-expression heuristics to detect the heads of noun phrases.</Paragraph>
      <Paragraph position="2"> We measured agreement rates among our human annotators to assess the difficulty of the IE task. We computed pairwise agreement scores among our 3 English annotators and among our 3 French annotators. The exact-word-match scores ranged from 16-31% for French and 24-27% for English. These relatively low numbers suggest that the exact-word-match criterion is too strict. The exact-NP-match agreement scores were much higher, ranging from 43-54% for French and 51-59% for English  .</Paragraph>
      <Paragraph position="3"> These agreement numbers are still relatively low, however, which partly reflects the fact that IE is a subjective and difficult task. Inspection of the data revealed some systematic differences of approach among annotators. For example, one of the French annotators marked 4.5 times as many locations as another. On the English side, the largest disparity was a factor of 1.4 in the tagging of victims.</Paragraph>
    </Section>
    <Section position="2" start_page="7" end_page="7" type="sub_section">
      <SectionTitle>
6.2 Monolingual English &amp; French Evaluation
</SectionTitle>
      <Paragraph position="0"> As a key baseline for our cross-language projection studies, we first evaluated the AutoSlog-TS and TBL training approaches on monolingual English and French data. Figure 2 shows (1) English training by running AutoSlog-TS on unannotated texts and then applying its patterns to the human-annotated English test data, (2) English training and testing by applying TBL to the human-annotated English data with 5-fold cross-validation, (3) English training by applying TBL to annotations produced by Sundance (using AutoSlog-TS patterns) and then testing the TBL rules on the human-annotated English data, and (4) French training and testing by applying TBL to human annotated French data with 5-fold cross-validation.</Paragraph>
      <Paragraph position="1"> Table 2 shows the performance in terms of Precision (P), Recall (R) and F-measure (F). Through- null Agreement rates were computed on a subset of the data annotated by multiple people; systems were scored against the full corpus, of which each annotator provided the standard for one third.</Paragraph>
      <Paragraph position="2"> out our experiments, AutoSlog-TS training achieves higher precision but lower recall than TBL training.</Paragraph>
      <Paragraph position="3"> This may be due to the exhaustive coverage provided by the human annotations used by TBL, compared to the more labor-efficient but less-complete</Paragraph>
    </Section>
    <Section position="3" start_page="7" end_page="7" type="sub_section">
      <SectionTitle>
6.3 TBL-based IE Projection and Induction
</SectionTitle>
      <Paragraph position="0"> As noted in Section 5, both the English and French corpora were divided into unannotated (&amp;quot;plain&amp;quot;) and annotated (&amp;quot;antd&amp;quot; or &amp;quot;Tst&amp;quot;) sections. Figure 3 illustrates these native-language data subsets in white. Each native-language data subset also has a machine-translated mirror in French/English respectively (shown in black), with an identical number of sentences to the original. By word-aligning these 4 native/MT pairs, each becomes a potential vehicle for cross-language information projection.</Paragraph>
      <Paragraph position="2"> resentative example pathway for projection. Here an English TBL classifier is trained on the 140Kword human annotated data and the learned TBL rules are applied to the unannotated English subcorpus. The annotations are then projected across the Giza++ word alignments to their MT-French mirror. Next, a French TBL classifier (TBL1) is trained on the projected MT-French annotations and the learned French TBL rules are subsequently applied to the native-French test data.</Paragraph>
      <Paragraph position="3"> An alternative path (T</Paragraph>
      <Paragraph position="5"> is more direct, in that the English TBL classifier is applied immediately to the word-aligned MT-English translation of the French test data. The MT-English annotations can then be directly projected to the French test data, so no additional training is necessary. Another short direct projection path</Paragraph>
      <Paragraph position="7"> Table 3 shows the results of our TBL-based experiments. The top performing pathway is the</Paragraph>
      <Paragraph position="9"/>
    </Section>
    <Section position="4" start_page="7" end_page="7" type="sub_section">
      <SectionTitle>
6.4 Sundance-based IE Projection and
Induction
</SectionTitle>
      <Paragraph position="0"> Figure 4 shows the projection and induction model using Sundance for English IE annotation, which is almost isomorphic to that using TBL. One notable difference is that Sundance was trained by applying AutoSlog-TS to the unannotated English text rather than the human-annotated data. Figure 4 also shows an additional set of experiments (S  data. The motivation was that native-English extraction patterns tend to achieve low recall when applied to MT-English text (given frequent mistranslations such as &amp;quot;to crush&amp;quot; a plane rather than &amp;quot;to crash&amp;quot; a plane). By training AutoSlog-TS on the sentences generated by an MT system (seen in the S  This is a &amp;quot;fair&amp;quot; gain, in that the MT-trained AutoSlog-TS patterns didn't use translations of any of the French test data.  Table 4 shows that the best Sundance pathway achieved an F-measure of .37. Overall, Sundance averaged 7% lower F-measures than TBL on comparable projection pathways. However, AutoSlog-TS training required only 3-4 person hours to review the learned extraction patterns while TBL training required about 150 person-hours of manual IE annotations, so this may be a viable cost-reward tradeoff. However, the investment in manual English IE annotations can be reused for projection to new foreign languages, so the larger time investment is a fixed cost per-domain rather than per-language.</Paragraph>
    </Section>
    <Section position="5" start_page="7" end_page="7" type="sub_section">
      <SectionTitle>
6.5 Analysis and Implications
</SectionTitle>
      <Paragraph position="0"> * For both TBL and Sundance, the P1, P2 and P3-family of projection paths all yield stand-alone monolingual French IE taggers not specialized for any particular test set. In contrast, the P4 series of pathways (e.g. P</Paragraph>
    </Section>
  </Section>
  <Section position="8" start_page="7" end_page="7" type="metho">
    <SectionTitle>
MT
4 for Sundance), were trained
</SectionTitle>
    <Paragraph position="0"> specifically on the MT output of the target test data.</Paragraph>
    <Paragraph position="1"> Running an MT system on test data can be done automatically and requires no additional human language knowledge, but it requires additional time (which can be substantial for MT). Thus, the higher performance of the P4 pathways has some cost.</Paragraph>
    <Paragraph position="2"> * The significant performance gains shown by Sundance when AutoSlog-TS is trained on MT-English rather than native-English are not free because the MT data must be generated for each new language and/or MT system to optimally tune to  S(1+2) combines the training data in S1 (280K) and S2 (140K), yielding a 420K-word sample.</Paragraph>
    <Paragraph position="3"> its peculiar language variants. No target-language knowledge is needed in this process, however, and reviewing AutoSlog-TS' patterns can be done successfully by imaginative English-only speakers. * In general, recall and F-measure drop as the number of experimental steps increases. Averaged over TBL and Sundance pathways, when comparing 2 and 3-step projections, mean recall decreases from 26.8 to 21.8 (5 points), and mean F-measure drops from 32.6 to 28.8 (3.8 points). Viable extraction patterns may simply be lost or corrupted via too many projection and retraining phases.</Paragraph>
    <Paragraph position="4"> * One advantage of the projection path families of P1 and P2 is that no domain-specific documents in the foreign language are required (as they are in the P3 family). A collection of domain-specific English texts can be used to project and induce new IE systems even when no domain-specific documents exist in the foreign language.</Paragraph>
    <Section position="1" start_page="7" end_page="7" type="sub_section">
      <SectionTitle>
6.6 Multipath Projection
</SectionTitle>
      <Paragraph position="0"> Finally, we explored the use of classifier combination to produce a premium system. We considered a simple voting scheme over sets of individual IE systems. Every annotation of a head noun was considered a vote. We tried 4 voting combinations: (1) the systems that used Sundance with English extraction patterns, (2) the systems that used Sundance with MT-English extraction patterns, (3) the systems that used TBL trained on English human annotations, (4) all systems. For each combination of n systems, n answer sets were produced using the voting thresholds T v =1..n. For example, for T v  =2every annotation receiving &gt;=2votes (picked by at least 2 individual systems) was output in the answer set. This allowed us to explore a precision/recall tradeoff based on varying levels of consensus. Figure 5 shows the precision/recall curves. Voting yields some improvement in F-measure and provides a way to tune the system for higher precision or higher recall by choosing the T v threshold.</Paragraph>
      <Paragraph position="1"> When using all English knowledge sources, the F-measure at T v =1 (.48) is nearly 3% higher than the strongest individual system. Figure 5 also shows the performance of a 5th system (5), which is a TBL system trained directly from the French annotations under 5-fold cross-validation. It is remarkable that the most effective voting-based projection system from English to French comes within 6% F-measure of the monolingually trained system, given that this cross-validated French monolingual system was trained directly on data in the same language and source as the test data. This suggests that cross-language projection of IE analysis capabilities can successfully approach the performance of dedicated systems in the target language.</Paragraph>
      <Paragraph position="2">  point represents performance for a particular voting threshold. In all cases, precision increases and recall decreases as the threshold is raised.</Paragraph>
      <Paragraph position="3"> French Test-Set Performance P R F Multipath projection from all English resources .43 .54 .48 Table 5: Best multipath English-French Projection Performance (from English TBL and Sundance pathways)</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML