XML Viewer - j01-1005

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/01/j01-1005_metho.xml
Size: 16,229 bytes
Last Modified: 2025-10-06 14:07:32
<?xml version="1.0" standalone="yes"?>
<Paper uid="J01-1005">
  <Title>Squibs and Discussions Unsupervised Named Entity Recognition Using Syntactic and Semantic Contextual Evidence</Title>
  <Section position="2" start_page="0" end_page="124" type="metho">
    <SectionTitle>
3 A standard POS tagger augmented with simple heuristics is used to detect possible instances of PNs.
</SectionTitle>
    <Paragraph position="0"> Errors are originated only by ambiguous sentence beginners, as &amp;quot;Owens Illinois&amp;quot; or &amp;quot;Boots Plc&amp;quot; causing partial recognition.</Paragraph>
    <Paragraph position="1">  Cucchiarelli and Velardi Unsupervised Named Entity Recognition where x = w\] or x = Wk and U-PN=wk or wj (the unknown PN can be either the head or the modifier), type i is the syntactic type of esl (e.g. N-of-N, NAN, V-for-N, etc.), and furthermore let: pl(esli(x, U_PN) ) be the plausibility of a detected esl. Plausibility is a measure of the statistical evidence of a detected syntactic relation (Basili, Marziali, and Pazienza 1994; Grishman and Sterling 1994) that depends upon local (i.e., sentence-level) syntactic ambiguity and global corpus evidence. The plausibility accounts for the uncertainty arising from syntactic ambiguity. ,. Finally, let: -- ESLA be a set of esls in PN_esl (the previously learned contextual model) defined as follows: for each esli(x, Uff)N) in ESL, put in ESLA the set of eslj(x, PNj) with typej = type i, x in the same position as esli, and PNj a known proper noun, in the same position as U_PN in esli.</Paragraph>
    <Paragraph position="2"> ESLB be a set of esls in PN_esl defined as follows: for each esli(x, U_PN) in ESL put in ESLB the set of eslj(w, PNj) with type\] -- type i, w in the same position as x in esli, Sim(w,x) &gt; 6, and PNj a known proper noun, in the same position as U_PN in esli. Sim(w, x) is a similarity measure between x and w. In our experiments, Sim(w,x) &gt; ~ iff w and x have a common hyperonym H in WordNet. The generality of H (i.e., the number of levels from x to H) is made parametric, to analyze the effect of generalization.</Paragraph>
    <Paragraph position="3"> * For each semantic category Cp,j compute evidence(Cp,j) as:</Paragraph>
    <Paragraph position="5"> pl(esli(x, PNj)) is the plausibility and arab(x) is the ambiguity of x in esli k is a constant factor used to incrementally reduce the influence of ambiguous words. The smoothing is tuned to be higher in</Paragraph>
  </Section>
  <Section position="3" start_page="124" end_page="126" type="metho">
    <SectionTitle>
ESLB
</SectionTitle>
    <Paragraph position="0"> a and fl are parametric, and can be used to study the evidence provided by ESLA and ESLB  Computational Linguistics Volume 27, Number 1 D(x, C(PNj)) is a discrimination factor used to determine the saliency (Yarowsky 1992) of a context esli(x, _) for a category C(PNj), i.e., how good a context is at discriminating between C(PNj)and the other categories. 4 The selected category for U~N is</Paragraph>
    <Paragraph position="2"> When grouping all the evidence of a U_PN in a text, the underlying hypothesis is that, in a given linguistic domain (finance, medicine, etc.), a PN has a unique sense. This is a reasonable restriction for Proper Nouns, supported by empirical evidence, though we would be more skeptical about the applicability of the one-sense-per-discourse paradigm (Gale, Church, and Yarowsky 1992) to generic words. We believe that it is precisely this restriction that makes the use of syntactic and semantic contexts effective for PNs.</Paragraph>
    <Paragraph position="3"> Notice that the formula of the evidence has several smoothing factors that work together to reduce the influence of unreliable or uninformative contexts. The formula also has parameters (k, ~, fl), estimated by running systematic experiments. Standard statistical techniques have been used to balance experimental conditions and the sources of variance.</Paragraph>
    <Paragraph position="4"> 3. Using WordNet for Context Generalization One of the stated objectives of this paper is to investigate the effect of context generalization (the addend ESLB in the formula of the evidence) on our sense tagging task.</Paragraph>
    <Paragraph position="5"> The use of on-line thesauri for context generalization has already been investigated with limited success (Hearst and Schuetze 1993; Brill and Resnik 1994; Resnik 1997; Agirre and Rigau 1996). Though the idea of using thesauri for context expansion is quite common, there are no clear indications that this is actually useful in terms of performance. However, studying the effect of context expansion for a PN tagging task in particular is relevant because: PNs may be hypothesized to have a unique sense in a text, and even in a domain corpus. Therefore, we can reliably consider as potential sense indicators all the contexts in which a PN appears. The only source of ambiguity is then the word wi co-occurring in a syntactic context with a PN, esli(wi, U_PN), but since in ESLB we group several contexts, hopefully spurious hyperonyms of wi will gain lower evidence. For example, consider the context &amp;quot;division of Americand3randsdnc&amp;quot;. Division is a highly ambiguous word, but, when generalizing it, the majority of its senses appearing in the same type of syntactic relation with a Proper Noun (e.g. branch of Drexel_ Burnhamd,ambert_Group dnc, part of Nationale_ Nederlanden_Group) are indeed pertinent senses.</Paragraph>
    <Paragraph position="6"> 4 For example, a Subject_Verb phrase with the verb make (e.g., Ace made a contract) is found with almost equal probability with Person and Organization names. We used a simple conditional probability model for D(x, c(PNj)), but we believe that more refined measures could improve performance.  Cucchiarelli and Velardi Unsupervised Named Entity Recognition * PN categories (e.g., Person, Location, Product) exhibit a more stable and less ambiguous contextual behavior than other more vague categories, such as psychological_feature. 5 * We can study the degree of generalization at which an optimum performance is achieved.</Paragraph>
  </Section>
  <Section position="4" start_page="126" end_page="126" type="metho">
    <SectionTitle>
4. Experimental Discussion
</SectionTitle>
    <Paragraph position="0"> The purpose of experimental evaluation is twofold: To test the improvement in robustness of a state-of-the-art NE recognizer. To study the effectiveness of syntactic contexts and of a &amp;quot;cautious&amp;quot; context generalization on the performance of the U_PN tagger, analyzed in isolation. The effect of generalization is studied by gradually relaxing the notion of similarity in the formula of evidence and by tuning, through the factors a and fl, the contribution of generalized contexts to the formula of evidence.</Paragraph>
    <Paragraph position="1"> In our experiment, we used the Italian Sole24Ore half-million-word corpus on financial news, the one-million-word Wall Street Journal corpus, and WordNet, as standard on-line available resources, as well as a series of computational tools made available for our research:  NE recognizers, through the use of our tagger. In Figure 1, three testing experiments are shown. The table measures the local performance of the NE tagging task achieved by the early NE recognizer, by our untrained tagger, and finally, the joint performance of the two methods.</Paragraph>
    <Paragraph position="2"> In the first test, we used the Italian Sole24Ore corpus. Due to the unavailability of WordNet in Italian, we used a dictionary of strict synonyms for context expansion. In this test, we &amp;quot;loosely&amp;quot; adapted the English VIE system (as used in MUC-6) to Italian.</Paragraph>
  </Section>
  <Section position="5" start_page="126" end_page="129" type="metho">
    <SectionTitle>
5 In Velardi and Cucchiarelli (2000) we formally studied the relation between category type and
</SectionTitle>
    <Paragraph position="0"> learnability of contextual cues for WSD. 6 We also used the GATE partial parser. We were not as successful with this parser because it is not designed for high-performance VP3?P and NP-PP detection, but prepositional contexts are often the most informative indicators. 7 This method produces a 20-30% reduction of the initial WordNet ambiguity, depending on the specific corpus.</Paragraph>
    <Paragraph position="1">  A: PNs correctly tagged by the early NE recognizer B: Total PNs in the Test Corpus C: Local Recall of the early NE recognizer (A/B) D: Total PNs detected by the early NE recognizer (D = A + A1 (errors) + G(unknown) E: Local Precision of the early NE recognizer (A/D) F: UPNs correctly tagged by the UPN tagger in the Test Corpus G: Total UPNs not detected by the early NE recognizer H: Local recall of UPN tagger (Phase2) (F/G) I: Total UPNs for which a decision was possible by the UPN tagger \]: Local precision of the UPN tagger K: Joint Recall of the two methods (A + F)/B L: Joint Precision of the two methods (A+F)/D Figure 1 Outline of results on the Sole24Ore corpus.</Paragraph>
    <Paragraph position="2"> We used the English gazetteer as it was and we applied simple &amp;quot;language porting&amp;quot; to the NE grammar (e.g., replacing English words and prepositions with corresponding Italian words, and little more), s This justifies the low performance of the rule-based classifier. Note that our context-based tagger produces a considerable improvement in performance (around 18%), therefore the global performance (column K and L) turns out to be comparable with state-of-the-art systems, without a significant readaptation effort.</Paragraph>
    <Paragraph position="3"> In the second test, we used again VIE, on the English Wall Street Journal corpus. We used a version of VIE that was designed to detect NE in a management succession domain (we are testing the effect of a domain shift here). Local performance was somewhat lower than in MUC-6. Again, we measured a 9% improvement using our tagger, and very high global performance.</Paragraph>
    <Paragraph position="4"> The third test was the most demanding. Here, we used only half of the named entity gazetteer used in previous experiments. The purpose of this test was also to verify the effect on performance of a poorly populated gazetteer. In this test, rather than using LASIE, we used a machine learning method described in Paliouras, Karkaletsis and Spyropolous (1998). This method uses as a training set the available half of the gazetteer to learn a context-based decision list for NE classification.</Paragraph>
    <Paragraph position="5"> As shown in Test 3, column B, the initial number of PNs in the test corpus is now considerably higher. The decision-list classifier is tuned to classify with high precision and lower recall. Therefore, only the &amp;quot;hardest&amp;quot; cases are submitted to our untrained classifier. In fact, local performance of our classifier is around 10% lower than for previous tests, but nevertheless, global performance (in terms of joint precision and recall) shows an improvement. Finally, we observe that the performance figures reported in Figure 1 say nothing about the various sources of errors. Errors and misses occur both during the off-line learning phase (as we said, NE instances and syntactic contexts  Cucchiarelli and Velardi Unsupervised Named Entity Recognition are not inspected for correctness, therefore the contextual knowledge base is error prone) and prior to the U_PN tagging phase: a compound PN may be incompletely recognized during POS tagging, causing the generation of an uninformative syntactic context (e.g., &amp;quot;Owens Illinois&amp;quot; at the beginning of a sentence is recognized as &amp;quot;owens Illinois&amp;quot;, causing a spurious NdN(owen,Illinois) context to be generated). Because all these &amp;quot;external&amp;quot; sources of noise are not filtered out, we may then reliably conclude that our tagger is effective at improving the robustness of proper noun classification, though clearly the amount of improvement depends upon the baseline performances of the early method used for PN classification.</Paragraph>
    <Paragraph position="6"> Although the classification evidence provided by syntactic contexts is somewhat noise prone, it proves to be useful as a &amp;quot;backup,&amp;quot; when other &amp;quot;simpler&amp;quot; contextual evidence does not allow a reliable decision.</Paragraph>
    <Section position="1" start_page="128" end_page="129" type="sub_section">
      <SectionTitle>
4.2 Effectiveness of Syntactic and Semantic Cues for Semantic Classification
</SectionTitle>
      <Paragraph position="0"> In a second experiment, we used the experimental set up of Test 2 (WSJ+VIE described above) to evaluate the effectiveness of context expansion on system performance. We applied a pruning method on WordNet (Cucchiarelli and Velardi 1998) to reduce initial ambiguity of contexts. This pruning method allowed an average of 27% reduction in the initial ambiguity of the total number of the 13,428 common nouns in the Wall Street Journal corpus. The objective of this experiment was to allow a more detailed evaluation of our method, with respect to several parameters.</Paragraph>
      <Paragraph position="1"> We built four test sets with the same distribution of PN categories and frequency distribution as in the application corpus. We selected four frequency ranges (1, 2, 3-9, &gt; 10) and in each range we selected 100 PNs, reflecting the frequency distribution in the corpus of the three main PN semantic categories--Person, Organization, and Location. We then built another test set, called TSAll, with 400 PNs again reflecting the frequency and category distribution of the corpus. The 400 PNs were then removed from the set of 37,018 esls extracted by our parser and from the gazetteer (whenever included).</Paragraph>
      <Paragraph position="2"> In this experiment, we wanted to measure the performance of the U_PN tagger over the 400 words in the test set, in terms of F-measure, according to several varying factors: * the category type; * the amount of initial contextual evidence (i.e., the frequency range, reflected by the different test sets); * the factors oe and fl, i.e., the influence of local and generalized contexts; * the level of generalization L.</Paragraph>
      <Paragraph position="3"> Figures 2 summarizes the results of the experiment. Figure 2(a) shows the increase in performance as a function of the values of oe and fl and the generalization level. N means no generalization, only the evidence provided by ESLA is computed; 0 means that ESLB collects the evidence provided by contexts in which w is a strict synonym of x according to WordNet; 1, 2, and 3 refer to incremental levels of generalization in the (pruned) WordNet hierarchy. The figure shows that context generalization produces up to 7% improvement in performance. Best results are obtained with L = 2 and ~ = 0.7, fl = 0.3. Further generalization may cause a drop in performance. High ambiguity is the cause of this behavior, despite WordNet pruning (without WordNet pruning, we observed a performance inversion at level 1; this experiment is not reported due to  limitations of space). Figure 2(b) illustrates the influence of initial contextual evidence. Recognition of singleton PNs remains almost constant as the contribution of generalized and nongeneralized contexts varies. Looking more in detail, we observe that recall increases with fl -- (1- c~), but precision decreases. Generalization on the basis of a unique context does not allow any filtering of spurious senses, while when grouping several contexts, spurious senses gain lower evidence (as anticipated in Section 3). Finally, we designed an experiment to evaluate the influence of the test set composition on the U_PN tagger performances. We performed an analysis of variance (ANOVA test \[Hoel 1971\]) on the results obtained by processing nine different test sets of 400 PNs each, selected randomly. In all our experiments the details of which we omit, for lack of space), we found that the U-PN tagging method performances were independent of the variations of the test set.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML