File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-2004_metho.xml
Size: 23,441 bytes
Last Modified: 2025-10-06 14:10:46
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-2004"> <Title>Improving Name Discrimination: A Language Salad Approach</Title> <Section position="4" start_page="26" end_page="26" type="metho"> <SectionTitle> 3 The Language Salad </SectionTitle> <Paragraph position="0"> In this paper, we explore the creation of a second order representation for a set of evaluation contexts using three different sets of feature selection data. The co-occurrence matrix may be derived from the evaluation contexts themselves, or from a separate set of contexts in a different language, or from the combination of these two (the Salad or Mix).</Paragraph> <Paragraph position="1"> For example, suppose we have 100 Romanian evaluation contexts that include an ambiguous name, and that same name also occurs 10,000 times in an English language corpus.2 Our goal is to cluster the 100 Romanian contexts, which contain all the information that we have about the name in Romanian. While we could derive a second order representation of the contexts, the resulting co-occurrence matrix would likely be very small and sparse, and insufficient for making good discrimination decisions. We could instead rely on first order features, that is look for frequent words or bigrams that occur in the evaluation contexts, and try and find evaluation contexts that share some of the same words or phrases, and cluster them based on this type of information. However, again, the small number of contexts available would likely result in very sparse representations for the contexts, and unreliable clustering results.</Paragraph> <Paragraph position="2"> Thus, our method is to derive a co-occurrence matrix from a language for which we have many 2We assume that the names either have the same spelling in both languages, or that translations are readily available. occurrences of the ambiguous name, and then use that co-occurrence matrix to represent the evaluation contexts. This relies on the fact that the evaluation contexts will contain at least a few names or words that are also used in the larger corpus (in this case English). In general, we have found that while this is not always true, it is often the case.</Paragraph> <Paragraph position="3"> We have also experimented with combining the English contexts with the evaluation contexts, and building a co-occurrence matrix based on this combined or mixed collection of contexts. This is the language salad that we refer to, a mixture of contexts in two different languages that are used to derive a representation of the evaluation contexts.</Paragraph> </Section> <Section position="5" start_page="26" end_page="28" type="metho"> <SectionTitle> 4 Experimental Data </SectionTitle> <Paragraph position="0"> We use data in four languages in these experiments, Bulgarian, English, Romanian, and Spanish. null</Paragraph> <Section position="1" start_page="26" end_page="27" type="sub_section"> <SectionTitle> 4.1 Raw Corpora </SectionTitle> <Paragraph position="0"> The Romanian data comes from the 2004 archives of the newspaper Adevarul (The Truth)3. This is a daily newspaper that is among the most popular in Romania. While Romanian normally has diacritical markings, this particular newspaper does not include those in their online edition, so the alphabet used was the same as English.</Paragraph> <Paragraph position="1"> The Bulgarian data is from the Sega 2002 news corpus, which was originally prepared for the CLEF competition.4 This is a corpus of news articles from the Newspaper Sega5, which is based in Sofia, Bulgaria. The Bulgarian text was transliterated (phonetically) from Cyrillic to the Roman alphabet. Thus, the alphabet used was the same as English, although the phonetic transliteration leads to fewer cognates and borrowed English words that are spelled exactly the same as in English text.</Paragraph> <Paragraph position="2"> The Spanish corpora comes from the Spanish news agency EFE from the year 1994 and 1995.</Paragraph> <Paragraph position="3"> This collection was used in the Question Answering Track at CLEF-2003, and also for CLEF-2005. This text is represented in Latin-1, and includes the usual accents that appear in Spanish.</Paragraph> <Paragraph position="4"> The English data comes from the GigaWord corpus (2nd edition) that is distributed by the Lin- null than 2 billion words of newspaper text that comes from five different news sources between the years 1994 and 2004. In fact, we subdivide the English data into three different corpora, where one is from 2004, another from 2002, and the third from 199495, so that for each of the evaluation languages (Bulgarian, Spanish, and Romanian) we have an English corpus from the same time period.</Paragraph> </Section> <Section position="2" start_page="27" end_page="27" type="sub_section"> <SectionTitle> 4.2 Evaluation Contexts </SectionTitle> <Paragraph position="0"> Our experimental data consists of evaluation contexts derived from the Bulgarian, Romanian, and Spanish corpora mentioned above. We also have English corpora that includes the same ambiguous names as found in the evaluation contexts.</Paragraph> <Paragraph position="1"> In order to quickly generate a large volume of experimental data, we created evaluation contexts from the corpora for each of our four languages by conflating together pairs of well known names or places, and that are generally not highly ambiguous (although some might be rather general).</Paragraph> <Paragraph position="2"> For example, one of the pairs of names we conflate is George Bush and Tony Blair. To do that, every occurrence of both of these names is converted to an ambiguous form (GB TB, for example), and the discrimination task is to cluster these contexts such that their original and correct name is re-discovered. We retain a record of the original name for each occurrence, so as to evaluate the results of our method. Of course we do not use this information anywhere in the process outside of evaluation.</Paragraph> <Paragraph position="3"> The following pairs of names were conflated in all four of the languages: George Bush-Tony Blair, Mexico-India, USA-Paris, Ronaldo-David Beckham (2002 and 2004), Diego Maradona-Roberto Baggio (1994-95 only), and NATO-USA. Note that some of these names have different spellings in some of our languages, so we look for and conflate the native spelling of the names in the different language corpora. These pairs were selected because they occur in all four of our languages, and they represent name distinctions that are commonly of interest, that is they represent ambiguity in names of people and places. With these pairs we are also following (Nakov and Hearst, 2003) who suggest that if one is introducing ambiguity by creating pseudo--words or conflating names, then these words should be related in some way (in order to avoid the creation of very sharp or obvious sense distinctions).</Paragraph> </Section> <Section position="3" start_page="27" end_page="28" type="sub_section"> <SectionTitle> 4.3 Discussion </SectionTitle> <Paragraph position="0"> For each of the three evaluation languages (Bulgarian, Romanian, and Spanish) we have contexts for five different name conflate pairs that we wish to discriminate. We have corresponding English contexts for each evaluation language, where the dates of both are approximately the same. This temporal consistency between the evaluation language and English is important because the contexts in which a name is used may change over time. In 1994, for example, Tony Blair was not yet Prime Minister of England (he became PM in 1997), and references to George Bush most likely refer to the US President who served from 1988 until 1992, rather than the current US President (who began his term in office in 2001). In 1994 the current (as of 2006) US President had just been elected governor of Texas, and was not yet a national figure. This points out that George Bush is an example of an ambiguous name, but our observation has been that in the 2002 and 2004 data (Romanian and Bulgarian) nearly all occurrences are associated with the current president, and that most of the occurrences in 1994-95 (Spanish) refer to the former US President. This illustrates an important point: it is necessary to consider the perspective represented by the different corpora.</Paragraph> <Paragraph position="1"> There is little reason to expect that news articles from Spain in 1994 and 1995 would focus much attention on the newly elected governor of Texas in the United States.</Paragraph> <Paragraph position="2"> Tables 1, 2, and 3 show the number of contexts that have been collected for each name conflate pair. For example, in Table 1 we see that there are 746 Bulgarian contexts that refer to either Mexico or India, and that of these 51.47% truly refer to Mexico, and 48.53% to India. There are 149,432 English contexts that mention Mexico or India, and the Mix value shown is simply the sum of the number of Bulgarian and English contexts.</Paragraph> <Paragraph position="3"> In general these tables show that the English contexts are much larger in number, however, there are a few exceptions with the Spanish data.</Paragraph> <Paragraph position="4"> This is because the EFE corpus is relatively large as compared to the Bulgarian and Romanian corpora, and provides frequency counts that are in some cases comparable to those in the English corpus. null</Paragraph> </Section> </Section> <Section position="6" start_page="28" end_page="28" type="metho"> <SectionTitle> 5 Experimental Methodology </SectionTitle> <Paragraph position="0"> For each of the three evaluation languages (Bulgarian, Romanian, Spanish) there are five name conflate pairs. The same name conflate pairs are used for all three languages, except for Diego Maradona-Roberto Baggio which is only used with Spanish, and Ronaldo-David Beckham, which is only used with Bulgarian and Romanian.</Paragraph> <Paragraph position="1"> This is due to the fact that in 1994-95 (the era of the Spanish data) neither Ronaldo nor David Beckham were as famous as they later became, so they were mentioned somewhat less often than in the 2002 and 2004 corpora. The other four name conflate pairs are used in all of the languages.</Paragraph> <Paragraph position="2"> For each name conflate pair we create a second order representation using three different sources of features selection data: the evaluation contexts themselves, the corresponding English contexts, and then the mix of the evaluation contexts and the English contexts (the Mix). The objective of these experiments is to determine which of these sources of feature selection data results in the highest F-Measure, which is the harmonic mean of the precision and recall of an experiment.</Paragraph> <Paragraph position="3"> The precision of each experiment is the number of evaluation contexts clustered correctly, divided by the number of contexts that are clustered. The clustering algorithm may choose not to assign every context to a cluster, which is why that denominator may not be the same as the number of evaluation contexts. The recall of each experiment is the the number of correctly clustered evaluation contexts divided by the total number of evaluation contexts. Note that for each of the three variations for each name conflate pair experiment exactly the same evaluation language contexts are being discriminated, all that is changing in each experiment is the source of the feature selection data. Thus the F-measures for a name conflate pair in a particular language can be compared directly. Note however that the F-measures across languages are harder to compare directly, since different evaluation contexts are used, and different English contexts are used as well.</Paragraph> <Paragraph position="4"> There is a simple baseline that can be used as a point of comparison, and that is to place all of the contexts for each name conflate pair into one cluster, and say that there is no ambiguity. If that is done, then the resulting F-Measure will be equal to the majority percentage of the true underlying entity as shown in Tables 1, 2, and 3. For example, for Bulgarian, if the 746 Bulgarian contexts for Mexico and India are all put into the same cluster, the resulting F-Measure would be 51.47%, because we would simply assign all the contexts in the cluster to the more common of the two entities, which is Mexico in this case.</Paragraph> </Section> <Section position="7" start_page="28" end_page="30" type="metho"> <SectionTitle> 6 Experimental Results </SectionTitle> <Paragraph position="0"> Tables 1, 2, and 3 show the results for our experiments, language by language. Each table shows the results for the 15 experiments done for each language: five name conflate pairs, each with three different sources of feature selection data.</Paragraph> <Paragraph position="1"> The row labeled with the name of the evaluation language reports the F-Measure for the evaluation contexts (whose number of occurrences is shown in the far right column) when the feature selection data is the evaluation contexts themselves. The rows labeled English and Mix report the F-Measures obtained for the evaluation contexts when the feature selection data is the English contexts, or the Mix of the English and evaluation contexts.</Paragraph> <Section position="1" start_page="28" end_page="29" type="sub_section"> <SectionTitle> 6.1 Bulgarian Results </SectionTitle> <Paragraph position="0"> The Bulgarian results are shown in Table 1. Note that the number of contexts for English is considerably larger than for Bulgarian for all five name conflate pairs. The Bulgarian and English data came from 2002 news reports.</Paragraph> <Paragraph position="1"> The Mix of feature selection data results in the best performance for three of the five name conflate pairs: George Bush - Tony Blair, Ronaldo David Beckham, and NATO - USA. For remaining two name conflate pairs, just using the Bulgarian evaluation contexts results in the highest F-Measure (Mexico-India, USA-Paris).</Paragraph> <Paragraph position="2"> We believe that this may be partially due to the fact that the two cases where Bulgarian leads to the best results are for very general or generic underlying entities: Mexico and India, and then the USA and Paris. In both cases, contexts that mention these entities could be discussing a wide range of topics, and the larger volumes of English data may simply overwhelm the process with a huge number of second order features. In addition, it may be that the English and Bulgarian corpora contain different content that reflects the different interests of the original readership of this text. For example, news that is reported about India might be rather different in the United States (the source of most of the English data) than in Bulgaria. Thus, the use of the English corpora might not have been as helpful in those cases where the names to be discriminated are more global figures. For example, Tony Blair and George Bush are probably in the news in the USA and Bulgaria for many of the same reasons, thus the underlying content is more comparable than that of the more general entities (like Mexico and India) that might have much different content associated with them.</Paragraph> <Paragraph position="3"> We observed that Bulgarian tends to have fewer cognates or shared names with English than do Romanian and English. This is due to the fact that the Bulgarian text is transliterated. This may account for the fact that the English-only results for Bulgarian are very poor, and it is only in combination with the Bulgarian contexts that the English contexts show any positive effect. This suggests that there are only a few words in the Bulgarian contexts that also occur in English, but those that do have a positive impact on clustering performance. null</Paragraph> </Section> <Section position="2" start_page="29" end_page="30" type="sub_section"> <SectionTitle> 6.2 Romanian Results </SectionTitle> <Paragraph position="0"> The Romanian results are shown in Table 2. The feature selection results in improvements for two of the five pairs (David Beckham - Ronaldo, and NATO - USA). The use of English contexts only provides the best results for two other pairs (Tony Blair - George Bush, and USA - Paris, although in the latter case the difference in the F-Measures that result from the three sources of data is minimal). There is one case (Mexico-India) where using the Romanian contexts as feature selection data results in a slightly better F-measure than when using English contexts.</Paragraph> <Paragraph position="1"> The improvement that the Mix shows for David Beckham-Ronaldo is significant, and is perhaps due to fact that in both English and Romanian text, the content about Beckham and Ronaldo is similar, making it more likely that the mix of English and Romanian contexts will be helpful. However, it is also true that the Mix results in a significant improvement for NATO-USA, and it seems likely that the local perspective in Romania and the USA would be somewhat different on these two entities. However, NATO-USA has a relatively large number of contexts in Romanian as well as English, so perhaps the difference in perspective had less of an impact in those cases where the number of Ro-</Paragraph> </Section> <Section position="3" start_page="30" end_page="30" type="sub_section"> <SectionTitle> 6.3 Spanish Results </SectionTitle> <Paragraph position="0"> The Spanish results are shown in Table 3. The Spanish and English contexts come from 19941995, which puts them in a slightly different historical era than the Bulgarian and Romanian corpora. null Due to this temporal difference, we used Diego Maradona and Roberto Baggio as a conflated pair, rather than David Beckham and Ronaldo, who were much younger and somewhat less famous at that time. Also, Ronaldo is a highly ambiguous name in Spanish, as it is a very common first name. This is true in English text as well, although casual inspection of the English text from 2002 and 2004 (where the Ronaldo-Beckham pair was included experimentally) reveals that Ronaldo the soccer player tends to occur more so than any other single entity named Ronaldo, so while there is a bit more noise for Ronaldo, there is not really a significant ambiguity.</Paragraph> <Paragraph position="1"> For the Spanish results we only note one pair (George Bush - Tony Blair) where the Mix of English and Spanish results in the best performance. This again suggests that the perspective of the Spanish and English corpora were similar with respect to these entities, and their combination was helpful. In two other cases (Maradona-Baggio, India-Mexico) English only contexts achieve the highest F-Measure, and then in the two remaining cases (USA-Paris, NATO-USA) the Spanish contexts are the best source of features.</Paragraph> <Paragraph position="2"> Note that for Spanish we have reasonably large numbers of contexts (as compared to Bulgarian and Romanian). Given that, it is especially interesting that English-only contexts are the most effective in two of five cases. This suggests that this approach may have merit even when the evaluation language does not suffer from problems of extreme scarcity. It may simply be that the information in the English corpora provides more discriminating information than does the Spanish, and that it is somewhat different in content than the Spanish, otherwise we would expect the Mix of English and Spanish contexts to do better than being most accurate for just one of five cases.</Paragraph> </Section> </Section> <Section position="8" start_page="30" end_page="31" type="metho"> <SectionTitle> 7 Discussion </SectionTitle> <Paragraph position="0"> Of the 15 name conflate experiments (five pairs, three languages), in only five cases did the use of the evaluation contexts as a source of feature selection data result in better F-Measure scores than did either using the English contexts alone or as a Mix with the evaluation language contexts. Thus, we conclude that there is a clear benefit to using feature selection data that comes from a different language than the one for which discrimination is being performed.</Paragraph> <Paragraph position="1"> We believe that this is due to the volume of the English data, as well as to the nature of the name discrimination task. For example, a per-son is often best described or identified by observing the people he or she tends to associate with, or the places he or she visits, or the companies with which he or she does business. If we observe that George Miller and Mel Gibson occur together, then it seems we can safely infer that George Miller the movie director is being referred to, rather than George Miller the psychologist and father of WordNet.</Paragraph> <Paragraph position="2"> This argument might suggest that first order co-occurrences would be sufficient to discriminate among the names. That is, simply group the evaluation contexts based on the features that occur within them, and essentially cluster evaluation contexts based on the number of features they have in common with other evaluation contexts. In fact, results on word sense discrimination (Purandare and Pedersen, 2004) suggest that first order representations are more effective with larger number of context than second order methods. However, we see examples in these results that suggests this may not always be the case. In the Bulgarian results, the largest number of Bulgarian contexts are for NATO-USA, but the Mix performs quite a bit better than Bulgarian only. In the case of Romanian, again NATO-USA has the largest number of contexts, but the Mix still does better than Romanian only. And in Spanish, Mexico-India has the largest number of contexts and English-only does better. Thus, even in cases where we have an abundant number of evaluation contexts, the indirect nature of the second order representation provides some added benefit.</Paragraph> <Paragraph position="3"> We believe that the perspective of the news organizations providing the corpora certainly has an impact on the results. For example, in Romanian, the news about David Beckham and Ronaldo is probably much the same as in the United States.</Paragraph> <Paragraph position="4"> These are international figures that are both external to countries where the news originates, and there is no reason to suppose there would be a unique local perspective represented by any of the news sources. The only difference among them might be in the number of contexts available. In this situation, the addition of the English contexts may provide enough additional information to improve discrimination performance in another language. null For example, in the 162 Romanian contexts for Ronaldo-Beckham, there is one occurrence of Posh, which was the stage name of Beckham's wife Victoria. This is below our frequency cut-off threshold for feature selection, so it would be discarded when using Romanian-only contexts.</Paragraph> <Paragraph position="5"> However, in the English contexts Posh is mentioned 6 times, and is included as a feature. Thus, the one occurrence of Posh in the Romanian corpus can be well represented by information found in the English contexts, thus allowing that Romanian context to be correctly discriminated.</Paragraph> </Section> class="xml-element"></Paper>