File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/01/n01-1007_metho.xml
Size: 19,458 bytes
Last Modified: 2025-10-06 14:07:32
<?xml version="1.0" standalone="yes"?> <Paper uid="N01-1007"> <Title>Unsupervised Learning of Name Structure From Coreference Data</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 Problem De nition and Data Preparation </SectionTitle> <Paragraph position="0"> We assume that people's names have six (optional) components as exempli ed in the following somewhat contrived example: Our models make the following assumptions about personal names: all words of label l (the label number) must occur before all words of label l +1 with the exception of descriptors, a maximum of one word may appear for each label every name must include either a rst name or a last name in a loose sense, honori csandclosesare \closed classes&quot;, even if we do not know which words are in the classes. We implement this by requiring that words given these labels must appear in our dictionary only as proper names, and that they must have appeared at least three times in the training set used for the lexicon (sections 2-21 of the Penn Wall Street Journal treebank) null Section 5 discusses these constraints and assumptions. null The input to the name model is a noisy list of personal names. This list is approximately 85% correct; that is, about 15% of the word sequences are not personal names, but rather non-names, or the names of other types of entities. We obtained these names by running a program inspired by that of Collins and Singer [4] for unsupervised learning of named entity recognition. This program takes as input possible names plus contextual information about their occurrences. It then categorizes each name as one of person, place,ororganization. A possible name is considered to be a sequence of one or more proper nouns immediately dominated by a noun-phrase where the last of the proper nouns is the head (rightmost noun) of the noun phrase. We used as input to this program the parsed text found in the BLLIP WSJ 1987-89 WSJ Corpus |Release 1 [6]. Because of a minor error, the parser used in producing this corpus had a unwarranted propensity to label uncapitalized words as proper nouns. To correct for this we only allowed capitalized words to be considered proper nouns. In section 5 we note an unintended consequence of this decision.</Paragraph> <Paragraph position="1"> The coreference model for our tasks is also given a list of all personal names (as dened above) in each Wall Street Journal article. Although the BLLIP corpus has machine-generated coreference markers, these are ignored. null The output of both programs is an assignment from each name to a sequence of labels, one for each word in the name. Performance is measured by the percent of words labeled correctly and percent of names for which all of the labels are correct.</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 The Probability Models </SectionTitle> <Paragraph position="0"> We now consider the probability models that underlie our learning mechanisms. Both models are generative in that they assign probabilities to all possible labelings of the names. (For the coreference model the model generates all possible labelings given the proposed antecedent.) Let ~ l be a sequence of label assignments to the name ~n (a sequence of words). For the name model we estimate arg max</Paragraph> <Paragraph position="2"> We estimate this latter probability by assuming that the number of words assigned label l, n(l), is independent of which other labels have appeared. Our assumptions imply that with the exception of descriptor, all labels may occur zero or one times. We arbitrarily assume that there may be zero to fourteen descriptors. We then assume that the words in the name are independent of one another given their label. Thus we get the following equation:</Paragraph> <Paragraph position="4"> (2) Here w(i)istheith word from ~n assigned the label l in ~ l and N(l) is a random variable whose value is the number of words in the name with label l. To put this slightly di erently, we rst guess the number of words with each label l according to the distribution p(N(l)=n(l)). Given the ordering constraints, this completely determines which words in ~n get which label. We then guess each of the words according to the distribution p(w(i) j l). The name model does not use information concerning how often each name occurs. (That is, it implicitly assumes that all names occur equally often.) We have also considered somewhat more complex approximations to p( ~ l;~n). See section 5.</Paragraph> <Paragraph position="5"> The coreference model is more complicated. Here we estimate arg max</Paragraph> <Paragraph position="7"> That is, for each name the program identi es zero or one possible antecedent name. It does this using a very crude lter. The last word of the proposed antecedent (unless that word is \Jr.&quot;, in which case it looks at the second to last word) must also appear in ~n as well. If no such name exists, then ~c = 4 and we estimate the distribution according to equation 2. If more than one such name exists, we choose the rst one appearing in the article.</Paragraph> <Paragraph position="8"> Even if there is such a name, the program does not assume that the two names are, in fact, coreferent. Rather, a hidden random variable R determines how the two names relate. There are three possibilities: ~c is not coreferent (R = 4), in which case the probability is estimated according to the ~c = 4 case.</Paragraph> <Paragraph position="9"> ~c is not coreferent but is a member of the same family as ~n (e.g., \John Sebastian Bach&quot; and \Carl Philipp Emmanuel Bach&quot;). This case (R = f) is computed as the non-coreference case, but the second occurrence is given \credit&quot; for the last name.</Paragraph> <Paragraph position="10"> ~c is coreferent to ~n, in which case we compute the probability as described below. In this case we assume that any words shared by the two names must appear with the same label, and except for descriptions, labels may not change between them (e.g., if ~c has a rst name, then ~n can be given a rstnameonlyifitisthesamewordas that in ~c). This does not allow for nicknames and other such cases, but they are rare in the Wall Street Journal.</Paragraph> <Paragraph position="11"> More formally, we have</Paragraph> <Paragraph position="13"> l(s)) where s is thewordincommonthatcausedtheprevious name to be selected as a possible coreferent and ~ l(s) is the label assigned to s according to</Paragraph> <Paragraph position="15"> if R = ~c, use equation 8 below.</Paragraph> <Paragraph position="16"> In equation 4 the R = 4 case is reasonably straight-forward: we simply use equation 2 as the non-coreferent distribution. For R = f,as we noted earlier, we want to claim that the new name is a member of the same family as that of the earlier name. Thus, as we said earlier, we get \credit&quot; for the repeated family name. This is why we take the non-coreferent probability and divide by the probability of what we take to be the family name.</Paragraph> <Paragraph position="17"> This leaves the coreferent case. The basic idea is that we view the labeling of new name j ~c), is easy to compute from equation 2 using Bayes law. We now turn our attention to the second term.</Paragraph> <Paragraph position="18"> To establish a more detailed relationship between the old and new names we compute possible correspondences between the two names, where a correspondence speci es for each word in the old name if it is retained in the new name, and if so, the word in the new name to which it corresponds. Two words may correspond only if they are the same lexical item. (The converse does not hold.) Since in principle there can be multiple correspondences, we introduce the correspondences by summing the probability over all of them:</Paragraph> <Paragraph position="20"> In the second equation we simplify by making the assumption that the sum will be dominated by one of the correspondences, a very good assumption. Furthermore, as is intuitively plausible, one can identify the maximum without actually computing the probabilities: it is the with the maximum number of words retained from ~c. Henceforth we use to denote this maximum-probability correspondence.</Paragraph> <Paragraph position="21"> By specifying we divide the words of the old name into two groups, those (R)thatare retained in the new name and those (S)thatare subtracted when going to the new name. Similarly we have divided the words of the new name into two classes, those retained and those added (A). We then assume that the probability of a word being subtracted or retained is independent of the word and depends only on its label (e.g., the probability of a subtraction given the label l is p(s j l)). Furthermore, we assume that the labels of words in R do not change between</Paragraph> <Paragraph position="23"> l. Once we have pinned down R and S, any words left in ~ l must be added. However, we do not yet \know&quot; the labels of those, so we need a probability term p(l j a). Lastly, for words that are added in the new name, we need to guess the particular word corresponding to the label type. This gives us the following distribution: null</Paragraph> <Paragraph position="25"> Taken together, equations 2 |8 de ne our probability model.</Paragraph> </Section> <Section position="5" start_page="0" end_page="1" type="metho"> <SectionTitle> 4 Experiments </SectionTitle> <Paragraph position="0"> From the work on named entity recognition we obtained a list of 145,670 names, of which 87,809 were marked as personal names. A second program creates an ordered list of names that appear in each article in the corpus. The two les, names and article-name occurrences, are the input to our procedures.</Paragraph> <Paragraph position="1"> With one exception, all the probabilities required by the two models are initialized with flat distributions |i.e., if a random variable can take n possible values, each value is 1=n.</Paragraph> <Paragraph position="2"> The probabilites so set are: 1. p(N(l)=n(l)) from equation 2 (the probability that label l appears n(l) times), 2. p(w(i) j l) from equation 2 (the probability of generating w(i) given it has label l), 3. p(s j or added when going from the old name to the new name.</Paragraph> <Paragraph position="3"> We then used the expectation-maximization (EM) algorithm to re-estimate the values. We initially decided to run EM for 100 iterations as our benchmark. In practice no change in performance was observed after about 15 iterations. The one exception to the flat probability distribution rule is the probability distribution p(R), the probability of an antecedent being coreferent, a family relation, or non-coreferent. This distribution was set at .993, .002, and .005 respectively for the three alternatives and the values were not re-estimated by EM.</Paragraph> <Paragraph position="4"> words given the possible labels.</Paragraph> <Paragraph position="5"> The result shown in Figure 1 are basically correct, with \Director&quot; having a high probability as a descriptor, (0.0059), \Ms.&quot; having a high probability as honori c (0.058), etc. Some of the small non-zero probabilities are due to genuine ambiguity (e.g., Fisher does occur as a rst name as well as a last name) but more of it is due to small confusions in particular cases (e.g., \Director&quot; as a last-name, or \John&quot; as descriptor).</Paragraph> <Paragraph position="6"> After EM training we evaluated the program on 309 personal names from our names list that we had annotated by hand. These names were obtained by random selection of names labeled These were the rst values we tried and, as they worked satisfactorily, we simply left them alone. as personal names by the name-entity recognizer. If the named entity recognizer had mistakenly classi ed something as a personal name it was not used in our test data.</Paragraph> <Paragraph position="7"> For the name model we straightforwardly used equation 2 to determine the most probable label sequence ~ l for each name. Note, however, that the testing data does not itself include any information on whether or not the test name was a rst or subsequent occurrence of an individual in the text. To evaluate the coreference model we looked at the possible coreference data to nd if the test-data name was most common as a rst occurrence, or if not, which possible antecedent was the most common. If rst occurrence prevailed, ~ l was determined from equation 2, and otherwise it was determined using equation 3 with ~c set to the most common possible coreferent for this name.</Paragraph> <Paragraph position="8"> We compare the most probable labels ~ l for a test example with the hand-labeled test data. We report percentage of words that are given the correct label and percentage of names that are completely correct. The results of our experiments are as follows: As can be seen, information about possible coreference was a decided help in this task, leading to an error reduction of 59% for the number of labels correct and 63% for names correct.</Paragraph> </Section> <Section position="6" start_page="1" end_page="1" type="metho"> <SectionTitle> 5 Error Analysis </SectionTitle> <Paragraph position="0"> The errors tend to arise from three situations: the name disobeys the name structure assumptions upon which the program is based, the name is anomalous in some way, or sparse data.</Paragraph> <Paragraph position="1"> We consider each of these in turn.</Paragraph> <Paragraph position="2"> Many of the names we encounter do not obey our assumptions. Probably the most common situation is last names that, contrary to our assumption, are composed of more than word e.g., \Van Dam&quot;. Actually, a detail of our processing has caused this case to be under-represented in our data and testing examples. As noted in Section 2, uncapitalized proper nouns were not allowed. The most common extra last name is probably \van,&quot; but all of these names were either truncated or ignored because of our processing step.</Paragraph> <Paragraph position="3"> In principle, it should be possible to allow for multiple last names, or alternatively have a new label for \ rst of two last names&quot;. In practice, it is of course the case that the more parameters we give EM to ddle with, the more mischief it can get into. However, for a practical program this is probably the most important extension we envision.</Paragraph> <Paragraph position="4"> Names may be anomalous while obeying our restrictions at least in the letter if not the spirit. Chinese names have something very much like the rst-middle-last name structure we assume, but the family name comes rst. This is particularly relevant for the coreferent model, since it will be the family name that is repeated. There is nothing in our model that prevents this, but it is su ciently rare that the program gets \confused&quot;. In a similar way, we marked both \Dr.&quot; and \Sir&quot; as honori cs in our test data. However, the Wall Street Journal treats them very di erently from \Mr.&quot; in that the former tend to be included even in the rst mention of a name, while the latter is not. Thus in some cases our program labeled \Dr.&quot; and \Sir&quot; as descriptors.</Paragraph> <Paragraph position="5"> Lastly, there are situations where we imagine that if the program had more data (or if the learning mechanisms were somehow \better&quot;) it would get the example right. For example, the name \Mikio Suzuki&quot; appears only once in our corpus, as does the word \Mikio&quot;. \Suzuki&quot; appears two times, the rst being in \Yotaro Suzuki&quot; who is mentioned earlier in the same article. Unfortunately, because \Mikio&quot; does not appear elsewhere, the program is at a loss to decide which label to give it. However, because Yotaro is assumed to be a rst name, the program makes \Mikio Suzuki&quot; coreferent with \Yotaro Suzuki&quot; by labeling \Mikio&quot; descriptor. As noted briefly in section 3, we have considered more complicated probabilities models to replace equation 2. The most obvious of these is to allow the distribution over numbers of words for each label to be conditioned on the previous label |e.g., a bi-label model. This model generally performed poorly, although the coreference versions often performed as well as the coreference model reported here. Our hypothesis is that we are seeing problems similar to those that have bedeviled applying EM to tasks like part-of-speech tagging [7]. In such cases EM typically tries to lower probabilities of the corpus by using the tags to encode common word-word combinations. As the models corresponding to equations 2 and 8 do not include any label-label probabilities, this problem does not appear in these models.</Paragraph> </Section> <Section position="7" start_page="1" end_page="1" type="metho"> <SectionTitle> 6 Other Applications </SectionTitle> <Paragraph position="0"> It is probably clear to most readers that the structure and probabilities learned by these models, particularly the coreferent model, could be used for tasks other than assigning structure to names. For starters, we would imagine that a named entity recognition program that used information about name structure could do a better job. The named entity recognition program used to create the input looks at only a few features of the context in which the name appears, the complete name, and the individual words that appear in the name irrespective of the other words. Since the di erent kinds of names (person, company and location) di er in structure from one another, a program that simultaneously establishes both structure and type would have an extra source of information, thus enabling it to do a better job.</Paragraph> <Paragraph position="1"> Our name-structure coreferent model is also learning a lot of information that would be useful for a program whose primary purpose is to detect coreference. One way to see this is to look at some of the probabilities that the program learned. Consider the probability that we will have an honori c in a rst occurrence of a name:</Paragraph> <Paragraph position="3"> This is very low. Contrast this with the probability that we add an honori c in the second occurrence: p(a j honori c) = 1 (10) These dramatic probabilities are not, in fact, accurate, as EM tends to exaggerate the e ects by moving words that do not obey the trend out of the honori c category. They are, however, indicative of the fact that in the Wall Street Journal names are introduced without honori cs, but subsequent occurrences tend to have them (a fact we were not aware of at the start of this research).</Paragraph> <Paragraph position="4"> Another way to suggest the usefulness of this research for coreference and named-entity recognition is to consider the cases where our program's crude lter suggests a possible antecedent, but the probabilistic model of equation 4 rejects this analysis. The rst 15 cases are given in gure 2. As can be seen, except for \Mr. President&quot; and \President Reagan&quot;, all of the examples are either not coreferent or are not people at all.</Paragraph> </Section> class="xml-element"></Paper>