File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/n03-2024_metho.xml
Size: 8,892 bytes
Last Modified: 2025-10-06 14:08:15
<?xml version="1.0" standalone="yes"?> <Paper uid="N03-2024"> <Title>References to Named Entities: a Corpus Study</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 The Corpus </SectionTitle> <Paragraph position="0"> We used a corpus of news stories, containing 651,000 words drawn from six different newswire agencies, in order to study the syntactic form of noun phrases in which references to people have been realized. We were interested in the occurrence of features such as type and number of premodifiers, presence and type of postmodifiers, and form of name reference for people.</Paragraph> <Paragraph position="1"> We constructed a large, automatically annotated corpus by merging the output of Charniak's statistical parser (Charniak, 2000) with that of the IBM named entity recognition system Nominator (Wacholder et al., 1997). The corpus contains 6240 references. In this section, we describe the features that were annotated.</Paragraph> <Paragraph position="2"> Given our focus on references to mentions of people, there are two distinct types of premodifiers, &quot;titles&quot; and &quot;name-external modifiers&quot;. The titles are capitalized noun premodifiers that conventionally are recognized as part of the name, such as &quot;president&quot; in &quot;President George W. Bush&quot;. Name-external premodifiers are modifiers that do not constitute part of the name, such as &quot;Irish flutist&quot; in &quot;Irish flutist James Galway&quot;. The three major categories of postmodification that we distinguish are apposition, prepositional phrase modification and relative clause. All other postmodifications, such as remarks in parenthesis and verb-initial modifications are lumped in a category &quot;others&quot;.</Paragraph> <Paragraph position="3"> There are three categories of names corresponding to the general European and American name structure.</Paragraph> <Paragraph position="4"> They include full name (first+(middle initial)+last), last name only, and nickname (first or nickname).</Paragraph> <Paragraph position="5"> In sum, the target NP features that we examined were: a2 Is the target named entity the head of the phrase or not? Is it in a possessive construction or not? a2 If it is the head, what kind of pre- and post- modification does it have? a2 How was the name itself realized in the NP? In order to identify the appropriate sequences of syntactic forms in coreferring noun phrases, we analyze the coreference chains for each entity mentioned in the text. A coreference chain consists of all the mentions of an entity within a document. In a manually built corpus, a coreference chain can include pronouns and common nouns that refer to the person. However, these forms could not be automatically identified, so coreference chains in our corpus only include noun phrases that contain at least one word from the name. There were 3548 coreference chains in the corpus.</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 A Markov Chain Model </SectionTitle> <Paragraph position="0"> The initial examination of the data showed that syntactic forms in coreference chains can be effectively modeled by Markov chains.</Paragraph> <Paragraph position="1"> Let a0a2a1 be random variables taking values in I. We say that a3a4a0 a1a6a5a7a1a9a8a11a10 is a Markov chain with initial distribution a12 and transition matrix</Paragraph> <Paragraph position="3"> These properties have very visible counterparts in the behavior of coreference chains. The first mention of an entity does have a very special status and its appropriate choice makes text more readable. Thus, the initial distribution of a Markov chain would correspond to the probability of choosing a specific syntactic realization for the first mention of a person in the text. For each subsequent mention, the model assumes that only the form of the immediately preceding mention determines its form. Moreover, the Markov chain model is more informative than other possible approaches to modelling the same phe- null first row gives the initial distribution vector. a3a4a23 a42 a37 a5 gives the probability of going from form a23 to forma37 .</Paragraph> <Paragraph position="4"> full name last name nickname</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Model Interpretation </SectionTitle> <Paragraph position="0"> The number of possible syntactic forms, which corresponds to the possible combination of features, is large, around 160. Because of this, it is not easy to interpret the results if they are taken in their full form. We now show information for one feature at a time so that the tendencies can become clearer.</Paragraph> <Paragraph position="1"> A first mention is very likely to be modified in some way (probability of 0.76, Figure 1), but it is highly unlikely that it will be both pre- and postmodified (probability of 0.17). The Markov model predicts that at each next mention, modification can be either used or not, but once a non-modified form is chosen, the subsequent realizations will most likely not use modification any more.</Paragraph> <Paragraph position="2"> From the Markov chain that models the form of names (Figure 2) we can see that first name or nickname mentions are very unlikely. But it also predicts that if such a reference is once chosen, it will most likely continue to be used as a form of reference. This is intuitively very appealing as it models cases where journalists call celebrities by their first name (e.g., &quot;Britney&quot; or &quot;Lady Diana&quot; are often repeatedly used within the same article).</Paragraph> <Paragraph position="3"> Prepositional, relative clause and &quot;other&quot; modifications appear with equal extremely low probability (in the range 0.01-0.04) after any possible previous mention realization. Thus the syntactic structure of the previous mention cannot be used as a predictor of the appearance of any of these kinds of modifications, so for the task of rewriting references they should not be considered in any way but as &quot;blockers&quot; of further modification. The only type of postmodification with significantly high probability of 0.25 is apposition at the first mention.</Paragraph> <Paragraph position="4"> Figure 3 shows the probabilities for transitions between NPs with a different number of premodifiers. The mass above the diagonal is almost zero, showing that each subsequent mention has fewer premodifiers than the previous. There are exceptions which are not surprising; for example, a mention with one modifier is usually followed by a mention with one modifier (probability 0.5) accounting for title modifiers such as &quot;Mr.&quot; and &quot;Mrs.&quot;.</Paragraph> </Section> <Section position="6" start_page="0" end_page="0" type="metho"> <SectionTitle> 5 Rewrite Rules </SectionTitle> <Paragraph position="0"> The Markov chain model derived in the manner described above helps us understand what a typical text looks like.</Paragraph> <Paragraph position="1"> The Markov chain transitions give us defeasible preferences that are true for the average text. Human writers seek more style, so even statistically highly unlikely realizations can be used by a human writer. For example, even a first mention with a pronoun can be felicitous at times. The fact that we were seeking preferences rather than rules allows us to take advantage of the sometimes inaccurate automatically derived corpus. There have inevitably been parser errors or mistakes in Nominator's output, but these can be ignored since, given the large amount of data, the general preferences in realization could be captured even from imperfect data.</Paragraph> <Paragraph position="2"> We developed a set of rewrite rules that realize the highest probability paths in the Markov chains for name form and modification. In the cases where the name serves as a head of the NP it appears in, the highest probability paths suggest the following: a2 name realization: use full name at the first mention and last name only at subsequent mentions. The probability of such sequence of transitions is 0.66, compared with 0.01 for last name--full name--last name for example.</Paragraph> <Paragraph position="3"> a2 modification: the first mention is modified and subsequent mentions are not. As for the type of modification--premodifiers are preferred and in case they cannot be realized, apposition is used. Appositions and premodifiers are removed from any subsequent mention.</Paragraph> <Paragraph position="4"> The required type of NP realization is currently achived by extracting NPs from the original input documents.</Paragraph> </Section> class="xml-element"></Paper>