File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/w03-1609_metho.xml
Size: 11,187 bytes
Last Modified: 2025-10-06 14:08:37
<?xml version="1.0" standalone="yes"?> <Paper uid="W03-1609"> <Title>Paraphrase Acquisition for Information Extraction</Title> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Handling Problems in Real Texts </SectionTitle> <Paragraph position="0"> In the previous section we described our method for obtaining paraphrases in principle. However there are several issues in actual texts which pose difficulties for our method.</Paragraph> <Paragraph position="1"> The first one is in finding anchors which refer to the same entity. In actual articles, names are sometime referred to in a slightly different form. For example, &quot;President Bush&quot; can also be referred to as &quot;Mr. Bush&quot;. Additionally, sometime it is referred to by a pronoun, such as &quot;he&quot;. Since our method relies on the fact that those anchors are preserved across articles, anchors which appear in these varied forms may reduce the actual number of obtained paraphrases.</Paragraph> <Paragraph position="2"> To handle this problem, we extended the notion of anchors to include not just Extended Named Entities, but also pronouns and common nouns such as &quot;the president&quot;. We used a simple coreference resolver after Extended Named Entity tagging. Currently this is done by simply assigning the most recent antecedent to pronouns and finding a longest common subsequence (LCS) between two noun groups. Since it is possible to form a compound noun such as &quot;President-Bush&quot; in Japanese, we computed LCS for each character in the two noun groups. We used the following condition to decide whether two noun groups s1 and s2 are coreferential: null</Paragraph> <Paragraph position="4"> Here jsj denotes the length of noun group s and LCS(s1;s2) is the LCS of two noun groups s1 and s2.</Paragraph> <Paragraph position="5"> The second problem is to extract appropriate portions as paraphrase expressions. Since we use a tree structure to represent the expressions, finding common subtrees may take an exponential number of steps. For example, if a dependency tree in one article has one single predicate which has n arguments, the number of possible subtrees which can be obtained from the tree is 2n. So the matching process between arbitrary combinations of subtrees may grow exponentially with the length of the sentences. Even worse, it can generate many combinations of sentence portions which don't make sense as paraphrases. For example, from the expression &quot;two more people have died in Hong Kong&quot; and &quot;Hong Kong reported two more deaths&quot;, we could extract expressions &quot;in Hong Kong&quot; and &quot;Hong Kong reported&quot;. Although both of them share one anchor, this is not a correct paraphrase. To avoid this sort of error, we need to put some additional restrictions on the expressions.</Paragraph> <Paragraph position="6"> (Shinyama et al., 2002) used the frequency of expressions to filter these incorrect pairs of expressions. First the system obtained a set of IE patterns from corpora (Sudo and Sekine, 2001), and then calculated the score for each candidate paraphrase by counting how many times that expression appears as an IE pattern in the whole corpus. However, with this method, obtainable expressions are limited to existing IE patterns only. Since we wanted to obtain a broader range of expressions not limited to IE patterns themselves, we tried to use other restrictions which can be acquired independently of the IE system.</Paragraph> <Paragraph position="7"> We partly solve this problem by calculating the plausibility of each tree structure. In Japanese sentences, the case of each argument which modifies a predicate is represented by a case marker (postposition or joshi) which follows a noun phrase, just like prepositions in English but in the opposite order. These arguments include subjects and objects that are elucidated syntactically in English sentences.</Paragraph> <Paragraph position="8"> We collected frequent cases occurring with a specific predicate in advance. We applied this restriction when generating subtrees from a dependency tree by calculating a score for each predicate as follows: null Let an instance of predicate p have cases C = fc1;c2;:::;cng and a function Np(I) be the number of instances of p in the corpus whose cases are I = fc1;c2;:::;cmg. We compute the score Sp(C) of the instance:</Paragraph> <Paragraph position="10"> the number of instances of p in the corpus: Using this metric, a predicate which doesn't have cases that it should usually have is given a lower score. A subtree which includes a predicate whose score is less than a certain threshold is filtered out. This way we can filter out expressions such as &quot;Hong Kong reported&quot; in Japanese since it would lack an object case which normally the verb &quot;report&quot; should have. Moreover, this greatly reduces the number of possible combinations of subtrees.</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Experiments </SectionTitle> <Paragraph position="0"> We used Japanese news articles for this experiment. First we collected articles for a specific domain from two different newspapers (Mainichi and Nikkei). Then we used a Japanese part-of-speech tagger (Kurohashi and Nagao, 1998) and Extended Named Entity tagger to process documents, and put them into a Topic Detection and Tracking system.</Paragraph> <Paragraph position="1"> In this experiment, we used a modified version of a Japanese Extended Named Entity tagger (Uchimoto et al., 2000). This tagger tags person names, organization names, locations, dates, times and numbers. Next we applied a simple vector space method to obtain pairs of sentences which report the same event. After that, we used a simple coreference resolver to identify anchors. Finally we used a dependency analyzer (Kurohashi, 1998) to extract portions of sentences which share at least one anchor.</Paragraph> <Paragraph position="2"> In this experiment, we used a set of articles which reports murder cases. The results are shown in Table 1. First, with Topic Detection and Tracking, there were 156 correct pairs of articles out of 193 pairs obtained. To simplify the evaluation process, we actually obtained paraphrases from the top 20 pairs of articles which had the highest similarities.</Paragraph> <Paragraph position="3"> Obtained paraphrases were reviewed manually. We used the following criteria for judging the correctness of paraphrases: 1. They has to be describing the same event.</Paragraph> <Paragraph position="4"> 2. They should capture the same information if we use them in an actual IE application.</Paragraph> <Paragraph position="5"> We tried several conditions to extract paraphrases. First we tried to extract paraphrases using neither coreference resolution nor case restriction. Then we applied only the case restriction with the threshold 0:3 < Sp(C), and observed the precision went up from 24% to 56%. Furthermore, we added a simple coreference resolution and the precision rose to 62%. We got 23 correct paraphrases. We found that several interesting paraphrases are obtained. Some examples are shown in Figure 3 (correct paraphrases) and Figure 4 (incorrect paraphrases). It is hard to say how many paraphrases can be ultimately obtained from these articles. However, it is worth noting that after spending about 5 hours for this corpus we obtained 100 paraphrases manually.</Paragraph> </Section> <Section position="6" start_page="0" end_page="0" type="metho"> <SectionTitle> 5 Discussion </SectionTitle> <Paragraph position="0"> Some paraphrases were incorrectly obtained. There were two major causes. The first one was dependency analysis errors. Since our method recognizes boundaries of expressions using dependency trees, if some predicates in a tree take extra arguments, this may result in including extraneous portions of the sentence in the paraphrase. For example, the predicate &quot;lay in ambush&quot; in Sample 3 should have taken a different noun as its subject. If so, the predicate doesn't share the anchors any more and could be eliminated.</Paragraph> <Paragraph position="1"> The second cause was the lack of recognizing contexts. In Sample 4, we observed that even if two expressions share multiple anchors, an obtained pair can be still incorrect. We hope that this kind of error can be reduced by considering the contexts around expressions more extensively.</Paragraph> </Section> <Section position="7" start_page="0" end_page="0" type="metho"> <SectionTitle> 6 Future Work </SectionTitle> <Paragraph position="0"> We hope to apply our approach further to obtain more varied paraphrases. After a certain number of paraphrases are obtained, we can use the obtained paraphrases as anchors to obtain additional paraphrases. For example, if we know &quot;A dismantle B&quot; and &quot;A destroy B&quot; are paraphrases, we could apply them to &quot;U.N. reported Iraq dismantling more missiles&quot; and &quot;U.N. official says Iraq destroyed more Al-Samoud 2 missiles&quot;, and obtain another pair of paraphrases &quot;X reports Y&quot; and &quot;X says Y&quot;.</Paragraph> <Paragraph position="1"> This approach can be extended in the other direction. Some entities can be referred to by completely different names in certain situations, such as &quot;North Korea&quot; and &quot;Pyongyang&quot;. We are also planning to identify these varied external forms of a single entity by applying previously obtained paraphrases. For example, if we know &quot;A restarted B&quot; and &quot;A reactivated B&quot; as paraphrases, we could apply them to &quot;North Korea restarted its nuclear facility&quot; and &quot;Pyongyang has reactivated the atomic facility&quot;. This way we know &quot;North Korea&quot; and &quot;Pyongyang&quot; can refer to the same entity in a certain context.</Paragraph> <Paragraph position="2"> In addition, we are planning to give some credibility score to anchors for improving accuracy. We found that some anchors are less reliable than others even if they are considered as proper expressions. For example, in most U.S. newspapers the word &quot;U.S.&quot; is used in much wider contexts than word such as &quot;Thailand&quot; although both of them are country names. So we want to give less credit to these widely used names.</Paragraph> <Paragraph position="3"> We noticed that there are several issues in generalizing paraphrases. Currently we simply label every Named Entity as a slot. However expressions such as &quot;the governor of LOCATION&quot; can take only a certain kind of locations. Also some paraphrases might require a narrower context than others and are not truly interchangeable. For example, &quot;PERSON was sworn&quot; can be replaced with &quot;PERSON took office&quot;, but not vice versa.</Paragraph> </Section> class="xml-element"></Paper>