File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/p06-2002_metho.xml
Size: 16,591 bytes
Last Modified: 2025-10-06 14:10:24
<?xml version="1.0" standalone="yes"?> <Paper uid="P06-2002"> <Title>Enrique.Alfonseca@uam.es enrique@lr.pi.titech.ac.jp Pablo.Castells@uam.es oku@pi.titech.ac.jp</Title> <Section position="5" start_page="9" end_page="12" type="metho"> <SectionTitle> 2 Proposed pattern generalization </SectionTitle> <Paragraph position="0"> procedure To begin with, for every appearance of a pair of concepts, we extract a context around them. Next, those contexts are generalized to obtain the parts that are shared by several of them. The procedure is detailed in the following subsections.</Paragraph> <Section position="1" start_page="10" end_page="10" type="sub_section"> <SectionTitle> 2.1 Context extraction procedure </SectionTitle> <Paragraph position="0"> After selecting the sentences for each pair of related words in the training set, these are processed with a part-of-speech tagger and a module for Named Entity Recognition and Classification (NERC) that annotates people, organizations, locations, dates, relative temporal expressions and numbers. Afterward, a context around the two words in the pair is extracted, including (a) at most five words to the left of the first word; (b) all the words in between the pair words; (c) at most five words to the right of the second word. The context never jumps over sentence boundaries, which are marked with the symbols BOS (Beginning of sentence)andEOS(Endofsentence). Thetworelated concepts are marked as <hook> and <target>.</Paragraph> <Paragraph position="1"> Figure 1 shows several example contexts extracted fortherelationshipsbirthyear, birthplace, writer-book and capital-country.</Paragraph> <Paragraph position="2"> Furthermore, for each of the entities in the relationship, the system also stores in a separate file the way in which they are annotated in the training corpus: thesequencesofpart-of-speechtagsofevery appearance, and the entity type (if marked as such). So, for instance, typical PoS sequences for names of authors are &quot;NNP&quot;1 (surname) and &quot;NNP NNP&quot; (first name and surname). A typical entity kind for an author is person.</Paragraph> </Section> <Section position="2" start_page="10" end_page="10" type="sub_section"> <SectionTitle> 2.2 Generalization pseudocode </SectionTitle> <Paragraph position="0"> In order to identify the portions in common between the patterns, and to generalize them, we apply the following pseudocode (Ruiz-Casado et al., 1. Store all the patterns in a setP.</Paragraph> <Paragraph position="1"> 2. Initialize a setRas an empty set.</Paragraph> <Paragraph position="2"> 3. WhileP is not empty, (a) For each possible pair of patterns, calculate the distance between them (described in Section 2.3).</Paragraph> <Paragraph position="3"> (b) Take the two patterns with the smallest distance, pi and pj.</Paragraph> <Paragraph position="4"> (c) Remove them from P, and add them to R.</Paragraph> <Paragraph position="5"> (d) Obtain the generalization of both, pg (Section 2.4).</Paragraph> <Paragraph position="6"> (e) If pg does not have a wildcard adjacent to the hook or the target, add it toP. 4. ReturnR At the end, R contains all the initial patterns andthoseobtainedwhilegeneralizingtheprevious ones. The motivation for step (e) is that, if a patterncontainsawildcardadjacenttoeitherthehook null or the target, it will be impossible to know where it starts or ends. For instance, when applying the pattern <hook> wrote * <target> to a text, the wildcard prevents the system from guessing where the title of the book starts.</Paragraph> </Section> <Section position="3" start_page="10" end_page="11" type="sub_section"> <SectionTitle> 2.3 Edit distance calculation </SectionTitle> <Paragraph position="0"> So as to calculate the similarity between two patterns, a slightly modified version of the dynamic programming algorithm for edit-distance calculation (Wagner and Fischer, 1974) is used. The distance between two patterns A and B is defined as the minimum number of changes (insertion, addition or replacement) that have to be done to the first one in order to obtain the second one.</Paragraph> <Paragraph position="1"> The calculation is carried on by filling a matrix M, as shown in Figure 2 (left). At the same A: wrote the well known novel B: wrote the classic novel is calculated, andD is the matrix indicating the choice that produced the minimal distance for each cell inM. time that we calculate the edit distance matrix, it is possible to fill in another matrixD, in which we record which of the choices was selected at each step: insertion, deletion, replacement or no edition. This will be used later to obtain the generalized pattern. We have used the following four characters: A and B, containing respectively 5 and 4 tokens. M(5,4) has the value 2, indicating the distance between the two complete patterns. For instance, the two editions would be replacing well by classic and removing known.</Paragraph> </Section> <Section position="4" start_page="11" end_page="11" type="sub_section"> <SectionTitle> 2.4 Obtaining the generalized pattern </SectionTitle> <Paragraph position="0"> After calculating the edit distance between two patterns A and B, we can use matrix D to obtain a generalized pattern, which should maintain the common tokens shared by them. The procedure used is the following: * Every time there is an insertion or a deletion, the generalized pattern will contain a wildcard, indicating that there may be anything in between.</Paragraph> <Paragraph position="1"> * Every time there is replacement, the generalized pattern will contain a disjunction of both tokens.</Paragraph> <Paragraph position="2"> * Finally, in the positions where there is no edit operation, the token that is shared between the two patterns is left unchanged.</Paragraph> <Paragraph position="3"> The patterns in the example will produced the generalized pattern Wrote the well known novel Wrote the classic novel ----------------------Wrote the well|classic * novel The generalization of these two patterns produces one that can match a wide variety of sentences, so we should always take care in order not to over-generalize.</Paragraph> </Section> <Section position="5" start_page="11" end_page="12" type="sub_section"> <SectionTitle> 2.5 Considering part-of-speech tags and </SectionTitle> <Paragraph position="0"> Named Entities If we consider the result in the previous example, we can see that the disjunction has been made between an adverb and an adjective, while the other adjective has been deleted. A more natural result, with the same number of editing operations as the previousone,wouldhavebeentodeletetheadverb to obtain the following generalization: Wrote the well known novel Wrote the classic novel ----------------------Wrote the * known|classic novel This is done taking into account part-of-speech tags in the generalization process. In this way, the edit distance has been modified so that a replacementoperationcanonlybedonebetweenwordsof null the same part-of-speech.2 Furthermore, replacements are given an edit distance of 0. This favors the choice of replacements with respect to deletions and insertions. To illustrate this point, the will be set to 0, because both tokens are adjectives. In other words, the d function is redefined as:</Paragraph> <Paragraph position="2"> Note that all the entities identified by the NERC module will appear with a PoS tag of entity, so it is possible to have a disjunction such as location|organization/entity in a generalized pattern (See Figure 1).</Paragraph> </Section> </Section> <Section position="6" start_page="12" end_page="12" type="metho"> <SectionTitle> 3 Proposed pattern scoring procedure </SectionTitle> <Paragraph position="0"> As indicated above, if we measure the precision of the patterns using a hook corpus-based approach, the score may be inadvertently increased because they are only evaluated using the same terms with which they were extracted. The approach proposed herein is to take advantage of the fact that we are obtaining patterns for several relationships.</Paragraph> <Paragraph position="1"> Thus, the hook corpora for some of the patterns can be used also to identify errors done by other patterns.</Paragraph> <Paragraph position="2"> The input of the system now is not just a list of related pairs, but a table including several relationships for the same entities. We may consider it as mini-biographies as in Mann and Yarowsky (2005)'s system. Table 1 shows a few rows in the input table for the system. The cells for which no data is provided have a default value of None, which means that anything extracted for that cell will be considered as incorrect.</Paragraph> <Paragraph position="3"> Although this table can be written by hand, in our experiments we have chosen to build it automatically from the lists of related pairs. The system receives the seed pairs for the relationships, and mixes the information from all of them in a single table. In this way, if Dickens appears in the seed list for the birth year, death year, birth place and writer-book relationships, four of the cells in its row will be filled in with values, and all the rest will be set to None. This is probably a very strict evaluation, because, for all the cells for whichtherewasnovalueinanyofthelists, anyresultobtainedwillbejudgedasincorrect. However, the advantage is that we can study the behavior of the system working with incomplete data.</Paragraph> <Paragraph position="4"> The new procedure for calculating the patterns' precisions is as follows: 1. For every relationship, and for every hook, collect a hook corpus from the Internet.</Paragraph> <Paragraph position="5"> 2. Apply the patterns to all the hook corpora collected. Whenever a pattern extracts a relationship from a sentence, * If the table does not contain a row for the hook, ignore the result.</Paragraph> <Paragraph position="6"> * If the extracted target appears in the corresponding cell in the table, consider it correct.</Paragraph> <Paragraph position="7"> * If that cell contained different values, or None, consider it incorrect.</Paragraph> <Paragraph position="8"> For instance, the pattern <target> 's <hook> extracted for director-film may find, in the Dickens corpus, book titles. Because these titles do not appear in the table as films directed by Dickens, the pattern will be considered to have a low accuracy. null In this step, every pattern that did not apply at least three times in the test corpora was discarded.</Paragraph> </Section> <Section position="7" start_page="12" end_page="14" type="metho"> <SectionTitle> 4 Pattern application </SectionTitle> <Paragraph position="0"> Finally, given a set of patterns for a particular relation, the procedure for obtaining new pairs is straightforward: 1. For any of the patterns, 2. For each sentence in the test corpus, (a) Look for the left-hand-side context in the sentence.</Paragraph> <Paragraph position="1"> (b) Look for the middle context.</Paragraph> <Paragraph position="2"> (c) Look for the right-hand-side context. (d) Take the words in between, and check that either the sequence of part-of- null speech tags or the entity type had been and number of unique patterns after the extraction and the generalization step, and after calculating theiraccuracyandfilteringthosethatdidnotapply 3 times on the test corpus.</Paragraph> <Paragraph position="3"> seen in the training corpus for that role (hook or target). If so, output the relationship. null 5 Evaluation and results The procedure has been tested with 10 different relationships. For each pair in each seed list, a corpus with 500 documents has been collected using Google, from which the patterns are extracted. Table 2 shows the number of patterns obtained. It is interesting to see that for some relations, such as birth-year or birth-place, more than one thousand patternshavebeenreducedtoafew. Table3shows the patterns obtained for the relationship birthyear. It can also be seen that some of the patterns with good precision contain the wildcard *, which helped extract the correct birth year in roughly 50 occasions. Specially of interest is the last pattern, ber of times that a pattern extracted information, when applied to a test corpus.</Paragraph> <Paragraph position="4"> cedure here indicated, but which would have obtained an accuracy of 0.54 using the traditional hook corpus approach. This is because in other test corpora (e.g. in the one containing soccer players and clubs) it is more frequent to find the name of a person followed by a number that is not his/her birth year, while that did not happen so often in the birth year test corpus.</Paragraph> <Paragraph position="5"> For evaluating the patterns, a new test corpus has been collected for fourteen entities not present in the training corpora, again using Google. The chosen entities are Robert de Niro and Natalie Wood (actors), Isaac Asimov and Alfred Bester (writers), Montevideo and Yaounde (capitals), Gloria Macapagal Arroyo and Hosni Mubarak (country presidents), Bernardo Bertolucci and FedericoFellini(directors),PeterPaulRubensand Paul Gauguin (painters), and Jens Lehmann and Thierry Henry (soccer players). Table 4 shows the results obtained for each relationship.</Paragraph> <Paragraph position="6"> We have observed that, for those relationships in which the target does not belong to a Named Entitytype, itiscommonforthepatternstoextract additionalwordstogetherwiththerighttarget. For example, rather than extracting The Last Emperor, the patterns may extract this title together with its rating or its length, the title between quotes, or phrases such as The classic The Last Emperor. In the second column in the table, we measured the percentage of times that a correct answer appearsinsidetheextractedtarget, sotheseexamples would be considered correct. We call this metric inclusion precision.</Paragraph> <Section position="1" start_page="14" end_page="14" type="sub_section"> <SectionTitle> 5.1 Comparison to related approaches </SectionTitle> <Paragraph position="0"> Although the above results are not comparable to Mann and Yarowsky (2005), as the corpora used are different, in most cases the precision is equal or higher to that reported there. On the other hand, we have rerun Ravichandran and Hovy (2002)'s algorithm on our corpus. In order to assure a fair comparison, their algorithm has been slightly modified so it also takes into account the part-of-speech sequences and entity types while extracting the hooks and the targets during the rule application. So, for instance, the relationship birth date is only extracted between a hook tagged as a person and a target tagged as either a date or a number. The results are shown in Table 5. As can be seen, our procedure seems to perform better for all of the relations except birth place. It is interesting to note that, as could be expected, for those targets for which there is no entity type defined (films, books and pictures), Ravichandran and Hovy (2002)'s extracts many errors, because it is not possible to apply the Named Entity Recognizer to clean up the results, and the accuracy remains below 10%. On the other hand, that trend does not seem to affect our system, which had very poor results for painter-picture, but reasonably good for actor-film.</Paragraph> <Paragraph position="1"> Other interesting case is that of birth places.</Paragraph> <Paragraph position="2"> A manual observation of our generalized patterns shows that they often contain disjunctions of verbs such as that in (1), that detects not just the birth place but also places where the person lived. In this case, Ravichandran and Hovy (2002)'s patterns resulted more precise as they do not contain disjunctions or wildcards.</Paragraph> </Section> </Section> <Section position="8" start_page="14" end_page="14" type="metho"> <SectionTitle> (1) HOOK ,/, returned|travelled|born/VBN to|in/IN TARGET </SectionTitle> <Paragraph position="0"> It is interesting that, among the three relationships with the smaller number of extracted patterns, one of them did not produce any result, and pus for our approach and Ravichandran and Hovy (2002)'s.</Paragraph> <Paragraph position="1"> the two others attained a low precision. Therefore, it should be possible to improve the performance of the system if, while training, we augment the training corpora until the number of extracted patterns exceeds a given threshold.</Paragraph> </Section> class="xml-element"></Paper>