File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/02/w02-1113_metho.xml
Size: 10,340 bytes
Last Modified: 2025-10-06 14:08:04
<?xml version="1.0" standalone="yes"?> <Paper uid="W02-1113"> <Title>Generating extraction patterns from a large semantic network and an untagged corpus Thierry POIBEAU Thales and LIPN Domaine de Corbeville</Title> <Section position="4" start_page="0" end_page="1" type="metho"> <SectionTitle> 3 The semantic net </SectionTitle> <Paragraph position="0"> The semantic network used in this experiment is a multilingual net providing information for five European languages. We quickly describe the network and then give some detail about its overall structure.</Paragraph> <Section position="1" start_page="0" end_page="1" type="sub_section"> <SectionTitle> 3.1 Overall description </SectionTitle> <Paragraph position="0"> The semantic network we use is called The Integral Dictionary. This database is basically structured as a merging of three semantic models available for five languages. The maximal coverage is given for the French language, with 185.000 word-meanings encoded in the database. English Language appears like the second language in term of coverage with 79.000 word-meanings. Three additional languages (Spanish, Italian and German) are present for about 39.500 senses.</Paragraph> <Paragraph position="1"> These smallest dictionaries, with universal identifiers to ensure the translation, define the Basic Multilingual Dictionary available from the ELRA. Grefenstette (1998) has done a corpus coverage evaluation for the Basic Multilingual Dictionary. The newspapers corpora defined by the US-government-sponsored Text Retrieval Conference (TREC) have been used as a test corpus. The result was that the chance of pulling a random noun out of the different corpora was on average 92% . This statistic is given for the Basic Multilingual Dictionary and, of course, the French Integral Dictionary reaches the highest coverage.</Paragraph> </Section> <Section position="2" start_page="1" end_page="1" type="sub_section"> <SectionTitle> 3.2 Semantic links </SectionTitle> <Paragraph position="0"> The links in the semantic network can connect word-senses together, but also classes and concepts. Up to now, more than 100 different kinds of links have been definded. All these links are typed so that a weight can be allocated to each link, given its type. This mechanism allows to very precisely adapt the network to the task: one does not use the same weighting to perform lexical acquisition as to perform word-sense disambiguation. This characteristic makes the network highly adaptive and appropriate to explore some kind of lexical tuning.</Paragraph> <Paragraph position="1"> This network includes original strategies to measure the semantic proximity between two words. These measures take into account the similarity between words (their common features) but also their differences. The comparison between two words is based on the structure of the graph: the algorithm calculates a score taken into account the common ancestors but also the different ones.</Paragraph> <Paragraph position="2"> This means that for a target English text, one can assume that 92% of the tokens will be in the semantic We will not detail here the different measures that have been implemented to calculate similarities between words. Please refer to (Dutoit and Poibeau, 2002) for more details.</Paragraph> </Section> </Section> <Section position="5" start_page="1" end_page="3" type="metho"> <SectionTitle> 4 Acquisition of semantically </SectionTitle> <Paragraph position="0"> equivalent predicative structures For IE applications, defining an appropriate set of extraction pattern is crucial. That is why we want to validate the proposed measures to extend an initial set of extraction patterns.</Paragraph> <Section position="1" start_page="1" end_page="1" type="sub_section"> <SectionTitle> 4.1 The acquisition process </SectionTitle> <Paragraph position="0"> The process begins when the end-user provides a predicative linguistic structure to the system along with a representative corpus. The system tries to discover relevant parts of text in the corpus based on the presence of plain words closely related to the ones of the example pattern. A syntactic analysis of the sentence is then done to verify that these plain words correspond to a predicative structure. The method is close to the one of E. Morin et C.</Paragraph> <Paragraph position="1"> Jacquemin (1999), who first locate couples of relevant terms and then try to apply relevant patterns to analyse the nature of their relationship. The detail algorithm is described below: 1. The head noun of the example pattern is compared with the head noun of the candidate pattern using the proximity measure. This result of the measure must be under a threshold fixed by the enduser. null 2. The same condition must be filled by the &quot;expansion&quot; element (the complement of the noun or of the verb of the candidate pattern).</Paragraph> <Paragraph position="2"> 3. The structure must be predicative (either a nominal or a verbal predicate, the algorithm does not make any difference at this level).</Paragraph> <Paragraph position="3"> The result of this analysis is a table that represent predicative structures equivalent to the initial example pattern. The process uses the corpus and the semantic net as two different complementary knowledge sources: [?] The semantic net provides information about lexical semantics and relations between words [?] The corpus attests possible expressions and filter irrelevant ones.</Paragraph> <Paragraph position="4"> We performed some evaluation on different French corpora, given that the semantic net is especially rich for this language. We take the expression cession de societe (company transfer) as an initial pattern. The system then discovered the following expressions, each of them being semantically related to the initial cession de *c-company*...</Paragraph> <Paragraph position="5"> This result includes some phase with *c-company*: the corpus has been previously preprocessed so that each named entity is replaced by its type. This process normalizes the corpus so that the learning process can achieve better performance.</Paragraph> <Paragraph position="6"> The result must be manually validated. Some structures are found even if they are irrelevant, due to the activation of irrelevant links. It is the case of the expression renoncer a se porter acquereur (to give up buying sthg), which is not relevant. In this case, there was a spurious link between to give up and company in the semantic net.</Paragraph> </Section> <Section position="2" start_page="1" end_page="3" type="sub_section"> <SectionTitle> 4.2 Dealing with syntactic variations </SectionTitle> <Paragraph position="0"> The previous step extract semantically related predicative structures from a corpus.</Paragraph> <Paragraph position="1"> These structures are found in the corpus in a certain linguistic structure, but we want the system to be able to find this information even if it appears in other kind of linguistic sequences. That is the reason why we associate some meta-graphs with these linguistic structures, so that different transformation can be recognized . This transformation concerns the syntactic level, either on the head (H) or on the expansions (E) of the linguistic structure. The meta-graphs encode transformations concerning the following structures: graph. A meta-graph is then a kind of abstract grammar (see also the notion of metagrammar in the TAG theory (Candito, 1999) [?] Verb -- direct object (especially when introduced by the French preposition a or de), [?] Noun -- noun complement.</Paragraph> <Paragraph position="2"> These meta-graphs encode the major part of the linguistic structures we are concern with in the process of IE.</Paragraph> <Paragraph position="3"> The graph on Figure 2 recognizes the following sequences (in brackets we underline the couple of words previously extracted from the corpus): Reprise des activites charter... (H: reprise, E: activite) Reprendre les activites charter... (H: reprendre, E: activite) Reprise de l'ensemble des magasins suisse... (H: reprise, E: magasin) Reprendre l'ensemble des magasins suisse... (H: reprendre, E: magasin) Racheter les differentes activites... (H: racheter, E: activite) Rachat des differentes activites... (H: rachat, E: activite) This kind of graph is not easy to read. It includes at the same time some linguistic tags and some applicability constraints. For example, the first box contains a reference to the @A column in the table of identified structures. This column contains a set of binary constraints, expressed by some signs + or -. The sign + means that the identified pattern is of type verb-direct object: the graph can then be applied to deal with passive structures. In other words, the graph can only be applied in a sign + appears in the @A column of the constraints table. The constraints are removed from the instantiated graph . Even if the resulting graph is normally not visible (the compilation process directly produced a graph in a binary format), we can give an equivalent graph.</Paragraph> <Paragraph position="4"> This mechanism using constraint tables and meta-graph has been implemented in the finite-state toolbox INTEX (Silberztein, 1993). 26 meta-graphs have been defined modelling linguistic variation for the 4 predicative structures defined above. The phenomena mainly concern the insertion of modifiers (with the noun or the verb), verbal transformations (passive) and phrasal structures (relative clauses like ...Vivendi, qui a rachete Universal...Vivendi, that bought Universal). The compilation of the set of meta-graphs produces a graph made of 317 states and 526 relations. These graphs are relatively abstract but the end-user is not intended to directly manipulate them. They generate instantiated graphs, that is to say graphs in which the abstract variables have been replaced linguistic information as modeled in the constraint tables.</Paragraph> <Paragraph position="5"> This method associates a couple of elements with a set of transformation that covers more examples than the one of the training corpus. This generalization process is close to the one imagined by Morin and Jacquemin (1999) for terminology analysis.</Paragraph> </Section> </Section> class="xml-element"></Paper>