File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/02/w02-1101_metho.xml
Size: 22,638 bytes
Last Modified: 2025-10-06 14:08:02
<?xml version="1.0" standalone="yes"?> <Paper uid="W02-1101"> <Title>Knowledge-Based Multilingual Document Analysis</Title> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 The Application </SectionTitle> <Paragraph position="0"> In the 5th Framework NAMIC Project (News Agencies Multilingual Information Categorisation), the defined task of the system was to support the automatic authoring of multilingual news agencies texts where the chosen languages were English, Italian and Spanish. The goal was the Hypertextual linking of related articles in one language as well as related articles in the other project languages. One of the intermediate goals of NAMIC was to categorise incoming news articles, in one of the three target languages and use Natural Language Technology to derive an 'objective representation' of the events and agents contained within the news. This representation which is initially created once using representative news corpora is stored in a repository and accessed in the authoring process.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.1 Automatic Authoring </SectionTitle> <Paragraph position="0"> Automatic Authoring is the task of automatically deriving a hypertextual structure from a set of available news articles (in three different languages English, Spanish and Italian in our case). This relies on the activity of event matching. Event matching is the process of selecting the relevant facts in a news article in terms of their general type (e.g. selling or buying companies, winning a football match), their participants and their related roles (e.g. the company sold or the winning football team) Authoring is the activity of generating links between news articles according to relationships established among facts detected in the previous phase.</Paragraph> <Paragraph position="1"> For instance, a company acquisition can be referred to in one (or more) news items as: a3 Intel, the world's largest chipmaker, bought a unit of Danish cable maker NKT that designs high-speed computer chips used in products that direct traffic across the internet and corporate networks.</Paragraph> <Paragraph position="2"> a3 The giant chip maker Intel said it acquired the closely held ICP Vortex Computersysteme, a German maker of systems for storing data on computer networks, to enhance its array of datastorage products.</Paragraph> <Paragraph position="3"> a3 Intel ha acquistato Xircom inc. per 748 milioni di dollari.</Paragraph> <Paragraph position="4"> a3 Le dichiarazioni della Microsoft, infatti, sono state precedute da un certo fermento, dovuto all'interesse verso Linux di grandi ditte quali Corel, Compaq e non ultima Intel (che ha acquistato quote della Red Hat) ...</Paragraph> <Paragraph position="5"> The hypothesis underlying Authoring is that all the above news items deal with facts in the same area of interest to a potential class of readers. They should be thus linked and links should suggest to the user that the underlying motivation is that they all refer to Intel acquisitions.</Paragraph> </Section> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 The NAMIC Architecture </SectionTitle> <Paragraph position="0"> The NAMIC system uses a modularised IE architecture whose principal components, used to create the IE repository, are morpho-syntactic analysis, categorisation and semantic analysis. During Morpho-Syntactic analysis, a modular and lexicalised shallow morpho-syntactic parser (Basili et al., 2000b), provides the extraction of dependency graphs from source sentences. Ambiguity is controlled by part-of-speech tagging and domain verb-subcategorisation frames that guide the dependency recognition phase.</Paragraph> <Paragraph position="1"> It is within the semantic analysis, which relies on the output of this parser, that objects in the text, and their relationships to key events are captured. This process is explained in more detail in 4. In the next two sections, we will elaborate on the IE engine. For a full description of the NAMIC Architecture see (Basili et al., 2001).</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.1 LaSIE </SectionTitle> <Paragraph position="0"> In NAMIC, we have integrated a key part of the Information Extraction system called LaSIE (Large-scale Information Extraction system, (Humphreys et al., 1998)). Specifically, we have taken the Named Entity Matcher and the Discourse Processor from the over-all architecture of LaSIE. The roles of each of these modules is outlined below.</Paragraph> <Paragraph position="1"> The Named Entity (NE) Matcher finds named entities (persons, organisations, locations, and dates, in our case) through a secondary phase of parsing which uses a NE grammar and a set of gazetteer lists. It takes as input parsed text from the first phase of parsing and the NE grammar which contains rules for finding a predefined set of named entities and a set of gazetteer lists containing proper nouns. The NE Matcher returns the text with the Named Entities marked. The NE grammar contains rules for coreferring abbreviations as well as different ways of expressing the same named entity such as Dr. Smith, John Smith and Mr.</Paragraph> <Paragraph position="2"> Smith occurring in the same article.</Paragraph> <Paragraph position="3"> The Discourse Processor module translates the semantic representation produced by the parser into a representation of instances, their ontological classes and their attributes, in the XI knowledge representation language (Gaizauskas and Humphreys, 1996). XI allows a straightforward definition of cross-classification hierarchies, the association of arbitrary attributes with classes or instances, and a simple mechanism to inherit attributes from classes or instances higher in the hierarchy.</Paragraph> <Paragraph position="4"> The semantic representation produced by the parser for a single sentence is processed by adding its instances, together with their attributes, to the discourse model which has been constructed for a text. Following the addition of the instances mentioned in the current sentence, together with any presuppositions that they inherit, the coreference algorithm is applied to attempt to resolve, or in fact merge, each of the newly added instances with instances currently in the discourse model.</Paragraph> <Paragraph position="5"> The merging of instances involves the removal of the least specific instance (i.e. the highest in the ontology) and the addition of all its attributes to the other instance. This results in a single instance with more than one realisation attribute, which corresponds to a single entity mentioned more than once in the text, i.e. a coreference.</Paragraph> <Paragraph position="6"> The mechanism described here is an extremely powerful tool for accomplishing the IE task, however, in common with all knowledge-based approaches, and as highlighted in the introduction to this paper, the significant overhead in terms of development and deployment is in the creation of the world model representation. null</Paragraph> </Section> </Section> <Section position="6" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Large-Scale World Model Acquisition </SectionTitle> <Paragraph position="0"> The traditional limitations of a knowledge-based information extraction system such as LaSIE have been the need to hand-code information for the world model - specifically relating to the event structure of the domain. This is also valid for NAMIC. To aid the development of the world model, a semi-automatic boot-strapping process has been developed, which creates the event type component of the world model.</Paragraph> <Paragraph position="1"> To us, event descriptions can be categorised as a set of regularly occurring verbs within our domain, complete with their subcategorisation information.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.1 Event Hierarchy </SectionTitle> <Paragraph position="0"> The domain verbs can be selected according to statistical techniques and are, for the moment, subjected to hand pruning. Once a list of verbs has been extracted, subcategorisation patterns can be generated automatically using a combination of weakly supervised example-driven machine learning algorithms.</Paragraph> <Paragraph position="1"> There are mainly three induction steps. First, syntactic properties are derived for each verb, expressing the major subcategorisation information underlying those verbal senses which are more important in the domain. Then, in a second phase, verb usage examples are used to induce the semantic properties of nouns in argumental positions. This information relates to selectional constraints, independently assigned to each verb subcategorisation pattern. Thus, different verb senses are derived, able to describe the main properties of the domain events (e.g. Companies acquire companies). In a third and final phase event types are derived by grouping verbs according to their syntactic-semantic similarities. Here, shared properties are used to generalise from the lexical level, and generate verbal groups expressing specific semantic (and thus conceptual) aspects. These types are then fed into the event hierarchy as required for their straightforward application within the target IE scenario.</Paragraph> <Paragraph position="2"> Each verb a4 is separately processed. First, each local context (extracted from sentences in the source corpus) is mapped into a feature vector describing: a3 the verb a4 of each vector (i.e. the lexical head of the source clause); a3 the different grammatical relationships (e.g.</Paragraph> <Paragraph position="3"> Subj and Obj for grammatical subject and objects respectively) as observed in the clause; a3 the lexical items, usually nouns, occurring in specific grammatical positions, e.g. the subject Named Entity, in the clause.</Paragraph> <Paragraph position="4"> Then, vectors are clustered according to the set of shared grammatical (not lexical) properties: Only the clauses showing the same relationships (e.g. all the Subj-a4a6a5a8a7a10a9 -Obj triples) enter in the same subset a11a13a12 . Each cluster thus expresses a specific grammatical behaviour shared by several contexts (i.e. clauses) in the corpus. The shared properties in a11a13a12 characterise the cluster, as they are necessary and sufficient membership conditions for the grouped contexts.</Paragraph> <Paragraph position="5"> As one context can enter in more than one cluster (as it can share all (or part) of its relations with the others), the inclusion property establishes a natural partial order among clusters. A cluster a11a14a12 is included in another cluster a11a16a15 if its set of properties is larger (i.e. a17a18a12a20a19a21a17a14a15 ) but it is shown only by a subset of the contexts of the latter a11 a15 . The larger the set of membership constraints is, the smaller the resulting cluster is. In this way, clusters are naturally organised into a lattice (called Galois lattice). Complete properties express for each cluster candidate subcategorisation patterns for the target verb a4 .</Paragraph> <Paragraph position="6"> Finally, the lattice is traversed top-down and the search stops at the more important clusters (i.e. those showing a large set of members and characterised by linguistically appealing properties): they are retained and a lexicon of subcategorisation structures (i.e. grammatical patterns describing different usages of the same verb) is compiled for the target verb a4 . For example, (buy, [Subj:X, Obj:Y])can be used to describe the transitive usage of the verb a9a23a22a25a24 . More details can be found in (Basili et al., 1997). The lattice can be further refined to express semantic constraints over the syntactic patterns specified at the previous stage. A technique proposed in (Basili et al., 2000a) is adopted by deriving semantic constraints via synsets (i.e. synonymy sets) in the Word-Net 1.6 base concepts (part of EuroWordNet). When a given lattice node expresses a set of syntactic properties, then this suggests: a3 a set of grammatical relations necessary to express a given verb meaning, a26a28a27a8a29a8a30a31a30a31a30a31a29a32a26a34a33 ; and a3 references to source corpus contexts a35 where the grammatical relations are realised in texts.</Paragraph> <Paragraph position="7"> This information is used to generalise verb arguments. For each node/pattern, the nouns appearing in the same argumental position a36 (in at least one of the referred examples in the corpus) are grouped together to form a noun set a37a28a12 : a learning algorithm based on EuroWordNet derives the most informative EuroWordNet synset(s) for each argument, activated by the a37a28a12 members. Most informative synsets are those capable of (1) generalising as many nouns as possible in a37 a12 , while (2) preserving their specific semantic properties. A metric based on conceptual density (Agirre and Rigau, 1995) is here employed to detect the promising, most specific generalisations a38a39a5a41a40a43a42a44a37a45a12a47a46 of a37a48a12 . Then the derived sets for each argument a26a28a27a49a29a8a30a31a30a31a30a31a29a32a26a34a33 are used to generate the minimal set of semantic patterns a38a6a27a49a29a8a30a31a30a31a30a31a29a50a38a39a33 capable of &quot;covering&quot; all the examples in a35 , with a38a51a12a53a52a54a38a39a5a41a40a43a42a44a37a55a12a56a46a58a57a59a36 . The sequences express the most promising generalisations of examples a35 for the subcategorisation a26a28a27a8a29a8a30a31a30a31a30a31a29a32a26a34a33 . As an example, (buy, [Agent:Company,Object:Company]) expresses the knowledge required for matching sentences like &quot;Intel buys Vortex&quot;. Full details on the above process can be found in (Basili et al., 2000a). Notice how Company is a base concept in EuroWordNet and it is shared among the three languages. It can thus be activated via the Inter-Lingual-Index from lexical items of any language.</Paragraph> <Paragraph position="8"> If included in the world model (as a concept in the object hierarchy), these base concepts play the role of a multilingual abstraction for the event constraints.</Paragraph> <Paragraph position="9"> The final phase in the development of a large scale world model aims to link the event matching rules valid for one verb to the suitable event hierarchy nodes. The following semi-automatic process can be applied: a3 First, a limited set of high level event types can be defined by studying the corpus and via knowledge engineering techniques (e.g. interactions with experts of the domain); a3 then, semantic descriptions of verbs can be grouped automatically, according to the similarity among their corresponding patterns; a3 finally, the obtained verb groups can be mapped to the high-level types, thus resulting in a flat hierarchy.</Paragraph> <Paragraph position="10"> An example of the target event hierarchy is given in Currently, a set of event types (a60 main groupings in a financial domain ranging from &quot;Company Acquisitions&quot; and &quot;Company Assets&quot; to &quot;Regulation&quot;) have been defined. Within the eight event groupings, we acquired more than 3000 lexicalisations of events.</Paragraph> <Paragraph position="11"> The clustering step has been approached with a technique similar to the Galois lattices, where feature vectors represent syntactic-semantic properties of the different verbs (i.e. pattern a38 a12 a29a8a30a31a30a31a30a31a29a50a38a39a33 derived in the previous phase). All verbs are considered1 and the obtained clusters represent semantic abstractions valid for more than one verb. The following is an example of the grouping of the verbs acquire to win.</Paragraph> <Paragraph position="12"> The above cluster expresses a conceptual property able to suggest a specific event subtype. Thus, manual mapping to the correct high-level concept (&quot;Company acquisition&quot; event type) is made possible and more intuitive. As semantic constraints in event types are given by base concepts, translations into Italian and Spanish rules (for example: (acquistare, [Agent:Company,Object:Company])) are possible. They inherit the same topological position in the event ontology. Accordingly, the world model has a structure (i.e. the main object and event hierarchies) which is essentially language independent. Only the lowest levels are representative of each language. Here, a language specific lexicalisation is required. The advantage is that most of the groups derived for English can be retained for other languages, and a simple translation suffices for most of the patterns. Lexicalisations are thus associated with the language independent abstractions (i.e. matching rules over parsed texts) which control the behaviour of instances of these events in the discourse processing.</Paragraph> <Paragraph position="13"> The integrated adoption of EuroWordNet and the automatic acquisition/translation of verb rules is thus the key idea leading to a successful and quick development of the large scale IE component required for automatic authoring.</Paragraph> <Paragraph position="14"> 1Initial partitions according to the Levin classification (Levin, 1993) are adopted. A partition of the verbs is built for each of the Levin classes and conceptual clustering is applied internally to each group.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.2 Object Hierarchy </SectionTitle> <Paragraph position="0"> In typical Information Extraction processing environments, the range of objects in the text is expected to be as limited and constrained as the event types. For example, when processing 'management succession' events (MUC-6, 1995), the object types are the obvious person, location, organisation, time and date.</Paragraph> <Paragraph position="1"> Intuitively however, if the need was to process the entire output of a news gathering organisation, it seems clear that we must be able to capture a much wider range of possible objects which interact with central events. Rather than attempt to acquire all of this object information from the corpus data, we instead chose to use an existing multilingual lexical resource, EuroWordNet.</Paragraph> <Paragraph position="2"> EuroWordNet (Vossen, 1998) is a multilingual lexical knowledge (KB) base comprised of hierarchical representations of lexical items for several European languages (Dutch, Italian, Spanish, German, French, Czech and Estonian). The wordnets are structured in the same way as the English WordNet developed at Princeton (Miller, 1990) in terms of synsets (sets of synonymous words) with basic semantic relations between them.</Paragraph> <Paragraph position="3"> In addition, the wordnets are linked to an Inter-Lingual-Index (ILI), based on the Princeton Word-Net 1.5. (WordNet 1.6 is also connected to the ILI as another English WordNet (Daude et al., 2000)).</Paragraph> <Paragraph position="4"> Via this index, the languages are interconnected so that it is possible to go from concepts in one language to concepts in any other language having similar meaning. Such an index also gives access to a shared top-ontology and a subset of 1024 Base Concepts (BC). The Base Concepts provide a common semantic framework for all the languages, while language specific properties are maintained in the individual wordnets. The KB can be used, among others, for monolingual and cross-lingual information retrieval, which was demonstrated by (Gonzalo et al., 1998).</Paragraph> <Paragraph position="5"> The example rules shown in the previous section relate to Agents which conveniently belong to a class of Named Entities as would be easily recognised under the MUC competition rules (person, company and location for example). However, a majority of the rules extracted automatically from the corpus data involved other kinds of semantic classes of information which play key roles in the subcategorisation patterns of the verbs.</Paragraph> <Paragraph position="6"> In order to be able to work with these patterns, it was necessary to extend the number of semantic classes beyond the usual number of predefined classes, across a variety of languages.</Paragraph> <Paragraph position="7"> Representing the entirety of EWN in our object hierarchy would be time consuming, and lead to inefficient processing times. Instead we took advantage of the Base Concepts (Rodriquez et al., 1998) within EWN, a set of approximately 1000 nodes, with hierarchical structure, that can be used to generalise the rest of the EWN hierarchy.</Paragraph> <Paragraph position="8"> These Base Concepts represent a core set of common concepts to be covered for every language that has been defined in EWN. A concept is determined as important (and is therefore a base concept) if it is widely used, either directly or as a reference for other widely used concepts. Importance is reflected in the ability of a concept to function as an anchor to attach other concepts.</Paragraph> <Paragraph position="9"> The hierarchical representation of the base concepts is added to the object hierarchy of the NAMIC world model. Additionally, a concept lookup function is added to the namematcher module of the NAMIC architecture. This lookup takes all common nouns in the input, and translates them into their respective EWN Base Concept codes.</Paragraph> <Paragraph position="10"> This process was reversed in the event rule acquisition stage, so that each occurrence of a object in a rule was translated into a Base Concept code. This has two effects. Firstly, the rules become more generic, creating a more compact rule base. Secondly, given the nature of the inter-lingual index which connects EWN lexicons, the rules became language independent at the object level. Links between the lexicalisations of events are still required, and at present are hand-coded, but future development of the verb representations of WN might eliminate this.</Paragraph> <Paragraph position="11"> In summary, this new, expanded WM covers both the domain specific events and a wide range of agents, and can be acquired largely automatically from corpus data, and used to process large amounts of text on a spectrum of domains by leveraging existing multi-lingual lexical resources.</Paragraph> </Section> </Section> class="xml-element"></Paper>