File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/97/p97-1066_metho.xml
Size: 7,996 bytes
Last Modified: 2025-10-06 14:14:40
<?xml version="1.0" standalone="yes"?> <Paper uid="P97-1066"> <Title>Knowledge Acquisition from Texts : Using an Automatic Clustering Method Based on Noun-Modifier Relationship</Title> <Section position="3" start_page="0" end_page="504" type="metho"> <SectionTitle> 2 The morpho-syntactic analysis : </SectionTitle> <Paragraph position="0"> the LEXTER software LEXTER is a terminology extraction software (Bourigault et al., 1996). A corpus of French texts on any technical subject can be fed into it. LEXTER performs a morpho-syntactic analysis of this corpus and gives a network of noun phrases which are likely to be terminological units.</Paragraph> <Paragraph position="1"> Any complex term is recursively broken up into two parts : head (e.g. PLANNING in the term REGIONAL NETWORK PLANNING), and expansion (e.g.</Paragraph> <Paragraph position="2"> REGIONAL in the term REGIONAL NETWORK) 1 This analysis allows the organisation of all the candidate terms in a network format, known as the XAll the examples given in this paper are translated from French.</Paragraph> <Paragraph position="3"> &quot;terminological network&quot;. Each analysed complex candidate term is linked to both its head (H-link) and expansion (E-link).</Paragraph> <Paragraph position="4"> LEXTER alSO extracts phraseological units (PU) which are &quot;informative collocations of the candidate terms&quot;. For instance, CONSTRUCTION OF THE HIGH-VOLTAGE LINE is a PU built with the candidate term HIGH-VOLTAGE LINE. PUs are recursively broken up into two parts, similarly to the candidate terms, and the links are called H'-link and E'-link.</Paragraph> <Paragraph position="5"> 3 The data for the clustering module The candidate terms extracted by LEXTER can be NPs or adjectives. In this paper, we focus on NP clustering. A NP is described by its &quot;terminological context&quot;. The four syntactic links of LEXTER Can be used to define this terminological context. For instance, the &quot;expansion terminological context&quot; (Eterminological context) of a NP is the set of the candidate terms appearing in the expansion of the more complex candidate term containing the current NP in head position. For example, the candidate terms (NATIONAL NETWORK, REGIONAL NETWORK, DISPATCHING NETWORK) give the context (NATIONAL, REGIONAL, DISPATCHING) for the noun NETWORK.</Paragraph> <Paragraph position="6"> If we suppose that the modifiers represent specialisations of a head NP by giving a specific attribute of it, NPs described by similar E-terminological contexts will be semantically close. These semantic similarities allow the KE to build conceptual fields in the early stages of the KA process.</Paragraph> <Paragraph position="7"> The links around a NP within a PU are also interesting. Those candidate terms appearing in the head position in a PU containing a given NP could denote properties or actions related to this NP. For instance, the PUs LENGTH OF THE LINE and NOMINAL</Paragraph> </Section> <Section position="4" start_page="504" end_page="504" type="metho"> <SectionTitle> POWER OF THE LINE show two properties (LENGTH </SectionTitle> <Paragraph position="0"> and NOMINAL POWER) of the object LINE; the PU</Paragraph> </Section> <Section position="5" start_page="504" end_page="504" type="metho"> <SectionTitle> CONSTRUCTION OF THE LINE shows an action (CON- </SectionTitle> <Paragraph position="0"> STRUCTION) which can be applied to the object LINE.</Paragraph> <Paragraph position="1"> This definition of the context is original compared to the classical context definitions used in Information Retrieval, where the context of a lexical unit is obtained by examining its neighbours (collocations) within a fixed-size window. Given that candidate terms extraction in LEXTER is based on a morpho-syntactical analysis, our definition allows us to group collocation information disseminated in the corpus under different inflections (the candidate terms of LEXTER are lemmatised) and takes into account the syntactical structure of the candidate terms. For instance, LEXTER extracts the complex candidate term BUILT DISPATCHING LINE, and analyses it in (BUILT (DISPATCHING LINE)); the adjective BUILT will appear in the terminological context of DISPATCHING LINE and not in that of DISPATCHING. It is obvious that only the first context is relevant given that BUILT characterises the DISPATCHING LINE and not the DISPATCHING.</Paragraph> <Paragraph position="2"> To perform NP clustering, we prepared two data sets : in the first, NPs are described by their E-terminological context; in the second one, both the E-terminological context and the H'- terminological context (obtained with the H'-link within PUs) are used. The same filtering method 2 and clustering algorithm are applied in both cases.</Paragraph> <Paragraph position="3"> Table 1 shows an extract from the first data set.</Paragraph> <Paragraph position="4"> The columns are labelled by the expansions (nominal or adjectival) of the NPs being clustered. Each line represents a NP (an individual, in statistical terms) : there is a '1' when the term built with the NP and the expansion exists (e.g. REGIONAL NETWORK is extracted by LEXTER), and a '0' otherwise (&quot;national line&quot; is not extracted by LEXTER).</Paragraph> </Section> <Section position="6" start_page="504" end_page="504" type="metho"> <SectionTitle> NATIONAL DISPATCHING REGIONAL </SectionTitle> <Paragraph position="0"> In the remainder of this article, we describe the way a KE uses LEXICLASS to build &quot;conceptual fields&quot; and we also compare the clusterings obtained from the two different data sets.</Paragraph> </Section> <Section position="7" start_page="504" end_page="505" type="metho"> <SectionTitle> 4 The conceptual analysis : the LEXICLASS software </SectionTitle> <Paragraph position="0"> LEXICLASS is a clustering tool written using C language and specialised data analysis functions from Splus TM software.</Paragraph> <Paragraph position="1"> Given the individuals-variables matrix above, a similarity measure between the individuals is calculated 3 and a hierarchical clustering method is performed with, as input, a similarity matrix. This kind of methods gives, as a result, a classification tree (or dendrogram) which has to be cut at a given level in order to produce clusters. For example, this method, applied on a population of 221 NPs (data set 1) gives 2This filtering method is mandatory, given that the chosen clustering algorithm cannot be applied to the whole terminological network (several thousands of terms) and that the results have to be validated by hand. We have no space to give details about this method, but we must say that it is very important to obtain proper data for clustering 21 clusters, figure 1 shows an example of such a cluster. null The interpretation, by the KE, of the results given by the clustering methods applied on the data of table 1 leads him to define conceptual fields. Figure 1 shows the transition from an automatically found cluster to a conceptual field : the KE constitutes the conceptual fields of &quot;the structures&quot;. He puts some concepts in it by either validating a candidate term (e.g. LINE), or reformulating a candidate term (e.g. PRIMARY is an ellipsis and leads the KE to create the concept primary substation). The other candidate terms are not kept because they are considered as non relevant by the KE. The conceptual fields have to be completed all along the KA process. At the end of this operation, the candidate terms appearing in a conceptual field are validated. This first stage of the KA process is also the opportunity for the KE to constitute synonym sets : the synonym terms are grouped, one of them is chosen as a concept label, and the others are kept as the values of a generic attribute labels of the considered concept (see figure 2 for an example).</Paragraph> </Section> class="xml-element"></Paper>