File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/abstr/94/w94-0104_abstr.xml
Size: 6,491 bytes
Last Modified: 2025-10-06 13:48:17
<?xml version="1.0" standalone="yes"?> <Paper uid="W94-0104"> <Title>Study and Implementation of Combined Techniques for Automatic Extraction of Terminology Bdatrice Daille TALANA</Title> <Section position="1" start_page="0" end_page="29" type="abstr"> <SectionTitle> Abstract </SectionTitle> <Paragraph position="0"> This paper presents an original method and its implementation to extract terminology from corpora by combining linguistic filters and statistical methods. Starting from a linguistic study of the terms of telecommunication domain, we designed a number of filters which enable us to obtain a first selection of sequences that may be considered as terms. Various statistical scores are applied to this selection and results are evaluated.</Paragraph> <Paragraph position="1"> This method has been applied to French and to English, but this paper deals only with French.</Paragraph> <Paragraph position="2"> Introduction A terminology bank contains the vocabulary of a technical domain: terms, which refer to its concepts.</Paragraph> <Paragraph position="3"> Building a terminological bank requires a lot of time and both linguistic and technical knowledge. The issue, at stake, is the automatic extraction of terminology of a specific domain from a corpus. Current research on extracting terminology uses either linguistic specifications or statistical approaches. Concerning the former, \[Bouriganlt, 1992\] has proposed a program which extracts automatically from a corpus sequences of lexical units whose morphosyntax characterizes maximal technical noun phrases. This list of sequences is given to a terminologist to be checked. For the latter, several works (\[Lafon, 1984\], \[Church and Hanks, 1990\], \[Calzolari and Bindi, 1990\], \[Smadja and McKeown, 1990\]) have shown that statistical scores are useful to extract collocations from corpora. The main problem with one or the other approach is the &quot;noise&quot;: indeed, morphosyntactic criteria are not sufficient to isolate terms, and collocations extracted thanks to statistical methods belong to various types of associations: functional, semantical, thematical or uncharacterizable ones.</Paragraph> <Paragraph position="4"> Our goal is to use statistical scores for extracting technical compounds only and to forget about the other types of collocations. We proceed in two steps: first, apply a linguistic filter which selects candidates from the corpus; then, apply statistical scores to rank these candidates and select the scores which fit our purpos(~ best, in other words scores that concentrate their high values to terms and their low values to co-occurrcuccs which are not terms.</Paragraph> <Section position="1" start_page="0" end_page="29" type="sub_section"> <SectionTitle> Linguistic Data </SectionTitle> <Paragraph position="0"> In a first part, we therefore study the linguistic specifications on the nature of terms in the technical domain of telecommunications for French. Then, taking into account these linguistics results, we present the method and the program which extracts andcounts the candidate terms.</Paragraph> <Paragraph position="1"> Linguistic specifications Terms are mainly multi-word units of nominal type that could be characterized by a range of morphological, syntactic or semantic properties. The main property of nominal terms is the morphosyntactic one: its str,cture belongs to well-known morphosyntactic structures such asN ADJ, N1 de N2, etc. that have been studied by \[Mathieu-Colas, 1988\] for French. Some graphic indications (hyphen), morphological indications (restrictious in flexion) and syntactic ones (absence of determiners) could also be good clues that a noun phrase is a term.</Paragraph> <Paragraph position="2"> We have also employed a semantic criteria: the criterion of unique referent. A term refers to an unique and universal concept. However, it is not obvious to apply this criterion to a technical domain where we are not expert.</Paragraph> <Paragraph position="3"> So, we have interpreted the criterion of unique referent by the one of unique translation. A French term is always identically translated, mostly by a compound or a simple noun in English. We have extracted mare,ally terms following these criteria from our bilingual corpus, available in French and English, the Satellite Communication Handbook (SCH) containing 200 000 words in ,,ach language. Then, we have classified terms following their lengths; the length of a term is defined as the numb,,r of main items it contains. 1 From this classification, it at)p('ars that terms of length 2 are by far the most frcquenl, ones. As statistical methods ask for a good rcl)rcsentatiou in number of the samples, we decided to extract in a first round only terms of length 2 that we will call base-term which matched a list of previously determined patterns: N ADJ station terrienne (Earth station) NI de (D~T) N2 zone de eouverture (coverage zone) Nt h (DET) N2 rdflecteur d grille (grid reflector) Nt PREP N~ liaison par satellite (satellite link) Ni N2 diode tunnel (tunnel diode) Of course, terms exist whose length is greater than 2.</Paragraph> <Paragraph position="4"> But the majority of terms of length greater than 2 are created recursively from base-terms. We have distinguished three operations that lead to a term of length 3 from a term of length 1 or 2: &quot;overcomposition&quot;, modificatio,, and coordination. We illustrate now these operations with a few examples where the base-terms appear inside brackets: I. Ovcrcomposltion Two kinds of overcomposition have been pointed out: ow'rcomposition by juxtaposition and overcomposition by substitution.</Paragraph> <Paragraph position="5"> (a) .I u xtaposition A term obtained by juxtaposition is built with at least one base-term whose structure will not be altered. The example below illustrate the juxtaposition of a base-term and a simple noun:</Paragraph> <Paragraph position="7"> (living a base-term, one of its main item is substituted by a base-term whose head is this main item. For example, in the N1 PREP1 N2 structure, N1 is subtituted by a base-term of N1 PREP2 N3 structure to create a term of N1 PREPs N3 PREP1 N~ structure: rdscau d satellites + rgseau de transit ~ rgseau de transit h satellites (satellite transit network). We notice in the above example that the structure of r~seau h satellites (satellite network) is altered.</Paragraph> </Section> </Section> class="xml-element"></Paper>