File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/73/c73-2005_metho.xml
Size: 21,228 bytes
Last Modified: 2025-10-06 14:11:05
<?xml version="1.0" standalone="yes"?> <Paper uid="C73-2005"> <Title>NICOLETTA CALZOLARI- LAURA PECCHIA- ANTONIO ZAMPOLLI* WORKING ON THE ITALIAN MACHINE DICTIONARY: A SEMANTIC APPROACH</Title> <Section position="1" start_page="0" end_page="0" type="metho"> <SectionTitle> NICOLETTA CALZOLARI- LAURA PECCHIA- ANTONIO ZAMPOLLI* WORKING ON THE ITALIAN MACHINE DICTIONARY: A SEMANTIC APPROACH 1. GENERAL FRAMEWORK </SectionTitle> <Paragraph position="0"> 1.1. Foreword.</Paragraph> <Paragraph position="1"> The work described by the two co-authors of this article is presented with a double objective: apart from giving specific details on a particular project they also wished to provide a concrete example of the type of research which has been made possible by the Italian Machine Dictionary (DMI).</Paragraph> <Paragraph position="2"> The DMI is, in fact, one of the principal projects of the Linguistics Division (DL) of CNUCE. Other articles in the first volume of the Proceedings also refer to the DMI. 1 In this introduction I intend to indicate briefly how the DMI project, and, in particular, how the research described in the article has been inserted into the framework of the whole complex of activities of the DL and into our general conception of linguistic data processing (LDV).</Paragraph> <Paragraph position="3"> As I have already stated in my introduction to these Proceedings, 2 it is my conviction that, at this moment, special attention should be taken in order to promote, both on the theoretical and on the practical level, systematic and ordered interaction among the many different LPD activities. In particular, this cooperation should be realized between those activities which focus on the construction of theoretical models and those focussing on the processing of large corpora of linguistic data. The activity of the DL, especially in recent years, has been increasingly directed towards this goal.</Paragraph> <Paragraph position="4"> * A. Zampolli is the author of Part. 1., N. Calzolari and L. Pecchia are the authors of Part 2. 1 See Vol. I, 1, pp. 257-262 and 297-301.</Paragraph> <Paragraph position="5"> See Vol. I, 1, pp. xx-xxa.</Paragraph> <Paragraph position="6"> 50 NICOLETTA CALZOLARI- LAURA PECCHIA- ANTONIO ZAMPOLLI 1.2. Activities of the Linguistics Division (Dr~).</Paragraph> <Paragraph position="7"> For approximatdy 10 years all, or almost all, of the research projects in the different fields of the linguistic data processing in Italy have been worked out with the collaboration of the DL in the computational side of their work.3 In the field of lexicography, large corpora of texts have been processed in order to produce the lexical archives necessary to construct extensive historical language dictionaries (see, for example, the Tesoro della lingua italiana delle origini of the Accademia della Crusca), or dictionaries of &quot; languages for special purposes&quot; (e.g. the Dizionario Giuridico of the Istituto per la Documentazione Giuridica). 4 In both modern and classical philological research, the computer is now used with increasing frequency in Italy in order to automate the customary and traditionally time consuming task of indexing texts and producing concordances from them (e.g. the project for the analysis of the corpus of Grammatici Latini, ed. Keil), 5 and also for a number of more specific, complex operations, such as the automatic comparison of different editions of the same text (e.g. the project for the ' contrastive concordances' of Orlando Furioso of L. Ariosto). 6 Literary criticism and the history of literature are also beginning to make use of similar procedures, employing, in particular, statistical a For a more detailed description and the relative bibliography see ZAMPOLLr, 1973a, 1973b, 1977a. It is necessary to emphasize an important consequence of this fact. Firstly, almost all the projects underway in Italy in this sector adopt the standards introduced by the Dr. In addition, an automatic library containing over 5000 texts in more than 20 languages has been established. This archive may be processed with general-purpose standardized programs because all the texts have been stored using the same scientific and technical criteria. Thus it is possible to perform some linguistic research operations which would otherwise be impossible. For example, one of our projects aims at constructing a new model of the quantitative aspects of the language, on the basis of the data provided by this archive. The earlier models have been falsified by the new quantitative data produced by the increasing number of text-processing projects underway in different countries. As a first step, we aim at identifying those linguistic facts which have a stable frequency in the texts of a language, those which have a frequency which is stable only within certain subsets of a language (literary genres, single authors, particular themes, etc.), those whose frequency does not show appreciable regularity. In a second stage, an attempt will be made to construct and verify quantitative models to describe the regularities actually found and to identify the contextual factors connected with such regularities.</Paragraph> <Paragraph position="8"> See A. DuRo (1973), C. CIAMPI (1973) and F. DIMITRESCU (1973).</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="metho"> <SectionTitle> 5 See GmLLI and others (1978). 8 See SEgRE-ZAMPOLLI (1974). THE ITALIAN MACHINE DICTIONARY: A SEMANTIC APPROACH 51 </SectionTitle> <Paragraph position="0"> processing as an auxiliary tool in the study of the style of individual authors, schools, or literary genres. 7 Linguistic statistics is also adopted in psycho-linguistic studies, for example to &quot;measure&quot; the linguistic alterations introduced by certain nosological categories.S A combination of statistical processing and algorithms of the &quot;pattern recognition&quot; type are used in a heuristic way on traditional oral texts to identify clauses, formulae, and, in general, the various elements of the popular repertory.9 In all the above quoted types of projects, the electronic data processing essentially aims at organizing, in computer storage or in printed form, all the linguistic units of a certain level (words, syntagms, syntactical structures, etc.) occurring in a text, in order to enable a more efficient, rapid and economic retrieval of them. In other words, the processing basically consists in the following types of operations: to input, store, manipulate texts of different kinds (which may be considered as facts of la parole); to recognize and explicitly represent in the text the occurrence of linguistic units (phonemes, lemmas, a~xes, syntagms, syntactical types, etc.: these units may be considered to be at the level of la langue); to execute some canonical operations (retrieval, ordering, counting, comparing, etc.) on such units, in batch or conversational form.</Paragraph> <Paragraph position="1"> We also cooperate with some projects in the field of full-text information retrieval, which also uses lexicographical-type processing for documentary purposes, mainly on juridical and historical texts.</Paragraph> <Paragraph position="2"> All the above mentioned activities make use of closely inter-related procedures which the DI~ has developed and put into operation with the collaboration of various Italian Universities and CNR Institutes.</Paragraph> <Paragraph position="3"> More exactly, it could be said that the DL has realized, or is in the process of realizing, a certain number of basic processing &quot;components&quot; and that each of the procedures so far developed consists in the concatenation of some of these components.</Paragraph> <Paragraph position="4"> The functions of each of these components are well-known within the I~DP environment: the acquisition of texts in machine readable form; the production of the typical results of lexical analysis (different types of concordances, context-cards, etc.); the representation of the large variety of characters typical of the I.Dr; morphological analyses</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 52 NICOLETTA CALZOLARI-LAURA \]?ECCHIA- ANTONIO ZAMPOLLI </SectionTitle> <Paragraph position="0"> and consultation of Machine Dictionaries (DMS); syntactical parsers; phonological transcription; etc.</Paragraph> <Paragraph position="1"> I feel that the following three characteristics of these components should be emphasized.</Paragraph> <Paragraph position="2"> a) They are conceived so as to be, as far as possible, generalized (i.e. applicable to all the texts processed at the Dr, whatsoever their nature, language, or the purpose of the processing), 1deg flexible (the user can activate, within the set of rules which constitute the &quot; algorithmic linguistic knowledge&quot; of the program, those rules which best respond to his particular needs), 11 and modular (the components must be inter-compatible and open to the inclusion of any eventual new components: the inter-compatibility is ensured by exchangeinterfaces between the various components; these interfaces consist in a formalism which provides structures, organizations and codes for the representation of linguistic units both at the text and at the linguistic system level).</Paragraph> <Paragraph position="3"> b) These components may be used - at least in principle - with the same basic functions both in lexicographical-philological type applications and in translation, documentation, question-answering, etc. ~ x0 For example, the component proposed for the acquisition of texts in machine readable form performs the following functions: accepts, as input, texts in any natural language (as long as they can be transcribed alphabetically) of any period, or literary form or genre (scientific texts, recorded dialogs, protocols, interviews, novels, inventories, etc.); stores the texts on auxiliary memory; produces listings which reproduce the text as near as possible to its original form; supplies text editing facilities for checking and correction of eventual errors. At the basis of this component is an encoding system which is designed to represent all the different graphemes and graphic features which can appear in printed texts or can be inserted in them in the preediting stage.</Paragraph> <Paragraph position="4"> xx For example, the context of a word can be constructed and delimited by activating and ordering diversely a suitably chosen subset of the available rules from a general contextualisation algorithm (see ZAMPOLLI, 1971): to coincide the context with a structural unit (verse, strophe, etc.); to delimit the context exclusively on the basis of the punctuation immediately preceding or following it; to assign a specific portion of the syntactic structure as context, etc.</Paragraph> <Paragraph position="5"> x~ In particular, at the beginning of the 60s, attempts were made to classify the different systems for LDP according to the so-called ' depth-parameter ' of the linguistic level of operation. Such classifications selected a certain &quot;depth&quot; level along this parameter, and drew in correspondence to this level the demarcation line between the uses of the computer in linguistics which merit the name computational linguistics (CL) and those which do not.</Paragraph> <Paragraph position="6"> Our viewpoint is different. All computational systems functioning for linguistic researches or which operate on linguistic data belong to the CL. Besides, at least in principle, the majority of those systems, independently from the fact that they are considered either below or above an established demarcation line, have a number of components THE ITALIAN MACHINE DICTIONARY: A SEMANTIC APPROACH 53 c) They are, as far as possible, the result of studies which are both research and operationally oriented.</Paragraph> <Paragraph position="7"> 1.3. The Italian Machine Dictionary (DMI).</Paragraph> <Paragraph position="8"> The DMI has also been realized in accordance with these criteria.</Paragraph> <Paragraph position="9"> It has been conceived and is used as a means for semi-automatic lemmatisation, i.e. for the recognition of the occurrences of the various units of the Italian lexical system within a text. It is used in lexicographical, statistical, philological text processing and is utilized in full-text information retrieval systems in order to identify in the documents all the different forms which belong to the same lemma of a specific form appearing in the &quot; question &quot; asked by the user. It will be used to associate to the words from a text the information requested by syntactical and semantical parsers (morpho-syntactical categories, syntactical &quot;valences &quot;, semantical markers, etc.). 18 In the lemmatisation stage, the DMI can be adapted by the user to obtain lexical analyses at different levels of complexity. We think of the definitions of a lexical unit (lemma) as a set of pertinent features (morphological. syntactical, graphical, etc.). Different inflected forms in common. For example, a procedure for lexical analysis necessitates: the acquisition in machine readable form and the computer printing of a variety of texts and graphemes; a morphological analyzer and the consultation of a DM for semiautomatic lemmatization; syntactic and semantic parsers for homograph disambiguation. An automatic translation system requires all these features (in addition to the transfer and generation components).</Paragraph> <Paragraph position="10"> 13 Of course, we have considered whether it would be possible and convenient to compile a DM without having first defined in detail the components which will use the linguistic information contained in it. As an example, let us consider the choice and the formalization of grammatical information (morpho-syntactical categories, valences, specification of possible constructs, etc.) to be coded in the dictionary as &quot; input&quot; of a syntactic parser. Obviously, this depends on the grammatical model and the strategy used by the parser. This does not necessarily mean, however, that once a DM has been compiled with specifically chosen grammatical information, it is necessary to substitute the grammatical part of the DM if the grammatical model should change. Although there are a number of different opinions on this important point, our experience has suggested that, eventually, it will be necessary to extend and complete the already existing information rather than substituting it. In the majority of cases, independently of the definition of their theoretical status, the basic syntactical properties of a lexical unit may be formulated in a neutral way with respect to the model and systems which use them. This affirmation can be largely verified, at least for models within the same &quot; scientific paradigm &quot;, e.g. the generative-transformational ones. Nevertheless, there is perhaps enough evidence to assert that the basic information, at the morpho-syntactical level, is still, to a large extent, valid, even when considering other paradigms such as the so-called &quot;artificial intelligence paradigm &quot;.</Paragraph> <Paragraph position="11"> 54 NICOLETTA CALZOLARI- LAURA PECCHIA- ANTONIO ZAMPOLLI of a text are considered to belong to the same lemma if and only if they have in common all the pertinent features which identify a lemma, distinguishing it from all other lemmas. We have constructed an inventory of features which may be used in the definition of a lexical unit. Such an inventory is based upon a survey of the features used both in lexicographic practice and in linguistic theories. Each entry of the DMI is associated with the set of all the possible features of the inventory which may be used in its definition. The user is allowed to disactivate those features which he does not wish to utilize: for example, the differences between nominal and verbal use of participles or those between adjectival pronouns and pronouns, etc. Obviously, if some distinctions are neutralized, the number of lexical units which constitute the DMI, as defined by the user, and very often the number of possible homographs, are reduced. In other words, if we consider the DMI a concrete representation of the Italian lexical system, in which the lexical units are defined using all the features proposed by the different lexicological and lexicographical traditions, the user can modify the structure of this system and the inventory of its lexical units in accordance with his specific linguistic requirements (Zampolli, 1973a).</Paragraph> <Paragraph position="12"> In this perspective, the DMI is used not only as a tool for text processing but also as an object of studies and research in itself.</Paragraph> <Paragraph position="13"> While in studies at the level of la parole the object is given immediately for the r~DV in the form of corpora of texts, the object in studies on la langue must be specifically constructed. An example which can be given is the first step in a research on the functional load of the phonological oppositions of a phonematic system. This step consists in the inventory of the minimal pairs existing in the lexicon for each opposition and therefore it presupposes the existence of an inventory of all the different forms of the studied language in phonological transcription. The burden of creating an inventory of this type and dimension, and the complexity of the operations required in order to discover and count all the minimal pairs are such that all those tasks are impossible without a computer. Another example could be a study on the &quot;rendement&quot; of the different suffmes, which requires an inventory of all the words in which each suffLx appears.</Paragraph> <Paragraph position="14"> In order to make research work of this type possible, the DMt has been conceived diversely from most of the other DMS in existence.</Paragraph> <Paragraph position="15"> These have usually been realized exclusively as components in translation procedures, information retrieval systems, etc. Such DMS, almost always, include only a limited number of lexical items.</Paragraph> <Paragraph position="16"> THE ITALIAN MACHINE DICTIONARY: A SEMANTIC APPROACH 55 The DMI has a structure and dimensions that allow us to consider it as an exhaustive, automatically processable representation of the lexical component of the Italian linguistic system. The DMI is, therefore, intended as an instrument for research studies at the level of la langue where exhaustive inventories, data and observations are necessary.</Paragraph> <Paragraph position="17"> 1.4. Theoretical background.</Paragraph> <Paragraph position="18"> The research project described below by N. Calzolari and L. Pecchia is an example of how the DMI can be used in this direction.</Paragraph> <Paragraph position="19"> The actual situation of linguistic theory is that of constant change and development. Not only are the traditional models being continuously modified but some researchers affirm also that the debate is now between theories which belong to different scientific ', paradigms&quot; Examples usually quoted are the number of different generative-transformational schools (interpretive semantics, generative semantics, etc.), relational grammar, cognitive semantics. In this situation, some researchers present the following alternatives: whether the scope of the research work conducted in LDP must, of necessity, be directed towards a specific linguistic theory, or whether LDP can produce results which can be utilized by different linguistic schools.</Paragraph> <Paragraph position="20"> For the sake of simplicity we will examine certain examples from the syntax field. A clear example of LDP activity directed at a specific linguistic theory is, in my opinion, offered by the so-called ' grammar testers ', i.e. those computational systems which apply a lexicon and a grammar for automatic sentence generation. 14 These systems, at least in the intention of their creators, constitute a concrete and precise specification at the computational level of a determined linguistic theory; the grammar is considered as a program used to produce sentences; the algorithms which interpret the rules are considered as a part of the meta-theory; the production of concrete sentences serves to verify the coherence of the rules, the completeness and lack of contradiction of the formal apparatus and to indicate, practically, the extension of the subset of language generated by the grammar.</Paragraph> <Paragraph position="21"> Evidently, these systems are intentionally strictly connected with 14 This is not the place to enter into a discussion on the complex and well-known problem of the relation and the differences between &quot; generation&quot; as an abstract calculus of all the possible grammatical objects and the automatic &quot;production&quot; of concrete sentences.</Paragraph> </Section> class="xml-element"></Paper>