File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/80/c80-1002_metho.xml
Size: 20,930 bytes
Last Modified: 2025-10-06 14:11:18
<?xml version="1.0" standalone="yes"?> <Paper uid="C80-1002"> <Title>AUTOMATIC PROCESSING OF WRITTEN FRENCH LANGUAGE</Title> <Section position="1" start_page="0" end_page="0" type="metho"> <SectionTitle> AUTOMATIC PROCESSING OF WRITTEN FRENCH LANGUAGE </SectionTitle> <Paragraph position="0"/> </Section> <Section position="2" start_page="0" end_page="0" type="metho"> <SectionTitle> Abstract </SectionTitle> <Paragraph position="0"> An automatic processor of written French language is described. This processor uses syntactic and semantic informations about words in order to construct a semantic net representing the meaning of the sentences. The structure of the network and the principles of the parser are explained. An application to the processing of the medical records is then discussed.</Paragraph> <Paragraph position="1"> I. Introduction SABA (&quot;Semantic Analyser , Backward Approach&quot;) is an automatic parser of French language currently developped at Liege University, Belgium I. It is now aimed at the processing of medical records 2. However, the principles of this system were conceived independently of any specific application. SABA has been fundamentally conceived as a general, flexible French language parser, which could be used as a component of a natural language interface between a human user and a given computer process 8. This parser is not limited to the processing of correct, academic French. It is aimed also at processing the casual language of an average user.</Paragraph> <Paragraph position="2"> Though our system is uniquely concerned with French, we have translated our examples in English everytime that it was possible. In this way, we hope that the non French-speaking reader might be able to get the flavour of our work.</Paragraph> <Paragraph position="3"> 2. General description of the s~stem SABA, as a parsing system, is essentially semantically oriented, Its goal is not to identify the complete syntactic structure of the input sentences, but rather to determine the possible semantic relationships between the terms of these sentences. More specifically, the system tries to characterize the semantic dependencies that appear in a sentence between the complements and the terms which are completed by them (from now on, a term of this last kind will be called a &quot;completee&quot;). We will insist immediately upon the fact that both concepts of&quot;complement&quot; and of &quot;completee&quot; are to be taken in a general way. The syntactic subject of a verb is thus treated as a complement of this verb.</Paragraph> <Paragraph position="4"> To characterize these semantic dependencies, the system uses a small set of relationships like AGENT, OBJECT, INSTRU-MENT, LOCUS, and so on. In this way, our system is related to the family of &quot;case systems&quot;, using the now well known principles of case grammars 3 14 However, in contrast to some authors 3 15 17 18 , we don~t try to find a complete and minimal set of universal relationships. The only criterion for the choice of our relationships is their practical usefulness. For the time being, about twenty different relationships are used by the system.</Paragraph> <Paragraph position="5"> All the relationships which are identified in an input sentence are summarized in a semantic network, which represents the semantic structure of this sentence. The (simplified) representation of a complete sentence may be illustrated by the figure I. The fundamental principles of the network will be described in the next section.</Paragraph> <Paragraph position="6"> The grammar used by the system has two components, syntactic and semantic, which are used interactively. The syntactic component has two main tasks. First, it segments the sentence into syntactic units. Second, it defines an order of processing for all these units. This syntactic component, which is implemented in a procedural way, will be described in section 5.</Paragraph> <Paragraph position="7"> The semantic component defines which semantic relationships are acceptable between terms. As we shall see later, its scope is not only the relationships between verbs and nominal groups, but also the dependencies between nouns, between nouns and adjectives, and, in fact, all possible dependencies expressible in the French language. The semantic component will be described in section 4.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 3. A semantic net </SectionTitle> <Paragraph position="0"> Since a few years, semantic nets are well known as tools for expressing knowledge and meaning 13 16 17. Let us recall briefly the principle of such networks: a semantic net is a set of nodes, which may represent the different significant terms of a text or of a domain of knowledge. These nodes are interconnected by labelled arcs, which represent the semantic relationships established between them.</Paragraph> <Paragraph position="1"> A complete semantic network, which must be able to express a great variety of --9-semantic informations, is generally a very complex structure 7 9 17 The structure that we use may be somewhat simpler, because it is aimed only at the representation of sentences, and not of general knowledge (at least at this state of our work). However it is still fairly complex, as it can be seen in figure I.</Paragraph> <Paragraph position="2"> found today the books that he was searching for since two months.</Paragraph> <Paragraph position="3"> We will not try here to discuss all the subtleties of our net structure. Rather, we will restrict ourselves to the statement of a few basic principles. All these principles can be explained with the help of the very simple example of the First of all, in our terminology, verbs are not treated as predicates, (the arguments of which being the different nouns of the sentence), but rather as arguments themselves. We have abandoned the dominant point of view that the verbs express mainly relationships, while the others terms express objects or properties. Instead, we admit that a sentence is composed of content words, which we call &quot;semantic entities&quot;, related by semantic relationships. The semantic entities include not only the nouns, but also the verbs, adjectives, adverbs, and some prepositions.</Paragraph> <Paragraph position="4"> Secondly, the semantic relationships are oriented (the positive orientation being denoted in the network by an arrow). By definition, the positive orientation is such that : - the origin of the arc is the node corresponding to the term which appears in the sentence as the complement, and - the extremity of the arc is the node corresponding to the term which appears in the sentence as the completee.</Paragraph> <Paragraph position="5"> Third, a logical interpretation corresponds to the semantic net. We will admit that to the graphic configuration</Paragraph> <Paragraph position="7"> We will remark that the relation R is not symmetrical with respect to its arguments : the first argument corresponds to the destination node of the network representation.</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 4. A_semantic grammar </SectionTitle> <Paragraph position="0"> The task of the semantic component of the grammar is to define which semantic relationships are acceptable between the semantic entities. In order to do that, we shall use semantic knowledge about words. To each content word are assigned two different &quot;property lists&quot;: an &quot;E-list&quot; and a &quot;T-list&quot; E-lists.</Paragraph> <Paragraph position="1"> The E-list which is associated with one term lists all the relationships where this term may appear as a &quot;completee&quot;. As an alternate definition, we may say that an E-list lists all possible kinds of complements for the associated term.</Paragraph> <Paragraph position="2"> For example, the E-list of the verb &quot;to eat&quot;would be something like that :</Paragraph> <Paragraph position="4"> E-lists appear to be very similar to the traditional case frames used by the case grammar theory. There exists however a distinction : case frames were meaned to indicate possible ARGUMENTS for verbs, considered as predicates.</Paragraph> <Paragraph position="5"> E-lists are used to indicate possible RELATIONSHIPS for the associated terms, which are considered as arguments.</Paragraph> <Paragraph position="6"> The E-list associated to a term is a characteristic of this term itself, and cannot be deduced from the context. It must be given by a dictionary.</Paragraph> <Paragraph position="7"> T-lists.</Paragraph> <Paragraph position="8"> The T-list which is associated to a term lis~ the possible relationships where this term may appear as a complement. We may also understand a T-list as the list of the possible kinds of &quot;completee&quot; of a term. In contrast to the E-list, the T-list of a term is, at least partially, bound by the context of this term in a sentence. The T-list of a noun, for example, is provided by the preposition which begins the nominal group. Each preposition introduces thus a given, --10-fixed T-list. And, to preserve the generality of this rule, the lack of preposition is assimilated to the presence of an &quot;empty&quot; preposition, called PHI. For example, the T-list introduced by the French preposition &quot;par&quot; is something</Paragraph> <Paragraph position="10"> Of course, we do not consider that the lists given as examples are complete.</Paragraph> <Paragraph position="11"> They are only an illustration of the real configuration of the system.</Paragraph> <Paragraph position="12"> Some properties of T-lists and E-lists.</Paragraph> <Paragraph position="13"> a) From a logical point of view, the occurrence of a relationship, say AGENT, in the E-list associated with a given</Paragraph> <Paragraph position="15"> The occurrence of the same relationship in the T-list associated to X is equivalent to : (~y) AGENT (y ,X) Consequently, the only difference between T-lists and E-lists lies in the orientation given to the relationships described by them.</Paragraph> <Paragraph position="16"> b) For any relationship, such as AGENT, we may define the &quot;inverse relationship&quot;</Paragraph> <Paragraph position="18"> Given these inverse relationships, we have the following property of E-lists and T-lists : &quot;The occurrence of a given relationship in the E-list associated with a term X is equivalent to the occurrence of the inverse relationship in the T-list associated to the same term, and reciprocally&quot; null This property is used in some complex situations where a term which appears in the input sentence as a complement must be represented in the network as a completee. This is the case, for example, of past and present participles used as adjectives.</Paragraph> <Paragraph position="19"> c) The same relationship may not occur twice in a given list of properties (E-list or T-list). Concerning E-lists, this restriction may be translated as : &quot;two different terms cannot play in the same sentence the same role with respect to a given term&quot;, which is a typical restriction in some case systems 3 is.</Paragraph> <Paragraph position="20"> d) Only one of the relationships listed in the T-list of a term may be used in a given sentence. This means that each term in a sentence has a single role to play. This condition is not true for E-lists : all the relationships given by the E-list of a term may be used in a sentence where this term occurs.</Paragraph> <Paragraph position="21"> The properties c) and d) are called the two &quot;exclusivity principles&quot; of the system. null Compatibility condition and selectional restrictions We will now show how we will use these property lists in our system. First we will state a compatibility condition : &quot;given two terms, one of which is a possible complement of the second, a necessary condition to establish a given relationship between them is that this relationship be present both in the E-list of the possible completee and in the T-list of the possible complement&quot;. This condition is a necessary but not a sufficient one. The reason of this can be shown in the following example Let us admit that we want to establish the AGENT relationship between &quot;eating&quot; and &quot;Peter&quot; in &quot;Peter eats&quot;. We must of course know that</Paragraph> <Paragraph position="23"> i.e, that the act of eating takes an AGENT, and that Peter may be the AGENT of some activity. But, in order to be allowed to state</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> AGENT (EAT ING, PETER), </SectionTitle> <Paragraph position="0"> we must also know whether the two assig-</Paragraph> <Paragraph position="2"> are correct.</Paragraph> <Paragraph position="3"> These assignments will be submitted to a set of restrictions. These restrictions are associated with the property lists of the terms. Restrictions concerning the complements of a term are associated with the E-list of this term. Restrictions concerning the completee of a term are associated with its T-list.</Paragraph> <Paragraph position="4"> The system uses different kinds of restrictions, in order to solve different kinds of ambiguities. The main one, which concerns nouns (and adjectives), uses a classification of these terms into a hierarchized set of semantic classes. With the help of this classification, we can for example express that &quot;the AGENT of the action of eating must be an Animate being&quot;, which is denoted the system can easily parse the sentence &quot;Peter eats an apple with a knife&quot; and produce the three relationships</Paragraph> <Paragraph position="6"> Other kinds of restrictions are based on syntactic classes or on modal properties (of verbs). We shall not discuss them further here.</Paragraph> </Section> <Section position="6" start_page="0" end_page="0" type="metho"> <SectionTitle> 5. The parser </SectionTitle> <Paragraph position="0"> The grammar, and thus the parser, are not completely free of syntactic considerations. The syntactic part of the parser has two main tasks : - the segmentation of the input text into &quot;syntactic units&quot; - the determination of a parsing strategy. null Syntactic units.</Paragraph> <Paragraph position="1"> Four kinds of syntactic units are defined : words, groups, clauses and sentences. null - Words, or atomic symbols are strings of characters which match dictionary entries (this definition takes into account not only single words, but also locutions, as &quot;at least&quot;, that will be treated by the system as atomic symbols). To each Word are associated syntactic and semantic properties. Words are divided into two main classes : * the semantic entities, or content words, which can be arguments of semantic relationships : nouns, pronouns, verbs, adverbs, adjectives, and some prepositions.</Paragraph> <Paragraph position="2"> * the function words, which cannot be arguments of semantic relationships : coordinate and subordinate conjunctions, articles, some prepositions, negation words, and so on.</Paragraph> <Paragraph position="3"> - Groups are sequences of Words. Each Group has a central term, which may be a noun (nominal Groups), a pronoun, an adjective or an adverb.</Paragraph> <Paragraph position="4"> - A Clause consists of one and only one verb, and of a set of Groups. A Clause has also a central term, which is its verb.</Paragraph> <Paragraph position="5"> - A Sentence is a sequence of Words delimited by a &quot;terminator&quot; (punctuation mark like a period, a question-mark,...) A Sentence contains a set of Clauses. The parsing strategy.</Paragraph> <Paragraph position="6"> The parsing strategy is fundamentally a bottom-up strategy, which may be defined recursively as follows : For each syntactic unit (except Words), execute the following steps : - the segmentation of this unit into its own internal syntactic units, - the parsing of these internal units according to a definite order, - the determination of the semantic relationships between the internal units, - the substitution of the given unit, at the next higher level, by a special symbol, which represents the analyzed unit.</Paragraph> <Paragraph position="7"> The semantic relationships are determined according to the semantic grammar defined above. We want now to insist on the two other crucial points of this algorithm : the segmentation of a given unit, and the order for parsing the internal units.</Paragraph> <Paragraph position="8"> The segmentation procedures has two tasks : breaking down sentences into clauses, and clauses into groups.</Paragraph> <Paragraph position="9"> The segmentation of sentences into clauses is based on the following technics : starting at a verb, and moving one word at a time to the left AND to the right until a delimiter is recognized. For groups, the same technics applies, except that a group is never extended to the right of its main term. Lists of clause-delimiters and of group-delimiters are known by the system. Coordinate conjunctions, which can or cannot be delimiters depending on the context, do receive a special treatment.</Paragraph> <Paragraph position="10"> An important point concerning the segmentation of the sentence into clauses must be stressed. It is performed each time that a clause must be selected to be analyzed by the system. This strategy gives to the system the possibility to use informations collected in a prior state of the analysis. In this way, the structure of very complex sentences can be successfully analyzed.</Paragraph> <Paragraph position="11"> All segmentation procedures are already -12-implemented and function satisfyingly. An example of segmentation of a French sentence is shown in the figure 3. The sentence appears as a list of words, delimited on the left and on the right by the special symbol SB (Sentence Boundary). A syntactic category is assigned to each word. The results shown in the figure are a simplification of the output of the system.</Paragraph> <Paragraph position="12"> a) Les chiens auxquels vous vous attachez et qui vous rendent de l'affection deviennent d'inestimables compagnons.</Paragraph> <Paragraph position="13"> b) The dogs that you love and who love you in return become precious comrades a) the original French sentence b) the English translation c) the input of the segmentation procedure null d) the state of the sentence after the analysis of the relative clause &quot;auxquels vous vous attachez&quot;, which is replaced by the special symbol PR e) the state of the sentence after the analysis of the relative clause &quot;qui vous rendent de l'affection&quot;, which is replaced by PR f) final state of the sentence : the main clause was found and replaced by</Paragraph> </Section> <Section position="7" start_page="0" end_page="0" type="metho"> <SectionTitle> PP </SectionTitle> <Paragraph position="0"> concerning the order for parsing the internal units at a given level, two strategies are applied, one for clauses, and one for groups.</Paragraph> <Paragraph position="1"> For clauses, we simply follow the bottom-up strategy, with the following rule : all subordinate clauses (relative, conjunctive, infinitive,...) are processed before the clauses on which they depend. If two clauses are on the same level, a left to right priority is applied.</Paragraph> <Paragraph position="2"> For groups, a backward strategy is applied : the system always starts from the end of the clause, and moves towards the beginning. At each step, the internal structure of a group is parsed,</Paragraph> </Section> <Section position="8" start_page="0" end_page="0" type="metho"> <SectionTitle> AND THEN THE POSSIBLE RELATIONSHIPS BET- WEEN THIS GROUP AND THE FOLLOWING GROUPS (ALREADY PARSED) ARE INVESTIGATED. This </SectionTitle> <Paragraph position="0"> particular order (after which the system is named) has a crucial importance.</Paragraph> <Paragraph position="1"> It is based on two facts : the first is related to the structure of the language. In French, complements are nearly always at the right of the terms on which they depend; the second is related to the system : we know that the T-lists of the semantic entities are, at least partially, deduced from the context. Consequently, at the moment when the system investigates the potential relationships between a term and some possible complement, the group in which this complement appears must have already been parsed !</Paragraph> </Section> class="xml-element"></Paper>