File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/00/w00-1404_metho.xml

Size: 21,552 bytes

Last Modified: 2025-10-06 14:07:25

<?xml version="1.0" standalone="yes"?>
<Paper uid="W00-1404">
  <Title>Document Structure and Multilingual Authoring</Title>
  <Section position="3" start_page="24" end_page="25" type="metho">
    <SectionTitle>
2 Our approach to Multilingual
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="24" end_page="25" type="sub_section">
      <SectionTitle>
Document Authoring
</SectionTitle>
      <Paragraph position="0"> Our Multilingual Document Authoring system has the following main features: First, the authoring process is monolingual, but the results are multilingual. At each point of the process the author can view in his/her own language the ..... . .......... between ~ML`d~cumeqt~a~a9ring`~aad;mu~ti~nguaL;~,~.~te~t:~s/h~hasa~u~h~rex~:~.~aa~a~d~rea~Ewhere~he ..: text authoring/generation (Power and Scott, 1998; text still needs refinement are highlighted. Menus Hartley and Paris, 1997; Coch, 1996): the choices for selecting a refinement are also presented to the made by the author are treated as a kind of in- author is his/her own language. Thus, the author is terlingua (specific to the class of documents being always overtly working in the language s/he nows, modelled), and it is the responsibility of appropri- but is implicitly building a language-independent ate &amp;quot;rendering&amp;quot; mechanisms to produce actual text representation of the document content. From this from these choices ill tile different languages 3 under representation, the system builds multilingual texts consideration, in any of several languages simultaneously. This ap-For such a program, existing XML tools suffer proach characterizes our system as belonging to an however from serious limitations. First, DTD's are emerging paradigm of&amp;quot;natural language authoring&amp;quot; too poor in expressive power (they are close to (Power and Scott, 1998; Hartley and Paris, 1997), context-free grammars) for expressing dependencies which is distinguished from natural language generbetween different parts of the document, an aspect ation by the fact that the semantic input is provided which becomes central as soon as the document interactively by a person rather than by a program micro-structure (its fine-grained semantic structure) accessing digital knowledge representations. starts to play a prominent role, as opposed to simply Second, the system maintains strong control both its macro-structure (its organization in large seman- over the semantics and the realizations of the docutic units, typically larger than a paragraph). Second, ment. At the semantic level, dependencies between current rendering mechanisms such as CSS (Cascad- different parts of the representation of the document ing Style Sheets) or XSLT (XLS transformation lan- content can be imposed: for instance the choice of guage) (W3C, 1999b) are ill-adapted for handling a certain chemical at a certain point in a mainteeven simple linguistic phenomena such as morpho- nance manual may lead to an obligatory warning logical variation or subject-verb agreement, at another point in the manual. At the realization In order to overcome these limitations, we are level, which is not directly manipulated by the auusing a formalism, Interaction Grammars (IG), a thor, the system can impose terminological choices specialization of Definite Clause Grammars (Pereira (e.g. company-specific nomenclature for a given conand Warren, 1980) which originates in A. Ranta's cept) or stylistic choices (such as choosing between Grammatical Framework (GF) (Ranta; M~enp~igt using the infinitive or the imperative mode in French and Ranta, 1999; Dynaetman et el., 2000), a gram- to express an instruction to an operator).</Paragraph>
      <Paragraph position="1"> matical formalism based on Martin-LSf's Type The- Finally, and possibly most distinctively, the story (Martin-L6f, 1984) and building on previous ex- mantle representation underlying the authoring properience with interactive mathematical proof editors cess is strongly document-centric and geared towards (Magnusson and Nordstr6m, 1994). In this formal- directly expressing the choices which uniquely charism, the carrier of meaning is a choice tree (called aeterize a given document in an homoge~cous class &amp;quot;abstract tree&amp;quot; in GF), a strongly typed object in of documents belonging to the same domain. Our which dependencies between substructures can be view is document-centric in the sense that it takes easily stated using the notion of dependent types, as its point of departure the widespread practice of The remainder of this paper is organized as fol- using XML tools for authoring the macro-structure lows. In section 2,,,we give a'~,high.teveloverview .of ..... of doeuments,-oand--extends this-practice towards an the Multilingual Document Authoring (MDA) sys- account of their m.icro-structure. But the analysis tern that we have developed at XRCE. In section of the micro-structure is only pushed as far as is 3, we present in some detail the formalism of In- necessary in order to account for the variability interaction Grammars. In section 4. we describe an side the class of documents considered, and not in terms of the ultimate meaning constituents of lan- 3The word &amp;quot;language&amp;quot; should be understood here in an extended sense tha! not only covers English. French. etc., but guage. This nlicro-structure can in general be dealso different styles or modes of communication, lerlniued by studying a corpus of documents and by  exposing the structure of choices that distinguish a given document from other documents in this class.</Paragraph>
      <Paragraph position="2"> This structure of choices is represented in a choice tree, which is viewed as the semantic representation for the document. 4 One single choice may be associated with text realizations of drastically different granularities: while in a pharmaceutical document the choice of an ingredient may result in the production of a single word, the choice of a &amp;quot;responsabilitywaiver&amp;quot; may result in a long stereotypical paragraph of text, the further analysis of which would be totally .counter-productive.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="25" end_page="27" type="metho">
    <SectionTitle>
3 Interaction Grammars
</SectionTitle>
    <Paragraph position="0"> Let us now give some details about the formalism of Interaction Grammars. We start by explaining the notion of choice tree on the basis of a simple context-free grammar, analogous to a DTD.</Paragraph>
    <Paragraph position="1"> Context-free grammars and choice trees Let's consider the following context-free grammar for describing simple &amp;quot;addresses&amp;quot; in English such as  &amp;quot;Paris, France&amp;quot;: s address --&gt; city, &amp;quot;,&amp;quot;, country.</Paragraph>
    <Paragraph position="2"> country --&gt; &amp;quot;France&amp;quot;.</Paragraph>
    <Paragraph position="3"> country --&gt; &amp;quot;Germany&amp;quot;.</Paragraph>
    <Paragraph position="4"> city --&gt; &amp;quot;Paris&amp;quot;.</Paragraph>
    <Paragraph position="5"> city --&gt; &amp;quot;Hamburg&amp;quot;.</Paragraph>
    <Paragraph position="6"> city --&gt; &amp;quot;the capital of&amp;quot;, country.</Paragraph>
    <Paragraph position="7"> What does it mean, remembering the XML anal null ogy, to author a &amp;quot;document&amp;quot; with such a CFG? It means that the author is iteratively presented with partial derivation trees relative to the grammar (partial in the sense that leaves can be terminals or nonterminals), and at each given authoring step both selects a certain nonterminal to &amp;quot;refine&amp;quot;, and also a given rule to extend this non-terminal one step further: this action is repeated until the derivation tree is complete.</Paragraph>
    <Paragraph position="8"> If one conventionally uses the identifier nonterminal~ to name the i-th rule expanding the nonterminal nonterminal, then the collection of choices made by the author during a session can be represented by a choice tree labelled with rule identifiers, also called combinators. An example of such a tree is addressl(city2,country2) 4This kind of semantic representation stands i-n contrast to some representations commonly used in NLP, which tend to emphasize the fine-grained predicate-argument structure of sentences independently of the productivity of such analyses .\[or a given class of documents.</Paragraph>
    <Paragraph position="9"> 5For compatibility with the notacionsCo follow, we use lowercase to denote nonlerminals, aml quoted strings to denote terminals, rather than tile inore usna\[ ul)pot'case lowercase convent ions.</Paragraph>
    <Paragraph position="10"> which corresponds to choices leading to the output &amp;quot;Hamburg, Germany&amp;quot;. 6 In.practice, rather than using combinator names which strictly adhere to this numbering scheme, we prefer to use mnemonic names directly relating to the meaning of the choices. In the sequel we will use the names adr; fra, ger, par, ham, cap for the six rules in the example grammar. The choice tree just described is thus written adr(ham,ger).</Paragraph>
    <Paragraph position="11"> Making choice trees explicit As we have argued previously, choices trees are in our view the cen- . tral repositoi-y of documentc0ntent and we Want to manipulate them explicitely. Definite Clause Grammars represent possibly the simplest extension of context-free grammars permitting such manipulation. Our context-free grammar can be extended straightforwardly into the DCG: 7</Paragraph>
    <Paragraph position="13"> country(Co).</Paragraph>
    <Paragraph position="14"> What these rules do is simply to construct choice trees recursively. Thus, the first rule says that if the author has described a city through the choice tree C and a country through the choice tree Co, then the choice tree adr(Co,C) represents the description of an address.</Paragraph>
    <Paragraph position="15"> If now, in this DCG, we &amp;quot;forget&amp;quot; all the terminals, which are language-specific, by replacing them with the empty string, we obtain the following &amp;quot;abstract</Paragraph>
    <Paragraph position="17"> which is in fact equivalent to the definite clause program: s SSuch a choice tree can be projected into a derivation tree in a straightforward way, by mapping a combinator nonterminali into the monterminal name nontermin,:.l, and by 'introducing terminal material as required by the specific rules.</Paragraph>
    <Paragraph position="18"> 7According to the usual logic programming conventions, lowercase letters denote predicates and functors, whereas uppercase letters denote metavariables that can be instauciated with terms.</Paragraph>
    <Paragraph position="19"> Sin the sense that rewriting the nonterminal goal address (adr (Co ,C)) to the empty string in the DCG is equivalent to proving the goal address(adr(Co,C)) in the program,  address(adr(Co,C)) :- city(C), country(Co).</Paragraph>
    <Paragraph position="20"> country (f ra).</Paragraph>
    <Paragraph position="21"> country (ger).</Paragraph>
    <Paragraph position="22"> city(par).</Paragraph>
    <Paragraph position="23"> city(ham).</Paragraph>
    <Paragraph position="24"> city(cap(Co)) :- country(Co).</Paragraph>
    <Paragraph position="25">  This abstract grammar (or, equivalently, this logic program), is language independent and recursively defines a set of well-formed choice trees of different categories, or types. Thus, the tree adr(ham,ger) is .well-formed &amp;quot;in&amp;quot;.. the. :typ~/add.~:r~s, ,End the .lice cap(fra) well-formed in the type city.</Paragraph>
    <Paragraph position="26"> Dependent Types In order to stress the typerelated aspects of the previous tree specifications, we are actually using in our current implementation the following notation for the previous abstract grammar:  adr(Co,C)::address --&gt; C::city, Co : : country.</Paragraph>
    <Paragraph position="27"> fra: :country --&gt; \[\] .</Paragraph>
    <Paragraph position="28"> ger: :country --&gt; \[\] .</Paragraph>
    <Paragraph position="29"> par: :city --&gt; \[3 .</Paragraph>
    <Paragraph position="30"> ham: :city --&gt; \[\].</Paragraph>
    <Paragraph position="31"> cap(Co)::city --&gt; Co::country.</Paragraph>
    <Paragraph position="32">  The first rule is then read: &amp;quot;if C is a tree of type city, and Co a tree of type country, then adr(Co,C) is a tree of type address&amp;quot;, and similarly for the remaining rules.</Paragraph>
    <Paragraph position="33"> The grammars we have given so far are deficient in one important respect: there is no dependency between the city and the country in the same address, so that the tree adr(ham,fra) is well-formed in the type address. In order to remedy this problena, dependent types (Ranta; Martin-L6f, 1984)can be used. From our point of view, a dependent type is simply a type that can be parametrized by objects of other types. We write: adr(Co,C)::address --&gt; C::city(Co), Co: :country.</Paragraph>
    <Paragraph position="34"> fra: :country --&gt; \[\] .</Paragraph>
    <Paragraph position="35"> get: :country --&gt; \[\] .</Paragraph>
    <Paragraph position="37"> in which the type city is now parametrized by objects of type country, and where the notation par : : city(fra) is read as &amp;quot;'paris atree of the type: city of fra'. 9 which is another way of stating the well-known duality between the rewriting and the goal-proving approaches to the interpretation of Prolog.</Paragraph>
    <Paragraph position="38"> 9In terms of the underlying Prolog implementation. &amp;quot;::&amp;quot; is simply an infix operator for a predicate of arity 2 which relates an object and its type, and both simple and dependent types are handled st raighforwardly.</Paragraph>
    <Paragraph position="39">  have just explained how abstract grammars can be used for specifying well-formed typed trees representing the content of a document.</Paragraph>
    <Paragraph position="40"> In order to produce actual multilingual documents from such specifications, a simple approach is to allow for parallel realization English, French ..... grammars, which all have the same underlying abstract. grammar (program), but which introduce terminals specific, to ~the_ language -at. hand. Thus. the (ollowing French andEnglish gi-annmkrs a/'e pai~allel to the ':&amp;quot; previous abstract grammar:ldeg  adr(Co,C)::address --&gt; C::city(Co), &amp;quot;,&amp;quot;, Co: :country.</Paragraph>
    <Paragraph position="41"> fra: :country --&gt; &amp;quot;France&amp;quot;.</Paragraph>
    <Paragraph position="42"> ger : : country --&gt; &amp;quot;Germany&amp;quot;.</Paragraph>
    <Paragraph position="43"> par::city(fra) --&gt; &amp;quot;Paris&amp;quot;.</Paragraph>
    <Paragraph position="44"> ham: : city(ger) --&gt; &amp;quot;Hamburg&amp;quot;.</Paragraph>
    <Paragraph position="45"> cap(Co)::city(Co) --&gt; &amp;quot;the capital of&amp;quot;, Co : : country.</Paragraph>
    <Paragraph position="46"> adr(Co,C)::address --&gt; C::city(Co), &amp;quot;,&amp;quot;, Co : : country.</Paragraph>
    <Paragraph position="47"> fra: : country --&gt; &amp;quot;In France&amp;quot;.</Paragraph>
    <Paragraph position="48"> ger : : country --&gt; &amp;quot;i' Allemagne&amp;quot;.</Paragraph>
    <Paragraph position="49"> par: : city(fra) --&gt; &amp;quot;Paris&amp;quot;.</Paragraph>
    <Paragraph position="50"> ham: : city (get) -- &gt; &amp;quot;Hambourg&amp;quot;.</Paragraph>
    <Paragraph position="51"> cap(Co): :city(Co) --&gt; &amp;quot;In capitale de&amp;quot;,  Co: :country.</Paragraph>
    <Paragraph position="52"> This view of realization is essentially the one we have adopted in the prototype at the time of writing, with some straighforward additions permitting the handling of agreement constraints and morphological variants. This simple approach has proven quite adequate for the class of documents we have been interested in.</Paragraph>
    <Paragraph position="53"> However, such an approach sees the activity of generating text from an abstract structure as basically a compositional process on strings, that is, a process where strings are recursively associated with subtrees and concatenated to produce strings at the next subtree level. But such a direct procedure has well-known limitations when the seinantic and syntactic levels do not have a direct correspondence (simple example: ordering a list of modifiers around a noun). We are currently experimenting with.a, powerful extension~of.stri.ng compqsihonality where tim objects compositionally associated with abstract subtrees are not strings, but syntactic representations with rich internal structure. The text 10Because the order of goals in the right-hand side of an abstract grammar rule is irrelevant, the goals on the right-hand sides of rule in two parallel realization grammars can appear in a different order, which permits certain reorganizations of the linguistic material (situation not shown in the example).  itself is obtained from the syntactic representation associated with the .total tree .by simply enumerating its leaves.</Paragraph>
    <Paragraph position="54"> In this extended view, realization grammars have rules of the following form:</Paragraph>
    <Paragraph position="56"> general public. Le VIDAL (r) includes a collection of notices ,for .around* 5 5.00. dmgs..a~ailable .in France.</Paragraph>
    <Paragraph position="57"> As the publisher, OVP-t~ditions du Vidal has taken care of homogeneity across the notices, reformatting and reformulating source information. The main source are the New Drug Authorizations (Autorisation de Mise sur le March~), regulatory documents written by pharmaceutical laboratories and approved by legal authorities.</Paragraph>
    <Paragraph position="58"> Relative to multilingual document authoring, this</Paragraph>
    <Paragraph position="60"> The rule shown is a rule for English: the syntactic representations are language dependent; parallel rules for the other languages are obtained by replacing the compose_english constraint (which is unique to this rule) by constraints appropriate to the other languages under consideration.</Paragraph>
    <Paragraph position="61"> Heterogeneous Trees and Interactivity Natural language authoring is different from natural language generation in one crucial respect. Whenever the abstract tree to be generated is incomplete (for instance the tree cap(Co)), that is, has some leaves which are yet uninstanciated variables, the generation process should not proceed with nondeterministically enumerating texts for all the possible instanelations of the initial incomplete structure. Instead it should display to the author as much of the text as it can in its present &amp;quot;knowledge state&amp;quot;, and enter into an interaction with the author to allow her to furthor refine the incomplete structure, that is, to further instanciate some of the uninstanciated leaves.</Paragraph>
    <Paragraph position="62"> To this purpose, it is useful to introduce along with the usual combinators (adr, fra, cap, etc.) new combinators of arity 0 called typenames, which are notated type, and are of type &amp;quot;type. These combinators are allowed to stand as leaves (e.g. in the tree cap(country)) and the trees thus obtained are said to be heterogeneous. The typenames are treated by the text generation process as if they were standard semantic units, that is, they are associated with text units which are generated &amp;quot;at their proper place&amp;quot; in the generated output. These text units are specially phrased and highlighted to indicate to the author that some choice has to be made to refine the underlying type (e.g. obtaining the text &amp;quot;la capitale de PAYS&amp;quot;). This choice has the effect of further instanelating the incomplete tree with &amp;quot;true&amp;quot; combinators, main (for which various terminological resources are available), (2) it is a homogeneous collection of documents all complying to the same division in sections and sub-sections, (3) there is a strong trend in international bodies such as the EEC towards making drug package notices (which are similar to VIDAL notices) available in multilingual versions strictly aligned on a common model. 11</Paragraph>
    <Section position="1" start_page="27" end_page="27" type="sub_section">
      <SectionTitle>
4.2 Corpus analysis
</SectionTitle>
      <Paragraph position="0"> An analysis of a large collection of notices from Le VIDAL (r) de la famille, describing different drugs, from different laboratories was conducted in order to identify: * the structure of a notice, (r) the semantic dependencies between elements in the structure.</Paragraph>
      <Paragraph position="1"> For this task, all the recta-information available is useful, in particular: explanations provided by Le VIDAL (r) de la famille and help of a domain expert. Corpus study was a necessary preliminary task before modeling the notices in the IG formalism presented in section 2.</Paragraph>
      <Paragraph position="2">  Notices from Le VIDAL (r) are all built on the same model, including a title (the name of the drug, plus some general information about it). followed by sections describing the main characteristics of the cirug: general description, composition, indications, contraindications, warnings, drug interactions, pregnancy and breast-feeding, dosage and administration, possible side effects. This initial knowledge * about the semantic content of the document is captured with a first., simple context free rule, such as: and the generation process is iterated.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML