File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/02/c02-2004_metho.xml

Size: 18,434 bytes

Last Modified: 2025-10-06 14:07:50

<?xml version="1.0" standalone="yes"?>
<Paper uid="C02-2004">
  <Title>A Linguistic Discovery Program that Verbalizes its Discoveries</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 Overview of UNIVAUTO
</SectionTitle>
    <Paragraph position="0"> Below is a brief description of UNIVAUTO (UNIVersals AUthoringTOol), drawing for illustration on data from Greenberg (1966).</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.1 The input
</SectionTitle>
      <Paragraph position="0"> UNIVAUTO accepts as input the following, manually prepared, information:  (1) A database (=a table), usually  comprising a sizable number of languages, described in terms of some properties (featurevalue pairs), as well as a list of the abbreviations used in the database. Below is a (simplified) description of the language Berber in terms of just 4 features: v-order (=the position of verb, subject and object), na/an (=the position of noun and adjective), cn/pn (=the position of common noun and proper noun), and pref/suf (=the presence of prefix or suffix): data(berber, [v-order=vso,na/an=na,cnpn/pncn=*,pref/suf=both] ).</Paragraph>
      <Paragraph position="1"> The value &amp;quot;*&amp;quot; is special, and is used to designate that either the feature cnpn/pncn is inapplicable for Berber or that the value for that feature is unknown.</Paragraph>
      <Paragraph position="2"> (2) a human agent's discoveries (represented as simple logical propositions, if originally formulated as complex ones); e.g.: discovery(agent=greenberg,no=3,nonstatistical, implication(v- order=vso,pr/po=pr)).</Paragraph>
      <Paragraph position="3"> This record states that a human agent, Greenberg, has found the implicational universal, relating two variables, to the effect that for all languages, if a language has a Verb-Subject-Object order then this language has prepositions (rather than postpositions), that this universals is non-statistical (holds without exceptions in the studied database), and that it is stated as Universal No. 3 in the original publication of the human agent.</Paragraph>
      <Paragraph position="4"> Aside from these basic sources of information, the input includes information on: the origin of database, if any (the full citation of work where the database is given); reference name(s) of database, if any; the kinds of objects rows and columns represent; etc..</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.2 The task
</SectionTitle>
      <Paragraph position="0"> The task UNIVAUTO addresses can be formulated as follows: Given the input information (as described in 2.1), find the language universals valid in the data, compare them with those discovered by some human agent, and write a report, if appropriate.</Paragraph>
      <Paragraph position="1"> E.g. a query to the system may look like: ?-discover(implication(A,B),non_statistical, positive_examples=4,compare_with=greenberg).</Paragraph>
      <Paragraph position="2"> It amounts to requesting that non-statistical implicational universals holding between two variables and supported in at least 4 positive examples be found, the results be compared with the findings of Greenberg, and, if judged as interesting enough, a report of these discoveries be written. Other queries may also be formulated (cf. 3.1), but currently only such involving one type of universal and one database at a time.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.3 The output
</SectionTitle>
      <Paragraph position="0"> Below we list some excerpts from Pericliev (1999) as an illustration of the system output.</Paragraph>
      <Paragraph position="1"> The program was run on the data from Greenberg (1966), with the query in the preceding section. It discovered some problems in his analyses (which forms the bulk of the text below) as well as 59 novel universals of type &amp;quot;If A then B, non-statistical&amp;quot; (as against 12 found by Greenberg, one of which further turned out to be wrong!). The paragraphs have bold face numeration to be used for later reference.</Paragraph>
      <Paragraph position="2">  implicational universals in the 30 languages sample of Greenberg 1966 and compare the results of the two studies.* &lt;...&gt; [3] We confirmed the validity of universals [12,13,15-a,15-b,21-a,22-a,27-a].</Paragraph>
      <Paragraph position="3"> [4] Universal [27-b] is also true, however it violates our restriction pertaining to the occurrence of at least 4 positive examples in the dataset. [27-b] is supported in 1 language (Thai).</Paragraph>
      <Paragraph position="4"> [5] Universals [16-a,16-b,16-c] are uncertain, rather than indisputably valid in the database investigated, since they assume properties in languages, which are actually marked in the database as &amp;quot;unknown or inapplicable&amp;quot; (notated with &amp;quot;*&amp;quot; in Table 1). Universal [16-a] would hold only if the feature AuxV/VAux is applicable for Berber, Hebrew, and Maori and in these languages the inflected auxiliary precedes the verb. Universal [16b] would hold only if the feature AuxV/VAux is applicable for Burmese and Japanese and in these languages the verb precedes the inflected auxiliary. Universal [16-c] would hold only if the feature AuxV/VAux is applicable for Loritja and in this language the verb precedes the inflected auxiliary. [6] Universal [23-a] is false. It is falsified in Basque, Burmese, Burushaski, Finnish, Japanese, Norwegian, Nubian, and Turkish, in which the proper noun precedes the common noun but in which the noun does not precede the genitive.</Paragraph>
      <Paragraph position="5">  [7] We found the following previously undiscovered universals in the data.</Paragraph>
      <Paragraph position="6"> [8] Universal 1. If in a language the adjective precedes the adverb then the main verb precedes the subordinate verb.</Paragraph>
      <Paragraph position="7"> [9] Examples of this universal are 8 languages: Fulani, Guarani, Hebrew, Malay, Swahili, Thai, Yoruba, and Zapotec. &lt;...&gt; [10] Universal 59. If a language has an initial  yes-no question particle then this language has the question word or phrase placed first in an interrogative word question.** &lt;...&gt; *The generated text continues with description of what an implicational universal is, a table of Greenberg's 30 language sample, accompanied by the abbreviations used, and a listing of the universals he found. His universals, verbalized by our program, are listed with their numeration in the original publication. An alpha-numeric numeration means that an originally complex universal has been split into elementary ones of the form &amp;quot;If A then B&amp;quot;. **There follows a conclusion which is a summary of the results.</Paragraph>
      <Paragraph position="8"> _______________________________________</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="0" end_page="2" type="metho">
    <SectionTitle>
3 The UNIVAUTO System
</SectionTitle>
    <Paragraph position="0"> UNIVAUTO comprises two basic modules: one in charge of the discoveries of the program, called UNIV(ersals), and the other in charge of the verbalization of these discoveries, called AU(thoring)TO(ol).</Paragraph>
    <Section position="1" start_page="0" end_page="2" type="sub_section">
      <SectionTitle>
3.1 The discovery module UNIV
</SectionTitle>
      <Paragraph position="0"> universals (holding without exceptions) or &amp;quot;statistical&amp;quot; universals (holding with some user-specified percentage of exceptions).</Paragraph>
      <Paragraph position="1"> Also, UNIV can compute (implicational) universals valid in (at least) a user-specified number of positive examples (=languages), as well as compute the statistical significance of universals (based on the kh  statistic). A minimal set-cover subroutine may guarantee the discovery of the smallest set(s) of universals, generating a typology (Pericliev 2002).</Paragraph>
      <Paragraph position="2"> Importantly, given the discoveries of another, human agent, UNIV employs a diagnostic program to find (eventual) errors in the humanly proposed universals. Currently, we identify as PROBLEMS the following categories:  (1) Restriction Problem: Universals found by human analyst that are &amp;quot;undersupported&amp;quot;, i.e. are below a user-selected threshold of positive evidence and/or percentage of validity (the latter applying to statistical universals).</Paragraph>
      <Paragraph position="3"> (2) Uncertainty Problem: Universals found by human analyst that tacitly assume a value for some linguistic property which is actually unknown or inapplicable (marked by '*' in the database).</Paragraph>
      <Paragraph position="4"> (3) Falsity Problem: Universals found  by human analyst that are false or are logically implied by simpler universals.</Paragraph>
      <Paragraph position="5"> The DISCOVERIES of UNIV are two lists, falling into one of the types: (1) new universals (absolute or implicational, and statistical or nonstatistical), and (2) problems (sub-categorized as above).</Paragraph>
    </Section>
    <Section position="2" start_page="2" end_page="2" type="sub_section">
      <SectionTitle>
3.2 The authoring module AUTO
</SectionTitle>
      <Paragraph position="0"> AUTO accepts as input the discoveries made by UNIV, but also has access to the input data (cf.</Paragraph>
      <Paragraph position="1"> 2.1) to make further computations, as necessary.</Paragraph>
      <Paragraph position="2"> AUTO can generally be characterized as a practical text generation system, of opportunistic type, intended to meet the needs of our particular task, rather than as a system intended to handle, in a general and principled way, scientific articles' composition or surface generation of a wide range of linguistic phenomena (reminiscent of earlier work on generation from formatted data of metereological bulletins (Kittredge et.al.'s RAREAS) or stock market reports (Kukich's Ana)). For applied NLG, cf. e.g. Reiter et. al.</Paragraph>
      <Paragraph position="3"> (1995); also Computational Linguistics 1998 4(23), and elsewhere. Xuang &amp; Fielder (1996) and later work verbalize machine-found mathematical proofs.</Paragraph>
      <Paragraph position="4"> First, AUTO needs to know whether the discoveries of UNIV are interesting enough for generating a report, and to this end, it uses a natural and simple numeric method: UNIV's discoveries (new universals+problems) are judged worthy of generating a report if they are at least as many in number as the number of the published discoveries of the human agent studying the same database.</Paragraph>
      <Paragraph position="5"> Having decided upon report generation, AUTO follows a fixed scenario for DISCOURSE COMPOSITION (scientific papers are known to follow such fixed structure in &amp;quot;genre analysis&amp;quot;). The details of this scenario, however, will vary in accordance with a number of parameters, related with the query to the system, the discoveries made in response to this query, as well as other considerations. The basic components of the scenario (alongside with some minor elaboration) are given below. Each component is structured as a separate text paragraph (possibly with sub-(sub)-paragraphs) .  1. Statement of title. Title is selected from one of the following foci : (i) new_universals, (ii) problems, (iii) new_universals+problems.</Paragraph>
      <Paragraph position="6"> (Focus (i) selected in Fig. 1, [1] .) 2. Introduction of goal. Choice among same foci. (Focus (iii) selected in Fig. 1, [2] .) 3. Elaboration of goal. Logical definition of type of universal investigated, constructed by our system, plus message on user-specified constraints (supporting evidence, etc.).</Paragraph>
      <Paragraph position="7"> 4. Description of the investigated data and the human discoveries. Based on data available from input.</Paragraph>
      <Paragraph position="8"> 5. Explaining the problems in the human  discoveries. UNIV's diagnostic subroutine feeds to AUTO problems classed in one of three sub-categories (cf. 3.1) for AUTO to decide how to  explain them.</Paragraph>
      <Paragraph position="9"> 6. Statement of machine discoveries. Input from the discoveries of UNIV.</Paragraph>
      <Paragraph position="10"> 7. Conclusion. Summary of findings (new_universals and/or problems).</Paragraph>
      <Paragraph position="11"> 8. References. Based on data avialable from  input.</Paragraph>
      <Paragraph position="12"> Below we briefly outline component (5).</Paragraph>
      <Paragraph position="13"> This paragraph comprises 4 sub-paragraphs, in this order: one conveying information on the confirmed humanly found universals (Fig. 1, [3]), and the remaining on problems of restrictions (=under-support), uncertainty and falsity (Fig. 1, [4,5,6]). Each sub-paragraph starts with an intro_part, making a statement about a collection of discoveries (e.g. &amp;quot;Universals [1,2,..] are under-supported/uncertain/false..&amp;quot;). All but the first sub-paragraph (referring to confirmed discoveries) also have a body_part, justifying why these predications hold for each individual discovery in the collection.</Paragraph>
      <Paragraph position="14"> The body_parts appeal either solely to examples (as in Fig. 1, [4], where mentioning an example of less support, appearing immediately after mentioning of the required one, suffices for an explanation) or to both examples and explanation of why these are indeed examples. The latter situation is illustrated by (Fig. 1, [5,6]). Thus, for instance, the examples justifying that a universal is false are actually its counter-examples and AUTO will find these counterexamples as well as the reason for that (in the case of implication, antecedent true, but consequent false).</Paragraph>
      <Paragraph position="15"> AUTO also has a limited SENTENCE-PLANNING FACILITY to decide how to split up a paragraph's content into sentences and clauses. Assume, for the sake of illustration, that we need to verbalize an under-support body_part, like that on (Fig. 1, par. [4]), but, say, requiring at least 8 supporting languages. The input to the sentence planning facility of AUTO would look like this (the last constituents indicating the  propositions having an equal number of supporting evidence. Within the framework of each such sentence, the system will group together the propositions supported by the same languages, taking care that the universals with smaller numeration appear first. After some further transformations, the system outputs this: [27-b] is supported in 1 language (Thai). [13] is supported in 6 languages (Burmese, Burushaski, Hindi, Japanese, Kannada, and Turkish), and so are [3,12,15-a] (Berber, Hebrew, Maori, Masai, Welsh, and Zapotec).</Paragraph>
      <Paragraph position="16"> For SURFACE GENERATION we use a hybrid approach, employing both templates and grammar rules, as required by the needs at the specific portions of text we are producing.</Paragraph>
      <Paragraph position="17"> The templates consist of canned text, interspersed with variables whose values are to be computed. The variables may stand either for individual words or for more abstract entities than words whose values are computed by grammar rules. To ensure agreement e.g. AUTO employs rules for agreement between subject and predicate, noun and determiner, demonstrative, relative-marker, apposition; between noun and pronoun (for pronominal reference); external sandhi, etc. If e.g. a variable stands for a list of languages, it will be handled by a grammar rule for and-coordinated NP to get e.g. &amp;quot;Masai, Welsh, and Zapotec&amp;quot;. Also, the templates are often randomly chosen among a set of &amp;quot;synonymous&amp;quot; alternatives in order to increase the variability of the produced texts. We have grammar rules to handle a variety of syntactic constructions, but the most important of them are those responsible for the verbalization of universals (forming by far the largest bulk of the produced texts). The dictionary part of that grammar is supplied from input (cf. 2.1). There are diverse ways of expressing implications in English (and we do not confine only to implications), and the grammar tries to attend to this fact. The grammar is a random generator, ensuring the avoidance of intra-textual repetitions in the statement of the many universals UNIV usually finds.</Paragraph>
      <Paragraph position="18"> Finally, AUTO also supports formatting facilities, e.g. for capitalization, correct spacing around punctuation marks, etc.</Paragraph>
    </Section>
    <Section position="3" start_page="2" end_page="2" type="sub_section">
      <SectionTitle>
4Conclusion
</SectionTitle>
      <Paragraph position="0"> We have shown how a simple text generator can be linked to a linguistic discovery program in order to verbalize its discoveries. Despite the seemingly bizarre nature of the task of article generation, this work was actually inspired by the practical need to verbalize the great number of universals UNIV has systematically found in the various databases we have explored, as well as by the need to compare these with the findings of previous researchers. Presumably, such problems have not confronted previous discovery programs because they searched non-conventional spaces (necessitating additional human interpretation of results), because their solution objects (e.g. numerical laws in physics/mathematics, reaction path-ways in chemistry, etc.) are not amenable to verbal expression or simply because the set of solution objects has been too small to require automated verbalization.</Paragraph>
      <Paragraph position="1"> In sum: UNIVAUTO models scientific domains in which a machine is likely to find numerous and verbalizable solution objects (conceivably, low-level generalisations), and the scientific discourses in these domains are basically limited to description of these findings. We believe that such domains are not exceptional in empirical sciences generally, and hence systems like ours are not unlikely to emerge to aid scientists in these domains.</Paragraph>
      <Paragraph position="2"> Acknowledgment. The writing of this paper was supported through an EC Marie Curie Fellowship MCFI-2001-00689. The author is solely responsible for information communicated and the European Commission is not responsible for any views or results expressed.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML