File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/02/w02-1022_evalu.xml

Size: 8,257 bytes

Last Modified: 2025-10-06 13:58:51

<?xml version="1.0" standalone="yes"?>
<Paper uid="W02-1022">
  <Title>Bootstrapping Lexical Choice via Multiple-Sequence Alignment</Title>
  <Section position="5" start_page="0" end_page="0" type="evalu">
    <SectionTitle>
4 Evaluation
</SectionTitle>
    <Paragraph position="0"> We implemented our system on formal mathematical proofs created by the Nuprl system, which has been used to create thousands of proofs in many mathematical flelds (Constable et al., 1986). Generating natural-language versions of proofs was flrst addressed several decades ago (Chester, 1976). But now,largeformal-mathematicslibrariesareavailable on-line.6 Unfortunately, they are usually encoded in highlytechnicallanguages(seeFigure7(i)). Natural-language versions of these proofs would make them more widely accessible, both to users lacking familiaritywithaspeciflcprover'slanguage, andtosearch engines which at present cannot search the symbolic language of formal proofs.</Paragraph>
    <Paragraph position="1"> Besides these practical beneflts, the formal mathematics domain has the further advantage of being particularly suitable for applying statistical generation techniques. Training data is available because   theorem-proverdevelopersfrequentlyprovideverbalizations of system outputs for explanatory purposes. In our case, a multi-parallel corpus of Nuprl proof verbalizationsalreadyexists(Holland-Minkleyetal., 1999) and forms the core of our training corpus.</Paragraph>
    <Paragraph position="2"> Also, from a research point of view, the examples from Figure 1 show that there is a surprising variety in the data, making the problem quite challenging.</Paragraph>
    <Paragraph position="3"> All evaluations reported here involved judgments from graduate students and researchers in computer science. We authors were not among the judges.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.1 Corpus
</SectionTitle>
      <Paragraph position="0"> Our training corpus consists of 30 Nuprl proofs and 83 verbalizations. On average, each proof consists of 5.08 proof steps, which are the basic semantic unit in Nuprl; Figure 7(i) shows an example of three Nuprl steps. An additional flve proofs, disjoint from the test data, were used as a development set for setting the values of all parameters.7 Pre-processing First, we need to divide the verbalization texts into portions corresponding to individual proof steps, since per-instance alignment handles verbalizations for only one semantic unit at a time. Fortunately, Holland-Minkley et al. (1999)</Paragraph>
      <Paragraph position="2"> Assume that i is an integer, we need to show jij = j!ij.</Paragraph>
      <Paragraph position="3"> From absval eq lemma, jij = j ! ij reduces to i = SSi. By the deflnition of pm equal, i = SSi is proved.</Paragraph>
      <Paragraph position="4"> Assume i is an integer. By the absval eq lemma, the goal becomes jij = j ! ij.</Paragraph>
      <Paragraph position="5"> Now, the original expression can be rewritten as i = SSi.</Paragraph>
      <Paragraph position="6">  Verbalization produced by traditional generation system; note that the initial goal is never specifled, which means that in the phrase \the goal becomes&amp;quot;, we don't know what the goal is. showed that for Nuprl, one proof step roughly corresponds to one sentence in a natural language verbalization. So, we align Nuprl steps with verbalization sentences using dynamic programming based on the number of symbols common to both the step and the verbalization. This produced 382 pairs of Nuprl steps and corresponding verbalizations. We also did some manual cleaning on the training data to reduce noise for subsequent stages.8</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.2 Per-component evaluation
</SectionTitle>
      <Paragraph position="0"> We flrst evaluated three individual components of our system: paraphrase thesaurus induction, argument-value selection in slotted lattice induction, andtemplateinduction. Wealsovalidatedtheutility of multi-parallel, as opposed to one-parallel, data.</Paragraph>
      <Paragraph position="1"> Paraphrase thesaurus We presented two judges withall71paraphrasepairsproducedbyoursystem.</Paragraph>
      <Paragraph position="2"> They identifled 87% and 82%, respectively, as being plausible substitutes within a mathematical context.</Paragraph>
      <Paragraph position="3"> Argument-value selection We next measured  howwelloursystemmatchessemanticargumentvalues with lattice node sequences. We randomly selected20Nuprlstepsandtheircorrespondingverbal- null izations. From this sample, a Nuprl expert identifled the argument values that appeared in at least one corresponding verbalization; of the 46 such values, our system correctly matched lattice node sequences to 91%. To study the relative efiectiveness of using multi-parallel rather than one-parallel data, we also implemented a baseline system that used only one (randomly-selected) verbalization among the multiple possibilities. This single-verbalization baseline matched only 44% of the values correctly, indicating the value of a multi-parallel-corpus approach.</Paragraph>
      <Paragraph position="4"> Templates Thirdly, we randomly selected 20 inducedtemplates;ofthese,aNuprlexpertdetermined null 8We employed pattern-matching tools to flx incorrect sentence boundaries, converted non-ascii symbols to a human-readable format, and discarded a few verbalizations which were unrelated to the underlying proof. that 85% were plausible verbalizations of the corresponding Nuprl. This was a very large improvement over the single-verbalization baseline's 30%, again validating the multi-parallel-corpus approach.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.3 Evaluation of the generated texts
</SectionTitle>
      <Paragraph position="0"> Finally, we evaluated the quality of the text our system generates by comparing its output against the system of Holland-Minkley et al. (1999), which produces accurate and readable Nuprl proof verbalizations. Their system has a hand-crafted lexical chooser derived via manual analysis of the same corpus that our system was trained on. To run the experiments, wereplacedHolland-Minkleyet. al'slexical chooser with the mapping dictionary we induced. (An alternative evaluation would have been to compare our output with human-authored texts. But this wouldn't have allowed us to evaluate the performance of the lexical chooser alone, as human proof generation may difier in aspects other than lexical choice.) The test set serving as input to the two systemsconsistedof20held-outproofs,unseenthrough- null out the entirety of our algorithm development work.</Paragraph>
      <Paragraph position="1"> We evaluated the texts on two dimensions: readability and fldelity to the mathematical semantics.</Paragraph>
      <Paragraph position="2"> Readability We asked 11 judges to compare the readability of the texts produced from the same Nuprl proof input: Figure 7(ii) and (iii) show an example text pair.9 (The judges were not given the original Nuprl proofs.) Figure 8 shows the results.</Paragraph>
      <Paragraph position="3"> Good entries are those that are not preferences for the traditional system, since our goal, after all, is to show that MSA-based techniques can produce output as good or better than a hand-crafted system.</Paragraph>
      <Paragraph position="4"> We see that for every lemma and for every judge, our system performed quite well. Furthermore, for more than half of the lemmas, more than half the 9To prevent judges from identifying the system producing the text, the order of presentation of the two systems' output was randomly chosen anew for each proof.</Paragraph>
      <Paragraph position="5"> Lemma % good Judge a b c d e f g h i j k l m n o p q r s t</Paragraph>
      <Paragraph position="7"> logictodetermine,giventheNuprlproofsandoutput texts, whether the texts preserved the main ideas of the formal proofs without introducing ambiguities.</Paragraph>
      <Paragraph position="8"> All 20 of our system's proofs were judged correct, while only 17 of the traditional system's proofs were judged to be correct.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML