File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/06/e06-1021_evalu.xml

Size: 4,323 bytes

Last Modified: 2025-10-06 13:59:32

<?xml version="1.0" standalone="yes"?>
<Paper uid="E06-1021">
  <Title>Towards Robust Context-Sensitive Sentence Alignment for Monolingual Corpora</Title>
  <Section position="6" start_page="165" end_page="166" type="evalu">
    <SectionTitle>
5 Evaluation
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="165" end_page="165" type="sub_section">
      <SectionTitle>
5.1 Britannica corpus
</SectionTitle>
      <Paragraph position="0"> Precision/recall curves for both methods, aggregated over all the documents of the testing portion of the Britannica corpus are given in Figure 5. To obtain different precision/recall points, we vary the threshold above which a sentence pair is deemed matched. Of course, when practically applying the algorithm, we have to pick a particular threshold, as we have done by choosing th.</Paragraph>
      <Paragraph position="1"> Precision/recall values atthis threshold are also indicated in the figure.6  Comparative results with previous algorithms are given in Table 1, in which the results for Barzilay and Elhadad's algorithm and previous ones are taken from Barzilay and Elhadad (2003). The paper reports the precision at 55.8% recall, since the Decomposition method (Jing, 2002) only produced results at this level of recall, as some of the method's parameters were hard-coded.</Paragraph>
      <Paragraph position="2"> Interestingly, the TF*IDF method is highly competitive in determining sentence similarity.</Paragraph>
      <Paragraph position="3"> 6Decreasing the threshold to 0.0 does not yield all pairs, since we only consider pairs with similarity strictly greater than 0.0, and restrict many-to-one matches to 2-to-1.  Despite its simplicity, it achieves the same performance as Barzilay and Elhadad's algorithm,7 and is better than all previous ones. Significant improvement is achieved by adding the global alignment. null Clearly, the method is inherently limited in that it can only match sentences with some lexical overlap. For instance, the following sentence pair that should have been matched was missed: a2 Population soared, reaching 756,000 by 1903, and urban services underwent extensive modification.</Paragraph>
      <Paragraph position="4"> a2 At the beginning of the 20th century, Warsaw had about 700,000 residents.</Paragraph>
      <Paragraph position="5"> Matching &amp;quot;1903&amp;quot; with &amp;quot;the beginning of the 20th century&amp;quot; goes beyond the scope of any method relying predominantly on word identity.</Paragraph>
      <Paragraph position="6"> The hope is, however, that such mappings could be learned by amassing a large corpus of accurately sentence-aligned documents, and then applying a word-alignment algorithm, as proposed by Quirk et al. (2004). Incidentally, examining sentence pairs withhigh TF*IDFsimilarity scores, there are some striking cases that appear to have been missed by the human judges. Of course, we faithfully and conservatively relied on the human annotation in the evaluation, ignoring such cases.</Paragraph>
    </Section>
    <Section position="2" start_page="165" end_page="166" type="sub_section">
      <SectionTitle>
5.2 Gospels
</SectionTitle>
      <Paragraph position="0"> For evaluating our algorithm's accuracy on the gospels, we again have to contend with the fact that the correct alignments are given at the resolution of pericopes, not verses. We cannot rely on the reciprocal best hit method we used for training, since itrelies ontheTF*IDFsimilarity scores, which we are attempting to evaluate. We therefore devise an alternative evaluation criterion, counting  a pair of verses as correctly aligned if they belong to a matched pericope in the gold annotation.</Paragraph>
      <Paragraph position="1">  For recall, we note that not all the verses of a matched pericope should be matched, especially when one pericope has substantially more verses than the other. In general, wemayexpect the number of verses to be matched to be the minimum of a3 pg1 a3 and a3 pg2 a3. We thus define recall as:  The results are given in Figure 6, including the word-overlap baseline, TF*IDF ranking with logistic regression, and the added global alignment. Once again, TF*IDF yields a substantial improvement over the baseline, and results are further improved by adding the global alignment.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML