File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/04/w04-3241_evalu.xml
Size: 6,745 bytes
Last Modified: 2025-10-06 13:59:21
<?xml version="1.0" standalone="yes"?> <Paper uid="W04-3241"> <Title>The Entropy Rate Principle as a Predictor of Processing Effort: An Evaluation against Eye-tracking Data</Title> <Section position="5" start_page="0" end_page="0" type="evalu"> <SectionTitle> 4.2 Results </SectionTitle> <Paragraph position="0"> Table 3 shows the results of the correlation analyses on the development set. These results were obtained after excluding all sentences at positions 1 and 2.</Paragraph> <Paragraph position="1"> In the newspaper texts in the BNC, these positions have a special function: position 1 contains the title, and position 2 contains the name of the author. The first sentence of the text is therefore on position 3 (unlike in the Penn Treebank, in which no title or author information is included and texts start at position 1).</Paragraph> <Paragraph position="2"> We then conducted the same correlation analyses on the test set, i.e., on the Embra eye-tracking corpus. The results are tabulated in Table 4. Note we set no threshold for sentence position in the test set, as reading times with sentence entropy and sentence position the maximum article length in this corpus was only 24 sentences.</Paragraph> <Paragraph position="3"> Finally, we investigated if the total reading times in the Embra corpus are correlated with sentence position and entropy. We computed regression analysis that partialled out word length, word frequency, and subject effects as recommended by Lorch and Myers (1990). All variables other than position were normalized by sentence length. Table 5 lists the resulting correlation coefficients. Note that no binning was carried out here. Figure 3 plots one of the correlations for illustration.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.3 Discussion </SectionTitle> <Paragraph position="0"> The results in Table 3 confirm that the results obtained on the Penn Treebank also hold for the newspaper part of the BNC. The top half of the table lists the correlation coefficients for the binned data. We find a significant correlation between sentence position and entropy for the cut-off values 25 and 76.</Paragraph> <Paragraph position="1"> In both cases, there is also a significant correlation with sentence length; this correlation is particularly high ( 0:8584) for c = 25. The entropy rate effect does not seem to hold if there is no cut-off; here, we fail to find a significant correlation (though the correlation with length is again significant). This is probably explained by the fact that the BNC test set contains sentences with a maximum position of 206, and data for these high sentence positions is very sparse.</Paragraph> <Paragraph position="2"> The lower half of Table 3 confirms another result from Experiment 1: there is generally a low, but significant correlation between entropy and position, even if the correlation is computed for individual sentences rather than for bins of sentences with the same position. Furthermore, we find that sentence length is again a significant predictor of sentence position, even on the raw data. This is in line with the results of Experiment 1.</Paragraph> <Paragraph position="3"> Table 4 lists the results obtained on the test set (i.e., the Embra corpus). Note that no cut-off was applied here, as the maximum sentence position in this set is only 24. Both on the binned data and on the raw data, we find significant correlations between sentence position and both entropy and sentence length. However, compared to the results on the BNC, the signs of the correlations are inverted: there is a significant negative correlation between position and entropy, and a significant positive correlation between position and length. It seems that the Embra corpus is peculiar in that longer sentences appear later in the text, rather than earlier. This is at odds with what we found on the Penn Treebank and on the BNC. Note that the positive correlation of position and length explains the negative correlation of position and entropy: length enters into the entropy calculation as 1jXj, hence a high jXj will lead to low entropy, and vice versa.</Paragraph> <Paragraph position="4"> We have no immediate explanation for the inversion of the relationship between position and length in the Embra corpus; it might be an idiosyncrasy of this corpus (note that the texts were specifically picked for eye-tracking, and are unlikely to be a random sample; they are also shorter than usual newspaper texts). Note in particular that the Embra corpus is not a subset of the BNC (although it was sampled from UK broadsheet newspapers, and hence should be similar to our development and training corpora).</Paragraph> <Paragraph position="5"> Let us now turn to Table 5, which lists the results of the analyses correlating the total reading time for a sentence with its position and its entropy (derived from n-grams with n = 2; : : :;5). Note that these correlation analyses were conducted by partialling out word length and word frequency, which are well-know to correlate with reading times. We find that even once these factors have been controlled, there is still a significant positive correlation between entropy and reading time: sentences with higher entropy are harder to process and hence have higher reading times. This is illustrated in Figure 3 for one of the correlations. As we argued in Section 2, this relationship between entropy and processing effort is a crucial prerequisite of the entropy rate principle. The increase of entropy with sentence position observed by G&C (and in our Experiment 1) only makes sense if increased entropy corresponds to increased processing difficulty (e.g., to increased reading time). Note that this result is compatible with previous research by McDonald and Shillcock (2003), who demonstrate a correlation between reading time measures and bigram probability (though their analysis is on the word level, not on the sentence level).</Paragraph> <Paragraph position="6"> The second main finding in Table 5 is that there is no significant correlation between sentence position and reading time. As we argued in Section 2, this is predicted by the entropy rate principle: the optimal way to send information is at a constant rate.</Paragraph> <Paragraph position="7"> In other words, speakers should produce sentences with constant informativeness, which means that if context is taken into account, all sentences should be equally difficult to process, no matter which position they are at. This manifests itself in the absence of a correlation between position and reading time in the eye-tracking corpus.</Paragraph> </Section> </Section> class="xml-element"></Paper>