File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/97/p97-1005_evalu.xml
Size: 3,848 bytes
Last Modified: 2025-10-06 14:00:22
<?xml version="1.0" standalone="yes"?> <Paper uid="P97-1005"> <Title>Automatic Detection of Text Genre</Title> <Section position="5" start_page="34" end_page="35" type="evalu"> <SectionTitle> 4 Results </SectionTitle> <Paragraph position="0"> Table 1 gives the results of the experiments. ~For each genre facet, it compares our results using surface cues (both with logistic regression and neural nets) against results using Karlgren and Cutting's structural cues on the one hand (last pair of columns) and against a baseline on the other (first column).</Paragraph> <Paragraph position="1"> Each text in the evaluation suite was tested for each facet. Thus the number 78 for NARRATIVE under method &quot;LR (Surf.) All&quot; means that when all texts were subjected to the NARRATIVE test, 78% of them were classified correctly.</Paragraph> <Paragraph position="2"> There are at least two major ways of conceiving what the baseline should be in this experiment. If the machine were to guess randomly among k categories, the probability of a correct guess would be 1/k. i.e., 1/2 for NARRATIVE. 1/6 for GENRE. and 1/4 for BROW. But one could get dramatic improvement just by building a machine that always guesses the most populated category: NONFICT for GENRE.</Paragraph> <Paragraph position="3"> MIDDLE for BROW, and No for NARRATIVE. The first approach would be fair. because our machines in fact have no prior knowledge of the distribution of genre facets in the evaluation suite, but we decided to be conservative and evaluate our methods against the latter baseline. No matter which approach one takes, however, each of the numbers in the table is significant at p < .05 by a binomial distribution.</Paragraph> <Paragraph position="4"> That is, there is less than a 5% chance that a machine guessing randomly could have come up with results so much better than the baseline.</Paragraph> <Paragraph position="5"> It will be recalled that in the LR models, the facets with more than two levels were computed by means of binary decision machines for each level, then choosing the level with the most positive score.</Paragraph> <Paragraph position="6"> Therefore some feeling for the internal functioning of our algorithms can be obtained by seeing what the performance is for each of these binary machines, and for the sake of comparison this information is also given for some of the neural net models. Table 2 shows how often each of the binary machines correctly determined whether a text did or did not fall in a particular facet level. Here again the appropriate baseline could be determined two ways.</Paragraph> <Paragraph position="7"> In a machine that chooses randomly, performance would be 50%, and all of the numbers in the table would be significantly better than chance (p < .05, binomial distribution). But a simple machine that always guesses No would perform much better, and it is against this stricter standard that we computed the baseline in Table 2. Here, the binomial distribution shows that some numbers are not significantly better than the baseline. The numbers that are significantly better than chance at p < .05 by the binomial distribution are starred.</Paragraph> <Paragraph position="8"> Tables 1 and 2 present aggregate results, when all texts are classified for each facet or level. Table 3, by contrast, shows which classifications are assigned for texts that actually belong to a specific known level. For example, the first row shows that of the 18 texts that really are of the REPORTAGE GENRE level, 83% were correctly classified as RE-PORTAGE, 6% were misclassified as EDITORIAL, and 11% as NONFICTION. Because of space constraints, we present this amount of detail only for the six GENRE levels, with logistic regression on selected surface variables.</Paragraph> </Section> class="xml-element"></Paper>