File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/94/c94-2174_metho.xml

Size: 7,599 bytes

Last Modified: 2025-10-06 14:13:44

<?xml version="1.0" standalone="yes"?>
<Paper uid="C94-2174">
  <Title>K. General Fiction L. Mystery M. Science Fiction N. Adventure and Western</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 categories
</SectionTitle>
    <Paragraph position="0"> In the ease of two categories, only one function is necessary foe' determining the category of an itenl. The flmction classified 478 cases correctly and miselassilled 22, out of the 500 cases, as shown in table 3 and figure I.</Paragraph>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 categories
</SectionTitle>
    <Paragraph position="0"> Using the three functions extracted, 366 cases were correctly classified, and 134 eases were misclassified, out of tile 500 cases, as can be seen in table 4 and figure 2.</Paragraph>
    <Paragraph position="1"> &amp;quot;Miscellaneous&amp;quot;, the most problematic category, is a loose grouping of different informative texts. The single most problematic subsubset of texts is a subset of eigh teen non-fiction texts labeled &amp;quot;learned/humalfities&amp;quot;.</Paragraph>
    <Paragraph position="2"> Sixteen of them were eniselassitied, thirteen as &amp;quot;miseell&amp;eleotls&amp;quot;. null</Paragraph>
    <Paragraph position="4"> 3. 1.';ctl .... \] 12(; I ~ ('~ %) 4. Misc. / 176 I 68 (47 %) 'focal L%~ deg l 134 (27 deg/~T  correctly classified and 242 cases inischlssilied out of the 500 cases, as shown in table 5. Trying to distin. guish I)eLween the di\[ferenL types of fiction is exl)ensive. hi tornis of errors. \[\[' the tiction subcategories were collapsed there only wouht be ten categories, and the error rate R)r the c.atogorizal,ion would iniprove as showil ill th0 &amp;quot;revis0d totM&amp;quot; record of the tal)le. The &amp;quot;learned~humanities&amp;quot; nubcal;egory is, as I)erore, prol)-lematic: only two of the. eighteen itomn were correctly classified. The. others were irlost often misclassilied as &amp;quot;l/,cligion&amp;quot; or &amp;quot;Belles l.ettre.s&amp;quot;. Validation of the Technique It is iinl)ortant to note that this exl)erinlent does not claim to show how geHrc, s ill fact ditfer. What we show is tha.t this sort of teellnique can. bc used t.o determine which l)aramcters to line, given ~ set of them. We did not use a test set disjoint from I, he training set, and we do not claiul I;hat the functions we had the method extract fi:onl the data are useful iu theulselves. We discuss how well this meJ, hod categorizes a set texl, given a set of categories, alld given a net of paralllCl.ers.</Paragraph>
    <Paragraph position="5"> The error rates clinlt) steelfly with the iiunlher of categories tested Ibr in the (:()rims we used. This ,m,y have to do with how the categories are chosen aud defined. For iustance, distinguishing between dill(rein.</Paragraph>
    <Paragraph position="6"> types of liction by fornlal or stylistic criteria of this kind may just he sonicthing we shouht not a.tteml)t: the fictiou types are naturally delined ill ternln o1 their content, a.fter all.</Paragraph>
    <Paragraph position="7"> 'Fhc statistical tcchni(luc of factor anM:qsi,~ can be used to discover categories, like l~iher has done. The prol/lenl with using automatically (lerived categories is that even if they are iu a sense tea.l, lneaniug that they are SUl)l)orted by data, i.hey may t)e dillicult to Cxl)lain for l he uuenthusiastic lltylliall if l.he ahii is to tlS(! the techlii(lUe in retrieval tooln.</Paragraph>
    <Paragraph position="8"> Other criteria that shouhl be studied are second alld higher order statistics on the rospeoLivc l)aranle ters. (-Jorl, ain l)aranieterst)robal)ly varG lnor~ ill certahl text types than other% aild they may have a s\[~'c?lJcd dislribulion as well. This is iiot dillicull, to deterliiine, although l.h(! standard methods do llOt nupl)orl, illltO lnatic detcrinination of staudard devial,iou or skl:wness as discrinlination criteria. 'lT)gethcr with iJle hwesti-. gation of sew;ra\] hil, herto Ultl.ried l)aranlcters, this is a 11(7.'(( step.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
Readability Indexing
</SectionTitle>
      <Paragraph position="0"> Not unrel~Lted to the study of genre is the study of rcadabilily which aims to categorize texts aecoMing to their suital)ility for assumed sets of assumed readers.</Paragraph>
      <Paragraph position="1"> There ix a weall, h of formula: to couqmte readahilil.y.</Paragraph>
      <Paragraph position="2"> Most commonly l,hey combine easily computed text measures&gt; typically average or Saml)led averag,: s&lt;n t(;ncc leugth couibiucd with siulihMy couqluled woM length, or in(ides((, of words not on a sl/ecified &amp;quot;easy word lint&amp;quot; (( ',hall, 1948; K late, 1963). hi spite of C, hall'n warnings al)out inj,.ticious application to writing tasks, readal)ility measurement has naively come to be used as a l)l:escriptive metric of good writiug as a tool for writers, ~md has thus COllie into some disrepute, among text researchers: Our small study conlirms the I)asie findings of the early readal)ility studies: the most im i)ortant fa.cl.ors of tim ones we tested are. word length, sentence length, and different derivatives of these two parameters. As long as readM)ility indexing nchemes are used iT, descriptive at)l)lications they work well to discrinlilml;e between text types.</Paragraph>
      <Paragraph position="3"> Application The technique shows practical promise. The territorial nial)s showu in ligm'es 1, 2, and 3 are intuitively une\['ul tools for (lisplayiug what type a particular text is, compared with other existing texts. The technique denionstrated above has au obvious application in informatiol~ retrieval, for l)ieking out interesting texts, if (cutest based methods select a too large set for easy nlanipulation and browning (Cutting c/ al, 1992).</Paragraph>
      <Paragraph position="4"> In any specific application area it will be unlikely t, hat the text datM)ase to be accessed will be completely free form. The texts uuder consideration will probably he speciiic in some way. C, enc'ral text tyl)eS may be useful, but quite l)rohably there will be a domain or liehl-sl)ecilic text typology. In till envisioned apl)lica~ tics, a user will employ a cascade of filters starting with filtering by topic, and continuing with filters by genre or text, l.yl)e, aim ending by filters for text quality, or other t(mtal,iv(; liner-grained quMilieal,ionn.</Paragraph>
      <Paragraph position="5"> The IntFilter Project The \[Ntl,'ilter F'roject at the departments of Computer aml Systems Sciences, C, omputational \[,inguistics, ~md Psychology at Stockhohn University is at present stiMy..</Paragraph>
      <Paragraph position="6"> ing texts on the USli'.NIi;T News cont'ercncing system, The project at present studies texts which appear on several different types of USF.Nt';T News coll\['erences, a, ml investigates how well the classilieation criteria and categories tllat exl)erienced USENI,71' News users report using (lutl&amp;quot;ilter, 1993) can be used by a newsreader systeni. To do this the l)roject apl)lics the method described here. The project uses categories such as &amp;quot;ltuery&amp;quot; ~ lCCOIIlll|ellt)l 1 llkLllIl()|l|lC(~lllelltll 1 &amp;quot;FAQ&amp;quot;, all(l so \['orth, categorizing theui I,sing paranieters such ;is difti~rent types of length tneanurcs, form word content, quote level, \]lereentage quoted text and other USEN I';T News Sl)ecific parameters.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML