File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/p06-2090_metho.xml

Size: 15,915 bytes

Last Modified: 2025-10-06 14:10:29

<?xml version="1.0" standalone="yes"?>
<Paper uid="P06-2090">
  <Title>Implementing a Characterization of Genre for Automatic Genre Identification of Web Pages</Title>
  <Section position="5" start_page="701" end_page="701" type="metho">
    <SectionTitle>
3 Attributes of the Tuple
</SectionTitle>
    <Paragraph position="0"> The attributes &lt;linguistic features, HTML tags, text types&gt; of the tuple represent the computationally tractable version of the combination &lt;purpose, form&gt; often used to define the concept of genre (e.g. cf. Roussinov et al. 2001).</Paragraph>
    <Paragraph position="1"> In our view, the purpose corresponds to text types, i.e. the rhetorical patterns that indicate what a text has been written for. For example, a text can be produced to narrate, instruct, argue, etc. Narration, instruction, and argumentation are examples of text types. As stressed earlier, text types are usually considered separate entities from genres (cf. Biber, 1988; Lee, 2001).</Paragraph>
    <Paragraph position="2"> Form is a more heterogeneous attribute. Form can refer to linguistic form and to the shape (layout etc.). From an automatic point of view, linguistic form is represented by linguistic features, while shape is represented by HTML tags. Also the functionality attribute introduced by Shepherd and Watters (1998) can be seen in terms of HTML tags (e.g. tags for links and scripts). While content words or terms show some drawbacks for automatic genre identification (cf. Boese and Howe, 2005), there are several types of linguistic features that return good results, for instance, Biberian features (Biber, 1988). In the model presented here we use a mixture of Biberian features and additional syntactic traits. The total number of features used in this implementation of the model is 100.</Paragraph>
    <Paragraph position="3"> These features are available online at: http://www.nltg.brighton.ac.uk/home/Marina.Santini/</Paragraph>
  </Section>
  <Section position="6" start_page="701" end_page="704" type="metho">
    <SectionTitle>
4 Inferential Model
</SectionTitle>
    <Paragraph position="0"> The inferential model presented here (partially discussed in Santini (2006a) combines the advantages of deductive and inductive approaches. It is deductive because the co-occurrence and the combination of features in text types is decided a priori by the linguist on the basis on previous studies, and not derived by a statistical procedure, which is too biased towards high frequencies (some linguistic phenomena can be rare, but they are nonetheless discriminating). It is also inductive because the inference process is corpus-based, which means that it is based on a pool of data used to predict some text types. A few handcrafted if-then rules combine the inferred text types with other traits (mainly layout and functionality tags) in order to suggest genres. These rules are worked out either on the basis of previous genre studies or of a cursory qualitative analysis. For example, rules for personal home pages are based on the observations by Roberts (1998), Dillon and Gushrowski (2000). When previous studies were not available, as in the cases of eshops or search pages, the author of this paper has briefly analysed these genres to extract generalizations useful to write few rules.</Paragraph>
    <Paragraph position="1"> It is important to stress that there is no hand-coding in the model. Web pages were randomly downloaded from genre-specific portals or archives without any further annotation. Web pages were parsed, linguistic features were automatically extracted and counted from the parsed outputs, while frequencies of HTML tags were automatically counted from the raw web pages. All feature frequencies were normalized by the length of web pages (in tokens) and then submitted to the model.</Paragraph>
    <Paragraph position="2"> As stated earlier, the inferential model makes a clear-cut separation between text types and genres. The four text types included in this implementation are: descriptive_narrative, expository_informational, argumentative_persuasive, and instructional. The linguistic features for these text types come from previous (corpus-)linguistic studies (Werlich 1976; Biber, 1988; etc.), and are not extracted from the corpus using statistical methods. For each web page the model returns the probability of belonging to the four text types. For example, a web page can have 0.9 probabilities of being argumentative_persuasive, 0.7 of being instructional and so on. Probabilities are interpreted in terms of degree or gradation.</Paragraph>
    <Paragraph position="3"> For example, a web page with 0.9 probabilities  of being argumentative_persuasive shows a high gradation of argumentation. Gradations/ probabilities are ranked for each web page.</Paragraph>
    <Paragraph position="4"> The computation of text types as intermediate step between linguistic and non-linguistic features and genres is useful if we see genres as conventionalised and standardized cultural objects raising expectations. For example, what we expect from an editorial is an 'opinion' or a 'comment' by the editor, which represents, broadly speaking, the view of the newspaper or magazine. Opinions are a form of 'argumentation'. Argumentation is a rhetorical pattern, or text type, expressed by a combination of linguistic features. If a document shows a high probability of being argumentative, i.e. it has a high gradation of argumentation, this document has a good chance of belonging to argumentative genres, such as editorials, sermons, pleadings, academic papers, etc. It has less chances of being a story, a biography, etc. We suggest that the exploitation of this knowledge about the textuality of a web page can add flexibility to the model and this flexibility can capture hybridism and individualization, the key forces behind genre evolution.</Paragraph>
    <Section position="1" start_page="702" end_page="702" type="sub_section">
      <SectionTitle>
4.1 The Web Corpus
</SectionTitle>
      <Paragraph position="0"> The inferential model is based on a corpus representative of the web. In this implementation of the model we approximated one of the possible compositions of a random slice of the web, statistically supported by reliable standard error measures. We built a web corpus with four BBC web genres (editorial, Do-It-Yourself (DIY) mini-guide, short biography, and feature), seven novel web genres (blog, eshop, FAQs, front page, listing, personal home page, search page), and 1,000 unclassified web pages from SPIRIT collection (Joho and Sanderson, 2004).</Paragraph>
      <Paragraph position="1"> The total number of web pages is 2,480. The four BBC genres represent traditional genres adapted to the functionalities of the web, while the seven genres are novel web genres, either unprecedented or showing a loose kinship with paper genres. Proportions are purely arbitrary and based on the assumption that at least half of web users tend to use recognized genre patterns in order to achieve felicitous communication. We consider the sampling distribution of the sample mean as approximately normal, following the Central Limit Theorem. This allows us to make inferences even if the population distribution is irregular or if variables are very skewed or highly discrete. The web corpus is available at: http://www.nltg.brighton.ac.uk/home/Marina.Santini/</Paragraph>
    </Section>
    <Section position="2" start_page="702" end_page="703" type="sub_section">
      <SectionTitle>
4.2 Bayesian Inference: Inferring with
Odds-Likelihood
</SectionTitle>
      <Paragraph position="0"> The inferential model is based on a modified version of Bayes' theorem. This modified version uses a form of Bayes' theorem called odds-likelihood or subjective Bayesian method (Duda and Reboh, 1984) and is capable of solving more complex reasoning problems than the basic version. Odds is a number that tells us how much more likely one hypothesis is than the other. Odds and probabilities contain exactly the same information and are interconvertible. The main difference with original Bayes' theorem is that in the modified version much of the effort is devoted to weighing the contributions of different pieces of evidence in establishing the match with a hypothesis. These weights are confidence measures: Logical Sufficiency (LS) and Logical Necessity (LN). LS is used when the evidence is known to exist (larger value means greater sufficiency), while LN is used when evidence is known NOT to exist (a smaller value means greater necessity). LS is typically a number &gt; 1, and LN is typically a number &lt; 1.</Paragraph>
      <Paragraph position="1"> Usually LS*LN=1. In this implementation of the model, LS and LN were set to 1.25 and 0.8 respectively, on the basis of previous studies and empirical adjustments. Future work will include more investigation on the tuning of these two parameters.</Paragraph>
      <Paragraph position="2"> The steps included in the model are the following:  1) Representation of the web in a corpus that is approximately normal.</Paragraph>
      <Paragraph position="3"> 2) Extraction, count and normalization of genrerevealing features.</Paragraph>
      <Paragraph position="4"> 3) Conversion of normalized counts into z-scores, which represent the deviation from the 'norm' coming out from the web corpus. The concept of &amp;quot;gradation&amp;quot; is based on these deviations from the norm.</Paragraph>
      <Paragraph position="5"> 4) Conversion of z-scores into probabilities, which means that feature frequencies are seen in terms of probabilities distribution.</Paragraph>
      <Paragraph position="6"> 5) Calculation of prior odds from prior probabilities of a text type. The prior probability for each of the four text types was set to 0.25 (all text types were given an equal chance to appear in a web page). Prior odds are calculated with the formula: prOdds(H)=prProb(H)/1-prProb(H) 6) Calculation of weighted features, or multipliers (Mn). If a feature or piece of evidence (E) has a  probability &gt;=0.5, LS is applied, otherwise LN is applied. Multipliers are calculated with the following formulae: if Prob (E)&gt;=0.5 then  according to the co-occurrence decided by the analyst on the basis of previous studies in order to infer text types. In this implementation the feature co-occurrence was decided following Werlich (1976) and Biber (1988).</Paragraph>
      <Paragraph position="7"> 8) Posterior odds for the text type is then calculated by multiplying prior odds (step 5) with co-occurrence of weighted features (step 7).</Paragraph>
      <Paragraph position="8"> 9) Finally, posterior odds is re-converted into a probability value with the following formula:</Paragraph>
      <Paragraph position="10"> Although odds contains exactly the same information as probability values, they are not constrained in 0-1 range, like probabilities.</Paragraph>
      <Paragraph position="11"> Once text types have been inferred, if-then rules are applied for determining genres. In particular, for each of the seven web genre included in this implementation, few hand-crafted rules combine the two predominant text types per web genre with additional traits. For example, the actual rules for deriving a blog are as simple as the following ones: if (text_type_1=descr_narrat_1|argum_pers_1) if (text_type_2=descr_narrat_2|argum_pers_2) if (page_length=LONG) if (blog_words &gt;= 0.5 probabilities) then good blog candidate.</Paragraph>
      <Paragraph position="12"> That is, if a web page has description_narration and argumentation_persuasion as the two predominant text types, and the page length is &gt; 500 words (LONG), and the probability value for blog words is &gt;=0.5 (blog words are terms such as web log, weblog, blog, journal, diary, posted by, comments, archive plus names of the days and months), then this web page is a good blog candidate.</Paragraph>
      <Paragraph position="13"> For other web genres, the number of rules is higher, but it is worth saying that in the current implementation, rules are useful to understand how features interact and correlate.</Paragraph>
      <Paragraph position="14"> One important thing to highlight is that each genre is computed independently for each web page. Therefore a web page can be assigned to different genres (Table 1) or to none (Table 2). Multi-label and no-label classification cannot be evaluated with standard metrics and their evaluation requires further research. In the next subsection we present the evaluation of the single label classification returned by the inferential model.</Paragraph>
    </Section>
    <Section position="3" start_page="703" end_page="704" type="sub_section">
      <SectionTitle>
4.3 Evaluation of the Results
</SectionTitle>
      <Paragraph position="0"> Single-label classification. For the seven web genres we compared the classification accuracy of the inferential model with the accuracy of classifiers. Two standard classifiers - SVM and Naive Bayes from Weka Machine Learning Workbench (Witten, Frank, 2005) - were run on the seven web genres. The stratified cross-validated accuracy returned by these classifiers for one seed is ca. 89% for SVM and ca. 67% for Naive Bayes. The accuracy achieved by the inferential model is ca. 86%.</Paragraph>
      <Paragraph position="1"> An accuracy of 86% is a good achievement for a first implementation, especially if we consider that the standard Naive Bayes classifier returns an accuracy of about 67%. Although slightly lower than SVM, an accuracy of 86% looks promising because this evaluation is only on a single label. Ideally the inferential model could be more accurate than SVM if more labels could be taken into account. For example, the actual classification returned by the inferential model is shown in Table 1. The web pages in Table 1 are blogs but they also contain either sequences of questions and answers or are organized like a how-to document, like in the snippet in Figure 1  The snippet shows an example of genre colonization, where the vocabulary and text forms of one genre (FAQs/How to in this case) are inserted in another (cf. Beghtol, 2001). These strategies are frequent on the web and might give rise to new web genres. The model also captures a situation where the genre labels available in the system are not suitable for the web page under analysis, like in the example in Table 2.</Paragraph>
      <Paragraph position="2">  This web page (shown in Figure 2) from the unannotated SPIRIT collection (see Section 4.1) does not receive any of the genre labels currently  If the pattern shown in Figure 2 keeps on recurring even when more web genres are added to the system, a possible interpretation could be that this pattern might develop into a stable web genre in future. If this happens, the system will be ready to host such a novelty. In the current implementation, only a few rules need to be added. In future implementations hand-crafted rules can be replaced by other methods. For example, an interesting adaptive solution has been explored by Segal and Kephart (2000).</Paragraph>
      <Paragraph position="3"> Predictions. Precision of predictions on one web genre is used as an additional evaluation metric. The predictions on the eshop genre issued by the inferential model are compared with the predictions returned by two SVM models built with two different web page collections, Meyerzu-Eissen collection and the 7-web-genre collection (Santini, 2006). Only the predictions on eshops are evaluated, because eshop is the only web genre shared by the three models. The number of predictions is shown in Table 3.</Paragraph>
      <Paragraph position="4">  The number of retrieved web pages (Total Predictions) is higher when the inferential model is used. Also the value of precision (Correct Predictions) is higher. The manual evaluation of the predictions is available online at: http://www.nltg.brighton.ac.uk/home/Marina.Santini/</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML