File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/w03-1107_metho.xml

Size: 19,475 bytes

Last Modified: 2025-10-06 14:08:35

<?xml version="1.0" standalone="yes"?>
<Paper uid="W03-1107">
  <Title>Feature Selection in Categorizing Procedural Expressions</Title>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Extraction of Procedural Expressions
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 Answering Procedures with Lists
</SectionTitle>
      <Paragraph position="0"> We can easily imagine a situation in which people ask procedural questions, for instance a user who wants to know the procedure for installing the Red-Hat Linux OS. When using a web search engine, the user could employ a keyword related to the domain, such as &amp;quot;RedHat,&amp;quot; &amp;quot;install,&amp;quot; or the synonyms of &amp;quot;procedure,&amp;quot; such as &amp;quot;method&amp;quot; or &amp;quot;process.&amp;quot; In conclusion, the search engine will often return a result that does not include the actual procedures, for instance, only including the lists of hyperlinks to some URLs or simple alternatives that have no intentional order as is given.</Paragraph>
      <Paragraph position="1"> This paper addresses the issue in the context of the solution being to return to the actual procedure.</Paragraph>
      <Paragraph position="2"> In the initial step of this study, we focused on the case that the continuous answer candidate passage is in the original text and furthermore restricted the form of documentation in the list. The list could be expected to contain important information, because it is a summarization done by a human. It has certain benefits pertaining to computer processing. These are: a) a large number of lists in FAQs or homepages on web pages, b) some clues before and after the lists such as title and leads, c) extraction which is relatively easy by using HTML list tags, e.g. &lt;OL&gt;,&lt;UL&gt;.</Paragraph>
      <Paragraph position="3"> In this study, a binary categorization was conducted, which divided a set of lists into two classes of procedures and non-procedures. The purpose is to reveal an effective set of features to extract a list explaining the procedure by examining the results of the categorization.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.2 Collection of Lists from Web Pages
</SectionTitle>
      <Paragraph position="0"> To study the features of lists contained in web pages, the sets of lists were made according to the following steps (see Table 1) : Step 1 Enter tejun (procedure) and houhou (method) to Google(Brin and Page, 1998) as keywords, and obtain a list of URLs that are to serve as the seeds of collection for the next step (Gathered).</Paragraph>
      <Paragraph position="1"> Step 2 Recursively search from the top page to the next lower page in the hyperlink structure and gather the HTML pages (Retrieved).</Paragraph>
      <Paragraph position="2"> Step 3 Extract the passages from the pages in Step 2 that are tagged with &lt;OL&gt; or &lt;UL&gt;. If a list has multiple layers with nested tags, each layer is decomposed as an independent list (Valid Pages).</Paragraph>
      <Paragraph position="3"> Step 4 Collect lists including no less than two items. The document is created in such a way that an article is equal to a list.</Paragraph>
      <Paragraph position="4"> Subsequently, the document set was categorized into procedure type and non-procedure type subsets by human judgment. For this categorization, the definition of the list to explain the procedure was as follows: a) The percentage of items including actions or operations in a list is more than or equal to 50%. b) The contexts before and after the lists are ignored in the judgment. An item means an article or an item that is prefixed by a number or a mark such as a bullet. That generally involves multiple sentences. In this categorization, two people categorized the same lists and a kappa test(Siegel and Castellan, 1988) is applied to the result. We obtained a kappa value of 0.87, i.e., a near-perfect match, in the computer domain and 0.66, i.e., a substantial match, in the other domains. Next, the documents were categorized according to their domain by referring to the page including a list. Table 2 lists the results. The values in parentheses indicate the number of lists before decomposition of nested tags. The documents of the Computer domain were dominant; those of the other domains consisted of only a few documents and were lumped together into a document set named &amp;quot;Others.&amp;quot; This domain consists of documents regarding education, medical treatment, weddings, etc. The instructions of software usage or operation on the home pages of web services were also assigned to the computer domain.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.3 Procedural Expressions in the Lists
</SectionTitle>
      <Paragraph position="0"> From the observations of the categorized lists made by humans, the following results were obtained: a) The first sentence in an item often describes an action or an operation. b) There are two types of items that terminate the first sentence: nominalized and nonnominalized. c) In the case of the nominalized type, verbal nouns are very often used at the end of sentence. d) Arguments marked by ga (a particle marking nominative) or ha (a particle marking topic) and negatives are rarely used, while arguments marked by wo (a particle marking object) appear frequently. e) At the end of sentences and immediately before punctuation marks, the same expressions appear repeatedly. Verbal nouns are inherent expressions verbified by being followed by the light verb suru in Japanese. If the features above are domain-independent characteristics, the lists in a minor domain can be categorized by using the features that were learned from the lists in the other major domain. The function words or flections appearing at the ends of sentences and before punctuation are known as markers, and specify the style of descrip- null tion in Japanese. Thus, to explain a procedure, the list can be expected to have inherent styles of description. null These features are very similar to those in an authorship identification task(Mingzhe, 2002; Tsuboi and Matsumoto, 2002). That task uses word n-gram, distribution of part of speech, etc. In recent research for web documents, frequent word sequences have also been examined. Our approach is based on these features.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Features
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.1 Baseline
</SectionTitle>
      <Paragraph position="0"> In addition to the features based on the presence of specific words, we examined sequences of words for our task. Tsuboi et al.(2002) used a method of sequential pattern mining, PrefixSpan, and an algorithm of machine learning, Support Vector Machine in addition to morphological N-grams. They proposed making use of the frequent sequential patterns of words in sentences. This approach is expected to contribute to explicitly use the relationships of  distant words in the categorization. The list contains differences in the omissions of certain particles and the frequency of a particle's usage to determine whether the list is procedural. Such sequential patterns are anticipated to improve the accuracy of categorization. The words in a sentence are transferred to PrefixSpan after preprocessing, as follows: Step 1 By using ChaSen(Matsumoto et al., 1999), a Japanese POS(Part Of Speech) tagger, we put the document tags and the POS tags into the list. Table 3 lists the tag set that was used.</Paragraph>
      <Paragraph position="1"> These tags are only used for distinguishing objects. The string of tags was ignored in sequential pattern mining.</Paragraph>
      <Paragraph position="2"> Step 2 After the first n sentences are extracted from each list item, a sequence is made for each sentence. Sequential pattern mining is performed for an item (literal) in a sequence as a morpheme. null By using these features, we conducted categorization with SVM. It is one of the large margin classifiers, which shows high generalization performance even in high dimensional spaces(Vapnik, 1995).</Paragraph>
      <Paragraph position="3"> SVM is beneficial for our task, because it is unknown which features are effective, and we must use many features in categorization to investigate their effectiveness. The dimension of the feature space is relatively high.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.2 Sequential Pattern Mining
</SectionTitle>
      <Paragraph position="0"> Sequential pattern mining consists of finding all frequent subsequences, that are called sequential patterns, in the database of sequences of literals. Apriori(Agrawal and Srikant, 1994) and PrefixSpan(Pei et al., 2001) are examples of sequential pattern mining methods. The Apriori algorithm is one of the most widely used methods, however there is a great deal of room for improvement in terms of calculation cost. The PrefixSpan algorithm succeed in reducing the cost of calculation by performing an operation, called projection, which confines the range of the search to sets of frequent subsequences. Details of the PrefixSpan algorithm are provided in an-</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
5 Experiments and Results
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.1 Experimental Settings
</SectionTitle>
      <Paragraph position="0"> In the first experiment, to determine the categorization capability of a domain, we employed a set of lists in the Computer domain and conducted a cross-validation procedure. The document set was divided into five subsets of nearly equal size, and five different SVMs, the training sets of four of the subsets, and the remaining one classified for testing. In the second experiment, to determine the categorization capability of an open domain, we employed a set of lists from the Others domain with the document set in the first experiment. Then, the set of the lists from the Others domain was used in the test and the one from the Computer domain was used in the training, and their training and testing roles were also switched. In both experiments, recall, precision, and, occasionally, F-measure value were calculated to evaluate categorization performance. F-measure is calculated with precision (P) and recall (R) in formula 1.</Paragraph>
      <Paragraph position="2"> The lists in the experiment were gathered from those marked by the list tags in the pages. To focus on the feasibility of the features in the lists for the categorization task, the contexts before and after each list are not targeted. Table 4 lists four groups divided by procedure and domain into columns, and the numbers of lists, items, sentences, and characters in each group are in the respective rows. The two values in each cell in Table 4 are the mean on the left and the deviation on the right. We employed Tiny-SVM1 and a implementation of PrefixSpan2 by T. Kudo. To observe the direct effect of the features, the feature vectors were binary, constructed  with word N-gram and patterns; polynomial kernel degree d for the SVM was equal to one. Support values for PrefixSpan were determined in an ad hoc manner to produce a sufficient number of patterns in our experimental conditions.</Paragraph>
      <Paragraph position="3"> To investigate the effective features for list categorization, feature sets of the lists were divided into five groups (see Table 5) with consideration given to the difference of content word and function words according to our observations (described in Section 3.3). The values in Table 5 indicate the numbers of differences between words in each domain data set.</Paragraph>
      <Paragraph position="4"> The notation of tags above, such as 'snp', follows the categories in Table 3. F2 and F3 consist of content words and F4 and F5 consist of function words. F6 was a feature group, which added verbal nouns based on our observations (described in Section 3.3).</Paragraph>
      <Paragraph position="5"> To observe the performances of SVM, we compared the results of categorizations in the conditions of F3 and F5 with a decision tree. For decision tree learning, j48.j48, which is an implementation of the C4.5 algorithm by Weka3, was chosen.</Paragraph>
      <Paragraph position="6"> In these experiments, only the first sentence in each list item was used because in our preliminary experiments, we obtained the best results when only the first sentence was used in categorization. As many as a thousand patterns from the top in the ranking of frequencies were selected and used in conditions from F1 to F6. For pattern selection, we examined the method based on frequency. In addition, mutual information filtering was conducted in some conditions for comparison with performances based only on pattern frequency. By ranking these with the mutual information filtering, we selected 100, 300,  and 500 patterns from 1000 patterns. Furthermore, the features of N-grams were varied to N=1, 1+2, and 1+2+3 by incrementing N and adding new N-grams to the features in the experiments.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.2 Experimental Results
</SectionTitle>
      <Paragraph position="0"> Table 6 lists the results of a 5-fold cross-validation evaluation of the Computer domain lists. Gradually, N-grams and patterns were added to input feature vectors, thus N=1, 2, 3, and patterns. The feature group primarily constructed of content words slightly overtook the function group, with the exception of recall, while trigram and patterns were added. In the comparison of F2 and F4, differences in performance are not as salient as differences in numbers of features. Incorporating verbal nouns into the categorization slightly improved the results. However, the patterns didn't work in this task. The same experiment-switching the roles of the two list sets, the Computer and the Others domain, was then performed (see Tables 7 and 8).</Paragraph>
      <Paragraph position="1"> Along with adding N-grams, the recall became worse for the group of content words. In contrast, the group of function words showed better perfor- null mance in the recall, and the overall balance of precision and recall were well-performed. Calculating the F-measure with formula 1, in most evaluations of open domain, the functional group overtook the content group. This deviation is more salient in the Others domain. In the results of both the Computer domain and the Others domain, the model trained with functions performed better than the model trained with content. The function words in Japanese characterize the descriptive style of the text, meaning that this result shows a possibility of the acquisition of various procedural expressions. From another perspective, when trigram was added as a feature, performance took decreased in recall. Adding the patterns, however, improved performance. It is assumed that there are dependencies between words at a distance greater than three words, which is beneficial in their categorization. Table 9 compares the results of SVM and j48.j48 decision tree. Table 10 lists the effectiveness of mutual information filtering. In both tables, values show the F-measure calculated with formula 1. According to Table 9, SVM overtook j48.j48 overall. j48.j48 scarcely changes with an increase in the number of features, however, SVM gradually improves performance. For mutual information filtering, SVM marked the best results with no-filter in the Computer domain. However, in the case of learning from the Others domain, the mutual information filtering appears effective.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.3 Discussion
</SectionTitle>
      <Paragraph position="0"> The comparison of SVM and decision tree shows the high degree of generalization of SVM in a high dimensional feature space. From the results of mutual information filtering, we can recognize that the sim- null ple methods of other pre-cleaning are not notably effective when learning from documents of the same domain. However, the simple methods work well in our task when learning from documents consisting of a variety of domains.</Paragraph>
      <Paragraph position="1"> Patterns performed well with mutual information filtering in a data set including different domains and genres. It appears that N-grams and credible patterns are effective in acquiring the common characteristics of procedural expressions across different domains. There is a possibility that the patterns are effective for moderate narrowing of the range of answer candidates in the early process of QA and Web information retrieval. In the Computer domain, categorization performed well overall in every POS group. That is why it includes many instruction documents, for instance software installation, computer settings, online shopping, etc., and those usually use similar and restricted vocabularies. Conversely, the uniformity of procedural expressions in the Computer domain causes poorer performance when learning from the documents of the Computer domain than when learning from the Others domain.</Paragraph>
      <Paragraph position="2"> We also often found in their expressions that for a  particular class of content word, special characters were adjusted (see Figure 1). This type of pattern occasionally contributed the correct classification in our experiment. The movement of the performance of content and function word along with the addition of N-grams is notable. It is likely that making use of the difference of their movement more directly is useful in the categorization of procedural text.</Paragraph>
      <Paragraph position="3"> By error analysis, the following patterns were obtained: those that reflected common expressions, including the multiple appearance of verbs with a case-marking particle wo. This worked well for the case in which the procedural statement partially occupied the items of the list. Where there were fewer characters in a list and failing POS tagging, pattern mismatch was observed.</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="0" end_page="0" type="metho">
    <SectionTitle>
6 Conclusion
</SectionTitle>
    <Paragraph position="0"> The present work has demonstrated effective features that can be used to categorize lists in web pages by whether they explain a procedure. We show that categorization to extract texts including procedural expressions is different from traditional text categorization tasks with respect to the features and behaviors related to co-occurrences of words. We also show the possibility of filtering to extract lists including procedural expressions in different domains by exploiting those features that primarily consist of function words and patterns with mutual information filtering. Lists with procedural expressions in the Computer domain can be extracted with higher accuracy.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML