File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/p06-2059_metho.xml

Size: 23,711 bytes

Last Modified: 2025-10-06 14:10:28

<?xml version="1.0" standalone="yes"?>
<Paper uid="P06-2059">
  <Title>Automatic Construction of Polarity-tagged Corpus from HTML Documents</Title>
  <Section position="5" start_page="452" end_page="453" type="metho">
    <SectionTitle>
3 The Idea
</SectionTitle>
    <Paragraph position="0"> This Section briefly explains our basic idea, and the detail of our corpus construction method is represented in the next Section.</Paragraph>
    <Paragraph position="1"> Our idea is to use certain layout structures and linguistic pattern in order to extract opinion sentences from HTML documents. More specifically, we used two kinds of layout structures: the itemization and the table. In what follows, we explain examples where opinion sentences can be extracted by using the itemization, table and linguistic pattern.</Paragraph>
    <Section position="1" start_page="452" end_page="453" type="sub_section">
      <SectionTitle>
3.1 Itemization
</SectionTitle>
      <Paragraph position="0"> The first idea is to extract opinion sentences from the itemization (Figure 1). In this Figure, opinions about a music player are itemized and these itemizations have headers such as 'pros' and 'cons'.</Paragraph>
      <Paragraph position="1"> By using the headers, we can recognize that opinion sentences are described in these itemizations.</Paragraph>
      <Paragraph position="2">  ence of opinion sentences are called indicators.</Paragraph>
      <Paragraph position="3"> Indicators for positive sentences are called positive indicators. 'Pros' is an example of positive indicator. Similarly, indicators for negative sentences are called negative indicators.</Paragraph>
    </Section>
    <Section position="2" start_page="453" end_page="453" type="sub_section">
      <SectionTitle>
3.2 Table
</SectionTitle>
      <Paragraph position="0"> The second idea is to use the table structure (Figure 2). In this Figure, a car review is summarized in the table.</Paragraph>
      <Paragraph position="2"> Plus This is a four door car, but it's so cool.</Paragraph>
      <Paragraph position="3"> Minus The seat is ragged and the light  We can predict that there are opinion sentences in this table, because the left column acts as a header and there are indicators (plus and minus) in that column.</Paragraph>
    </Section>
    <Section position="3" start_page="453" end_page="453" type="sub_section">
      <SectionTitle>
3.3 Linguistic pattern
</SectionTitle>
      <Paragraph position="0"> The third idea is based on linguistic pattern. Because we treat Japanese, the pattern that is discussed in this paper depends on Japanese grammar although we think there are similar patterns in other languages including English.</Paragraph>
      <Paragraph position="1"> Consider the Japanese sentences attached with English translations (Figure 3). Japanese sentences are written in italics and '-' denotes that the word is followed by postpositional particles.</Paragraph>
      <Paragraph position="2"> For example, 'software-no' means that 'software' is followed by postpositional particle 'no'. Translations of each word and the entire sentence are represented below the original Japanese sentence.</Paragraph>
      <Paragraph position="3"> '-POST' means postpositional particle.</Paragraph>
      <Paragraph position="4"> In the examples, we focused on the singly underlined phrases. Roughly speaking, they correspond to 'the advantage/weakness is to' in English. In these phrases, indicators ('riten (advantage)' and 'ketten (weakness)') are followed by postpositional particle '-ha', which is topic marker. And hence, we can recognize that something good (or bad) is the topic of the sentence.</Paragraph>
      <Paragraph position="5"> Based on this observation, we crafted a linguistic pattern that can detect the singly underlined phrases. And then, we extracted doubly underlined phrases as opinions. They correspond to 'run quickly' and 'take too much time'. The detail of this process is discussed in the next Section.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="453" end_page="453" type="metho">
    <SectionTitle>
4 Automatic Corpus Construction
</SectionTitle>
    <Paragraph position="0"> This Section represents the detail of the corpus construction procedure.</Paragraph>
    <Paragraph position="1"> As shown in the previous Section, our idea utilizes the indicator, and it is important to recognize indicators in HTML documents. To do this, we manually crafted lexicon, in which positive and negative indicators are listed. This lexicon consists of 303 positive and 433 negative indicators. Using this lexicon, the polarity-tagged corpus is constructed from HTML documents. The method consists of the following three steps:</Paragraph>
  </Section>
  <Section position="7" start_page="453" end_page="453" type="metho">
    <SectionTitle>
1. Preprocessing
</SectionTitle>
    <Paragraph position="0"> Before extracting opinion sentences, HTML documents are preprocessed. This process involves separating texts form HTML tags, recognizing sentence boundary, and complementing omitted HTML tags etc.</Paragraph>
    <Paragraph position="1"> 2. Opinion sentence extraction Opinion sentences are extracted from HTML documents by using the itemization, table and linguistic pattern.</Paragraph>
  </Section>
  <Section position="8" start_page="453" end_page="455" type="metho">
    <SectionTitle>
3. Filtering
</SectionTitle>
    <Paragraph position="0"> Since HTML documents are noisy, some of the extracted opinion sentences are not appropriate. They are removed in this step.</Paragraph>
    <Paragraph position="1"> For the preprocessing, we implemented simple rule-based system. We cannot explain its detail for lack of space. In the remainder of this Section, we describe three extraction methods respectively, and then examine filtering technique.</Paragraph>
    <Section position="1" start_page="453" end_page="454" type="sub_section">
      <SectionTitle>
4.1 Extraction based on itemization
</SectionTitle>
      <Paragraph position="0"> The first method utilizes the itemization. In order to extract opinion sentences, first of all, we have to find such itemization as illustrated in Figure 1.</Paragraph>
      <Paragraph position="1"> They are detected by using indicator lexicon and HTML tags such as BOh1BQ and BOulBQ etc.</Paragraph>
      <Paragraph position="2"> After finding the itemizations, the sentences in the items are extracted as opinion sentences. Their polarity labels are assigned according to whether the header is positive or negative indicator. From the itemization in Figure 1, three positive sentences and three negative ones are extracted.</Paragraph>
      <Paragraph position="3"> The problem here is how to treat such item that has more than one sentences (Figure 4). In this itemization, there are two sentences in each of the  (1) kono software-no riten-ha hayaku ugoku koto  this software-POST advantage-POST quickly run to The advantage of this software is to run quickly. (2) ketten-ha jikan-ga kakarisugiru koto-desu weakness-POST time-POST take too much to-POST The weakness is to take too much time.</Paragraph>
      <Paragraph position="4">  third and fourth item. It is hard to precisely predict the polarity of each sentence in such items, because such item sometimes includes both positive and negative sentences. For example, in the third item of the Figure, there are two sentences. One ('Has high pixel...') is positive and the other ('I was not satisfied...') is negative.</Paragraph>
      <Paragraph position="5"> To get around this problem, we did not use such items. From the itemization in Figure 4, only two positive sentences are extracted ('the color is really good' and 'this camera makes me happy while taking pictures').</Paragraph>
      <Paragraph position="6"> Pros: AF The color is really good.</Paragraph>
      <Paragraph position="7"> AF This camera makes me happy while taking pictures. null AF Has high pixel resolution with 4 million pixels. I was not satisfied with 2 million.</Paragraph>
      <Paragraph position="8"> AF EVF is easy to see. But, compared with SLR, it's hard to see.</Paragraph>
    </Section>
    <Section position="2" start_page="454" end_page="454" type="sub_section">
      <SectionTitle>
4.2 Extraction based on table
</SectionTitle>
      <Paragraph position="0"> The second method extracts opinion sentences from the table. Since the combination of BOtableBQ and other tags can represent various kinds of tables, it is difficult to craft precise rules that can deal with any table.</Paragraph>
      <Paragraph position="1"> Therefore, we consider only two types of tables in which opinion sentences are described (Figure 5). Type A is a table in which the leftmost column acts as a header, and there are indicators in that column. Similarly, type B is a table in which the first row acts as a header. The table illustrated in Figure 2 is categorized into type A.</Paragraph>
      <Paragraph position="2"> The type of the table is decided as follows. The table is categorized into type A if there are both</Paragraph>
      <Paragraph position="4"> positive and negative indicators in the leftmost column. The table is categorized into type B if it is not type A and there are both positive and negative indicators in the first row.</Paragraph>
      <Paragraph position="5"> After the type of the table is decided, we can extract opinion sentences from the cells that correspond to B7 and A0 in the Figure 5. It is obvious which label (positive or negative) should be assigned to the extracted sentence.</Paragraph>
      <Paragraph position="6"> We did not use such cell that contains more than one sentences, because it is difficult to reliably predict the polarity of each sentence. This is similar to the extraction from the itemization.</Paragraph>
    </Section>
    <Section position="3" start_page="454" end_page="455" type="sub_section">
      <SectionTitle>
4.3 Extraction based on linguistic pattern
</SectionTitle>
      <Paragraph position="0"> The third method uses linguistic pattern. The characteristic of this pattern is that it takes dependency structure into consideration.</Paragraph>
      <Paragraph position="1"> First of all, we explain Japanese dependency structure. Figure 6 depicts the dependency representations of the sentences in the Figure 3.</Paragraph>
      <Paragraph position="2"> Japanese sentence is represented by a set of dependencies between phrasal units called bunsetsuphrases. Broadly speaking, bunsetsu-phrase is an unit similar to baseNP in English. In the Figure, square brackets enclose bunsetsu-phrase and arrows show modifier AX head dependencies between bunsetsu-phrases.</Paragraph>
      <Paragraph position="3"> In order to extract opinion sentences from these dependency representations, we crafted the following dependency pattern.</Paragraph>
      <Paragraph position="5"> This pattern matches the singly underlined bunsetsu-phrases in the Figure 6. In the modifier part of this pattern, the indicator is followed by postpositional particle 'ha', which is topic marker  . In the head part, 'koto (to)' is followed by arbitrary numbers of postpositional particles. If we find the dependency that matches this pattern, a phrase between the two bunsetsu-phrases is extracted as opinion sentence. In the Figure 6, the doubly underlined phrases are extracted. This heuristics is based on Japanese word order constraint. null</Paragraph>
    </Section>
    <Section position="4" start_page="455" end_page="455" type="sub_section">
      <SectionTitle>
4.4 Filtering
</SectionTitle>
      <Paragraph position="0"> Sentences extracted by the above methods sometimes include noise text. Such texts have to be filtered out. There are two cases that need filtering process.</Paragraph>
      <Paragraph position="1"> First, some of the extracted sentences do not express opinions. Instead, they represent objects to which the writer's opinion is directed (Table 7). From this table, 'the overall shape' and 'the shape of the taillight' are wrongly extracted as opinion sentences. Since most of the objects are noun phrases, we removed such sentences that have the noun as the head.</Paragraph>
      <Paragraph position="2">  Secondly, we have to treat duplicate opinion sentences because there are mirror sites in the  To be exact, some of the indicators such as 'strong point' consists of more than one bunsetsu-phrase, and the modifier part sometimes consists of more than one bunsetsu-phrase. HTML documents. When there are more than one sentences that are exactly the same, one of them is held and the others are removed.</Paragraph>
    </Section>
  </Section>
  <Section position="9" start_page="455" end_page="457" type="metho">
    <SectionTitle>
5 Experimental Results and Discussion
</SectionTitle>
    <Paragraph position="0"> This Section examines the results of corpus construction experiment. To analyze Japanese sentence we used Juman and KNP  .</Paragraph>
    <Section position="1" start_page="455" end_page="455" type="sub_section">
      <SectionTitle>
5.1 Corpus Construction
</SectionTitle>
      <Paragraph position="0"> About 120 millions HTML documents were processed, and 126,610 opinion sentences were extracted. Before the filtering, there were 224,002 sentences in our corpus. Table2 shows the statistics of our corpus. The first column represents the three extraction methods. The second and third column shows the number of positive and negative sentences by extracted each method. Some examples are illustrated in Table 3.</Paragraph>
      <Paragraph position="1">  The result revealed that more than half of the sentences are extracted by linguistic pattern (see the fourth row). Our method turned out to be effective even in the case where only plain texts are available.</Paragraph>
    </Section>
    <Section position="2" start_page="455" end_page="456" type="sub_section">
      <SectionTitle>
5.2 Quality assessment
</SectionTitle>
      <Paragraph position="0"> In order to check the quality of our corpus, 500 sentences were randomly picked up and two judges manually assessed whether appropriate labels are assigned to the sentences.</Paragraph>
      <Paragraph position="1"> The evaluation procedure is the followings.</Paragraph>
      <Paragraph position="2">  kokoro-ni nokoru ongaku-ga nai impressive music-POST there is no There is no impressive music.</Paragraph>
      <Paragraph position="3"> AF Each of the 500 sentences are shown to the two judges. Throughout this evaluation, We did not present the label automatically tagged by our method. Similarly, we did not show HTML documents from which the opinion sentences are extracted.</Paragraph>
      <Paragraph position="4"> AF The two judges individually categorized each sentence into three groups: positive, negative and neutral/ambiguous. The sentence is classified into the third group, if it does not express opinion (neutral) or if its polarity depends on the context (ambiguous). Thus, two goldstandard sets were created.</Paragraph>
      <Paragraph position="5"> AF The precision is estimated using the goldstandard. In this evaluation, the precision refers to the ratio of sentences where correct labels are assigned by our method. Since we have two goldstandard sets, we can report two different precision values. A sentence that is categorized into neutral/ambiguous by the judge is interpreted as being assigned incorrect label by our method, since our corpus does not have a label that corresponds to neutral/ambiguous. null We investigated the two goldstandard sets, and found that the judges agree with each other in 467 out of 500 sentences (93.4%). The Kappa value was 0.901. From this result, we can say that the goldstandard was reliably created by the judges. Then, we estimated the precision. The precision was 459/500 (91.5%) when one goldstandard was used, and 460/500 (92%) when the other was used. Since these values are nearly equal to the agreement between humans (467/500), we can conclude that our method successfully constructed polarity-tagged corpus.</Paragraph>
      <Paragraph position="6"> After the evaluation, we analyzed errors and found that most of them were caused by the lack of context. The following is a typical example. You see, there is much information.</Paragraph>
      <Paragraph position="7"> In our corpus this sentence is categorized into positive one. The below is a part of the original document from which this sentence was extracted. I recommend this guide book. The Pros.</Paragraph>
      <Paragraph position="8"> of this book is that, you see, there is much information.</Paragraph>
      <Paragraph position="9"> On the other hand, both of the two judges categorized the above sentence into neutral/ambiguous, probably because they can easily assume context where much information is not desirable.</Paragraph>
      <Paragraph position="10"> You see, there is much information. But, it is not at all arranged, and makes me confused.</Paragraph>
      <Paragraph position="11"> In order to precisely treat this kind of sentences, we think discourse analysis is inevitable.</Paragraph>
    </Section>
    <Section position="3" start_page="456" end_page="457" type="sub_section">
      <SectionTitle>
5.3 Application to opinion classification
</SectionTitle>
      <Paragraph position="0"> Next, we applied our corpus to opinion sentence classification. This is a task of classifying sentences into positive and negative. We trained a classifier on our corpus and investigated the result.</Paragraph>
      <Paragraph position="1"> Classifier and data sets As a classifier, we chose Naive Bayes with bag-of-words features, because it is one of the most popular one in this task. Negation was processed in a similar way as previous works (Pang et al., 2002).</Paragraph>
      <Paragraph position="2"> To validate the accuracy of the classifier, three data sets were created from review pages in which the review is associated with meta-data. To build data sets tagged at sentence level, we used such reviews that contain only one sentence. Table 4 represents the domains and the number of sentences in each data set. Note that we confirmed there is no duplicate between our corpus and the these data sets.</Paragraph>
      <Paragraph position="3"> The result and discussion Naive Bayes classifier was trained on our corpus and tested on the three data sets (Table 5). In the Table, the second column represents the accuracy of the classification in each data set. The third and fourth  columns represent precision and recall of positive sentences. The remaining two columns show those of negative sentences. Naive Bayes achieved over 80% accuracy in all the three domains.</Paragraph>
      <Paragraph position="4"> In order to compare our corpus with a small domain specific corpus, we estimated accuracy in each data set using 10 fold crossvalidation (Table 6). In two domains, the result of our corpus outperformed that of the crossvalidation. In the other domain, our corpus is slightly better than the  One finding is that our corpus achieved good accuracy, although it includes various domains and is not accustomed to the target domain. Turney also reported good result without domain customization (Turney, 2002). We think these results can be further improved by domain adaptation technique, and it is one future work.</Paragraph>
      <Paragraph position="5"> Furthermore, we examined the variance of the accuracy between different domains. We trained Naive Bayes on each data set and investigate the accuracy in the other data sets (Table 7). For example, when the classifier is trained on Computer and tested on Restaurant, the accuracy was 0.757.</Paragraph>
      <Paragraph position="6"> This result revealed that the accuracy is quite poor when the training and test sets are in different domains. On the other hand, when Naive Bayes is trained on our corpus, there are little variance in different domains (Table 5). This experiment indicates that our corpus is relatively robust against the change of the domain compared with small domain specific corpus. We think this is because our corpus is large and balanced. Since we cannot always get domain specific corpus in real application, this is the strength of our corpus.</Paragraph>
    </Section>
  </Section>
  <Section position="10" start_page="457" end_page="458" type="metho">
    <SectionTitle>
6 Related Works
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="457" end_page="458" type="sub_section">
      <SectionTitle>
6.1 Learning the polarity of words
</SectionTitle>
      <Paragraph position="0"> There are some works that discuss learning the polarity of words instead of sentences.</Paragraph>
      <Paragraph position="1"> Hatzivassiloglou and McKeown proposed a method of learning the polarity of adjectives from corpus (Hatzivassiloglou and McKeown, 1997).</Paragraph>
      <Paragraph position="2"> They hypothesized that if two adjectives are connected with conjunctions such as 'and/but', they have the same/opposite polarity. Based on this hypothesis, their method predicts the polarity of adjectives by using a small set of adjectives labeled with the polarity.</Paragraph>
      <Paragraph position="3"> Other works rely on linguistic resources such as WordNet (Kamps et al., 2004; Hu and Liu, 2004; Esuli and Sebastiani, 2005; Takamura et al., 2005). For example, Kamps et al. used a graph where nodes correspond to words in the Word-Net, and edges connect synonymous words in the WordNet. The polarity of an adjective is defined by its shortest paths from the node corresponding to 'good' and 'bad'.</Paragraph>
      <Paragraph position="4"> Although those researches are closely related to our work, there is a striking difference. In those researches, the target is limited to the polarity of words and none of them discussed sentences. In addition, most of the works rely on external resources such as the WordNet, and cannot treat words that are not in the resources.</Paragraph>
    </Section>
    <Section position="2" start_page="458" end_page="458" type="sub_section">
      <SectionTitle>
6.2 Learning subjective phrases
</SectionTitle>
      <Paragraph position="0"> Some researchers examined the acquisition of subjective phrases. The subjective phrase is more general concept than opinion and includes both positive and negative expressions.</Paragraph>
      <Paragraph position="1"> Wiebe learned subjective adjectives from a set of seed adjectives. The idea is to automatically identify the synonyms of the seed and to add them to the seed adjectives (Wiebe, 2000). Riloff et al. proposed a bootstrapping approach for learning subjective nouns (Riloff et al., 2003). Their method learns subjective nouns and extraction patterns in turn. First, given seed subjective nouns, the method learns patterns that can extract subjective nouns from corpus. And then, the patterns extract new subjective nouns from corpus, and they are added to the seed nouns. Although this work aims at learning only nouns, in the subsequent work, they also proposed a bootstrapping method that can deal with phrases (Riloff and Wiebe, 2003). Similarly, Wiebe also proposes a bootstrapping approach to create subjective and objective classifier (Wiebe and Riloff, 2005).</Paragraph>
      <Paragraph position="2"> These works are different from ours in a sense that they did not discuss how to determine the polarity of subjective words or phrases.</Paragraph>
    </Section>
    <Section position="3" start_page="458" end_page="458" type="sub_section">
      <SectionTitle>
6.3 Unsupervised sentiment classification
</SectionTitle>
      <Paragraph position="0"> Turney proposed the unsupervised method for sentiment classification (Turney, 2002), and similar method is utilized by many other researchers (Yu and Hatzivassiloglou, 2003). The concept behind Turney's model is that positive/negative phrases co-occur with words like 'excellent/poor'. The co-occurrence statistic is measured by the result of search engine. Since his method relies on search engine, it is difficult to use rich linguistic information such as dependencies.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML