File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/06/p06-2059_intro.xml
Size: 2,396 bytes
Last Modified: 2025-10-06 14:03:43
<?xml version="1.0" standalone="yes"?> <Paper uid="P06-2059"> <Title>Automatic Construction of Polarity-tagged Corpus from HTML Documents</Title> <Section position="4" start_page="452" end_page="452" type="intro"> <SectionTitle> 2 Corpus Design </SectionTitle> <Paragraph position="0"> This Section explains the design of our corpus that is built automatically. Table 1 represents a part of our corpus that was actually constructed in the experiment. Note that this paper treats Japanese.</Paragraph> <Paragraph position="1"> The sentences in the Table are translations, and the original sentences are in Japanese.</Paragraph> <Paragraph position="2"> The followings are characteristics of our corpus: AF Our corpus uses two labels, B7 and A0. They denote positive and negative sentences respectively. Other labels such as 'neutral' are not used.</Paragraph> <Paragraph position="3"> AF Since we do not use 'neutral' label, such sentence that does not convey opinion is not stored in our corpus.</Paragraph> <Paragraph position="4"> AF The label is assigned to not multiple sentences (or document) but single sentence.</Paragraph> <Paragraph position="5"> Namely, our corpus is tagged at sentence level rather than document level.</Paragraph> <Paragraph position="6"> It is important to discuss the reason that we intend to build a corpus tagged at sentence level rather than document level. The reason is that one document often includes both positive and negative sentences, and hence it is difficult to learn the polarity from the corpus tagged at document level. Consider the following example (Pang et al., 2002): This film should be brilliant. It sounds like a great plot, the actors are first grade, and the supporting cast is good as well, and Stallone is attempting to deliver a good performance. However, it can't hold up.</Paragraph> <Paragraph position="7"> This document as a whole expresses negative opinion, and should be labeled 'negative' if it is tagged at document level. However, it includes several sentences that represent positive attitude. We would like to point out that polarity-tagged corpus created from reviews prone to be tagged at document-level. This is because meta-data (e.g. stars in AMAZON.COM) is usually associated with one review rather than individual sentences in a review. This is one serious problem in previous works.</Paragraph> </Section> class="xml-element"></Paper>