File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/05/i05-2001_intro.xml
Size: 8,257 bytes
Last Modified: 2025-10-06 14:02:55
<?xml version="1.0" standalone="yes"?> <Paper uid="I05-2001"> <Title>A Classification-based Algorithm for Consistency Check of Part-of-Speech Tagging for Chinese Corpora</Title> <Section position="3" start_page="1" end_page="4" type="intro"> <SectionTitle> 2 Describing the Context of Multi-category Words </SectionTitle> <Paragraph position="0"> The basic idea of our approach is to use the context information of multi-category words to judge whether they are tagged consistently or not. In other words, if a multi-category word appears in two locations and the surrounding words in those two locations are tagged similarly, the multi-category word should be assigned with the same POS tag in those two locations as well.</Paragraph> <Paragraph position="1"> Hence, our approach is based on the context of multi-category words and we model the context by looking at a window around a multi-category word and the tagging sequence of this window. In the rest of this section, we describe our vector representation of the context of multi-category words and how to determine various parameters in our vector representations.</Paragraph> <Section position="1" start_page="1" end_page="2" type="sub_section"> <SectionTitle> 2.1 Vector Representation of the Context of Multi-category Words </SectionTitle> <Paragraph position="0"> Our vector representation of context consists of three key components: the POS tags of each word in a context window (POS attribute), the importance of each word to the center multi-category word based on distance (position attribute), and the dependency of POS tags of the center multi-category word and its surrounding words (Dependency Attribute).</Paragraph> <Paragraph position="1"> Given a multi-category word and its context window of size D0, we represent the words in sequential order as B4DB B5. We also refer to the latter vector as POS tagging sequence. In practise, we choose a proper value of D0 so that the context window contains sufficient number of words and the complexity of our algorithm remains relatively low. We will discuss this matter in detail later. In this study, we set the value of D0 to be 7.</Paragraph> <Paragraph position="2"> The POS tagging sequence contains information of the POS of each preceding (following) word in a POS tagging sequence as well as the position of each POS tag. The POS of surrounding words may have different effect on determining the POS of the multi-category word, which we refer to as POS attribute and represent it using a matrix as follows.</Paragraph> <Paragraph position="3"> Suppose we have a tag set of size D1 POS tagging sequence, the POS attribute matrix CH is an D0 by D1 matrix, where the rows indicate the POS tags of the preceding words, multi-category word, and the following words in the context window, while the columns present tags in the tag set.</Paragraph> <Paragraph position="5"> For example, consider the the POS attribute matrix of &quot;P&quot; in the following sentence:</Paragraph> <Paragraph position="7"> As we let D0 BP BJ, we look at the word &quot;P&quot; and its 3 preceding and following words. Hence, the POS tagging sequence is ( a, n, u, a, n, d, v ). In our study, we used a standard tag set that consists of 25 tags. Suppose the tag set is ( n, v, a, d, u, p, r, m, q, c, w, I, f, s, t, b, z, e, o, l, j, h, k, g, y), then the POS attribute matrix of &quot;P&quot; in this example Due to the different distances from the multi-category word, the POS of the word before (after) the multi-category word may in a POS tagging sequence have a different influence on the POS tagging of the multi-category word, which we refer to as position attribute.</Paragraph> <Paragraph position="8"> Given a multi-category word with a context window of size D0, suppose the number of preceding (following) words is D2 (i.e., D0 BP BED2 B7 BD), the position attribute vector CE ) is the value of the position attribute of the CXth preceding (following) word. We further require that BKCX We choose a proper position attribute vector so that the multi-category word itself has the highest weight, and the closer the surrounding word , the higher its weight is. If we consider a context window of size 7, based on our preliminary experiments, we chose the following position attribute Note that if the POS tag in the POS tagging sequence is incorrect, the position attribute value of the corresponding position should be turned into a negative value, so that when the incorrect POS tag appears in a POS tagging sequence, this attribute can correctly show that the incorrect POS tag has negative effect on generating the final context vector.</Paragraph> <Paragraph position="9"> The last attribute we focus on is dependency attribute, which corresponds to the fact that there are mutual dependencies on the appearance of every POS in POS tagging sequences. In particular, we use transition probability and emission probability in Hidden Markov Model (HMM) (Leek, 1997) to capture this dependency.</Paragraph> <Paragraph position="10"> Given a tag set of size D1 B4CR in the entire corpus; and C8BX is the emission probability. null Note that both CC and BX are constructed from the entire corpus and we can look up these two tables easily when we consider the POS tags appear in POS tagging sequences.</Paragraph> <Paragraph position="11"> Now, when we look at a context window of size B5, there are three types of probabilities we need to take into account.</Paragraph> <Paragraph position="12"> The first one is the probability of the appearance of the POS tag D8 According to the above three probability formulas we can build a seven- dimensional vector, where each dimension corresponds to one POS tag, respectively.</Paragraph> <Paragraph position="13"> Given a multi-category word with a context window of size 7 and its POS tagging sequence, the dependency attribute vector CE of the multi-category word is defined as follows:</Paragraph> </Section> <Section position="2" start_page="2" end_page="2" type="sub_section"> <SectionTitle> 2.1.4 Context Vector of Multi-category Words </SectionTitle> <Paragraph position="0"> Now we are ready to define the context vector of multi-category words.</Paragraph> <Paragraph position="1"> Given a multi-category word with a context window of size D0 and its POS attribute matrix CH , position attribute vector CE</Paragraph> <Paragraph position="3"> where AB and AC are the weights of the position attribute and the dependency attribute, respectively.</Paragraph> <Paragraph position="4"> Note that we require ABB7AC BP BD, and their optimal values are determined by experiments in our study.</Paragraph> </Section> <Section position="3" start_page="2" end_page="4" type="sub_section"> <SectionTitle> 2.2 Experiment on the Size of the Context Window </SectionTitle> <Paragraph position="0"> Context vectors can be extended by using 4 to 7 preceding (following) words to substitute 3 preceding (following) words in context windows and POS tagging sequences. We conducted experiments with a context window of size 3 to 7 on our sampled 1M-word training corpus and performed closed test. The experimental results are evaluated in terms of both the precision of consistency check and algorithm complexity simultaneously.</Paragraph> <Paragraph position="1"> We plot the effect on precision in Figure 1.</Paragraph> <Paragraph position="2"> preceding (following) words.</Paragraph> <Paragraph position="3"> As shown in Figure 1, the precision of consistency check increases as we include more preceding (following) words. In particular, the precision is improved by 1% when we use 7 preceding (following) words. However, the increase of complexity is much higher than that of precision, because the dimensionality of the position attribute vector, POS attribute vector, and dependency attribute vector doubles. Hence, we chose 3 as the number of preceding (following) words to form context windows and calculate context vectors.</Paragraph> </Section> <Section position="4" start_page="4" end_page="4" type="sub_section"> <SectionTitle> 2.3 Effect on consistency check precision of </SectionTitle> <Paragraph position="0"/> </Section> </Section> class="xml-element"></Paper>