File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-1101_metho.xml

Size: 13,457 bytes

Last Modified: 2025-10-06 14:09:11

<?xml version="1.0" standalone="yes"?>
<Paper uid="W04-1101">
  <Title>Segmentation of Chinese Long Sentences Using Commas</Title>
  <Section position="5" start_page="3" end_page="5" type="metho">
    <SectionTitle>
3 Types of Commas
</SectionTitle>
    <Paragraph position="0"> The comma is the most common punctuation, and the one that might be expected to have the greatest effect on syntactic parsing. Also, it seems natural to break a sentence at the comma position in Chinese sentences. The procedure for syntactic analysis of a sentence, including the segmentation part, is as follows:  st step: segment the sentence at a comma  nd step: do the dependency analysis for each segment  rd step: set the dependency relation between segment pairs In Chinese dependency parsing, not all commas are proper as segmentation points. First, segmentation at comma in some sentences, will cause some of the words fail to find their heads. Figure 2 shows, in example (2), there are two words, Bei Hai (BeiHai City) and Zai (preposition) from the left segment have dependency relation with the word Shi (is) of the right segment. So, the segmentation at comma , will cause two of words Bei Hai (BeiHai City) and Zai (preposition) in the left segment, cannot find their head in the second step of syntactic parsing stage.</Paragraph>
    <Paragraph position="1"> Second, segmentation at commas can cause some words to find the wrong head. Example (3) of figure 3 shows two pairs of words with dependency relations. For each pair, one word is from the left segment, and one word is from the right segment : Xi Huan (like) from the left segment and Jiao Shi (teacher) from the right, Nian Qing (young) from the left and De (of) from the right. Segmentation at the comma will cause the word Nian Qing (young) to get the word Xi Huan (like) as its head, which is wrong.</Paragraph>
    <Paragraph position="2"> Example (2) and (3) demonstrate improper sentence segmentation at commas. In figure 2 and figure 3, there are two dependency lines that cross over the commas for both sentences. We call these kinds of commas mul_dep_lines_cross comma (multiple lines cross comma). In figure 1, there is only one dependency line cross over the comma. We call these kinds of commas one_dep_line_cross comma.</Paragraph>
    <Paragraph position="3"> Segmentation at one_dep_line_cross comma is helpful for reducing parsing complexity and can contribute to accurate parsing results. However, we should avoid segmenting at the position of mul_dep_lines_cross comma. It is necessary to check each comma according to its context.</Paragraph>
    <Section position="1" start_page="3" end_page="4" type="sub_section">
      <SectionTitle>
3.1 Delimiter Comma and Separator
Comma
</SectionTitle>
      <Paragraph position="0"> Nunberg (1990) classified commas in English into two categories, as a delimiter comma and a separator comma, by whether the comma is used to separate the elements of the same type  or not.</Paragraph>
      <Paragraph position="1"> While a delimiter comma is used to separate different syntactic types, a separator comma is used to separate members of conjoined elements. The commas in Chinese can also be classified into these two categories. The commas in example (3) and (4) are separators, while those in (2) and (5), are delimiters.</Paragraph>
      <Paragraph position="2"> However, both delimiter comma and separator commas can be mul_dep_line_cross commas. In example (2), the comma is a delimiter comma as well as a mul_dep_line_cross comma. As a separator comma, the comma in example (3), is also a mul_dep_line_cross comma. Nunbergs classification cannot help to identify mul_dep_line_cross commas.</Paragraph>
      <Paragraph position="3"> We therefore need a different kind of classification of comma. Both delimiter comma and separator comma can occur within a clause or at the end of a clause. Commas that appear at the end of a clause are clearly one_dep_line_cross commas.</Paragraph>
      <Paragraph position="4"> The segmentation at these kinds of comma is valid.  Same type means that it has the same syntactic role in the sentence, it can be a coordinate phrase or coordinate clause.</Paragraph>
    </Section>
    <Section position="2" start_page="4" end_page="4" type="sub_section">
      <SectionTitle>
3.2 Inter-clause Comma and Intra-clause
Comma
</SectionTitle>
      <Paragraph position="0"> Commas occurring within a clause are here called intra-clause commas. Similarly, commas at the end of a clause will be called inter-clause commas. Example (2), (3) include intra-clause commas, and example (4), (5) include inter-clause commas.</Paragraph>
    </Section>
    <Section position="3" start_page="4" end_page="5" type="sub_section">
      <SectionTitle>
Adjoining a Comma
</SectionTitle>
      <Paragraph position="0"> A segment is a group of words between two commas or a group of words from the beginning (or end) of a sentence to its nearest comma.</Paragraph>
      <Paragraph position="1"> To identify whether a comma is an inter-clause comma or an intra-clause comma, we assign values to each comma. These values reflect the nature of the two segments next to the comma. Either the left or right segment of a comma, can be deduced as a phrase  , or several non-overlapped phrases, or a clause.(see examples (6)~(15)). The value we assign to a comma is a two-dimensional value (left_seg, right_seg). The value of left_seg and right_seg can be p(hrase) or c(lause), therefore the assigned value for each comma can be (p,p), (p,c), (c,p) or (c,c).</Paragraph>
      <Paragraph position="2"> Commas with (p,p) as the assigned value, include the case when the left and right segment of the comma can be deduced as one phrase, as shown in example (6) or several non-overlapped phrases, as described in example (7).</Paragraph>
      <Paragraph position="3"> We can assign the value of (c,p) to commas in example (8), (9) and (10), indicating the left adjoining segment is a clause and the right one is a phrase or several non-overlapped phrases. In a similar way, commas in example (11)~(13) are case of (p,c).</Paragraph>
      <Paragraph position="4"> If a comma has (c,c) as the assigned value, both the left segment and the right segment can be deduced as a clause. The relation between the two clauses can be coordinate (example (14)) or subordinate (example (15)).</Paragraph>
      <Paragraph position="5">  Phrase is the group of words that can be deduced as the phrase in Chinese Penn Tree Bank 2.0. A phrase may contain an embedded clause as its adjunct or  complement.</Paragraph>
      <Paragraph position="6"> (a), Zai Ta Men Xie ,Zuo Ye Zhi V, (b) Ta Men --Qu De ,  In example (a) ,the PP has the embedded clause as its complement. And in example (b), the embedded clause is the adjunct of the NP.</Paragraph>
    </Section>
    <Section position="4" start_page="5" end_page="5" type="sub_section">
      <SectionTitle>
joining Segments
</SectionTitle>
      <Paragraph position="0"> A word (some words) in the left segment and a word (some words) in the right segment of a comma may or may not have a dependency relation(s). For a comma, if at least one word from the left segment has a dependency relation with a word from the right segment, we say the left segment and the right segment have a syntactic relation. Otherwise the two segments adjoining the comma have no syntactic relations. Rel() functions are defined in table-1.</Paragraph>
      <Paragraph position="1">  Rel() ! To check if any words of the left segment has a dependency relation with the word of the right segment.</Paragraph>
      <Paragraph position="2"> ! If there is, Rel()=1 Otherwise Rel()=0.</Paragraph>
      <Paragraph position="3"> Dir() ! To indicate how many direction(s) of the dependency relations the left and right segment have. when Rel()=1.</Paragraph>
      <Paragraph position="4"> ! For one_dep_line_cross comma, Dir()=1. ! For mul_dep_line_cross comma, if the directions of the dependency relations are the same, Dir()=1, else Dir()=2.</Paragraph>
      <Paragraph position="5"> Head() ! To indicate which side of segment contains the head of any words of the other side, when Rel()=1.</Paragraph>
      <Paragraph position="6"> ! When Dir()=1, if the left segment contains any word as the head of a word of the right, Head() = left; Otherwise Head()=right.</Paragraph>
      <Paragraph position="7"> ! When Dir()=2, 1. According to the direction of dependency relation of these two segments, to find the word which has no head.</Paragraph>
      <Paragraph position="8"> 2. If the word is on the left, Head()=left, otherwise, Head()=right.</Paragraph>
      <Paragraph position="9">  For the one_dep_line_cross comma, the left and right segments have syntactic relation, and only one word from a segment has a dependency relation with a word from the other segment. For mul_dep_line_cross comma, at least two pairs of words from each segment have dependency relations. We then say that the left and right segments adjacent to the comma have multiple dependency relations. The directions of each relation may differ or not. We define a function Dir() as follows : if all the directions of the relations are the same, get 1 as its value, else 2 for its value. This is in table-1. We also define function Head() to indicate whether the left segment or the right segment contains the head word of the other when the two segments have syntactic relation. This is also shown in table 1. In example (3) as figure 3 shows, Rel()=1,  clause Comma For commas assigned values (p,p) or (c,c), the function Rel() is always 1. Commas with values (c, p) or (p,c) can be further divided into two sub-cases. Table 2 shows the sub-case of (c,p), and table 3 shows the sub-cases of (p,c).</Paragraph>
      <Paragraph position="10">  Commas with the value of (p,p), (c,p)-II and (p,c)-II are used to connect coordinate phrases or to separate two constituents of a clause. These commas are intra-clause commas.</Paragraph>
      <Paragraph position="11"> Commas with (c,c), (c,p)-I and (p,c)-I are used as a clause boundaries. These are inter-clause commas.</Paragraph>
      <Paragraph position="12"> An inter-clause comma joins the clauses together to form a sentence. The commas that belong to an inter-clause category are safe as segmentation points (Kim, 2001).</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="5" end_page="5" type="metho">
    <SectionTitle>
4 Feature Selection
</SectionTitle>
    <Paragraph position="0"> To identify the inter-clause or intra-clause role of a comma, we need to estimate the right and left segment conjuncts to the comma, using information from both segments. Any information to identify a segment as a clause or a phrase or phrases is useful. Carreras and Marquez (2001) prove that using features containing relevant information about a clause leads to more efficient clause identification. Their system outperforms all other systems in CoNLL01 clause identification shared task (Sang &amp; Dejean, 2001). Given this consideration, we select two categories of features as follows.</Paragraph>
    <Paragraph position="1">  (1) Direct relevant feature category: predicate and its complements.</Paragraph>
    <Paragraph position="2"> (2) Indirect relevant feature category: auxiliary words or adverbials or prepositions or clausal conjunctions.</Paragraph>
    <Paragraph position="3"> Directly relevant features VC: if a copula Shi appears VA: if an adjective appears VE: if as the main verb appears VV: if a verb appears CS: if a subordinate conjunction appears  BA_BEI: if or appears LC: if a localizer appears FIR_PR : if the first word is a pronoun LAS_LO: if the last word is a localizer LAS_T : if the last word is a time LAS_DE_N : if the last word is a noun that follows De No_word : if the length of a word is more than 5 no_verb: if no verb(including VA) DEC: if there is relative clause ONE: if the segment has only one word To detect whether a segment is a clause or phrase, the verbs are important. However, Chinese has no morphological paradigms and a verb takes various syntactic roles besides the predicate, without any change of its surface form. This means that information about the verb is not sufficient, in itself, to determine whether segment is a clause.</Paragraph>
    <Paragraph position="4"> When the verb takes other syntactic roles besides the predicate, its frequently accompanied by function words. For example, a verb can be used as the complement of the auxiliary word or De (Xia, 2000), to modify the following verb or noun. In these cases, the auxiliary words are helpful for deciding the syntactic role of the verb. Other function words around the verb also help us to estimate the syntactic role of the verb. Under this consideration, we employ all the function words as features, where they are composed as the indirect relevant feature category.</Paragraph>
    <Paragraph position="5"> Table 4 gives the entire feature set. The label of each feature type is same as the tag set of Chinese Penn Treebank 2.0 (see Xia (2000) for more detailed description). If the feature appears at the left segment, we label it as L_feature type, and if it is on the right, its labeled as R_ feature type, where feature type is the feature that is shown on table 4. The value for each feature is either 0 or 1. When extracting features of a sentence, if any feature in the table 4, appears in the sentence, we assign the value as 1 otherwise 0. The features of example (12) are extracted as table 5 describes. All of these values are composed as an input feature vector for</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML