File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/i05-2002_metho.xml

Size: 11,680 bytes

Last Modified: 2025-10-06 14:09:36

<?xml version="1.0" standalone="yes"?>
<Paper uid="I05-2002">
  <Title>A Hierarchical Parsing Approach with Punctuation Processing for Long Chinese Sentences</Title>
  <Section position="4" start_page="7" end_page="8" type="metho">
    <SectionTitle>
3 Motivations
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="7" end_page="7" type="sub_section">
      <SectionTitle>
3.1 Differences between Chinese and
English Punctuations
</SectionTitle>
      <Paragraph position="0"> In Chinese, there are some punctuations which don't exist in English. The first one is a pair of Chinese book-name mark 'g457' and 'g458', which are obvious marks that the content between them must be name of a book. The second one is pause mark 'g451', which replaces comma as the separating mark between coordinate components.</Paragraph>
      <Paragraph position="1"> For instance, sentence &amp;quot;I like to walk, skip, and run.&amp;quot; can be translated into Chinese one as &amp;quot;g6117 g2928g8438g17220g451g17351g451g2656g17317g452&amp;quot;. Chinese pause mark is the evident mark with the exclusive usage is to separate coordinate words or simple phrases, so it is easier to get coordinate words or simple phrases in Chinese sentences.</Paragraph>
    </Section>
    <Section position="2" start_page="7" end_page="7" type="sub_section">
      <SectionTitle>
3.2 Special Difficulty in Parsing Long
Chinese Sentences
</SectionTitle>
      <Paragraph position="0"> In essence, English is a kind of hypotaxis language, so an intact syntax structure denotes a sentence. When several simple sentences are connected to form a compound sentence, there should be obvious conjunctions between them.</Paragraph>
      <Paragraph position="1"> Differently, Chinese is a kind of parataxis language, and the language unit which expresses a complete thought is an intact Chinese sentence.</Paragraph>
      <Paragraph position="2"> Therefore, several sentences with associative meanings can be connected by come punctuations to form a compound one without any conjunctions. This type of sentence is called 'run-on sentence', and which is prevalent in Chinese. For example, we randomly selected 4431 sentences whose lengths are over 30 characters from a Chinese corpus named TCT 973.</Paragraph>
      <Paragraph position="3">  There are 1830 run-on sentences, covering 41.3%. Chinese sentence &amp;quot;g6117g10628g5062g8505g1849g1025g5192g712 g8611g3837g6392g17722g712g6642g5483g6117g12946g11142g2159g4625g452&amp;quot; is this kind of sentence. The corresponding English meaning is &amp;quot;Now, I am not young and I still have to take bus to work everyday, which make me very tired&amp;quot;. So, in above Chinese sentences, commas are used not only as separating marks of sub-sentences but also as separating marks of components in one sub-sentence. However, lack of connections makes methods [7, 8] of segmenting complex sentences invalid. In this situation, acquisition of the boundaries of sub-sentences and syntactic structure of sub-sentences or phrases should be done simultaneously in once-level parsing strategy, which will undoubtedly increase the difficulty of parsing long sentences.</Paragraph>
    </Section>
    <Section position="3" start_page="7" end_page="8" type="sub_section">
      <SectionTitle>
3.3 Corresponding Solution
</SectionTitle>
      <Paragraph position="0"> In order to solve this problem, a hierarchical parsing (HP) approach is proposed by us.</Paragraph>
      <Paragraph position="1"> Nunberg's theory of two categories grammars provides us the theoretical base of HP approach.</Paragraph>
      <Paragraph position="2">  According to his definition of two categories of grammars described in section 2, the two grammars can operate at different levels independently. Punctuations which can occur as elements of text grammar are defined by us as 'divide' punctuations. Then punctuations which can occur as elements of lexical grammar are 'ordinary' ones. The 'divide' punctuations can be used to divide the whole sentence into several parts. Then the parsing will be carried out in two steps. Thus, acquisition of syntactic structure of sub-sentences or phrases is done in the first level parsing, and acquisition of the boundaries of sub-sentences and relationship of sub-sentences or phrases can be done in second level parsing.</Paragraph>
      <Paragraph position="3"> This is the main idea of HP approach, which can reduce the difficulty of parsing run-on sentences and other types of compound sentences. The framework of HP approach is shown as following Figure 1:</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="8" end_page="9" type="metho">
    <SectionTitle>
4 Hierarchical Parsing Approach
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="8" end_page="8" type="sub_section">
      <SectionTitle>
4.1 Classification of Chinese Punctuations
</SectionTitle>
      <Paragraph position="0"> In this paper, the 'divide' punctuations are defined as follows: If lexical sentences or phrases which are separated by certain punctuations must be correlative to each other wholly not partly, these punctuations are in level of text grammar, which are classified as 'divide' punctuations. Punctuations in a and b of Figure 2 are examples of two categories of punctuations ( P stands for punctuations).</Paragraph>
      <Paragraph position="1"> In Chinese, the semicolon is used to separate coordinate sub-sentences. The colon is used as separation mark of interpretative phrases or sub-sentences from former sub-sentences. So, according to above definition, they can be classified as 'divide' punctuations. The comma, specially, can occur as a mark of coordinate phrases element. So, using of it as 'divide' punctuation may cause improper division problems and a compensatory solution is introduced in, which will be discussed in detail in Section 4.3.3.</Paragraph>
    </Section>
    <Section position="2" start_page="8" end_page="9" type="sub_section">
      <SectionTitle>
4.2 Grammar Rules
</SectionTitle>
      <Paragraph position="0"> The automatic extraction of grammar rules which include punctuations depends on large scales of parsed Chinese corpus which has ample syntactic phenomena and standard usage of punctuations. Fortunately, Chinese tree-bank named TCT 973 is such a corpus. It includes 1,000, 000 words and covers all kinds of text after 1990 th . The average length of each sentence is 23.3 words. Long sentences of over 20 words length account for half of it.</Paragraph>
      <Paragraph position="1"> Firstly, original grammar rules are extracted. Then generalizations are done about the use of the various punctuation marks from the rules set. For example, as mentioned before, Chinese book-name mark 'g1949' and 'g1950'are obvious marks that the content between ' g1949 ' and 'g1950'must be name of a book by any syntactic category. Therefore, a generalized rule can be deduced as below: :{ , , , ......}-NP X X NP VP S PPg457g458</Paragraph>
      <Paragraph position="3"> In above generalized rule, X can be any POS of phrases or single word, so possible rules that have not been deduced from tree-bank are added into the grammar rules set with probabilities 1.</Paragraph>
      <Paragraph position="4"> Except for above special situations,g3 corresponding probabilities of all grammar rules are computed by Maximum Likelihood Estimate  (MLE) method.At last, all rules are combined to form an intact grammar system.</Paragraph>
    </Section>
    <Section position="3" start_page="9" end_page="9" type="sub_section">
      <SectionTitle>
4.3 Parsing Strategy
</SectionTitle>
      <Paragraph position="0"> Depending on above classification, commas, semicolons and colons are used to divide sentences into a series of sub-sentences. Notice that quotation marks and parenthesis are treated as transparent and syntactically non-functional.</Paragraph>
      <Paragraph position="1">  All sub-sentences and phrases gotten from the division processing are inputs of the first level parsing. A chart parsing algorithm is used here. The grammar rules and corresponding probabilities are used to do parsing and disambiguating. Then for all sub-sentences and phrases, their parsing trees are the highest probabilities ones of all possible trees.</Paragraph>
    </Section>
    <Section position="4" start_page="9" end_page="9" type="sub_section">
      <SectionTitle>
4.3.3 Detection of Improper Division and
Combination
</SectionTitle>
      <Paragraph position="0"> Because of the specialty of comma, using of it as the division mark may cause improper divisions. The main causation is improper division between coordinate phrases which have been same component of the sentence. For example, Chinese sentence &amp;quot;g6117g2928g8438g3324g7161g3837g2447g16278 g17187g7703g14469g712g3324g3811g3837g2447g8439g17187g14667g14469g712g3324g12191g3837g2447g16278g17187 g13430g2506g712g1306g7368g2928g8438g3324g1920g3837g2447g8439g17187g19646g7235g452&amp;quot; is a typical coordinate structure similar to &amp;quot;I like to do ..., to do ..., to do..., but I like better to...&amp;quot; in English. So, the first three &amp;quot;g2928g8438&amp;quot;are coordinate predicates of the sentences. Then the improper division will break up this relationship. In this section, a detection and combination method is proposed by us to solve this problem in parsing Chinese sentences.</Paragraph>
      <Paragraph position="1"> Because the lexical expressions surrounding punctuations are parsed in first level parsing, it is easy to get their internal syntactic structures information we need. Just a simple analysis procedure is needed to judge if there exists such a coordinate relationship between lexical expressions surrounding commas.</Paragraph>
      <Paragraph position="2"> A description of the analysis strategy is given according to this example.</Paragraph>
      <Paragraph position="3"> Just like Figure 3 shows, the components after the first comma are parsed as verb phrase (VP) marked as B. Obviously B is composed of a preposition phrase (PP) and a verb phrase. If there exists a minimal length of phrase immediately before the first comma and this phrase has totally the same structure to phrase B, then they are coordinate phrases. In Figure 3, A  is such a phrase. The components after other commas are analyzed similarly. Finally, A</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="9" end_page="10" type="metho">
    <SectionTitle>
, B
</SectionTitle>
    <Paragraph position="0"> and C are coordinate phrases. Since the verb phrase D immediately after the second comma has obviously different structure from A  , B and C, so they aren't coordinate components. The part-of-speech tags throughout this paper follow  Through the above analysis, we can see that the first and second commas are actually in level of lexical grammar, using them as 'divide' punctuations will cause the improper division as shown in Figure 2 of b. Therefore, we present a method to use sub-tree adjoining operation, firstly combine the sub-tree A  Then the execution conditions and results of such adjoining operation are summarized as following rules:</Paragraph>
    <Paragraph position="2"> The execution conditions of both Rule (2) and (3) are defined as follows: all X should be coordinate phrases with the same syntactic categories.</Paragraph>
    <Paragraph position="3">  The parsing algorithm of this module is totally the same to the first level parsing; with the difference is the input string. At the first parsing stage, inputs are POS sequence of words, but at the second parsing stage, inputs are POS sequence of all sub-tree root nodes. After this stage of parsing, the best parsing trees of whole sentences will be constructed.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML