File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/97/w97-0104_intro.xml
Size: 4,395 bytes
Last Modified: 2025-10-06 14:06:21
<?xml version="1.0" standalone="yes"?> <Paper uid="W97-0104"> <Title>II I A Statistics-Based Chinese Parser</Title> <Section position="4" start_page="0" end_page="5" type="intro"> <SectionTitle> 5. 2 Statistics from treebank </SectionTitle> <Paragraph position="0"> The difficulty to parse nati/ral language sentences is their high ambiguities. Traditionally, disambiguation problems in parsing have been addressed by enumerating possibilities and explicitly declaring knowledge which might aid most interesting natural language processing problems. As the large.scale annotated corpora become available nowadays, automatic knowledge acquisition from them becomes a new efficient approach and has been widely used in many natural language processing systems.</Paragraph> <Paragraph position="1"> Treebanks are the collections of sentences marked with syntactic constituent structure trees. The statistics extracted from a large scale treebank will show useful syntactic distribution principles and be very helpful for disambiguation in a parser. Some statistical data and rules used in our parser are briefly described as follows: (1) boundary distribution data(Sl) This group of data shows the different influence of context information on the constituent boundaries in a sentence, counted by the co-occurrence frequencies of different constituent boundary labels(b~ with the word(w~) and pmt-of-speech(POS) tags(ti), which include: (a) the co-occurrence frequencies with functional words: ~wi, bi); (b) the co-occurrence frequencies with a single POS tag: j~ts,b~); (c) the co-occurrence frequencies wig local POS tags:f(bi, ti, ti+j) or./~ti.s, ti, b+). They play an important role in the prediction of constituent boundary locations.</Paragraph> <Paragraph position="2"> (2) Syntactic tag reduction data(S2) This group of data records the possibilities for the constituent structures to be reduced as different syntactic tags, represented by a set of statistical rules: constituent structure -> {syntactic tag, reduction probability}.</Paragraph> <Paragraph position="3"> For example, the rule v+n -> vp 0.93, np 0.0'7 indicates that a syntactic constituent composed by a verb(v) and a noun(n) can be reduced as a verb phrase(vP) with the probability 0.93, and as a noun phrase(rip) only 0.07 ~. Based on them, it is easy to determinate the suitable syntactic tag for a parsed constituent according to its internal structure components.</Paragraph> <Paragraph position="4"> In Chinese, there arc a group of verbs with especial synlactic functions. They can directly modify a noun, such as the verb &quot;xun//an(Wain)&quot; in the phrase &quot;xurd/o.n ~rTumccha(training handbook)&quot;. Therefore,, we have the noun phrases with constituent smscture &quot;v+n&quot; in Chinese treebank.</Paragraph> <Paragraph position="5"> ! (3) syntactic tag distribution on a boundary(S3) This group of data expresses the possibilities for an open or a close bracket to be the boundary of a constituent with certain kind of syntactic tags under different POS context. For example, n \[.p..7.> vp 0.531, pp 0.462, np 0.007, indicates that the probability for an open bracket under the context of noun(n) and preposition(p) to be the left boundary of a verb phrase(vp) is 0.531,'a prepositional phrase(pp) 0.462, and a noun phrase(rip) 0.007. This kind of data provides the basis for matching brackets and labeling the matched constituents. (4) constituent preference data(S4) This group of data records the preference for a constituent to be combined with its left adjacent constituent or the right adjacent one under local context, counted by the frequencies of different constituent combination cases in treebank(see Figure 1), which are represented as: {<constituent combination case>, <left combination frequency>, <right combination frequency>} For example, {p+nF4-vp, 190, 0~. indicates that the combination frequency of the noun phrase(np) with preposition(p) under the local context &quot;p+np+vp&quot; is 190, and with verb phrase(vp) is 0. They will be helpful in preference matching model.</Paragraph> <Paragraph position="6"> where -, is the frequency of the constituent L~ cz ~3 T \] in treebank. It provides useful information for syntactic disambiguation.</Paragraph> <Paragraph position="8"/> </Section> class="xml-element"></Paper>