XML Viewer - p98-2206

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/98/p98-2206_metho.xml
Size: 16,736 bytes
Last Modified: 2025-10-06 14:15:01
<?xml version="1.0" standalone="yes"?>
<Paper uid="P98-2206">
  <Title>Chinese Word Segmentation without Using Lexicon and Hand-crafted Training Data</Title>
  <Section position="4" start_page="1265" end_page="1268" type="metho">
    <SectionTitle>
2. Principle
</SectionTitle>
    <Paragraph position="0"> 2.1. Mutual information and difference of t-score between characters Mutual information and t-score, two important concepts in information theory and statistics, have been exploited to measure the degree of association between two words in an English corpus\[4\]. We adopt these measures almost completely here, with one major modification: the variables in two relevant formulae are no longer words but Chinese characters.</Paragraph>
    <Paragraph position="1"> Definition 1 Given a Chinese character string 'xy', the mutual information between characters x and 3,(or equally, the mutual information of the location between x and y) is defined as: mi(x:y) = log 2 p(x,y) p(x)p(y) where p(x,y) is the co-occurrence probability of x and y, and p(x), p(y) are the independent probabilities of x and y respectively.</Paragraph>
    <Paragraph position="2"> As claimed by Church(1991), the larger the mutual information between x and y, the higher the possibility of x and y being combined together. For</Paragraph>
    <Paragraph position="4"> The distribution of mi(x:y) for sentence (I) is illustrated in Fig. l(where &amp;quot;~&amp;quot; denotes x, y should be combined and &amp;quot;m&amp;quot; be separated in terms of human judgment. This convention will be effective throughout the paper). The correct segmentation for (1) can be achieved when we decide that every location between x and y in the sentence be treated as 'combined' or 'separated' accordingly if its mY value is greater than or below a threshold(suppose the threshold is 3.0 for this example): economy cooperation will be I ff? for current world economy trend of an appropriate answer (Economic cooperation will be an appropriate answer to the trend of economics in current worM.) It is evident that x and y are to be strongly combined together if mY(x.'y)&gt;&gt;O and to be separated if mi(x:y)&lt;&lt;O. But if mi(x.'y) ~ O, the association of x and y becomes uncertain.</Paragraph>
    <Paragraph position="5"> Observe the mY distribution for sentence (2) in</Paragraph>
    <Paragraph position="7"> In the region of 2.0 ~&lt; mY &lt; 4.0, there exist  some confusions: we have mY(~.&amp;quot; ~=mi(~t:.Y~ :) &gt; mi(.T/z. * ~Yt~), mi(fl~: ~) &gt; mi(~. 7 ~') &gt; mi(;~?: t~), and mY(~.&amp;quot; ~) &gt; mY(/~: f/:), however, &amp;quot;~J~:~&amp;quot;&amp;quot;7~: ~'&amp;quot;'~}~:~'&amp;quot;'~: ~&amp;quot;should be separated and &amp;quot;~: ~'&amp;quot;'~:~'&amp;quot;'~: \[\] '&amp;quot;'}~: ~J:&amp;quot; be combined by human judgment -- the power of mi is somewhat weak in i;:. ...... ...... ::::)iii=;::~i E' ~1~ iZiii::. :.~i~iii!!ill :ii i::ii.: .~7; m . ! Ill :&amp;quot;:: .... .................. . ......... : : i g:.:: :s:. ================================================================ ~ii ~ * : : ::.:.::. ~:i:: ?, , m:, ,,,, .............. ~:~: ::~::::&amp;quot; : :: :i:===============================,:,:m: ~:~i::;i m :': Ill &amp;quot; - : .:.:::::E;E&amp;quot; E:E:: &amp;quot;&amp;quot; &amp;quot; : :E: &amp;quot;:.&amp;quot;hq ............ &amp;quot; ........... Character pairs in sentence Fig. 1 The distribution of mi(sentence 1) * connect i break 1266 mi 8 t :&amp;quot; : : .... ~ : ~ ~ iiiiiiiiiiiiiiiiiiiii}iii}i ii~iiiiii;iiiiii~iiii 6 .................... %:;22Z2221;21;Z:;ZI;II2;ZI%2222;IZ;221;I;ZII/IZI:;:2: 4 : ~,::!:: :~:;~:;:~.~/~i~:~ii~!~ii~;~iii:iiiiiiii~i~ii:i~i;i!iii~iiii~i?ii!~:~;i~;~i~i!i~iiiiiiiiiii~i~i~i~!~!~!i~:i:~;~!i:i~ii:i:~: \] .connect\]break ,i: i* ~; ~; :&amp;quot; :&amp;quot; :: :!:!::':':: &amp;quot;::::'::&amp;quot; :&amp;quot; :i31~!~i!.i:::ih::i!:i!i}:~!:!:;5}!~::~:?i~:ii:iiilh~!!i!!iii::i!!!:!i!:'::i:~ \] * .. ::.:::. ........ ::-:::::: :::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::: ........... ..: ............. : ...... 0 ~ ~ &amp;quot; :~ ~ :: :: :i~ii:i~i:i~i~ i ~ '=::iiiiiiii~i,i~:~!~ii,!iii.~i~i~iiii~iii~i~iii~i:i~i ~ ill.:}: .iii~!~}!~} ~:i &amp;quot;: ....................... ::::~ii~:: ........... &amp;quot; ........ ~:iiiiiiiiii~!iiiiiii~!iii~!~i!i!i~.~iiii!iiiii1i1i1i1~1ii!~!i2i!!i!iiii~i -2 &amp;quot;~:ii?i/~!5~2ii~i2~;~!~:~i;ii~iiiiii5iiiiiiiigig!iii~i~iiiii~!~!~1!~iiiiiiiiiiiiiii~iiiiiiii?i?~s~s~s -4 ...... ii;i~!:~i~i~i~i::ii!!i~i~iiiiiiiiii~iiiiii~iiiii!:!!i~;i!!i~i~i!iii!iiiii!iiiiiiii~iiiii~i!~i~i~i!!!ii!!ii~iiiiiii~i~iii  Fig.2 The distribution ofmi(sentence 2) Characterpmrs m sentence the 'intermediate' range of its value. To solve this problem, we need to seek other ways additionally.</Paragraph>
    <Paragraph position="8"> Definition 2 Given a Chinese character string 'xvz'. the t-score of the character y relevant to characters x and z is defined as:</Paragraph>
    <Paragraph position="10"> where p(ylx) is the conditional probability of y given x, and p(zly), of z given y, and var(p(ylx)), var(p(zly)) are variances of p(ylx) and of p(zly) respectively.</Paragraph>
    <Paragraph position="11"> Also as pointed out by Church( 1991), ts~, z (y) indicates the binding tendency of y in the context of  x and z: ifp(zly)&gt; p(ylx), or ts~.z(y) &gt; 0 then y tends to be bound with z rather than with x if p(ylx)&gt; p(zly), or tsx, (y) &lt; 0 then y tends to be bound with x rather than with z  A distinct feature of ts is that it is context-dependent (a relative measure), along with certain degree of flexibility to the context, whereas mi is context-independent (an absolute measure). Its drawback is it attaches to a character rather than to the location between two adjacent characters. This may cause some inconvenience if we want to unify it with mi. We initially introduce a new measure dts instead of ts: Definition 3 Given a Chinese character string 'vxyw', the difference oft-score between characters x and y is defined as: dts(x: y) = tSv.y (x) - tSx, w. (y) Now dts(x:y) is allocated to the location between x and y, just like mi(x:y). And the context of dts(x:y) becomes 4 characters, 1 character larger than that of tSx, z (y). The value of dts(x:y) reflects the competition results among four adjacent characters v, x, y and w:</Paragraph>
    <Paragraph position="13"> In this case, x and y attract each other. The location between x and y should be bound.</Paragraph>
    <Paragraph position="14"> (2) tSv.y (x) &lt; 0 tSx. w (y) &gt; 0 (x tends to combine with v, and y tends to</Paragraph>
    <Paragraph position="16"> In this case, x and y repel each other. The location between x and y should be separated.</Paragraph>
    <Paragraph position="17"> (3a) tsv.y (x) &gt; 0 tsx,w (y) &gt; 0 (x tends to combine with y, whereas y tends to combine with w) (3b) tsv. e (x) &lt; 0 tsx. ~ (y) &lt; 0 (x tends to combine with v, whereas y tends to combine with x) (r)&lt; (r)&lt; @ (r) In cases of (3a) and (3b), the status of the location between x and y is determined by the competition of ts~, e (x) and tSx, w (Y) : if dts(x:y) &gt; 0 then it tends to be bound if dts(x:y) &lt; 0 then it tends to be separated</Paragraph>
    <Paragraph position="19"> Fig.3 The distribution of dts(sentence 2) Character pairs in sentence The general rule governing dts is similar as that governing mi: the higher the difference of t-score between x and y, the stronger the combination strength between them, and vice versa.</Paragraph>
    <Paragraph position="20"> But the role of dts is somewhat different from that of mi: it is capable of complementing the 'blind area&amp;quot; of mi on some occasions.</Paragraph>
    <Paragraph position="21"> Consider sentence (2) again. The distribution of dis for it is shown in Fig. 3. Return to the character pairs whose mi values fall into the region of 2.0 ~&lt; mi &lt; 4.0 in Fig. 2, compare their dts values accordingly: dts( ~:.T/:) &gt; dts(PS~Je: ~) &gt; dts(H. ~7~g), dts(;~.&amp;quot; l~) &gt; dts(y~: ~) &gt; dts(~.&amp;quot; 7~C/~), and dts(~: ff)&gt; dts(~_: E) -- the conclusion dra~ from these comparisons is very close to the human judgment.</Paragraph>
    <Paragraph position="22"> 2.2. Local maximum and local minimum of dts Most of the character pairs in sentence (2) have got satisfactory explanations by their mi and dts so far. &amp;quot;~\]~ : ~ .... ~ : ~&amp;quot; are two of few exceptions. We have mi(~. ~)&gt; mi(J\]::~) and dts(PSYj~: ~)&gt; dts(Tf: \]~), however, the human judgment is the former should be separated and the latter be bound. Aiming at this, we further proposed two new concepts, that is, local maximum and local minimum of dts.</Paragraph>
    <Paragraph position="23"> Definition 4 Given 'vxyw' a Chinese character string, dts(x:y) is said to be a local maximum if dts(x.'y) &gt; dts(v:x) and dts(x:y) &gt; dts(y:w). And, the height of the local maximum dts(x:y) is defined as:</Paragraph>
    <Paragraph position="25"> Definition 5 Given 'vxyw' a Chinese character string, dts(x:y) is said to be a local minimum if dts(x.'y)&lt; dts(v:x) and dts(x:y) &lt; dts(y:w). And, the depth of the local minimum dts(x:y) is defined</Paragraph>
    <Paragraph position="27"> Two basic hypotheses can be easily made as the consequence of context-dependability of dts(note: mi has not such property): Hypothesis 1 x and y tends to be bound ifdts(x:y) is a local maximum, regardless of the value of dts(x:y)(even it is low).</Paragraph>
    <Paragraph position="28"> Hypothesis 2 x and y tends to be separated if dts(x:y) is a local minimum, regardless of the value of dts(x:y) (even it is high).</Paragraph>
    <Paragraph position="29"> In Fig. 3, dts(fi4-j~: ~,~) is a local minimum whereas dts(H.'j~g) isn't. At least we can say that &amp;quot;~-\]t:~&amp;quot; is likely to be separated, as suggested by the hypothesis 2(though we still can say nothing more about &amp;quot;T\[::~&amp;quot;).</Paragraph>
    <Paragraph position="30"> 2.3. The second local maximum and the second local minimum of dts We continue to define other four related concepts: Definition 6 Suppose 'vxyzw' is a Chinese character string, and dts(x:y) is a local maximum. Then dts(y:z) is said to be the right second local maximum of dts(x:y) if dts(y:z)&gt; dts(v:x) and dts(y:z) &gt; dts(z:w).And, the distance between the local maximum and the second local maximum is defined as:  dis(locmax, y:z) = dts(x:y)- dts(y:z) Definition 7 Suppose 'vxyzw' is a Chinese  character string, and dts(x:y) is a local minimum. Then dts(y:z) is said to be the right second local minimum of dts(x:y) if dts(y:z)&lt; dts(v:x) and dts(y:z) &lt; dts(z:w). And, the distance between the local minimum and the second local minimum is defined as: dis(locmin, y:z) = dts(y:z)- dts(x:y) The left second local maximum and the left second local minimum of dts(x:y) can be defined similarly.</Paragraph>
    <Paragraph position="31"> Refer to Fig. 3. By definition, dts(fl~.'yT~) is the left second local minimum of dts(3~g: 7~'), and dts(y~.'~) is the right second local maximum of dts('~&amp;quot;y~) meanwhile the left second local minimum of dts(C/~: ~).</Paragraph>
    <Paragraph position="32"> These four measures are designed to deal with two conunon construction types in Chinese word formation: &amp;quot;2 characters + I character&amp;quot; and  &amp;quot;1 character + 2 characters&amp;quot;. We will skip the discussion about this due to the limited volume of the paper.</Paragraph>
    <Paragraph position="33"> 3. Algorithm  The basic idea is to try to integrate all of the measures introduced in section 2 together into an algorithm, making best use of the advantages and bypassing the disadvantages of them under different conditions.</Paragraph>
    <Paragraph position="34"> Given an input sentence S, let /~,,, : the mean ofmi of all locations in S; o'm,: the standard deviation ofmi of all locations in S; flat.,. : the mean ofdts of all locations in S; (in fact, /ta, ~. ----- 0) o-a, s : the standard deviation of dts of all locations in S we divide the distribution graphs of mi and dts  -o'at ~ &lt; dts(x:y)~ 0 dts(x:y) &lt;~- o&amp;quot; a,; mi(x:y) &gt; l.t., + o',. i iU mi &lt; mi(x:y)~ /.t mi + O'mi region c ~t,, i -- o-mi &lt; mi(x:y)&lt;~ lu,,i region d mi(x:y) &lt;~ lu,.~ -- o-,,,  The algorithm scans the input sentence S from left to right two times: The first round for S For any location (x:y) in S, do  (x:y) is the right second local max or '--'if (x:y) is the left second local max ifdts(x:y) is the second local minimum then if dis(locmin, x:y) &lt; 0.5 x lrmin(loc, x:y) then mark (x:y) &amp;quot;--' if (x:y) is the right second local min or '~' if (x:y) is the left second local min  The second round for S if (x:y) is marked '?' then if mi(x:y) &gt;~ 0 then mark (x:y) 'bound' else 'separated' if (x:y) is marked '---&amp;quot; then the status of (x:y) follows that of the adjacent location on the left side if (x:y) is marked '---&amp;quot; then the status of (x:y) follows that of the adjacent location on the right side (The constants 61, 62, 63, ~l, ~2, ~3 are determined by experiments, satisfying: G &lt; &amp;_ &lt; G ; G &lt; G &lt; G and 0=2.5) Generally speaking, the lower the &lt;dts(x:y), mi(x:y)&gt; in distribution graphs, the more restrictive the constraints. Take 'bound' operation as example: there is not an 3, additional condition in case 1.1; in case 1.6 however, the existence of a local maximum is needed; in case 1.3, a requirement for the height of local maximum is added; in case 1.4, the height required becomes even higher; and in case 1.5, which is the worst case for 'bound' operation, the height must be high enough. Case 2 says if the second local maximum is pretty, near to the local maximum corresponded, then its status ('bound' or 'separated') would be likely to be consistent with that of the local maximum. So does the second local minimum. Finally, for locations marked '?' with which we have no more means to cope, simply make decisions by the value of mi(we set it to 2.5, same as that in the system of Sproat and Shih(1993)). Recall sentence (2). The character pair &amp;quot;7~: ~E&amp;quot; is regarded as 'separated' successfully by following &amp;quot;~E: W_,&amp;quot;(local minimum) with the rule in case 2 although its mi value is rather high(3.4). &amp;quot;~: ~J~&amp;quot; is marked '?' in the first round and treated properly by 0 in the second round.</Paragraph>
    <Paragraph position="35"> The algorithm outputs segmentation for sentence (2) at last: the correct France tennis competition today</Paragraph>
    <Paragraph position="37"> in Paris the western suburbs I open curtain (The Tennis Competition of France opened in the western suburbs of Paris today.) Note that there exist two ambiguous fragments &amp;quot;~TI:~&amp;quot;(&amp;quot;~ I ~'&amp;quot; or &amp;quot;~&amp;quot;) and &amp;quot;~ ~&amp;quot;(&amp;quot;~ I ~&amp;quot; or &amp;quot;~1 ~ I ;~\]~&amp;quot;), as well as two proper nouns &amp;quot;France&amp;quot; and &amp;quot;Paris&amp;quot; in sentence (2).</Paragraph>
  </Section>
  <Section position="5" start_page="1268" end_page="1270" type="metho">
    <SectionTitle>
4. Experimental results
</SectionTitle>
    <Paragraph position="0"> We select 100 Chinese sentences, consisting of 1588 characters(or 1587 locations between character pairs) randomly as testing texts. The statistical data required by calculating mi and dts, in fact it is character bigram, is automatically derived from a news corpus of about 20M Chinese characters. The testing texts and training corpus are mutually excluded.</Paragraph>
    <Paragraph position="1"> Out of 1587 locations in the testing texts, 1456 are correctly marked by our algorithm.</Paragraph>
    <Paragraph position="2"> We define the accuracy of segmentation as: # of locations being correctly marked # of locations in texts Then, the accuracy for testing texts is 1456/1587 = 91.75%.</Paragraph>
    <Paragraph position="3"> The distribution of local maximum, local minimum and other types ofdts value(involving the second local maximum and the second local minimum) of the testing texts over &lt;dts, mi&gt; regions is summarized in Fig. 4 (Fig. 5 is the same distribution in percentage representation). This would be helpful for readers to understand our algorithm.</Paragraph>
    <Paragraph position="4"> Future work includes: (1) enlarging the size of  experiments; (2) refining the algorithm by studying the relationship between mi and dts in depth; and (3) integrating it as a module with the existing Chinese segmenters so as to improve their performance (especially in ability to cope with unknown words and ability to adapt to various domains). -- it is indeed the ultimate goal of our research here.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML