File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/96/c96-1035_metho.xml

Size: 12,755 bytes

Last Modified: 2025-10-06 14:14:07

<?xml version="1.0" standalone="yes"?>
<Paper uid="C96-1035">
  <Title>Chinese Word Segmentation based on Maximum Matching and Word Binding Force</Title>
  <Section position="4" start_page="0" end_page="200" type="metho">
    <SectionTitle>
3 Word Frequency Method for
Segmentation
</SectionTitle>
    <Paragraph position="0"> In this statistical approach in terms of word frequencies, a lexicon needs not only a rich repertoire of word entries, lint also the usage frequency of e, ach word. To segment a line of text, each possible segmentation alternative is ewduated according to the product of the word fi'equencies of the words Seglnented. The word sequence, with the highest fi'equency product is accepted a.s correct.</Paragraph>
    <Paragraph position="1"> This method is simple but its a(:curacy (h,,lmnds heavily on the accuracy of the usage fi'equencies.</Paragraph>
    <Paragraph position="2"> The usage frequency of a word differs greatly from  one tytm (t\[ do(:umcnts to a.noth(n', say, a l)assag(: of world news as a,ga,hlsl; a, t(',(:hnical r(~,port. Sil~c('~ (,here, a.r('. I;(uls o\[ l,h(/llsa.Ii(ls o\[ words a,cl;ively us('.(l, Oil(', nc(;ds a giganti(: (:oll(~(:ti(ll: ()f texts to mak(~ ;m a,(',(:urat,(~ estimal;(~, lint t)y t;h(~.u, the (~stimat;(~ is jusl, an averag(~ a,n(l it; ma.y not; t)(,, suital)le for any tyt)(! (/\[ (h)(:mn(mt at all. /n oth(n' words, 1,}m variml(:(~ of ml(:h &amp;ll (~st;illl~tl;(~ is to() great making I;h(; (~stiirlat(! listless.</Paragraph>
  </Section>
  <Section position="5" start_page="200" end_page="200" type="metho">
    <SectionTitle>
4 The Lexicon
</SectionTitle>
    <Paragraph position="0"> Most Chines(; linguists ac(',(;1)t the (h',:linition of a wor(1 as thc minimum unit tha,t is scmanticMly (',omt/h',t(~ and (',all lie, I)Ut; tog('%her as t/uihting t)lo('ks to form a, sent(ulc(u llow(:vex, in Chines(:, wor(ls can t)(~ unit(:d t(/ fi)rm (:Oml)()mM words, a.n(l they in turn, (',;/.): (:oral/in(: furth(',r 1:() t'()rm 3,('.1, higher (&gt;r(lcr('d (:omt)(/und words. As ;1 ma.tt(~r of t~lC\[;~ COlllI)o1111(l WOl'ds ,~LI'(; (~,xtr(un(~ly C(/IlllllOIl ;/11(~ they exist in large numbers. R is imI)ossit)h', t;(/ in(:lud(', all (:Olil\[)()lllld words into the, h!xi(:(m t)ut just to kcc t) (,host: which are \['re(tu(;nt;ly Uso(l a,n(i ha.v(', the word (:omtl(mcnts unit('d clos(',ly. A lexicon was at:quirt;(1 from th(; Inst;il;ulx'~ o\[ \[nf()rm;lt;i()ii S(;i(~,ll(:(~,, Acad(',nlia, Sini(:a. in Taiwan. Thcr('~ are 78410 word (mtri(:s in l:his h~xi(:(tn, (~n(:h associ null ated with a usage frextu(;n(:y. A (:O:'lnls (/t: over 63 million (:hara('.t(:rs o\[ news lines was acquired \[rom China. l)u(~ t(/ (:ultural difl'(:r(m(:(:s of tim two st)(:i('t;ios, there arc many words en(:(nmt(~r(:(1 in th(: (:()rpllS t)llt II()t in t:h(~ lexi(:on, rl'h(! lal;t(!r must t, heretbre lie em'ichcd 1)efor(~ it can 1)e a t)pli(:d 1:(/ t)(wt'orln the lexical a.nalysis. The tits( st, el/ t()wa,r(ls this end is to merge a h~xi(:(m l/ut)lishcd in China into this one, in(:r(',asing the numt)(u' of word ent;ries to 85,855.</Paragraph>
  </Section>
  <Section position="6" start_page="200" end_page="200" type="metho">
    <SectionTitle>
5 The Proposed Word
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="200" end_page="200" type="sub_section">
      <SectionTitle>
Segmentation Algorithm
</SectionTitle>
      <Paragraph position="0"> Tllc t)rot/os(:d algorithm of this t)al&gt;(:r makes use (t\[' a f(/rward ma.ximmn matching st, ra.t(;gy to i(hultify w()r(\[s, In this r(:sl)(~(:l; ~ this algorithm is a structural atll)roa(:h. (hMer this sl;ratcgy, errors are, usually a.ssot;iated with singh',-(:haract(~r words, ill th('~ first (:hm'a,(:ter (if a litm is i(hmtili(~d ns a single(:haract(~r word, what it nlcans is that; ther(~ is no multi-character word entry in the l(~xi(:on th;d; starts with such a chara(:tcr. In that case, there is not much on(, can do about it,. On the other hand, when a character is khmtifie, d as a single-cha.ra(:tcr word fl following another word (t in th(: line, one (:annot he, ltl wondca.'ing whether tim sole chm'acter (:omt)osing/~ shouhl not 1)(', combined with th(' suffix of (t to form another word instea.d, even il that metals ('hanging (~ int() a shorter w(tr(\[. In that case, every t)ossil/h~ w(/rd sO,(lll(~n(;(? alternative (:orresponding to the Sllt)-s(:qilo, iicc of (;hari-l(:t(ws fr()ill c~ and /3 together will 1)e evaluated according to the produ(:t o\[ its constituent word binding for(',es.</Paragraph>
      <Paragraph position="1"> Ttle binding force of a. wor(l is a. rues.sure of how strongly the charact(',rs conll)osing th(,, word are bound t()g(~ther as a single unit;. This for(x: is oL ten equated to tim usage fr(~qu(mcy of the word.</Paragraph>
      <Paragraph position="2"> In this l'(!Sl)(:(;l; , the pr()l&gt;()s(;(1 algoritlun is a, sta.tisti(:al apl)roach. It is as (,,tti(:i(:nt as tim maximum lna.tching moth(/(1 I)(~(;aus(! wor(l binding f()r(:(!s ;u'(~ utilized only in (,x(:(~pti(mal cases, th)w(wer, much of the word amt/iguities are climilmt(~d, h~a(ling to a vc'ry high word identification accuracy. S('gm(:ntation errors ass(/ciat('d with multi-cha.ract(,,r words can 11(: r(~(h:(:cd 1)y adding or (leh~ting woMs to or from the h',xi(:on as well as adjusting word t)in(ling forces.</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="200" end_page="201" type="metho">
    <SectionTitle>
6 Structure of the Lexicon
</SectionTitle>
    <Paragraph position="0"> Words in the h:xi(-(in are divided into 5 groups a, ccording to woM h;ngths. They corr(:spond to words ()t' l, 2, 3, 4, and more than 4 cha,ra(&gt; ters with group sizes equal t() 7025, 53532, 12939, 11269, and 1090 rt;stmctively. Since iilOSt of tlw, l;iule spent ill mm,lyzing a line, o\[ text is ill linding a match among the h;xicon (',ntries, a chwcr organization o\[ the lexicon Slmcds up the s(',mching 1)rot'(~ss trclnc, lMously. Most Chin(,a*, words are o\[ (mr: or two cha.racJx;rs ()lily. Searching for l(mg(!r WOl:dS I)(~\['Ol(: sholt(w OliOS ~/s ln'at:tised iu ma.ximum nu:tching recalls Sl)en(ling a great ileal of time s(,,arching for ram-existent; targets. To overcome this problem, I;11(', following measur(',s arc, takc, n to organize tim h:xicon for fast s(:m'(:h: * All sinp;h; (:ha.la.c.t,(:r w()t'(t,q a,l.O, sI;or(;d ill ;-/ I;able of 32768 bins. Since tilt; itll;Cl'llld cod(: Of a cha.rat'tcr takes 2 bytes, bits l-15 m'e used as th(! bin address for the, wor:l.</Paragraph>
    <Paragraph position="1"> * All 2-charat't(',r words are stored ill a se, parat(; tabh: of 655&amp;quot;{6 bins. 'I'll(', two low order bytes of the two (:hara(:ttn's arc used as a short iw t:(:g(',l&amp;quot; for bin address. Should t\]mrt~ be other words (:ont(',sting for the, same biu, they a, re kept in a linked list.</Paragraph>
    <Paragraph position="2"> * Any 3-ttha.ra, cl;t;r word is split into a 2(',ha.ra,('.t(;r pre\[ix and a, i\[-chara(:ter sutlix. The prt!lix will tm si,ored in the bin tabh: for 2('\]lar~l(:t(',r words with (:lear indi(:ation of its l)rcfix st&amp;(liB. Thc Sill\[ix will bc stored in the bin table for l-(:harac, t(:r words, again, wiLh clear indication of its suffix status. All (tut/li(;ate entries are coral)trier1, i.e., if (~ is a word as well as a suflix, tilt; two entries arc combined into one with a,n indication that it; can serve as a word as well as a suffix.</Paragraph>
    <Paragraph position="3"> * Any d-t:haract;ex word is divided up into a 2chara(',ttu&amp;quot; prefix and a 2-(:haract(n' suffix, 1)oth stored in tile bin table :\[or 2-character words, with ch:ar indications of tll(;ir r('~spc(:tivc status. Each prefix points to a link(;d list of associated suffixes.</Paragraph>
    <Paragraph position="4">  * Any word longer than 4 characters will be divided into a 2-character prefix, a 2-character infix and a suffix. The prefix and tile infix are stored in the bin table for 2-character words, with clear indications of their status. Each prefix points to a linked list of associated infixes and each infix in turn, points to a linked list of associated suffixes.</Paragraph>
    <Paragraph position="5"> Maximum matching segmentation of a sequence of characters &amp;quot;...abcdefghij.. 2' at the character &amp;quot;a&amp;quot; starts with matching &amp;quot;ab&amp;quot; against the 2-character words table. If no match is found, then, &amp;quot;a&amp;quot; is assumed a 1-character word and maximum matching moves on to &amp;quot;b&amp;quot;. If a match is found, then, &amp;quot;ab&amp;quot; is investigated to see if it can be a prefix. If it cannot, then &amp;quot;ab&amp;quot; is a 2-character word and maximum matching moves on to &amp;quot;c&amp;quot;. If it can, then one examines if it can be associated with an infix. If it can, then one examines if &amp;quot;cd&amp;quot; can be an infix associated with &amp;quot;ab&amp;quot;. If the answer is negative, then the possibility of &amp;quot;abed&amp;quot; being a word is considered. If that fails again, then &amp;quot;c&amp;quot; in the table of 1-character words is examined to see if it can be a suffix. If it; can, then &amp;quot;abe&amp;quot; will be examined to see if can be a word by searching the 1-chara(q;er suffix linked list pointed at by &amp;quot;ab&amp;quot;. Otherwise, one has to accept that &amp;quot;ab&amp;quot; is a 2-character word and moves on to start Inatching at &amp;quot;c&amp;quot;. If &amp;quot;cd&amp;quot; can be an infix preceded by &amp;quot;ab&amp;quot;, the linked list pointed at; by &amp;quot;cd&amp;quot; as an infix will be searched for the longest possible sutfix to combine with &amp;quot;abed&amp;quot; as its prefix. If no match can be found, then one has to give up &amp;quot;cd&amp;quot; as an infix to &amp;quot;ab':.</Paragraph>
    <Paragraph position="6"> 7 Training of the System Despite the fact thai; the lexicon acquired from Taiwan has been augmented with words fl'om another lexicon developed in China, when it is applied to segment 1.2 million chm'acter news passages in blocks of 10,000 characters each randomly selected over the text corpus, an average word seginentation error rate (IZ) of 2.51% was found with a standard deviation (c,) of 0.57%, mostly caused by uncommon words not included in the enriched lexicon. Then it is decided that the lexicon should be fllrther enriched with new words and adjusted word binding forces over a number of generations.</Paragraph>
    <Paragraph position="7"> In generation i, n new blocks of text are picked randomly from the corpus and words segmented using the lexicon enriched in the previous generation. This process will stop when I* levels off over several generations. The 100(1 - a)% confidence interval of t* in generation i is :tzto.a~,~,__l~r/v~ where a is the standard deviation of error rates in generation i- 1, and n is the number of blocks to be segmented in generation i. to.5~,n-1 is the density function of (0.5a, n - 1) degrees of freedom(Devore, 1991). Throughout the experiments below, n is always chosen to be 20 so that the 90% confidence interval (i.e., (t = 0.1) of t z is about :k0.23%.</Paragraph>
  </Section>
  <Section position="8" start_page="201" end_page="201" type="metho">
    <SectionTitle>
8 Experimental Results
</SectionTitle>
    <Paragraph position="0"> The lexicon has been updated over six generations after being applied to word segment 1.2 million characters. Tile vocabulary increases from 85855 words to 87326 words. The segmentation error rates over seven generations of the training process are shown in the table below:  Most of these errors occur in proper nouns not included in the lexicon. They are hard to avoid unless they become l)opular enough to be added to the lexicon. The CPU time used for segmenting a text; of 1,200,000 characters is 5.7 seconds on an</Paragraph>
  </Section>
  <Section position="9" start_page="201" end_page="201" type="metho">
    <SectionTitle>
IBM I{ISC System/6000 3BT computer.
9 Conclusion
</SectionTitle>
    <Paragraph position="0"> Lexical analysis is a basic process of analyzing and understanding a language. The proposed algorithm provides a highly accurate and highly efficient way for word segmentation of Chinese texts.</Paragraph>
    <Paragraph position="1"> Due to cultural differences, tile same language used in different geographical regions and difl'crent applications can be quite diffferent causing problems in lexical analysis. However, by introducing new words into and adjusting word binding threes in the lexicon, such difficulties can be greatly mitigated. null This word segmentor will be applied to word segment the entire corpus of 63 million characters before N-gram statistics will be collected for post-processing recognizer outputs.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML