File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-0135_metho.xml

Size: 9,139 bytes

Last Modified: 2025-10-06 14:10:36

<?xml version="1.0" standalone="yes"?>
<Paper uid="W06-0135">
  <Title>NetEase Automatic Chinese Word Segmentation</Title>
  <Section position="3" start_page="0" end_page="194" type="metho">
    <SectionTitle>
2 Modern Chinese Automatic Segmen-
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
tation System
2.1 System Structure
</SectionTitle>
      <Paragraph position="0"> The WS system of NETEASE CO. supports Chinese and English word segmentation, Chinese named entity recognition, Chinese part of speech tagging and phrase conglutination. In ordering to processing mass data, it is designed as an efficient system. The whole system includes some processing steps: pre-processing, number/date/time recognition, unknown words recognition, segmenting, POS tagging and postprocessing, as Fig 1 shows.</Paragraph>
      <Paragraph position="1"> The Prehandler module performs the preprocessing, splits the text into sentences according to the punctuations.</Paragraph>
      <Paragraph position="2"> Number/Data/Time recognition processes the number, date, time string and English words.</Paragraph>
      <Paragraph position="3"> Unknown word recognition includes personal name recognition and place name recognition.</Paragraph>
      <Paragraph position="4"> Segmenter component performs wordsegmenting task, matches all the candidate words and processes ambiguous lexical.</Paragraph>
      <Paragraph position="5"> POSTagger module performs part of speech tagging task and decides the optimal word segmentation using hierarchical hidden Markov model (HHMM) [Zhang, 2003].</Paragraph>
      <Paragraph position="6"> Posthandler retrieves phrases with multigranularities from segmentation result and detects new words automatically etc.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="193" type="sub_section">
      <SectionTitle>
2.2 Ambiguous phrase segmentation
</SectionTitle>
      <Paragraph position="0"> Assume that &amp;quot;AJB&amp;quot; are character strings and that W is a word list. In the field &amp;quot;AJB&amp;quot;, if &amp;quot;AJ&amp;quot;[?]W,  and &amp;quot;JB&amp;quot;[?]W, then &amp;quot;AJB&amp;quot; is called ambiguous phrase of overlap type. For example, in the string &amp;quot;Dang Dai Biao &amp;quot;, both &amp;quot;Dang Dai &amp;quot; and &amp;quot;Dai Biao &amp;quot; are words , so &amp;quot;Dang Dai Biao &amp;quot; is an ambiguous phrase of overlap type; and there is one ambiguous string.</Paragraph>
      <Paragraph position="1"> In the string &amp;quot;AB&amp;quot;, if &amp;quot;AB&amp;quot;[?]W(word), &amp;quot;A&amp;quot;[?] W, and &amp;quot;B&amp;quot;[?]W, then the string &amp;quot;AB&amp;quot; is called ambiguous phrase of combination type. For example, in the string &amp;quot;Ge Ren &amp;quot;, since &amp;quot;Ge Ren &amp;quot;, &amp;quot;Ge &amp;quot; and &amp;quot;Ren &amp;quot;are all words, so the string &amp;quot;Ge Ren &amp;quot; is an ambiguous phrase of combination type.</Paragraph>
      <Paragraph position="2"> We have built an ambiguous phrase lib of overlap and combination type from tagged corpus, which contains 200,000 phrases from 1gram to 4-gram. For example: &amp;quot;Cai /d Neng /v Chuang Zao /v, Chuang Zao /vn Zuo /v Zhun Bei vn&amp;quot; If one ambiguous phrase found in raw text, the potential segmentation result will be found in the lib and submit to next module. If not found, POS tagger module will disambiguate it.</Paragraph>
    </Section>
    <Section position="3" start_page="193" end_page="194" type="sub_section">
      <SectionTitle>
2.3 Chinese Personal Name Recognition
</SectionTitle>
      <Paragraph position="0"> At present we only consider the recognition of normal people name with both a family name and a first name. We got the statistical Character Set of Family Name and First Name data from corpus. And also consider the ability of character of constructing word. Some characters itself cannot be regarded as a word or composes a word with other characters, such as &amp;quot;Deng ,Nie ,Xin &amp;quot;; Some name characters which can compose word with other characters only, e.g. &amp;quot;Liu ,Zhang ,Ying &amp;quot; can construct words &amp;quot;Liu Hai Er ,[?] Zhang Zhi ,Ying Xiong &amp;quot;;Some name characters are also a common words themselves, e.g. &amp;quot;Tang ,Ma &amp;quot;.</Paragraph>
      <Paragraph position="1"> The recognition procedure is as follows: 1) Find the potential Chinese personal names: Family name is the trigger. Whenever a family name is found in a text, its following word is taken as a first name word, or its following two characters as the head character and the tail character of a first name. Then the family name and its following make a potential people name, the probable largest length of which is 4 when it is composed of a double-character family name and a double-character first name.</Paragraph>
      <Paragraph position="2"> 2) Based on the constructing word rules and the protective rules, sift the potential people names for the first time. For example, when raw text is &amp;quot;San Zhang ...,Wu Zhou ...&amp;quot;, then the &amp;quot;Zhang ,Zhou &amp;quot; were not family name. Because the &amp;quot;San ,Wu &amp;quot; is number. 3) Compute the probabilities of the potential name and the threshold values of corresponding family names, then sift the people names again based on the personal name probability function and description rules.</Paragraph>
      <Paragraph position="3"> 4) According to the left-boundary rules and the right-boundary rules which base on title, for Fig 1 Structure and Components of WS  example, &amp;quot;Zong Tong ,Xue Yuan &amp;quot;, and name frequent of context, determine the boundaries of people names.</Paragraph>
      <Paragraph position="4">  5) Negate conflicting potential people names.</Paragraph>
      <Paragraph position="5"> 6) Output the result: The output contains every sentence in the processed text and the start and the end positions and the reliability values of all people names in it.</Paragraph>
    </Section>
    <Section position="4" start_page="194" end_page="194" type="sub_section">
      <SectionTitle>
2.4 Chinese Place Name Recognition
</SectionTitle>
      <Paragraph position="0"> By collecting a large scale of place names, For example, (1) The names of administrative regions superior to county; (2) The names of inhabitation areas; (3) The names of geographic entities, such as mountain, river, lake, sea, island etc.; (4) Other place names, e.g. monument, ruins, bridge and power station etc. building the place name dictionary.</Paragraph>
      <Paragraph position="1"> Collecting words that can symbolize a place, e.g. &amp;quot;Di Qu &amp;quot;, &amp;quot;Cheng Shi &amp;quot;, &amp;quot;Xiang &amp;quot; etc. Base on these knowledge we applied positive deduction mechanism. Its essence is that with reference to certain control strategies, a rule is selected; then examining whether the fact matches the condition of the rule, if it does, the rule will be triggered.</Paragraph>
      <Paragraph position="2"> In addition, Those words that often concurrent with a place name are collected , including: &amp;quot;Zai &amp;quot;, &amp;quot;Wei Yu &amp;quot; etc. And which often concurrent with a people name, such as &amp;quot;Tong Zhi &amp;quot;, &amp;quot;Shuo &amp;quot; and so on, are also considered in NER.</Paragraph>
      <Paragraph position="3"> WS system identifies all potential place names in texts by using place name base and gathers their context information; and through deduction, it utilizes rule set and knowledge base to confirm or negate a potential place name; hereupon, the remainders are recognized place name.</Paragraph>
    </Section>
    <Section position="5" start_page="194" end_page="194" type="sub_section">
      <SectionTitle>
2.5 Multi-granularities of word segmenta-
</SectionTitle>
      <Paragraph position="0"> tion Whenever we deploy the segmenter for any application, we need to customize the output of the segmenter according to an application specific standard, which is not always explicitly defined. However, it is often implicitly defined in a given amount of application data (for example, Search engines log, Tagged corpus) from which the specific standard can be partially learned.</Paragraph>
      <Paragraph position="1"> Most variability in word segmentation across different standards comes from those words that are not typically stored in the basic dictionary. To meet the applications of different levels, in our system, the standard adaptation is conducted by a post-processor which performs an ordered list of transformations on the output. For example: When input is &amp;quot;Guo Wu Yuan An Quan Sheng Chan Zhuan Jia Zu &amp;quot;, the output will be:  1. &amp;quot;Guo Wu Yuan /An Quan / Sheng Chan / Zhuan Jia Zu &amp;quot; 2. &amp;quot;Guo Wu Yuan /An Quan Sheng Chan / Zhuan Jia Zu &amp;quot; 3. &amp;quot;Guo Wu Yuan /An Quan Sheng Chan Zhuan Jia Zu &amp;quot;  Result 1 is normal segmentation, also is minimum granularity of word. Result 2 and 3 is bigger granularity. Every application can select appropriate segmentation result according to its purpose.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML