File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-0117_metho.xml

Size: 5,083 bytes

Last Modified: 2025-10-06 14:10:37

<?xml version="1.0" standalone="yes"?>
<Paper uid="W06-0117">
  <Title>Posts and Telecommunications yuandong@bupt.edu.cn</Title>
  <Section position="4" start_page="0" end_page="122" type="metho">
    <SectionTitle>
2 System Description
2.1 MSRA Open track of WS
</SectionTitle>
    <Paragraph position="0"> The system used in open track of WS is based on the system (Li 2005) participated in the second international WS bakeoff. We mainly modify the factoid detection rules and add the GKB (The Grammatical Knowledge-base of Contemporary Chinese) dictionary. The system also has a few postprocessors. The main postprocessors include named entity recognizers and TBL (Transformation-Based Learning) component.</Paragraph>
    <Paragraph position="1">  In our basic system, Chinese words can be categorized into one of the following types: lexicon words, morphological words, factoids, name entities. These types of words were processed in different ways in our system, and were incorporated into a unified statistical framework of the trigram language model. The details about the basic system are reported in (Li 2005).</Paragraph>
    <Paragraph position="2">  The factoid rules used in the basic system were summarized according to the MSRA training data. The Tokenization Guidelines of Chinese Text (V5.0) was provided by MSRA in this bakeoff. We used the Guidelines to rewrite the factoid rules, and the performance had the distinct improvement.</Paragraph>
    <Paragraph position="3">  2.1.3 Named entity identification The named entity recognizer is the one participated in the NER bakeoff, as shown in figure 1. In the section 2.3, we will describe in detail.</Paragraph>
    <Section position="1" start_page="122" end_page="122" type="sub_section">
      <SectionTitle>
2.2 System Used in Close tracks
</SectionTitle>
      <Paragraph position="0"> The system used in closed tracks of WS is based on maximum entropy approach. The system also has a few postprocessors. The main postprocessors include combining the separated words and TBL component.</Paragraph>
      <Paragraph position="1">  The basic system is similar to (Ng and Low, 2004). We used the Tsujii laboratory maximum entropy package v2.0 (http://www-tsujii.is.s.utokyo.ac.jp/~tsuruoka/maxent/) to train our models. For CityU closed track, the basic features are the same as (Ng and Low, 2004). For MSRA closed track, we used two sets of basic features. The one is similar to (Ng and Low, 2004) and we change the window size of another one from 2 to 3, so we trained two models for MSRA closed track and submitted two results.</Paragraph>
      <Paragraph position="2">  Firstly, we extracted one lexicon from each training data. For MSRA closed track, the postprocessor only combined the words which appeared in the lexicon but were separated in the test result. For CityU closed track, we firstly used the factoid tool provided by the open system of WS to combine the separated factoid words, and then we used the lexicon to combine the separated words, at last the TBL was applied to the test result.</Paragraph>
    </Section>
    <Section position="2" start_page="122" end_page="122" type="sub_section">
      <SectionTitle>
2.3 MSRA Open track of NER
</SectionTitle>
      <Paragraph position="0"> The system used a hybrid algorithm which can combine a class-based statistical model (Gao 2004) with various types of rule-based knowledge very well. All the words were categorized into three types: Lexicon words (LWs), Factoid words (FTs), Named Entity (NEs). Accordingly, three main components were included to identify each kind of named entities: basic word candidates, NE combination and Viterbi search, as shown in Figure 1.</Paragraph>
      <Paragraph position="1"> Figure 1 FTRD NE Recognizer The recognizer was applied to open track of WS and we used it to participate in the MSRA open track of NER. The system also had a TBL postprocessor. null</Paragraph>
    </Section>
    <Section position="3" start_page="122" end_page="122" type="sub_section">
      <SectionTitle>
2.4 TBL
</SectionTitle>
      <Paragraph position="0"> In our system, the open source toolkit fnTBL (http://nlp.cs.jhu.edu/~rflorian/fntbl/index.html) is chosen. Coping with word segmentation task, we utilized a method called &amp;quot;LMR&amp;quot; tagging which was the same as (Nianwen Xue and Libin Shen 2003). Two rule template sets were used in our system. The complicated one had 40 templates, which covered various kinds of words position and tag position occurrence, i.e., considering contextual information of words and tags. For example, rule &amp;quot;pos_0 word_0 word_1 word_2 =&gt; pos&amp;quot; could generate rules containing information about current word, current word's tag, the next word and the word after next. The other rule template neglected tag information, it took only contextual word information into account. For an instance, &amp;quot;word_0 word_1 word_2 =&gt; pos&amp;quot;. The task of WS applied the two rule template sets, and the task of NER only applied the complicated one. In the Section 3, we will compare the two rule template sets.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML