File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/i05-3022_metho.xml
Size: 4,576 bytes
Last Modified: 2025-10-06 14:09:44
<?xml version="1.0" standalone="yes"?> <Paper uid="I05-3022"> <Title>Chinese Word Segmentation in FTRD Beijing</Title> <Section position="3" start_page="0" end_page="150" type="metho"> <SectionTitle> 2 System Description </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.1 Code identification and conversion </SectionTitle> <Paragraph position="0"> For processing both Simplified and Traditional Chinese text from a variety of locales, including Mainland China, Hong Kong and Taiwan, we choose UTF-8 as internal character representation within the system. The ability to transparently handle Chinese text from any Chinese locale greatly simplifies the logic of the segmentation system.</Paragraph> </Section> <Section position="2" start_page="0" end_page="150" type="sub_section"> <SectionTitle> 2.2 N-gram language model </SectionTitle> <Paragraph position="0"> In our system, Chinese words can be categorized into one of the following types: lexicon words, morphological words, factoids, name entities.</Paragraph> <Paragraph position="1"> These types of words are processed in different ways in our system, and are incorporated into a unified statistical framework of the trigram language model.</Paragraph> <Paragraph position="2"> Each input sentence is first segmented into individual characters. These characters and the character strings are then looked up in a lexicon. For the efficient search, the lexicon is represented by a TRIE compressed in a double-array data struc- null ture. Given a character string, all its prefix strings that form lexicon words can be retrieved efficiently by browsing the TRIE whose root represents its first character.</Paragraph> <Paragraph position="3"> There are twenty four kinds of factoid words, such as time, date, money, etc. All the factoid words are represented as regular expressions, and compiled into a compressed DFA with the row-index algorithm.</Paragraph> <Paragraph position="4"> As (Wu 2003) discussed in the paper, it is those morphologically derived words (MDWs hereafter) that are most controversial and most likely to be treated differently in different standards and different systems. In our system, there are six main categories of morphological processes, affixation, directional verb, resultative verb, splitting verb, reduplication and merging, and we employ a chart parsing algorithm augmented with word lattices structure which incorporates the morphological rules especially designed for Chinese languages with restrictive CFG.</Paragraph> <Paragraph position="5"> 2.2.4 Name entity identification Our NE identification concentrates on three types of NEs, namely, personal names (PERs), location names (LOCs) and organization names (ORGs). For Chinese person names, we only consider PN candidates that begin with a family name stored in the family name list and follow a given name which is of one or two characters long. For transliterations of foreign person names, a PN candidate would be generated if it contains only characters stored in a transliterated character list. For location names and organizations names, we only use the LN list and ON list to generate the candidates.</Paragraph> </Section> <Section position="3" start_page="150" end_page="150" type="sub_section"> <SectionTitle> 2.3 Segmentation standards adaptor </SectionTitle> <Paragraph position="0"> In this bakeoff, there are four segmentation standards and slightly different from ours. Standard adaptation is conducted with the application of an ordered list of transformations on the output of our segmentation system. The method we use is Transformation-Based Learning, and the transformation templates are lexicalized templates. In our system, we designed 14 lexicalized templates.</Paragraph> </Section> <Section position="4" start_page="150" end_page="150" type="sub_section"> <SectionTitle> 2.4 Speed </SectionTitle> <Paragraph position="0"> As we optimized our lexicon and decoding process, the speed of segmentation is very fast.</Paragraph> <Paragraph position="1"> On a single 2.80 GHz, 1G bytes memory, Xeon machine, the system is able to process about 0.73 Mega bytes per second.</Paragraph> <Paragraph position="2"> The speed may vary according to the sentence lengths: given texts of the same size, those containing longer sentences will take more time. The number reported here is an average of the time taken to process the test sets of the eight tracks we participated in.</Paragraph> </Section> </Section> class="xml-element"></Paper>