File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/w03-1725_metho.xml
Size: 10,423 bytes
Last Modified: 2025-10-06 14:08:37
<?xml version="1.0" standalone="yes"?> <Paper uid="W03-1725"> <Title>A Unicode based Adaptive Segmentor</Title> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 Design Objectives and Components </SectionTitle> <Paragraph position="0"> With the wide use of Unicode based operating systems such as Window 2000 and Window XP, we now see more and more text data written in both the Simplified form and the Traditional form to co-exist on the same system. It is also likely that text written in mixed mode. Because of this reality, the first design objective of this system is its ability to handle the segmentation of Chinese text written in either Simplified Chinese, Traditional Chinese, or mixed mode. As an example, we should be able to segment the same sentence in different forms such as the example given below: The second design objective is to adopt the modular design approach where different functional parts are separately implemented using independent modules and each module tackles one problem at a time. Using this modular approach, we can isolate problems and fine tune each module with minimal effect on other modules in the system.</Paragraph> <Paragraph position="1"> Special features like adding new rules or new dictionary can be easily done without affecting other modules. Consequently, the system is more flexible and can be easily extended.</Paragraph> <Paragraph position="2"> The third design objective of the system is to make the segmentor adaptive to different application domains. We consider it having more practical value if the segmentor can be easily trained using some semi-automatic process to work in different domains and work well for text with different regional variations. We consider it essential that the segmentor has tools to help it to obtain regional related information quickly even if annotated corpora are not available. For instance, when it runs text from Hong Kong, it must be able to recognize the personal names such as if such a name(quadra-gram) appears in the text often.</Paragraph> <Paragraph position="3"> segmentor and data manager. The segmentor is the core component of the system. It has a preprocessor, the kernel, and a post-processor. As the system has to maintain a number of tables such as the dictionaries, family name list, etc., a separate component called data manager is responsible in handling the maintenance of these data. The pre-processor has separate modules to handle paragraphs, ASCII code, numbers, time, and proper names including personal names, place and organizational names, and foreign names. The kernel supports different segmentation algorithms. It is the application or user's choice to invoke the preferred segmentation algorithms that at current time include the basic maximum matching and minimum matching in both forward and backward mode. These can also be used to build more complicated algorithms later on. In addition, the system provides segmentation using part-of-speech tagging information to help resolve ambiguity. The post-processor applies morphological rules which cannot be easily applied using a dictionary.</Paragraph> <Paragraph position="4"> The data manager helps to maintain the knowledge base in the system. It also has an accessory software called the new word extractor which can collect statistical information based on character bi-grams, tri-grams and quadra-grams to semi-automatically extract words and names so that they can be used by the segmentor to improve performance especially when switching to a new domain. Another characteristic of this segmentor is that it provides tagging information for segmented text. The tagging information can be optionally omitted if not needed by an application.</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Implementation Details </SectionTitle> <Paragraph position="0"> The basic dictionary of this system was provided by Peking University [4] and we also used the tagging data from [4]. The data structure for our dictionaries are very similar to that discussed in [5]. As our program needs to handle both Simplified and Traditional Chinese characters, Unicode is the only solution for dealing with more than one script at the same time.</Paragraph> <Paragraph position="1"> Even though it is our design objective to support both Simplified and Traditional Chinese, we do not want to keep two different sets of dictionaries for Simplified and Traditional Chinese. Even if two versions are kept, it would not serve well for text in mixed mode. For example, Traditional Chinese word of &quot;the day after tomorrow&quot; should be , and for Simplified Chinese, it should be .</Paragraph> <Paragraph position="2"> However sometimes we can see the word appears in a Traditional Chinese text. We cannot say that it is wrong because the sentence is still semantically correct especially in Unicode environment. Therefore the segmentor should be able to segment those words correctly such as in the examples: &quot; &quot;, and in &quot; &quot;. We must also deal with dictionary maintenance related to Chinese variants. For example, characters are variants, so are In order to keep the dictionary maintenance simple, our system uses a single dictionary which only keeps the so called canonical form of a word. In our system, the canonical form of a word is its &quot;simplified form&quot;. We quoted the word &quot;simplified&quot; because only certain characters have simplified forms such as to , but for , there is no simplified form. In the case of variants, we simply choose one of them as the canonical character. The canonical characters are maintained in the traditional-simplified character conversion table as well as in a variant table.</Paragraph> <Paragraph position="3"> Whenever a new word, item, is added into the dictionary, it must be added using a function CanonicalConversion(), which takes item as an input. During segmentation, the corresponding dictionary look up function will first convert the token to its canonical form before looking up in the dictionary.</Paragraph> <Paragraph position="4"> The personal name recognizers (separate for Chinese names and foreign names) use the maximum-likelihood algorithm with consideration of commonly used Chinese family names, given names, and foreign name characters. It works for Chinese names of length up to 5 characters. In the following examples you can see that our system successfully recognized the name . This is done using our algorithm, not by putting her name in our dictionary: Organization names and place names are recognized mainly using special purpose dictionaries. The segmentor uses tagging information to help resolve ambiguity. The disambiguation is mostly based on rules such as p + (n + f) -> p + n + f which would word to correct For efficiency reasons, our system uses only about 20 rules. The system is flexible enough for new rules to be added to improve performance.</Paragraph> <Paragraph position="5"> The new word extractor is an accessory program to extract new words from running text based on statistical data which can either be grabbed from the internet or collected from other sources. The basic statistical data include bi-gram frequency, tri-gram frequency, and quadra-gram frequencies. In order to further example whether a bi-gram, say , is indeed a word, we further collect forward conditional frequency of , and the back-ward conditional frequency of , . For an i-gram token, we also use the (i+1)-gram statistics to eliminate those i-grams that are only a part of (i+1) - gram word. For instance, if the frequency of bi-gram is very close to the frequency of tri-gram , it is less likely that is a word. Of course, whether is a word depends on quadra-gram results. Using the statistical result, a set of rules was applied to these i-grams to eliminate entries that are not considered new words. Minimal manual work is required to identify whether the remaining candidates are new words. Before words are added into the dictionary, part-of-speech information are added manually (although not necessary) before using the canonical function.</Paragraph> <Paragraph position="6"> The following table shows examples of bi-grams which are found by the new word extractor using one year Hong Kong Commercial Daily News data.</Paragraph> </Section> <Section position="6" start_page="0" end_page="3" type="metho"> <SectionTitle> 4 Performance Evaluation </SectionTitle> <Paragraph position="0"> The valuation metrics used in [6] were adopted here.</Paragraph> <Paragraph position="1"> denotes the number of words identified by the segmentation algorithm , and N is the number of words correctly identified. We participated in the open tests for all four corpora. The results are shown in the following table.</Paragraph> <Paragraph position="2"> The worst performance in the 4 tests were for the CTB(UPenn) data. From the observation from the testing data, we found that the main problem with have with CTB data is the difference in word granularity. To confirm our observation, we have done an analysis of combining errors and overlapping errors. The results show that the ratios of combining errors in all the error types are 0.8425(AS), 0.87684(CTB), 0.82085(HK), and 0.77102(PK). The biggest problem we have with AS data, on the other hand is due to out of vocabulary mistakes. Even though our new word extractor can help us to reduce this problem, but we have not trained our system using data from Taiwan. Our best performance was on PK data because we used a very similar dictionary. The additional training of data for HK was done using one year Commercial Daily( ).</Paragraph> <Paragraph position="3"> The following table summarizes the execution speed of our program for the 4 different sources: The program initialization needs around 2.25 seconds mainly to load the dictionaries and other data into the memory before the segmentation can start. If we only count the segmentation time, the rate of segmentation on the average is around 7,500 characters for the first three corpora. It seems that the processing speed for Peking U. data is faster. This may be because the dictionaries we used are closer to the PK system, thus it would take less time to work on disambiguation.</Paragraph> </Section> class="xml-element"></Paper>