File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/93/w93-0312_metho.xml
Size: 9,923 bytes
Last Modified: 2025-10-06 14:13:32
<?xml version="1.0" standalone="yes"?> <Paper uid="W93-0312"> <Title>Example-Based Sense Tagging of Running Chinese Text Xiang Tong Chang-ning Huang</Title> <Section position="3" start_page="102" end_page="102" type="metho"> <SectionTitle> 2. Overview of tile Sense-Tagging System </SectionTitle> <Paragraph position="0"> The sense-tagger under discussion represents partial results of some three years of continued efforts on the part of Tsinghua University, Beijing, China to build systems for the processing of general, unrestricted running Chinese texts.</Paragraph> <Paragraph position="1"> The system was implemented in ' C', and currently runs on the Sun Workstation at the National AI Laboratory in the University.</Paragraph> </Section> <Section position="4" start_page="102" end_page="105" type="metho"> <SectionTitle> 2. 1. Resources </SectionTitle> <Paragraph position="0"> The sense-tagging module uses two MRDs and one MTD. The first MRD, for the sake of discussion, say MRD-I,is 'tP~'l~'?.7,.i~-i~:-~::gtg' (Fu, 1987). It contains about 6,000 one-syllable words, e. g. , '~' (beat), '~' (drum), and 43,000 compound words and phrases, e.g. ':Ff~' (beating drums). Each word has one or more word senses. For example, ':tq' (beat) has 26 senses and '~fi\[' (drum) 6. Note that capital letters in the numbers tagged indicate homographs, and the Arabic numbers the sense number under the homograph. The content of the word ':\]q' (beat) is given as following:</Paragraph> <Paragraph position="2"> The second MRD, for the sake of discussion, say MRD-2, is the Chinese thesaurus '~\] ~i~\]i~\]~0~' (Mei, 1983) with about 70,000 entries. It has a 3-level categorization system. At Level 1, the dictionary has 12 major categories. At Level 2, the 12 major categories split into 94 subcategories. At the lowest level, Level 3, the dictionary has altogether 1,428 subcategories. Under the current numbering system, the capital letter indicates major categories, the lower-case letter subcategories, and the Arabic numbers the numbering under the two superordinate categories. For example, 'Bp13' refers to one of the categories that the word '~' (drum) falls into. B is a first level category, p is a second level subcategory, 13 is the numbering of the subcategory under Bp. Partial list of the numbering of some\categories is given as follows: pound words and phrases. Word phrases like '{T~rf~' (beating drums) are disambiguated in the MTD with word sense numbers tagged to both '~J&quot; (beat) and '~J~' (drum), e.g. '~_B02 ~_A01'. The numbers tagged are based on the numbering system as used in MRD_ 1. For those compounds that have component whose meaning is not related to the resultant compound, the Arabic numbers in the component's tag is '00' (e.g. , ~_A00 ~_A0\], tO_A00 t~_ A00). Much of the work in constructing the MTD was done by machine, but supplemented by handcoding. The following gives a partial list of the contents of the MTD : The word segmentation module is a much simplied version of a more complicated segmentation program developed at the Laboratory. It looks forward through each sentence for maximum match of character strings as recorded in the MTD. The tagging of most known phrases is done with the help of the MTD.</Paragraph> <Paragraph position="3"> '{T~' would be an example in question. The involved operation is simple, i. e. , 'match to access'. When an input segment matches an entry in the MTD, the tagged form of the matched segment replaces the input segment in the sentence. Step 2 : Example-based sense tagging of one-syllable words The system uses an example-based sense-tagging algorithm for the disambiguation of one-syllable words, which are not listed in system MTD. The detail of the algorithm is described in Section 3.</Paragraph> <Paragraph position="4"> Step 3 : Default sense tagging of untagged one-syllable words from Step 2 A default sense number is assigned to each and every one syllable word untagged from Step 2. The default sense numbers are determined on the basis of frequency of occurrence data.</Paragraph> </Section> <Section position="5" start_page="105" end_page="107" type="metho"> <SectionTitle> 3. Example-Based Sense-Tagging </SectionTitle> <Paragraph position="0"> Chinese words build to form compound words. In 94.7 % of the time, the meaning of the resultant compounds is related to the contributing meanings of the component words (Zhang, 1986, p. 87). The compound words and phrases in the MTD contain implicit syntactic information for purpose of example-based reasoning about the senses of Chinese words in context.</Paragraph> <Paragraph position="1"> For example, if ':~q&quot; '~' (beat gongs and drums ) is in the input text and the sense of ':~\]&quot; (beat) cannot be determined. In order to disambiguate the word sense of '{\]&quot; (beat), the system looks through the MTD for every compound word and phrase beginning with '~\]&quot; (beat) and decides that the phrases ':~q'_B02 ~-A01' (beat drums) is an appropriate example to reason about the word '~I' (beat) as found in 'C/I ~' (beat gongs and drums), since '\]~' (drums) and '~ ~' (gongs and drums) are in the same lowest category 'Bpl3' in MRD_2. The system then assigns the tag 'B02', which belongs to '~\]&quot; (beat) in '{\]'_B02 ~_A0I' (beat drums), to '~\]&quot; (beat) in '{\]&quot; '~' (beat gongs and drums).</Paragraph> <Paragraph position="2"> Formally, when S~ Sz&quot;&quot; S, represent input segments from 1 to n, W represents an untagged segment, and the immediate context of I,,V is represented by L....'-- L2 Lt W R~ Rz&quot;&quot; R.o..., where L stands for 'Left', R stands for 'Right', and range equals 5, we have the following: St $2 &quot;'&quot; S. (a) where S,(k= 1, &quot;&quot; , n) is a word, compound word or phrase L,..,, ... L2 Ll W Rt R~ ... R,.,, (b) where L,, R~(i= 1, &quot;'&quot; ,rmzge) is a word, compound word or phrase In the forward reasoning process, assuming that (W R~) is a possible compound word or phrase, for all entries in MTD beginning with W which is in the form (W_tag Item), the system computes the relatedness of the two words or phrases (W R,) and (W_tag Item), where 'Item' may be an annotated word, compound word, phrase, or just a meaningless Chinese character string. The concept distance of R, and Item is computed to determine the relatedness of the two compound words/phrases. Hence,</Paragraph> <Paragraph position="4"> For every pair of (W R~) (i=\], ... ,range) and (W_tag Item) in the MTD, the pair that has the greatest non-zero relatedness measure is determined and the W in (b) above is substituted by the W_tag in the determined pair.</Paragraph> <Paragraph position="5"> The reasoning process works similarly in both directions of W, i. e. , forward to R,o,t, and backward to L,~a,. When the process proceeds forward, the system looks for entries beginning with W. On the other hand, when the process works backwards to the left of W, the system looks for annotated entries in the MTD ending with W.</Paragraph> <Paragraph position="6"> The examples are given as following:</Paragraph> <Paragraph position="8"> The word '~i' (new) has six senses. The annotated phrase '~-A01 ~_ AOI ~ is found in the MTD. The system calculates the conceptual distance between '~i~' and '~' among others. Since '~d~' and '~'~' are found to be in the same lowest subcategory 'Dd06', the conceptual distance between them is 0. The system then assigns the tag 'A01', which belongs to '~,ti:' as in 'j~-~:-A0I ~-A0I', to '~i:' in the above sentence.</Paragraph> <Paragraph position="9"> The word '~' (receive, suffer) has six senses. The annotated phrase '~-A02 ~'_A02' is found in the MTD. The system calculates the conceptual distance between &quot;/~&quot; and '~'~' among others. Since '~&quot; and 'It~'~' are found to be in the adjacent lowest subcategories, i. e. , 'HclS' and 'Hcl9' respectively, the conceptual distance between them is 1. The system then assigns the tag 'A02', which belongs to '~' as in '~_A02 ~'-A02', to '~' in the above sentence.</Paragraph> <Paragraph position="11"> The word '/~' (right, power) has seven senses. The annotated phrase '~_A01 ~:~-A01' is found in the MTD. The system calculates the conceptual distance between '~' and '~' among others. Since 'g,.~ and 'g2J'~' are found to be in the same lowest subcategory 'Dj03', the conceptual distance be- null tween them is O. The system then assigns the tag 'A01', which belongs to '~' as in '~.~-A01 ~._A01', to '~' in the above sentence.</Paragraph> <Paragraph position="12"> The word '$fl' (each other) has four senses. The annotated phrase '$~-A01 ~,~_A01' is found in the MTD. The system calculates the conceptual distance between '~' and '~I~' among others. Since '~' and '~' are found to be in the same lowest subcategory 'Jc01', the conceptual distance between them is 0. The system then assigns the tag 'A01', which belongs to '~l~' as in '#I~_A01 ~,_A01' , to '}I~' in the above sentence.</Paragraph> </Section> <Section position="6" start_page="107" end_page="107" type="metho"> <SectionTitle> 4. Evaluation </SectionTitle> <Paragraph position="0"> The input Chinese texts that the present system works on are news release texts from the official Chinese Xinhua News Agency. No preprocessing of these news release texts is required.</Paragraph> <Paragraph position="1"> The performance of the present sense-tagger is encouraging. The hit rate of correct sense tagging can run as high as 95 %. The lowest hit rate ever recorded was 70M. The appendix gives a sample text which is the output of our system. The hit rate of correct sense tagging of this sample is 93.79M. Essentially, the hit rate of correct sense tagging performed by the system is a function of the coverage of the system MTD and MRDs.</Paragraph> </Section> class="xml-element"></Paper>