File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/98/m98-1016_metho.xml
Size: 15,691 bytes
Last Modified: 2025-10-06 14:14:51
<?xml version="1.0" standalone="yes"?> <Paper uid="M98-1016"> <Title>LEARNING PROCESS: INFORMATION DISTILLATION OF TRAINING CORPUS Learning Process in General</Title> <Section position="1" start_page="0" end_page="0" type="metho"> <SectionTitle> DESCRIPTION OF THE KENT RIDGE DIGITAL LABS SYSTEM USED FOR MUC-7 </SectionTitle> <Paragraph position="0"/> </Section> <Section position="2" start_page="0" end_page="0" type="metho"> <SectionTitle> BASIC OF THE SYSTEM </SectionTitle> <Paragraph position="0"> We aim to build a single simple framework for tasks in text information extraction, for which, to a certain extent, the required information can be resolved locally.</Paragraph> <Paragraph position="1"> Our system is statistics-based. As usual, language model is built from training corpus.</Paragraph> <Paragraph position="2"> This is the so-called learning process. Much e#0Bort has been spent to absorb domain knowledge in the language model in a systematic and generic way, because the system is designed not for one particular task, but for general local information extraction.</Paragraph> <Paragraph position="3"> For the information extraction part #28tagging#29, the system consists of the following modules: null #0F Sentence segmentor and tokenizer. This module accepts a stream of characters as input, and transforms it into a sequence of sentences and tokens. The way of tokenization can vary with di#0Berent tasks and domains. For example, most English text is tokenized in the same way, while tokenization in Chinese itself is a research topic.</Paragraph> <Paragraph position="4"> #0F Text analyzer. This module provides analysis necessary for the particular task, be it semantic, syntactic, orthographic, etc. This same analyzer is also applied in the learning process.</Paragraph> <Paragraph position="5"> #0F Hypothesis generator. The possibilities for each word #28token#29 are determined. Rules can be captured by letting one word have one choice, as is the case in the recognition of time, date, money and percentage terms for the Chinese Named Entity #28NE#29 task.</Paragraph> <Paragraph position="6"> These are identi#0Ced by pattern matching rules.</Paragraph> <Paragraph position="7"> #0F Disambiguation module. This is essentially implementation of Viterbi algorithm.</Paragraph> <Paragraph position="8"> All the above modules will be described in detail in the following sections.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> TEXT INFORMATION EXTRACTION TO TAGGING </SectionTitle> <Paragraph position="0"> First of all, a brief of the modeling of the problem is in order. Each word in text is assigned a tag, information can then be obtained from tags of all words. For example, for the English NE task,</Paragraph> <Paragraph position="2"> Grouping all adjacent words with tag PERSON gives a person name, grouping those with tag ORG gives an organization name, etc.</Paragraph> <Paragraph position="3"> The problem becomes, for any given sequence of words w = w correspondingly.</Paragraph> <Paragraph position="4"> Note that there are di#0Berent ways of assigning tags. For the above example, tags can also be:</Paragraph> <Paragraph position="6"> Thisway, extra informationsuch as common surnames, #0Crst names, organization endings #28Corp., Inc. etc#29 and so on can be obtained. It is observed that di#0Berent tags for a same task make di#0Berence. We feel that choosing an appropriate tag set is a problem worthyof careful investigation. Intuitively, a tag set for a particular task must be: su#0Ecient, meaning that the information extracted must be su#0Ecient for the task; and e#0Ecient, meaning that there should be no redundant and nonrelevant information.</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> LEARNING PROCESS: INFORMATION DISTILLATION OF TRAINING CORPUS </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> Learning Process in General </SectionTitle> <Paragraph position="0"> Careful consideration has been given to study how to absorb domain knowledge in language model#28s#29 in a generic and systematic way. The basic idea is, as much as possible relevant and signi#0Ccant information #28to the task#29 contained in the original corpus should retain in back-o#0B corpora where back-o#0B features are stored, so that correct decisions can be made from the statistics generated from the back-o#0B corpora when they can not be done from the statistics from the original training corpus.</Paragraph> <Paragraph position="1"> The original training corpus is in the form of word#2Ftag, statistics about words and tags including local contextual information can be obtained. Eachword in the corpus is given a back-o#0B feature by the principle that the back-o#0B features of all words should extract the most information from the corpus relevant to the particular task. The information loss is compensated by gain of generosity. A back-o#0B corpus in the form of back-o#0B feature#2Ftag is then generated, and statistics can be obtained in the same manner. The original corpus is processed this way for a certain number of times. Every time, a less descriptive back-o#0B corpus which gains more in generosity is generated, and thus the corresponding statistics.</Paragraph> <Paragraph position="2"> For example, semantic classes can be used as back-o#0B features for all the words in Example 1, which gives the back-o#0B corpus of the following form: seman1#2F- seman2#2F- ... semanM-1#2FPERSON semanM#2FPERSON ... semanN-3 #2FORG semanN-2#2FORG semanN-1#2FORG semanN#2For part-of-speech as back-o#0B features, which gives</Paragraph> <Paragraph position="4"> The generation of back-o#0B corpora is describedby Figure1.The total numberofback-o#0B corpora therein is a controllable parameter.</Paragraph> <Paragraph position="5"> Learning Process for Chinese NE #0F Training Corpus and Supporting Resources We have a text corpus of about 500,000 words from People Daily and Xinhua News Agency, all of whichwere manually checked for both word segmentation and part of speech tagging.</Paragraph> <Paragraph position="6"> In addition, we have a lexicon of 89,777 words, in which 5351 words are labeled as geographic names, 304 words are people's name and 183 are organization names. 1167 words consist of more than 4 characters. The longest word #28meaning #5CGreat Britain and North Ireland United Kingdom&quot;#29 contains 13 characters.</Paragraph> <Paragraph position="7"> About 50,000 di#0Berentwords appeared in the 500,000 words corpus.</Paragraph> <Paragraph position="8"> We also have three entity name lists: people name list #2867,616 entries#29, location name list #286,451 entries#29 and organization name list #286190 entries#29. #0F Observation: Problems and Solutions 1. Intuitively, case information of proper names in English writing system provides good indication about locations and boundaries of entity names. There are successful systems #5B2#5D which are built upon this intuition. Unfortunately, the uniformity of character string in Chinese writing system does not contain such information.</Paragraph> <Paragraph position="9"> One should look for such analogous indicative characteristics which may be unique in Chinese language.</Paragraph> <Paragraph position="10"> 2. Word in Chinese is a vague concept and there is no clear de#0Cnition for it. There are boundary ambiguities between words in texts for even human being understanding, and inevitably machine processing. Tokenization, or word segmentation is still a problem in Chinese NLP.Word boundary ambiguities exist not only between commonly used words which are not in entity names, but also between commonly used words and entity names.</Paragraph> <Paragraph position="11"> 3. Besides the uniformity appearance of characters, proper names in Chinese can consist of commonly used words. As a matter of fact, almost all Chinese characters can be a commonly used words themselves, including those in entity names such as people's names, location names, etc.</Paragraph> <Paragraph position="12"> Therefore, unlike English, the problem of Chinese entity recognition should not be isolated from the problem of tokenization, or word segmentation.</Paragraph> <Paragraph position="13"> #0F Building Language Models One level of back-o#0B features, which are also called word classes, are obtained by the following way: We extend the idea in the new word detection engine of the integrated model of Chinese word segmentor and part of speech tagger #5B1#5D. The idea is to extend the scope of an interested word class of new word, the proper names, into named entities by looking into broader range of constituents. Under this framework, we believe contextual statistics plays important rules in deciding word boundary and predicting the categories of named entities, while local statistics, or information resides within words or entities, can provide evidence for suggesting the appearance of named entity and deciding the validity of these entities. We need to make full use of both contextual and local statistics to recognize these named entities, thus contextual language model and entity models are created.</Paragraph> <Paragraph position="14"> The basic process to build the model is like this: 1. Change the tag set of the part-of-speech tagger by splitting the tag NOUN into more detailed tags related to the particular task, which include the symbolic notions of person, location, organization, date, time, money and percentage. 2. Replace the tag NOUN in the training corpus with the above extended new tags. Only ambiguous words are manually checked.</Paragraph> <Paragraph position="15"> 3. Build contextual language model with the training corpus with the new tag set. 4. Build entity models from the entity name lists. Eachentity has its own model. Learning Process for English NE #0F Training Corpus and Supporting Resources SGML marked up #28for NE task only#29 Brown corpus and corpus from Wall Street Journal. In total the size of words is 7.2MB, words with SGML-markup is 9.5MB. Supporting resources include the location list, country list, corporation reference list and the people's surname list provided by MUC. Only the single-word entries in these lists are in actual use.</Paragraph> <Paragraph position="16"> #0F Observation: Problems and Solutions Case information, or more generally, orthographic information, gives good evidence of names, as was observed in #5B2#5D. Although things get muddled up when one really gets deep into it: e.g. #0Crst words of sentences, words which do not have all normal #28lower#29 case form #28e.g. #5CI&quot;#29, or words whose cases are changed due to other reasons such as formatting #28e.g. titles#29, being artifacts, etc. Nevertheless, this is an very important information for identifying entity names.</Paragraph> <Paragraph position="17"> Prepositions are also helpful, so are common su#0Exes and pre#0Cxes of the entities, such as Corp., Mr., and so on. In general, all such useful information should be somehow sorted out. Word classes tailored for this particular purpose will be ideal. #0F Building Language Models There are two levels of back-o#0B features represented byword classes.</Paragraph> <Paragraph position="18"> For the following words, the two back-o#0B features are the same: month words #28January, February, ...#29, cardinal numbers #28one, two, 1 #18 31, ...#29, ordinal numbers #281st, #0Crst, 2nd, second, ...#29, etc.</Paragraph> <Paragraph position="19"> For the rest of words, the #0Crst level features are word classes provided by a machine auto classi#0Ccation of words, while the second level of features include: In total, the number of orthographic features is about 30.</Paragraph> <Paragraph position="20"> To give a sense what information is extracted from the original training corpus, for example, the two back-o#0B sentences for Example 1 are: From the above statistics, it's interesting to notice that non-#0Crst common words which are initial capitalized have a far more chance to be organization than person #28frequencies 7525 vs 195#29 and location #28frequencies 7525 vs 896#29. This agrees with general observations. Also interesting is that suchwords have a higher chance not to be any of the seven entities. This comes as a bit surprise. For NLP researchers, though, it may not be a surprise at all. This example also gives a sense how general observations are represented in a precise way.</Paragraph> <Paragraph position="21"> Further research is to be carried out to justify quantitively the merits of this learning process. Its full potential has yet to be exploited. So far, our experimentation has proved that: 1. Various kinds of text analysis #28syntactic, semantic, orthographic, etc#29 can be incorporated into the same framework in a precise way, which will be used in the information extraction #28tagging#29 stage in the same way; 2. It provides an easy way to absorb human knowledge as well as domain knowledge, and thus customization can be done easily; 3. It gives great #0Dexibilityashow to optimize the system.</Paragraph> <Paragraph position="22"> 1 and 2 are somehow clear from the above discussion. Details on the disambiguation module will reveal 3.</Paragraph> <Paragraph position="23"> DETAILS OF THE SYSTEM MODULES 1. Sentence segmentor and tokenizer: initial tokenization by looking up dictionary for Chinese, standard way for English.</Paragraph> <Paragraph position="24"> 2. Text analyzer. What has been done for training corpus in the learning stage is done here. After the analysis, eachword possesses a given number of back-o#0B features. 3. Hypothesis generator.</Paragraph> <Paragraph position="25"> #0F Chinese: based on entities' pre#0Cxes, su#0Exes, trigger words and local context information, guesses are made about possible boundaries of entities and categories of entities. Time, date, money, andpercentage are extracted by pattern-matching rules.</Paragraph> <Paragraph position="26"> #0F English: for each word basically look for all the possibilities from the database #0Crst. If the word is not found, look for the possibilities of its back-o#0B features. 4. Disambiguation module. Recall that information extraction from word sequence w becomes #0Cnding the corresponding tag sequence t. In the paradigm of maximum likelihood estimation, the best set of tags t is the one such that prob#28tjw#29 = #282#29 and #283#29 can be justi#0Ced by Hidden Markov Modeling for the generation of word sequences.</Paragraph> <Paragraph position="27"> As always, Viterbi algorithm is employed to compute the probability #281#29, given any approximations like #282#29 and #283#29. When sparse data problem is encountered, back-o#0B and smoothing strategy can be adopted, e.g.</Paragraph> <Paragraph position="28"> where N is the total number of back-o#0B features for the word.</Paragraph> <Paragraph position="29"> Note that no smoothing is employed in the abovescheme. From this scheme one can see that there exist various ways of back-o#0B and smoothing. This characteristics, as well as the free choices of back-o#0B features, is where the #0Dexibility of the system lies. Remark. In the actual system, back-o#0B and smoothing schemes are di#0Berent from the above. The actual schemes are not included because they are more complicated, and yet no systematic experimentation has been done to show that they are better than other options.</Paragraph> </Section> </Section> class="xml-element"></Paper>