File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/j04-1004_intro.xml
Size: 3,047 bytes
Last Modified: 2025-10-06 14:02:17
<?xml version="1.0" standalone="yes"?> <Paper uid="J04-1004"> <Title>Xiaotie Deng ++</Title> <Section position="4" start_page="77" end_page="78" type="intro"> <SectionTitle> 2. Unknown Words </SectionTitle> <Paragraph position="0"> As defined by Chen and Bai (1998), unknown words are words that are not listed in an ordinary dictionary, and word extraction seeks to identify such words. To give readers an intuitive view of these words, we list the types of unknown words that most frequently appear (Chen and Bai [1998] list 14 different types). What we should point out here is that except for numeric-type compounds, which are extracted separately, we extract all the other types of words together.</Paragraph> <Paragraph position="1"> 1. Proper names. These include acronyms, Chinese names, and those words that have been borrowed from other languages: for example, , 'Bank of China'; , 1 RIDF = observed IDF [?] predicted IDF = [?]log</Paragraph> <Paragraph position="3"> D ), where tf , df , and D are term frequency, document frequency, and number of documents, respectively. Computational Linguistics Volume 30, Number 1 'Feng Haodi' (Chinese girl's name); , 'Prince Edward'; , 'Microsoft'; and , 'the United Kingdom of Britain and Northern Ireland'. To recognize proper names is the first task for Chinese word extraction, because their meanings cannot be obtained through the combination of smaller words, as in the compound words that are described next. Therefore, a reasonable way to approach them is to deduce them from Chinese text collections.</Paragraph> <Paragraph position="4"> 2. Compound words. These are strings with specified meanings that are composed of shorter meaningful words: for example, , 'Industry and Commerce Bank of China', is composed of , 'China', , 'industry and commerce', and , 'bank'; and , 'foreign businessmen invested company', is composed of , 'foreign businessmen', , 'invest', and , 'company'. Compound words account for a large proportion of Chinese words because it is very easy to compose a new compound word out of smaller known words. There are about 5,000 commonly used Chinese characters, but the number of compound Chinese words is unpredictable. We want to extract those compounds that are accepted as words by most people.</Paragraph> <Paragraph position="5"> 3. Derived words. These are words that have affix morphemes: for example, , 'modernization', and , 'computerization', both of which contain affix morpheme .</Paragraph> <Paragraph position="6"> 4. Numeric-type compounds. Some examples of numeric-type compounds would be 1999 , '1999'; , 'the first session'; , 'year 2000'; and ,'11 streets'. Although these words have specific meanings and are used frequently, most dictionaries do not contain them. It is not very difficult to identify them, since there are morphological rules (Mo et al. 1996) for generating these words. Such numeric-type compounds contain numbers as the main components, and measure characters or words are used nearby.</Paragraph> </Section> class="xml-element"></Paper>