File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/03/p03-2039_intro.xml

Size: 2,361 bytes

Last Modified: 2025-10-06 14:01:50

<?xml version="1.0" standalone="yes"?>
<Paper uid="P03-2039">
  <Title>Chinese Unknown Word Identification Using Character-based Tagging and Chunking</Title>
  <Section position="2" start_page="0" end_page="0" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> Like many other Asian languages (Thai, Japanese, etc), written Chinese does not delimit words by spaces and there is no clue to tell where the word boundaries are. Therefore, it is usually required to segment Chinese texts prior to further processing.</Paragraph>
    <Paragraph position="1"> Previous research has been done for segmentation, however, the results obtained are not quite satisfactory when unknown words occur in the texts. An unknown word is defined as a word that is not found in the dictionary. As for any other language, all possibilities of derivational morphology cannot be foreseen in the form of a dictionary with a fixed number of entries. Therefore, proper solutions are necessary for the detection of unknown words.</Paragraph>
    <Paragraph position="2"> Along traditional methods, unknown word detection has been done using rules for guessing their location. This can ensure a high precision for the detection of unknown words, but unfortunately the recall is not quite satisfactory. It is mainly due to the Chinese language, as new patterns can always be created, that one can hardly efficiently maintain the rules by hand. Since the introduction of statistical techniques in NLP, research has been done on Chinese unknown word detection using such techniques, and the results showed that statistical based model could be a better solution. The only resource needed is a large corpus. Fortunately, to date, more and more Chinese tagged corpora have been created for research purpose.</Paragraph>
    <Paragraph position="3"> We propose an &amp;quot;all-purpose&amp;quot; unknown word detection method which will extract person names, organization names and low frequency words in the corpus. We will treat low frequency words as general unknown words in our experiments. First, we segment and assign POS tags to words in the text using a morphological analyzer. Second, we break segmented words into characters, and assign each character its features. At last, we use a SVM-based chunker to extract the unknown words.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML