File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/w03-1713_metho.xml
Size: 6,902 bytes
Last Modified: 2025-10-06 14:08:37
<?xml version="1.0" standalone="yes"?> <Paper uid="W03-1713"> <Title>News-Oriented Automatic Chinese Keyword Indexing</Title> <Section position="4" start_page="2" end_page="2" type="metho"> <SectionTitle> (3) Segmentation Filter </SectionTitle> <Paragraph position="0"> We find the first occurrence position of every candidate keyword and get the sentence at the position.</Paragraph> <Paragraph position="1"> Then the sentence is segmented. According to the segmented result, we can verify whether the character string is meaningful. First of all, we get the segmentation result of the character string in the segmented sentence. Suppose the character string</Paragraph> <Paragraph position="3"> will not be regarded as an integrated unit. That is, this item will be seen as meaningless and filtered out from the set of candidate keywords. Here we don't adopt the method of conducting frequency statistics of words after segmentation, but use segmentation tool after frequency statistics of character strings. There are some reasons. Above all, although the segmentation technique is relatively mature, its precision is still not high enough. Then, for the same character string, its segmentation results often differ in different sentences. Thus, it's difficult to compute the frequency of a character string precisely. Furthermore, now we only need to segment one sentence for a candidate keyword. That will save us a great deal of time.</Paragraph> </Section> <Section position="5" start_page="2" end_page="2" type="metho"> <SectionTitle> (4) POS Filter </SectionTitle> <Paragraph position="0"> Because keywords provide a brief summary for one document, they should be words or phrases that represent some meaning units such as nouns and noun phrases. Therefore, a single word whose part of speech is preposition, adverb, adjective, or conjunctive is filtered out. At the same time, verb phrases, adjective phrases, preposition phrases are also excluded from the candidate set. The same as segmentation filter, we only do the POS tagging for the sentence where every candidate keyword occurs the first time. If a candidate item is made of more than one word, it will have a sequence of POS tags according to which we can assign a phrase category. The POS tags or phrase categories are the basis for POS filtering.</Paragraph> <Paragraph position="1"> Only conducting frequency statistics of character strings can't refine the candidate set well, and we utilize the relatively mature linguistic segmentation and POS tagging techniques so that we can further improve the quality of the candidate keywords. Here, the general lexicon with about 60,000 Chinese words is applied to the processes of segmentation and POS tagging.</Paragraph> <Section position="1" start_page="2" end_page="2" type="sub_section"> <SectionTitle> 3.3 Selector Module </SectionTitle> <Paragraph position="0"> After several filtering, now we can get a reduced set of candidate keywords. Most character strings in the set are meaningful and reflect the content of the document to some extent. For every candidate now, we adopt several features to describe it. The features include frequency, length, position of the first occurrence, part of speech and whether it is a proper noun or in a pair of specific punctuations, as in table 1. At the same time, through the processing of several linguistic tools in filter module, we can assign a value to every feature in every candidate item.</Paragraph> <Paragraph position="1"> feature meaning of feature freq Frequency of an item len Length of an item is_noun Whether an item is a noun phrase in_title Whether the first occurrence of an item is in the title of one document in_seg1 Whether the first occurrence of an item is in the first paragraph of one document is_proper Whether an item is a proper noun, for example: person name, organization, translation term, place name, title of a person etc.</Paragraph> <Paragraph position="2"> in_sign Whether an item is bracketed by a pair of specific punctuations such as '<<>> ' and '&quot;&quot; '.</Paragraph> <Paragraph position="3"> We can find that the candidate set is still too large to select from it the keywords. Then we will conduct feature calculation to refine the candidate set. We have known that every candidate item has a feature-value set. These feature values are our basis to evaluate every candidate item. We compute a score for every candidate keyword through the module of feature computation. The higher the score, the more relevant the candidate is to the document.</Paragraph> <Paragraph position="4"> We compute the percentage how much manually indexed keywords of different lengths cover in the set of automatically generated candidates. As in figure 2, Length represents the length of keywords and percentage denotes the corresponding percentage that keywords of this length are in the set. The higher the percentage, the more likely the key-words of this length are to be selected. Therefore, we can make a conclusion that the score of a candidate is directly proportional to the percentage of its length. Then we can acquire the relation between score and length of a candidate. At the same time, we can also see that the score is directly proportional to a candidate's frequency. In addition, score is relevant to other features in table 1. Thus, we get formula 1, as following.</Paragraph> <Paragraph position="6"> Where ck represents a candidate keyword, the function freq(ck) gets the frequency of ck, len(ck) represents its length, that is, the number of Chinese characters every item includes. F represents all the binary features of a candidate keyword as in table 1. Every feature except the features of freq and len are denoted by f i . f i (ck) is a binary function and its value is 0 or 1. If a candidate item ck satisfies the</Paragraph> <Paragraph position="8"> feature, then the value is set to 1, otherwise, it's set to 0. w</Paragraph> <Paragraph position="10"> . For features is_noun, in_title, in_seg1, is_proper and in_sign, we set their weights to 7, 13, 5, 11 and 3 respectively by experience. After each candidate keyword gets a score, we choose those whose scores rank higher as keywords.</Paragraph> <Paragraph position="11"> Now the keywords we get are all selected from the original text. However, some keywords may express the content of the document, but they don't occur in the text. Therefore, we have constructed one list of content words with hierarchical relations as in figure 3. That is content words lexicon. The lexicon contains about 1,200 words which are often used as keywords. As the content words lexicon available now, we can look up in it and expand obtained keywords to a higher level, i.e., if a selected keyword has a parent in the lexicon, the parent word will be expanded as a keyword.</Paragraph> </Section> </Section> class="xml-element"></Paper>