File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/98/p98-2251_metho.xml
Size: 5,857 bytes
Last Modified: 2025-10-06 14:15:10
<?xml version="1.0" standalone="yes"?> <Paper uid="P98-2251"> <Title>Predicting Part-of-Speech Information about Unknown Words using Statistical Methods</Title> <Section position="3" start_page="0" end_page="1505" type="metho"> <SectionTitle> 2 Creating the Predictor </SectionTitle> <Paragraph position="0"> To build the unknown word predictor, a lexicon was created from the Brown corpus. The entry for a word consists of a list of all tags assigned to that word, and the number of times that tag was assigned to that word in the entire training corpus. For example, the lexicon entry for the word advanced is the following: advanced ((VBN 31) (JJ 12) (VBD 8)) This means that the word advanced appeared a total of 51 times in the corpus: 31 as a past participle (VBN), 12 as an adjective (J J), and 8 as a past tense verb (VBD). We can then use this lexicon to estimate P(wilti).</Paragraph> <Paragraph position="1"> This lexicon is used as a preliminary source to construct the unknown word predictor. This predictor is constructed based on the assumption that new words in a language are created using a well-defined morphological process. We wish to use suffixes and prefixes to predict possible tags for unknown words. For example, a word ending in -ed is likely to be a past tense verb or a past participle. This rough stemming is a preliminary technique, but it avoids the need for hand-crafted morphological information. To create a distribution for each given affix, the tags for all words with that affix are totaled. Affixes up to four characters long, or up to two characters less than the length of the word, whichever is smaller, are considered.</Paragraph> <Paragraph position="2"> Only open-class tags are considered when constructing the distributions. Processing all the words in the lexicon creates a probability distribution for all affixes that appear in the corpus. One problem is that data is available for both prefixes and suffixes--how should both sets of data be used? First, the longest applicable suffix and prefix are chosen for the word. Then, as a baseline system, a simple heuristic method of selecting the distribution with the fewest possible tags was used. Thus, if the prefix has a distribution over three possible tags, and the suffix has a distribution over five possible tags, the distribution from the prefix is used.</Paragraph> </Section> <Section position="4" start_page="1505" end_page="1505" type="metho"> <SectionTitle> 3 Refining the Predictions </SectionTitle> <Paragraph position="0"> There are several techniques that can be used to refine the distributions of possible tags for unknown words. Some of these that are used in our system are listed here.</Paragraph> <Section position="1" start_page="1505" end_page="1505" type="sub_section"> <SectionTitle> 3.1 Entropy Calculations </SectionTitle> <Paragraph position="0"> A method was developed that uses the entropy of the prefix and suffix distributions to determine which is more useful. Entropy, used in some part-of-speech tagging systems (Ratnaparkhi, 1996), is a measure of how much information is necessary to separate data. The entropy of a tag distribution is determined by the following equation: nij 1-- t nij Entropy of i-th affix = -/_/~i *degg2t~i) where nlj = j-th tag occurrences in i-th affix words Ni = total occurrences of the i-th affix The distribution with the smallest entropy is used, as this is the distribution that offers the most information.</Paragraph> </Section> <Section position="2" start_page="1505" end_page="1505" type="sub_section"> <SectionTitle> 3.2 Open-Class Smoothing </SectionTitle> <Paragraph position="0"> In the baseline method, the distributions produced by the predictor are smoothed with the overall distribution of tags. In other words, if p(x) is the distribution for the affix, and q(x) is the overall distribution, we form a new distribution p'(x) = Ap(x) + (1 - A)q(x). We use A = 0.9 for these experiments. We hypothesize that smoothing using the open-class tag distribution, instead of the overall distribution, will offer better results.</Paragraph> </Section> <Section position="3" start_page="1505" end_page="1505" type="sub_section"> <SectionTitle> 3.3 Contextual Information </SectionTitle> <Paragraph position="0"> Contextual probabilities offer another source of information about the possible tags for an unknown word. The probabilities P(tilti_l) are trained from the 90% set of training data, and combined with the unknown word's distribution. This use of context will normally be done in the tagger proper, but is included here for illustrative purposes.</Paragraph> </Section> <Section position="4" start_page="1505" end_page="1505" type="sub_section"> <SectionTitle> 3.4 Using Suffixes Only </SectionTitle> <Paragraph position="0"> Prefixes seem to offer less information than suffixes. To determine if calculating distributions based on prefixes is helpful, a predictor that only uses suffix information is also tested.</Paragraph> </Section> </Section> <Section position="5" start_page="1505" end_page="1505" type="metho"> <SectionTitle> 4 The Experiment </SectionTitle> <Paragraph position="0"> The experiments were performed using the Brown corpus. A 10-fold cross-validation technique was used to generate the data. The sentences from the corpus were split into ten files, nine of which were used to train the predictor, and one which was the test set. The lexicon for the test run is created using the data from the training set. All unknown words in the test set (those that did not occur in the training set) were assigned a tag distribution by the predictor. Then the results are checked to see if the correct tag is in the n-best tags. The results from all ten test files were combined to rate the overall performance for the experiment.</Paragraph> </Section> class="xml-element"></Paper>