File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/93/h93-1049_metho.xml

Size: 11,620 bytes

Last Modified: 2025-10-06 14:13:25

<?xml version="1.0" standalone="yes"?>
<Paper uid="H93-1049">
  <Title>HYPOTHESIZING UNTAGGED TEXT WORD ASSOCIATION FROM</Title>
  <Section position="5" start_page="248" end_page="248" type="metho">
    <SectionTitle>
3. OVERVIEW OF THE METHOD
</SectionTitle>
    <Paragraph position="0"> The method consists of three phases: 1) Automatic part of speech tagging of text. First, texts are labeled by our probabilistic part of speech tagger (POST) which has been extended for Japanese morphological processing \[Matsukawa et. al. 1993\]. This is fully automatic; human review is not necessary under the assumption that the tagger has previously been trained on appropriate text \[Meteer et. al. 1991\] 1 2) Finite state pattern matching.</Paragraph>
    <Paragraph position="1"> Second, a finite-state pattern matcher with patterns representing possible grammatical relations, such as verb/argument pairs, nominal compounds, etc. is run over the sample text to suggest word pairs which will be considered candidates for word associations. As a result, we get a word co-occurrence matrix. Again, no human review of the pattern matching is assumed.</Paragraph>
    <Paragraph position="2">  3) Filtering/Generalization of word associations via Chi-square. Third,  given the word co-occurrence matrix, the program starts from an initial pair of word groups (or a submatrix in the matrix), incrementally adding into the submatrix a word which locally gives the highest Chi-square score to the submatrix. Finally, words are removed which give a higher Chi-square score by their removal. By adding and removing words until reaching an appropriate significance level, we get a submatrix as a hypothesis of word associations between the cluster of words represented as rows in the submatrix and the cluster of words represented as columns in the submatrix.</Paragraph>
  </Section>
  <Section position="6" start_page="248" end_page="248" type="metho">
    <SectionTitle>
4 WORD SEGMENTATION AND PART
OF SPEECH LABELING
</SectionTitle>
    <Paragraph position="0"> 1 In our experience thus far in three domains and in both Japanese and English, while retraining POST on domain-specific data would reduce the error rate, the effect on overall performance of the system in data extraction from text has been small enough to make retraining unnecessary. The effect of domain-specific lexical entries (e.g., DRAM is a noun in microelectronics) often mitigates the need to retrain.</Paragraph>
    <Paragraph position="1"> Since in Japanese word separators such as spaces are not present, words must be segmented before we assign part of speech to words. To do this, we use JUMAN from Kyoto University to segment Japanese text into words, AMED, an example-based segmentation corrector, and a Hidden Markov Model (POST) \[Matsukawa, et. al. 1993\]. For example, POST processes an input text such as the following: and produces tagged text such as: 2</Paragraph>
    <Paragraph position="3"/>
  </Section>
  <Section position="7" start_page="248" end_page="249" type="metho">
    <SectionTitle>
5. FINITE STATE PATTERN
MATCHING
</SectionTitle>
    <Paragraph position="0"> We use the following finite state patterns for extracting possible Japanese verb/argument word co-occurrences from automatically segmented and tagged Japanese text.</Paragraph>
    <Paragraph position="1"> Completely different patterns would be used for English.</Paragraph>
    <Paragraph position="3"> Here, the first part (CN, PN or SN) represents a noun.</Paragraph>
    <Paragraph position="4"> Since in Japanese the head noun of a noun phrase is always at the right end of the phrase, this part should always match a head noun. The second part (CM or PT) represents a postposition which identifies an argument of a verb. The final pattern element (VB or SN) represents a verb. Sainflection nouns (SN) are nominalized verbs which form a verb phrase with the morpheme &amp;quot;suru.&amp;quot;</Paragraph>
    <Paragraph position="6"> Since argument structure in Japanese is marked by postpositions, i.e., case markers (i.e., &amp;quot;o,&amp;quot; &amp;quot;ga&amp;quot;) and partic?,es (e.g., &amp;quot;ni,&amp;quot; &amp;quot;kara,&amp;quot; . . .), word combinations matched with the patterns will represent associations between a noun filling a particular argument type (e.g., &amp;quot;o&amp;quot;) and a verb. Note that topic markers (TM; i.e., &amp;quot;wa&amp;quot;) and toritate markers (TTM; e.g.&amp;quot;mo&amp;quot;, &amp;quot;sae&amp;quot;, ...) are not included in the pattern since these do not uniquely identify the case of the argument.</Paragraph>
    <Paragraph position="7"> Just as in English, the arguments of a verb in Japanese may be quite distant from the verb; adverbial phrases and scrambling are two cases that may separate a verb from its argument(s). We approximate this in a finite state machine by allowing words to be skipped. In our experiment, up to four words could be skipped. As shown in Figure 1, matching an argument structure varies from distance 0 to 4.</Paragraph>
    <Paragraph position="8"> By limiting the algorithm to a maximum of four word gaps, and by not considering the ambiguous cases of topic markers and taritate markers, we have chosen to limit the cases considered in favor of high accuracy in automatically hypothesizing word associations. \[Brent, 1991\] similarly limited what his algorithm could learn in favor of high accuracy.</Paragraph>
  </Section>
  <Section position="8" start_page="249" end_page="249" type="metho">
    <SectionTitle>
6. FILTERING AND
GENERALIZATION VIA CHI-SQUARE
</SectionTitle>
    <Paragraph position="0"> Word combinations found via the finite state patterns include a noun, postposition, and a verb. A two dimensional matrix (a word co-occurrence matrix) is formed, where the columns are nouns, and the rows are pairs of a verb plus postposifion. The cells of the matrix are the frequency of the noun (column element) co-occurring in the given case with that verb (row element). Starting from a submatrix, the algorithm successively adds to the submatfix the word with the largest Chi-square score among all words outside the submatrix. Words are added until a local maximum is reached. Finally, the appropriateness of the submatrix as a hypothesis of word associations is checked with heuristic criteria based on the sizes of the row and the column of the submatrix.</Paragraph>
    <Paragraph position="1"> Currently, we use the following criteria for appropriateness  of a submatrix: LET 1 : size of row of submatfix m : size of column of submatrix C1, C2, C3 : parameters IF 1 &gt; C1, and m &gt; C1, and 1 &gt; C2 or m/l &lt; C3, and m &gt; C2 or l/m &lt; C3  THEN the submatrix is appropriate.</Paragraph>
    <Paragraph position="2"> For any submatrix found, the co-occurrence observations for the clustered words are removed from the word co-occurrence matrix and treated as a single column of clustered nouns and a single row of clustered verb plus case pairs. Currently, we use the following values for the parameters: C1=2, C2=10, and C3=10.</Paragraph>
    <Paragraph position="3"> Table 1. shows an example of clustering starting from the initial submatrix shown in Figure 2. The words in Figure 2 were manually selected as words meaning &amp;quot;organization.&amp;quot; In Table 1, the first (leftmost) column indicates the word which was added to the submatfix at each step. The second column gives an English gloss of the word. The third column reports fix,Y), the frequency of the co-occurrences between the word and the words that co-occur with it. For example, the first line of the table shows that the word &amp;quot;~/~L&amp;quot; (establish/-acc) co-occurred with the &amp;quot;organization&amp;quot; words 26 times. The rightmost column specifies I(X,Y), the scaled mutual information between the rows and columns of the submatrix. As the clustering proceeds, I(X,Y) gets larger.</Paragraph>
    <Paragraph position="4"> ~_~(company), ;~k:~l\](head quarter), mS(organization), ~(coorporation), iitij:~.J:(both companies), ~(school),</Paragraph>
  </Section>
  <Section position="9" start_page="249" end_page="251" type="metho">
    <SectionTitle>
7. EVALUATION
</SectionTitle>
    <Paragraph position="0"> Using 280,000 words of Japanese source text from the TIPSTER joint ventures domain, we tried several variations of the initial submatrices (word groups) from which the search in step three of the method starts: a) complete bipartite subgraphs, b) pre-classified noun groups and c) significantly frequent word pairs.</Paragraph>
    <Paragraph position="1"> Based on the results of the experiments, we concluded that alternative (b) gives both the most accurate word associations and the highest coverage of word associations. This technique is practical because classification of nouns is generally much simpler than that of verbs. We don't propose any automatic algorithm to accomplish noun classification, but instead note that we were able to manually classify nouns in less than ten categories at about 500 words/hour. That productivity was achieved using our new tool for manual word classification, which is partially inspired by EDR's way of classifying their semantic lexical data \[Matsukawa and Yokota, 1991 \].</Paragraph>
    <Paragraph position="2"> Based on a corpus of 280,000 words in the TIPSTER joint ventures domain, the most frequently occurring Japanese nouns, proper nouns, and verbs were automatically identified. Then, a student classified the frequently occurring nouns into one of the twelve categories in (1) below, and each frequently occurring proper noun into one of the four categories in (2) below, using a menu-based tool, we were able to categorize 3,195 lexical entries in 12 person-hours. 3 These categories were then used as input to the word co-occurrence algorithm.</Paragraph>
    <Paragraph position="3">  two phases; classification into the four categories la, lb, lc and ld, and further classification into the twelve categories. As a result, each word was checked twice. We found that using two phases generally improves both overall productivity and consistency.</Paragraph>
  </Section>
  <Section position="10" start_page="251" end_page="251" type="metho">
    <SectionTitle>
ORGANIZATION
LOCATION
PERSON
OTHER
</SectionTitle>
    <Paragraph position="0"> Using the 280,000 word joint venture corpus, we collected 14,407 word co-occurrences, involving 3,195 nouns and 4,365 verb/argument pairs, by the finite state pattern given in Section 5. 16 submatrices were clustered, grouping 810 observed word co-occurrences and 6,240 unobserved (or hypothesized) word co-occurrences. We evaluated the accuracy of the system by manual review of a random sample of 500 hypothesized word co-occurrences. Of these, 435, or 87% were judged reasonable. This ratio is fine compared with a random sample of 500 arbitrary word co-occurrences between the 3,195 nouns and the 4,365 verb/argument pairs, of which only 153 (44%) were judged reasonable. Table 2 below shows some examples judged reasonable; questionable examples are marked by &amp;quot;?&amp;quot;; unreasonable hypotheses are marked with an asterisk.</Paragraph>
    <Paragraph position="1"> With a small corpus (280,000 words) such as ours, considering small frequency co-occurrences is critical.</Paragraph>
    <Paragraph position="2"> Looking at Table 3 below, if we had to ignore co-occurrences with frequency less than five (as \[Church and Hanks 1990\] did), there would be very little data. With our method, as long as the frequency of co-occurrence of the word being considered with the set is greater than two, the statistic is stable.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML