File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/98/p98-2245_metho.xml

Size: 8,589 bytes

Last Modified: 2025-10-06 14:15:10

<?xml version="1.0" standalone="yes"?>
<Paper uid="P98-2245">
  <Title>Bridging the Gap between Dictionary and Thesaurus</Title>
  <Section position="3" start_page="1487" end_page="1487" type="metho">
    <SectionTitle>
2 Inter-relatedness of the Resources
</SectionTitle>
    <Paragraph position="0"> The three lexical resources used in this work are the 1987 revision of Roget's Thesaurus (ROGET) (Kirkpatrick, 1987), the Longman Dictionary of Contemporary English (LDOCE) (Procter, 1978) and Word-Net 1.5 (WN) (Miller et al., 1993). Figure 1 shows how word senses are organised in them. As we have mentioned, instead of directly mapping an LDOCE definition to a ROGET class, we bridge the gap with WN, as indicated by the arrows in the figure. Such a route is made feasible by linking the structures in common among the resources.</Paragraph>
    <Paragraph position="1"> Words are organised in alphabetical order in LDOCE, as in other conventional dictionaries. The senses are listed after each entry, in the form of text definitions. WN groups words into sets of synonyms (&amp;quot;synsets&amp;quot;), with an optional textual gloss. These synsets form the nodes of a taxonomic hierarchy.</Paragraph>
    <Paragraph position="2"> In ROGET, each semantic class comes with a number, under which words are first assorted by part of speech and then grouped into paragraphs according to the conveyed idea.</Paragraph>
    <Paragraph position="3"> Let us refer to Figure 1 and start from word x2 in WN synset X. Since words expressing every aspect of an idea are grouped together in ROGET, we can therefore expect to find not only words in synset X, but also those in the coordinate WN synsets (i.e. M and P, with words ml, m2, pl, P2, etc.) and the superordinate WN synsets (i.e. C and A, with words cl, c2, etc.) in the same ROGET paragraph. In other words, the thesaurus class to which x2 belongs should include roughly X U M U P U C U A. Meanwhile, the LDOCE definition corresponding to the sense of synset X (denoted by D~) is expected to be similar to the textual gloss of synset X (denoted by GI(X)). In addition, given that it is not unusual for A 120. N. cl, c2, ... (in C); /~'&amp;quot;--~ ml, m2, ... (in M); pl, p2, B C {el, c2, ... }. GIfC) ... (in P); xl, x2, ... (in X) I\[ V .... Adj .... E F M P X \[ml. m2.... }.GI(M) {pl, p2, ...I, GI(P} {xl, x2, ... }, GI(X) 121.N .... /~</Paragraph>
    <Paragraph position="5"> I.... definition (Dx) similiar t,) GI(X) or defined in terms of words in X t)r C, etc.</Paragraph>
    <Paragraph position="6">  dictionary definitions to be phrased with synonyms or superordinate terms, we would also expect to find words from X and C, or even A, in the LDOCE definition. That means we believe Dx ~ GI(X) and D~N(XUCUA) 5C/.</Paragraph>
  </Section>
  <Section position="4" start_page="1487" end_page="1487" type="metho">
    <SectionTitle>
3 The Algorithm
</SectionTitle>
    <Paragraph position="0"> The possibility of using statistical methods to assign ROGET category labels to dictionary definitions has been suggested by Yarowsky (1992). Our algorithm offers a systematic way of linking existing resources by defining a mapping chain from LDOCE to ROGET through WN. It is based on shallow processing within the resources themselves, exploiting their inter-relatedness, and does not rely on extensive statistical data. It therefore has an advantage of being immune to any change of sense discrimination with time, since it only depends on the organisation but not the individual entries of the resources. Given a word with part of speech, W(p), the core steps are  collect the corresponding gloss definitions, Gl(Sn), if any, the hypernym synsets Hyp(Sn), and the coordinate synsets Co(Sn).</Paragraph>
    <Paragraph position="1"> Step 3: Compute a similarity score matrix .4 for the LDOCE senses and the WN synsets. A similarity score .4(i,j) is computed for the i th LDOCE sense and the jth WN synset using a weighted sum of the overlaps between the LDOCE sense and the WN synset, hypernyms, and gloss respectively, that is</Paragraph>
    <Paragraph position="3"> For our tests, we tried setting az = 3, a2 = 5 and as = 2 to reveal the relative significance of finding a synonym, a hypernym, and any word in the textual gloss respectively in the dictionary definition.</Paragraph>
    <Paragraph position="4"> Step 4: From ROGET, find all paragraphs Pm{wi,w2, ...} such that W(p) E pro.</Paragraph>
    <Paragraph position="5"> Step 5: Compute a similarity score matrix B for the WN synsets and the ROGET classes. A similarity score B(j, k) is computed for the jth WN synset (taking the synset itself, the hypernyms, and the coordinate terms) and the k th ROGET class, according to the following: B(j, k) = bllSj N Pkl + b2IHyp(Sj) M Pkl + bHCo(Sj) n Pkl We have set bz = b2 = ba = 1. Since a ROGET class contains words expressing every aspect of the same idea, it should be equally likely to find synonyms, hypernyms and coordinate terms in common.</Paragraph>
    <Paragraph position="6"> Step 6: For i = I to t (i.e. each LDOCE sense), find max(A(i,j.)) from matrix A. Then trace from matrix B the jth row and find rnax(B(j,k)).</Paragraph>
    <Paragraph position="7"> The i th LDOCE sense should finally be mapped to the ROGET class to which Pk belongs.</Paragraph>
    <Paragraph position="8"> We have made an operational assumption about the analysis of definitions. We did not attempt to parse definitions to identify genus terms but simply approximated this by using the weights az, a2 and as in Step 3. Considering that words are often defined in terms of superordinates and slightly less often by synonyms, we assign numerical weights in the order a2 &gt; az &gt; as. We are also aware that definitions can take other forms which may involve part-of relations, membership, and so on, though we did not deal with them in this study.</Paragraph>
  </Section>
  <Section position="5" start_page="1487" end_page="1488" type="metho">
    <SectionTitle>
4 Testing and Results
</SectionTitle>
    <Paragraph position="0"> The algorithm was tested on 12 nouns, listed in Table 1 with the number of senses in the various lexical resources.</Paragraph>
    <Paragraph position="1"> The various types of possible mapping errors are summarised in Table 2. Incorrectly Mapped and Unmapped-a are both &amp;quot;misses&amp;quot;, whereas Forced Error and Unmapped-b are both &amp;quot;false alarms&amp;quot;. The performance of the three parts of mapping is shown in Table 3. The &amp;quot;carry-over error&amp;quot; is only  applicable to the last stage, L -+R, and it refers to cases where the final answer is wrong as a result of a faulty outcome from the first stage (L --+W).</Paragraph>
  </Section>
  <Section position="6" start_page="1488" end_page="1488" type="metho">
    <SectionTitle>
5 Discussion
</SectionTitle>
    <Paragraph position="0"> Overall, the Accurately Mapped figures support our hypothesis that conventional dictionaries and thesauri can be related through WordNet. Looking at the unsuccessful cases, we see that there are relatively more &amp;quot;false alarms&amp;quot; than &amp;quot;misses&amp;quot;, showing that errors mostly arise from the inadequacy of individual resources because there are no targets rather than from partial failures of the process. Moreover, the number of &amp;quot;misses&amp;quot; can possibly be reduced if more definition patterns are considered.</Paragraph>
    <Paragraph position="1"> Clearly the successful mappings are influenced by the fineness of the sense discrimination in the resources. How finely they are distinguished can be inferred from the similarity score matrices. Reading the matrices row-wise shows how vaguely a certain sense is defined, whereas reading them column-wise reveals how polysemous a word is.</Paragraph>
    <Paragraph position="2"> While the links resulting from the algorithm can be right or wrong, there were some senses of the test words which appeared in one resource but had no counterpart in the others, i.e. they were not attached to any links. Thus 18.9% of the LDOCE senses, 11.1% of the WN synsets and 58.1% of the ROGET classes were among these unattached senses. Though this implies the insufficiency of using only one single resource in any application, it also suggests there is additional information we can use to overcome the inadequacy of individual resources.</Paragraph>
    <Paragraph position="3"> For example, we may take the senses from one resource and complement them with the unattached senses from the other two, thus resulting in a more complete but not redundant sense discrimination.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML