File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/02/c02-1014_metho.xml

Size: 17,429 bytes

Last Modified: 2025-10-06 14:07:44

<?xml version="1.0" standalone="yes"?>
<Paper uid="C02-1014">
  <Title>Semiautomatic labelling of semantic features</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 Superficial semantic relationships
</SectionTitle>
    <Paragraph position="0"> between words in dictionaries According to Smith and Maxwell, there are three basic methods for defining a lexical entry [Smith and Maxwell., 1980]: * By means of a synonym: a word with the same sense as the lexical entry.</Paragraph>
    <Paragraph position="1"> finish. conclude(sin), terminate(sin) * By means of a classical definition: 'genus + differentia'. The genus is the generic term or</Paragraph>
    <Paragraph position="3"> hyperonym, and the lexical entry a more specific term or hyponym.</Paragraph>
    <Paragraph position="4"> aeroplane. vehicle (genus) that can fly (differentia) * By means of specific relators, that will often determine the semantic relationship between the lexical entry and the core of the definition.</Paragraph>
    <Paragraph position="5"> horsefly. Name given to (relator) certain insects (related term) of the Tabanidae family One method for identifying the semantic relationship that exists between different words is to extract the information from monolingual dictionaries.</Paragraph>
    <Paragraph position="6"> Agirre et al. (2000) applied it for Basque, using the definitions contained in the monolingual dictionary Euskal Hiztegia. We use for our research the information about genus, specific relators and synonymy extracted by them.</Paragraph>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Semiautomatic labelling using genus,
</SectionTitle>
    <Paragraph position="0"> specific relators and synonymy In order to label the common nouns that appear in the dictionary, we used the definitions of the 26,461 senses of the 16,380 common nouns defined by means of genus/relators (14,569) or synonyms (11,892).</Paragraph>
    <Paragraph position="1"> The experiment was carried out as follows: firstly, we used the information relative to genus and specific relators to extract the information regarding the [+-animate] feature (3.1).</Paragraph>
    <Paragraph position="2"> Subsequently, we also incorporated the information relative to synonymy (3.2). Finally, we repeated the automatic process iteratively in order to obtain better results (3.3). An example of the whole process is given in section 3.4.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 Labelling using information relative to
</SectionTitle>
      <Paragraph position="0"> genus and specific relators Our strategy consisted of manually labelling the semantic feature for a small number of words that appear most frequently in the dictionary as genus/relators. We used these words to infer the value of this feature for as many other words as possible.</Paragraph>
      <Paragraph position="1"> This inference is possible because in the hyperonymy/hyponymy relationship, that characterises the genus, semantic attributes are inherited. For example, if 'langile' (worker) has the [+animate] feature, all its hyponyms (or in other words, all the words whose hyperonym is 'langile') will have the same [+animate] feature. Certain genus are ambiguous, since they contain senses with opposing semantic features. For example 'buru' (head/boss) has the [-animate] feature when it means 'head' and the [+animate] feature when it means 'boss'. The semantic feature of the sense defined can also be deduced from some specific relators. In this way, the semantic feature of words whose relator is 'nolakotasuna' (quality) would be [-animate], such as in the case of 'aitatasuna' (paternity), for example. There are also certain relators that offer no information, such as 'mota' (type), 'izena' (name), and 'banako' (unit, individual).</Paragraph>
      <Paragraph position="2"> We used four types of labels during the manual operation: [+], [-], [?] and [x]. [?] for ambiguous cases; and [x] for relators that do not offer information regarding this semantic feature. In order to establish the reliability of the automatic labelling process for a particular noun, we considered the number of senses labelled, taking into account the reliability of the labels of the genus (or relator) that provided the information. The result was calculated as follows: Rel_noun = [?] Rel_genus_per_sense / n_senses During manual labelling, we assigned reliability value 1 to all labels, since all the senses of these nouns are taken into account.</Paragraph>
      <Paragraph position="3"> Figure 1 shows the algorithm used. For each common noun defined in the dictionary, we take, one by one, all their senses containing genus or relator, assigning in each case the first label associated to a genus or relator in the hierarchy of hyperonyms. When the sign of all the labels are coincident we use it to label the entry, in other case, we use the label [?]. In all cases, their reliability is calculated.</Paragraph>
      <Paragraph position="4"> When we detect a cycle, the search is interrupted and the sense to be tagged remains unlabelled.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.2 Labelling using synonymy information
</SectionTitle>
      <Paragraph position="0"> Labelling using genus and relators can be expanded by using synonymy. Since the synonymy relationship shares semantic features, we can deduce the semantic label of a sense if we know the label of its synonymes.</Paragraph>
      <Paragraph position="1"> Therefore, the information obtained during the previous phase can now be used to label new nouns. It also serves to increase the reliability of nouns already been labelled thanks to the genus information of some of their senses. If the synonymy information provided corroborates the genus information, the noun's reliability rating increases. If, on the other hand, the new label does not coincide with the previous one, a special label: [?] is assigned to the noun indicating this ambiguity.</Paragraph>
      <Paragraph position="2"> The automatic process using synonymy was implemented in the same way as in the previous process.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.3 Iterative repetition of the automatic
</SectionTitle>
      <Paragraph position="0"> process Our next idea was to repeat the process; since the information gathered so far using synonymy may also be applied hereditarily through the genus' hyperonymy relationship.</Paragraph>
      <Paragraph position="1"> We therefore repeated the process from the beginning, trying to label all the senses of the nouns that had not been fully labelled during the initial operations, by using the information contained in the senses of the nouns that had been fully labelled (reliability 1).</Paragraph>
      <Paragraph position="2"> As with the initial operation, we first used information about genus and relators, and then, synonymy.</Paragraph>
      <Paragraph position="3"> This process can be repeated any number of times, thereby labelling more and more words while increasing the reliability of the labelling itself. However, repetition of the process also increases the number of words labelled as ambiguous [?], since more senses are labelled during each iteration, thereby increasing the chances of inconsistencies. As we shall see, this iterative process improves the results logarithmically up to a certain number of repetitions, after which it has no further advantageous effects.</Paragraph>
    </Section>
    <Section position="4" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.4 Example of semiautomatic labelling for
</SectionTitle>
      <Paragraph position="0"> the [+-animate] feature The 100 words that are most frequently used as genus (g) or relators (r) were labelled manually for the [+-animate] feature, as shown in table 2 (tables 3, 4 and 5 contain the Basque words processed during the explained operation, along with their English translation in italics).</Paragraph>
      <Paragraph position="2"> We shall now trace the implementation of the automatic labelling process for certain nouns.</Paragraph>
      <Paragraph position="3"> Table 3 shows the results of the first labelling process using information about genus and relators. The words printed in bold in the results column are nouns that were labelled during the manual labelling process. We can see how the noun 'babesgarri' (protector) is labelled as [-] thanks to the information provided by the relator of its only sense, which was manually labelled.</Paragraph>
      <Paragraph position="4">  armour) had coincident labels, thereby giving a rating of 0.66 (f=(1+1)/3=0.66). The a' (mother) was labelled as [+], thanks the information about genus and relator of 2 of 3 senses, out of a total of 5 (the remaining two e synonymy information). The reliability was therefore calculated as 0.4 The word 'zinismo' (cynicism) as labelled as [-] thanks to the fact that the enus of its 2 senses were both labelled as such, h one did not have a reliability rating of Table 4 shows some examples of the process using synonym information.</Paragraph>
      <Paragraph position="5"> As we can see, 'iturburu' (spring), which the previous process had not managed to tag, is now labelled as [-] thanks to the synonymy information associated to one of the two senses. The resulting reliability rating is 0.06 (f=0.2/3=0.06). If we look at the term 'ama', which had previously been labelled as [+] on the basis of genus information, we see that the synonyms of the two senses that use synonymy Noun Genus lab. N. sens N. syn Results of the process using synonymy Lab. Relia.  thanks to synonym information. The words 'giltzape' (prison) and 'ikusgune' (viewpoint), which had had one sense labelled on the basis of genus, now have both senses labelled. The reliability rating for 'ikusgune' is calculated as f=(1+0.33)/2=0.66.</Paragraph>
      <Paragraph position="6"> We then repeated the process using first the genus/relator information (table 4) followed by the synonymy information (table 5).</Paragraph>
      <Paragraph position="7"> The aim of this repetition was to label only those words that had not been fully labelled, using the information provided by the terms that had been and that had a reliability rating of 1, such as 'babesgarri', 'gertaera', 'espetxe', 'adiskide', 'filosofia', 'ama', 'gertakuntza', 'lagun', 'jateko' and 'giltzape' (tables 4 and 5). This process succeeded in labelling the senses information. On the other hand, 'ikusgune' (viewpoint), 'jarrera' (attitude) and 'zinismo' (cynicism), did not benefit from this repetition. Following this process, we applied the synonymy information, thus completing the second iteration. The process may be repeated as many times as you wish.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Experiments for optimising the
</SectionTitle>
    <Paragraph position="0"> efficiency of the method We carried out a number of different tests for the [+-animate] semantic feature labelling the 2, 5, 10, 50, 100, 125 and 150 words most frequently used as genus/relators, and repeating the whole process (using both genus and relator and synonymy information) 1, 2 and 3 times.</Paragraph>
    <Paragraph position="1"> The first 5 terms that appear most frequently  nd iteration of automatic labelling using genus and relator information are labelled as [-]. Due to this , the word is now labelled as [?].</Paragraph>
    <Paragraph position="2"> s 'gertakuntza' (event), 'lagun' and 'jateko' (food), which only had one sense, are now labelled of 'armadura' (protector), 'adiskidetzako' (friend) and 'apio' (celery), previously left unlabelled, since their genus 'soineko' (garment), 'lagun' (friend) and 'jateko' (food) had been fully labelled using the synonym as genus/relators are also the most productive during the automatic labelling process. From here on, the rate of increase gradually falls, until only 7 terms are labelled automatically for every noun labelled manually.</Paragraph>
    <Paragraph position="3"> On average, the first 2 nouns each enabled 1840 terms to be labelled, the next 3 enabled 1112 while the next 5 enabled only 250. After the hundredth noun, this average dropped to just 7 new terms labelled automatically for every term labelled manually. These results are illustrated in figure 2.</Paragraph>
    <Paragraph position="4"> For efficiency reasons, we decided that when labelling other semantic features, we will label manually the 100 nouns most frequently used as genus/relators.</Paragraph>
    <Paragraph position="5"> In order to decide the number of iterations required for optimum results, we compared the results obtained after 1 to 10 iterations after manually labelling 100 nouns (Figure 3).</Paragraph>
    <Paragraph position="6"> Although no increase was recorded for the number of nouns with reliability rating 1 (i.e. with all senses labelled) after the 3 rd iteration, the results for other reliability ratings continued to increase up until the 8 th iteration, since as more and more information is gathered, new contradictions are generated and the number of ambiguous labels increases. When the results stabilise, we can affirm that all the available information has been used and the most accurate results possible with this manual labelling operation have been obtained. It is important to check that the process does indeed stabilise, and that it does so after a fairly low number of iterations (in this case, after 8).</Paragraph>
    <Paragraph position="7"> The repetition of the process does not significantly increase execution time. 10 iterations of the automatic labelling process for the [+-animate] feature takes just 11 minutes 33 seconds using the total capacity of the CPU of a Sun Sparc 10 machine with 512 Megabytes of memory running at 360 MHz.</Paragraph>
    <Paragraph position="8"> We can therefore conclude that the method is viable and that, in the automatic process for other semantic features, the necessary iterations should be carried out until the results are totally stabilised.</Paragraph>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
5 Accuracy and scope of the labelling
</SectionTitle>
    <Paragraph position="0"> process for the [+-animate] feature In order to calculate the accuracy of the automatic labelling process, we took 1% of the labelled words as a sample and checked them manually. The results are shown in table 6.</Paragraph>
    <Paragraph position="1">  Although we initially planned to use only the labels with a reliability rating of 1, after seeing the accuracy of the others, we decided to use all the labels obtained during the process, thereby achieving an overall accuracy rating of 99.2%. We can affirm that the semiautomatic process designed and implemented here is very efficient. The scope for the automatic labelling of the [+-animate] feature (table 7) was 75.14% of all the nouns contained in the dictionary (12,308 of 16,380), having manually labelled 100 nouns and  We also calculated the scope of this labelling in a real context, using the corpus gathered from the newspaper Euskaldunon Egunkaria, which contains 1,267,453 words and 311,901 common nouns, of which 7,219 are different nouns. Table 8 shows the results - a scope of 69.2% with regard to the nouns that appear in the text (47.6% of the total number of different common nouns contained in the corpus). In other words, after carrying out a very minor manual operation, we managed to label two out of every three nouns that appear in the corpus. Similarly, we noted that of the 500 nouns that appear most frequently in the corpus, 348 (69.6%) were labelled.</Paragraph>
  </Section>
  <Section position="7" start_page="0" end_page="0" type="metho">
    <SectionTitle>
6 Generalisation for use with other
</SectionTitle>
    <Paragraph position="0"> semantic features Given the process's efficiency, it can be generalised for use with other semantic features. To this end, we have adapted its implementation to enable the automatic process to be carried out on the basis of the manual labelling of any semantic feature.</Paragraph>
    <Paragraph position="1"> So far, we have carried out the labelling process for the [+-animate], [+-human] and [+-concrete] semantic features. Table 12 shows the corresponding results.</Paragraph>
    <Paragraph position="2">  We have presented a highly efficient semiautomatic method for labelling the semantic features of common nouns, using the study of genus, relators and synonymy as contained in the Euskal Hiztegia dictionary. The results obtained have been excellent, with an accuracy of over 99% and a scope of 68,2% with regard to all the common nouns contained in a real corpus of over 1 million words, after the manual labelling of only 100 nouns.</Paragraph>
    <Paragraph position="3"> As far as we know, no so method of semantic feature labelling has been described in the literature, although many authors [Pustejovsky, 2000; Sheremetyeva &amp; Nirenburg, 2000] claim the significance of semantic features in general, and [animacy] in particular, for NLP systems. One of the possible applications of these experiments is to enrich the Basque Lexical Database, EDBL, using the semantic information obtained.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML