File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/00/w00-0804_intro.xml
Size: 3,672 bytes
Last Modified: 2025-10-06 14:00:59
<?xml version="1.0" standalone="yes"?> <Paper uid="W00-0804"> <Title>Experiments in Word Domain Disambiguation for Parallel Texts</Title> <Section position="4" start_page="0" end_page="27" type="intro"> <SectionTitle> 2 WordNet and Subject Field Codes </SectionTitle> <Paragraph position="0"> In this work we will make use of an augmented WoRDNET, whose synsets have been annotated with one or more subject field codes. This resource, discussed in \[Magnini and Cavagli~, 2000\], currently covers all the noun synsets of WORD-NET 1.6 \[Miller, 1990\], and it is under development for the remaining lexical categories. Subject Field Codes (SFC) group together words relevant for a specific domain. The best approximation of SFCs are the field labels used in dictionaries (e.g. MEDICINE, A~;CHITECTURE), even if their use is restricted to word usages belonging to specific terminological domains. In WORDNET, too, SFCs seem to be treed occasionally and without a consistent desi~;n.</Paragraph> <Paragraph position="1"> Information brought by SFCs is complementary to what is already in WoRDNET. First of all a SFC may include synsets of different syntactic categories: for instance MEDICINE 1 groups together senses from Nouns, such as doctoral and hospital#I, and from Verbs such as operate#7.</Paragraph> <Paragraph position="2"> Second, a SFC may also contain .,~nses from different WORDNET sub-hierarchies (i.e. deriving from different &quot;unique beginners, or from different &quot;lexicographer files&quot;). For example, the SPORT SFC contains senses such as athlete#I, deriving from li:~e~orm#1, game_equipment#1 from physical_object#1, sport#1 from act#2, and playingJield#1 from location#1.</Paragraph> <Paragraph position="3"> We have organized about 250 SFCs in a hierarchy, where each level is made up of codes of the same degree of specificity: for example, the second level includes SFCs such as BOTANY, LIN-GUISTICS, HISTORY, SPORT and RELIGION, while at the third level we can find specializations such as AMERICAN.HISTORY, GRAMMAR, PHONETICS and TENNIS.</Paragraph> <Paragraph position="4"> A problem arises for synsets that do not belong to a specific SFC, but rather can appear in almost all of them. For this reason, a FACTOTUM SFC has been created which basically includes two types of synsets: Gener/c synsets, which are hard to classify in a particular SFC, are generally placed high in the WoRDNET hierarchy and are related senses of highly polysemous words. For example: null man#1 an adult male person (as opposed to a woman) man#3 the generic use of the word to refer to any human being date#1 day of the month aThroughout the paper subject field codes are ino cUcated with this TYPEFACE while word senses are reported with this typeface#l, with their corresponding numbering in WORDNET 1.6. Moreover, we use sub. ject field code, domain label and semantic field with the same meaning.</Paragraph> <Paragraph position="5"> dal;e#3 appointment, engagement * Stop Senses synsets which appear frequently in different contexts, such as numbers, week days, colors, etc. These synsets usually belong to non polysemous words and they behave much as stop words, because they do not significantly contribute to the overall meaning of a text.</Paragraph> <Paragraph position="6"> A single domain label may group together more than one word sense, resulting in a reduction of the polysemy. Figure 1 shows an example. The word &quot;book&quot; has seven different senses in WORD-NET 1.6: three of them are grouped under the PUBLISHING domain, causing the reduction of the polysemy from 7 to 5 senses.</Paragraph> </Section> class="xml-element"></Paper>