File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/03/n03-2023_intro.xml
Size: 1,402 bytes
Last Modified: 2025-10-06 14:01:44
<?xml version="1.0" standalone="yes"?> <Paper uid="N03-2023"> <Title>Category-Based Pseudowords</Title> <Section position="3" start_page="0" end_page="2" type="intro"> <SectionTitle> 2 MeSH and Medline </SectionTitle> <Paragraph position="0"> In this paper we use the MeSH (Medical Subject Headings) lexical hierarchy , but the approach should be equally applicable to other domains using other thesauri and ontologies. In MeSH, each concept is assigned one or more alphanumeric descriptor codes corresponding to particular positions in the hierarchy. For example, A (Anatomy), A01 (Body Regions), A01.456 (Head), A01.456.505 (Face), A01.456.505.420 (Eye). Eye is ambiguous according to MeSH and has a second code: A09.371 (A09 represents Sense Organs).</Paragraph> <Paragraph position="1"> In the studies reported here, we truncate the MeSH code at the first period. This allows for generalization over different words; e.g., for eye, we discriminate between senses represented by A01 and A09. This truncation reduces the average number of senses per token from 2.12 to 1.39, and the maximum number of ambiguity classes for a given word to 7; 71.18% of the tokens have a single class and 22.14% have two classes. From a collection of 180,226 abstracts from Medline 2003 training was done on 2/3 of the abstracts (120,150) and testing on the remaining 1/3 (60,076).</Paragraph> </Section> class="xml-element"></Paper>