File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/w04-0862_intro.xml
Size: 1,247 bytes
Last Modified: 2025-10-06 14:02:35
<?xml version="1.0" standalone="yes"?> <Paper uid="W04-0862"> <Title>The Swarthmore College SENSEVAL3 System</Title> <Section position="3" start_page="0" end_page="0" type="intro"> <SectionTitle> 2 Features </SectionTitle> <Paragraph position="0"> Each of the supervised algorithms made use of the same set of features, extracted from only the labeled data provided to us by the task organizers. We used no unlabeled data. We used the tagged and lemmatized data to extract the following features, which were the only features used in our system: * Bag-of-words and bag-of-lemmas * Bigrams and trigrams of words, lemmas, partof-speech, and case (Basque-only) around the target word * Topic or code (Basque, Catalan and Spanish) In order to prevent individual features from dominating any individual system, we used up to eight permutations of the above mentioned features (depending on the language) for each of our classifiers. Catalan and Spanish provided fine-grained part-of-speech tags which we felt would lead to sparse data problems. To reduce this problem, for some feature sets we made the part-of-speech tags more coarse by simplifying the tags to include only the first or first two letters of the tag.</Paragraph> </Section> class="xml-element"></Paper>