File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/92/a92-1043_intro.xml

Size: 4,075 bytes

Last Modified: 2025-10-06 14:05:06

<?xml version="1.0" standalone="yes"?>
<Paper uid="A92-1043">
  <Title>Learning a Scanning Understanding for &amp;quot;Real-world&amp;quot; Library Categorization</Title>
  <Section position="2" start_page="0" end_page="0" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> Our goal is to examine hybrid symbolic/connectionist and connectionist approaches for classifying a substantial number of real-world title phrases. These approaches are embedded in the framework of SCAN (Wermter 92), a Symbolic Connectionist Approach for Natural language phrases, aimed towards a scanning understanding of natural language rather than focusing on an in-depth understanding. For our experiments we took an existing classification from the online catalog of the main library at Dortmund University and as a first subclassification we selected titles from three classes: &amp;quot;computer science&amp;quot; (CS), &amp;quot;history/politics&amp;quot; (HP), and &amp;quot;materials/geology&amp;quot; (MG).</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2 Preprocessing of Title Phrases
2.1 Symbolic Syntactic Condensation
</SectionTitle>
      <Paragraph position="0"> The first approach used syntactic condensation based on a chart parser and a headnoun extractor. The symbolic chart parser built a syntactic structure for a title using a context-free grammar and a syntactic lexicon. Then the headnoun extractor retrieved the sequence of headnouns for building a compound noun. For instance, the compound noun &amp;quot;software access guidelines&amp;quot; was generated from &amp;quot;guidelines on subject access to microcomputer software&amp;quot;. This headnoun extractor was motivated *This research was supported in part by the Federal Secretary for Research and Technology under contract #01IV101AO and by the Computer Science Department of Dortmund University.</Paragraph>
      <Paragraph position="1"> by the close relationship between noun phrases and compound nouns and by the importance of nouns as content words (Finin 80).</Paragraph>
      <Paragraph position="2"> Each noun in a compound noun was represented with 16 binary manually encoded semantic features, like measuring-event, changing-event, scientific-field, property, mechanism etc. The set of semantic features had been developed as a noun representation for a related scientific technical domain and had been used for structural disambiguation (Wermter 89). The first approach contained a relatively small set of 76 titles since for each noun 16 features had to be determined manually and for each word in the title the syntactic category had to be in the lexicon which contained 900 entries.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.2 Unrestricted Complete Phrases
</SectionTitle>
      <Paragraph position="0"> In our second approach, we used an automatically acquired significance vector for each word based on the occurrence of the words in certain classes. Each value v(w, ci) in a significance vector represented the frequency of occurrence of word w in class ci divided by the total frequency of word w in all classes. These significance vectors were computed for the words of 2493 library titles from the three classes CS, HP, and MG.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.3 Elimination of Insignificant Words
</SectionTitle>
      <Paragraph position="0"> In the third approach we analyzed the most frequent words in the 2493 titles of the second approach. We eliminated words that occured more than five times in our corpus and that were prepositions, conjunctions, articles, and pronouns. Words were represented with the same significance vectors as in the second approach. This elimination of frequently occuring domain-independent words was expected to make classification easier since many domain-independent insignificant words were removed from the titles.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML