File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/05/w05-0406_concl.xml

Size: 2,757 bytes

Last Modified: 2025-10-06 13:54:55

<?xml version="1.0" standalone="yes"?>
<Paper uid="W05-0406">
  <Title>Identifying non-referential it: a machine learning approach incorporating linguistically motivated patterns</Title>
  <Section position="10" start_page="46" end_page="46" type="concl">
    <SectionTitle>
7 Conclusion
</SectionTitle>
    <Paragraph position="0"> The accurate classification of it as referential or non-referential is important for natural language tasks such as reference resolution (Ng and Cardie, 2002).</Paragraph>
    <Paragraph position="1"> Through an examination of the types of constructions containing non-referential it, we are able to develop a set of detailed grammatical patterns associated with non-referential it. In previous rule-based systems, word lists were created for the verbs and adjectives which often occur in these patterns. Such a system can be limited because it is unable to adapt to new texts, but the basic grammatical patterns are still reasonably consistent indicators of non-referential it. Given a POS-tagged corpus, the relevant linguistic patterns can be generalized over part-of-speech tags, reducing the dependence on brittle word lists. A machine learning algorithm is able to adapt to new texts and new words, but it is less able to generalize about the linguistic patterns from a small training set. To be able to use our knowledge of relevant linguistic patterns without having to specify lists of words as indicators of certain types of it, we developed a machine learning system which incorporates the relevant patterns as features alongside part-of-speech and lexical information. Two short lists are still used to help identify weather it and a few idioms. The k-nearest neighbors algorithm from the Tilburg Memory Based Learner is used with 25 features and achieved 88% accuracy, 82% precision, and 71% recall for the binary classification of it as referential or non-referential.</Paragraph>
    <Paragraph position="2"> Our classifier outperforms previous systems in both accuracy and precision, but recall is still a problem. Many instances of non-referential it are difficult to identify because typical clues such as complementizers and relative pronouns can be omitted.</Paragraph>
    <Paragraph position="3"> Because of this, subordinate and relative clauses cannot be consistently identified given only a POS-tagged corpus. Improvements could be made in the future by integrating chunking or parsing into the pattern-matching features used in the system. This would help in identifying extrapositional and cleft it.</Paragraph>
    <Paragraph position="4"> Knowledge about context beyond the sentence level will be needed to accurately identify certain types of cleft, weather, and idiomatic constructions.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML