File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/01/p01-1043_intro.xml

Size: 4,357 bytes

Last Modified: 2025-10-06 14:01:11

<?xml version="1.0" standalone="yes"?>
<Paper uid="P01-1043">
  <Title>A language-independent shallow-parser Compiler</Title>
  <Section position="2" start_page="0" end_page="0" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> Full syntactic parsers of unrestricted text are costly to develop, costly to run and often yield errors, because of lack of robustness of wide-coverage grammars and problems of attachment.</Paragraph>
    <Paragraph position="1"> This has led, as early as 1958 (Joshi &amp; Hopely 97), to the development of shallow-parsers, which aim at identifying as quickly and accurately as possible, main constituents (and possibly syntactic functions) in an input, without dealing with the most difficult problems encountered with &amp;quot;full-parsing&amp;quot;. Hence, shallow-parsers are very practical tools. There are two main techniques used to develop shallow-parsers:  1- Probabilistic techniques (e.g. Magerman 94, Ratnaparkhi 97, Daelmans &amp; al. 99) 2- Finite-state techniques (e.g. Grefenstette 96)  Probabilistic techniques require large amounts of syntactically-annotated training data1, which makes them very unsuitable for languages for 1 We are leaving aside unsupervised learning techniques here, since to our knowledge they have not proved a successful for developing practical shallow-parsers. which no such data is available (i.e. most languages except English) and also, they are not domain-independent nor &amp;quot;style-independent&amp;quot; (e.g. they do not allow to successfully shallow-parse speech, if no annotated data is available for that &amp;quot;style&amp;quot;). Finally, a shallow-parser developed using these techniques will have to mirror the information contained in the training data. For instance, if one trains such a tool on data were only non recursive NP chunks are marked2, then one will not be able to obtain richer information such as chunks of other categories, embeddings, syntactic functions...</Paragraph>
    <Paragraph position="2"> On the other hand, finite-state techniques rely on the development of a large set of rules (often based on regular expressions) to capture all the ways a constituent can expend. So for example for detecting English NPs, one could write the following rules :</Paragraph>
    <Paragraph position="4"> But this is time consuming and difficult since one needs to foresee all possible rewriting cases, and if some rule is forgotten, or if too many POS errors are left, robustness and/or accuracy will suffer.</Paragraph>
    <Paragraph position="5"> Then these regular expressions have to be manipulated i.e. transformed into automata, which will be determinized and minimized (both being costly operations). And even though determinization and minimization must be done only once (in theory) for a given set of rules, it is still costly to port such tools to a new set of rules (e.g. for a new language, a new domain) or to change some existing rules.</Paragraph>
    <Paragraph position="6"> In this paper, we argue that in order to accomplish the same task, it is unnecessary to develop full sets of regular expression : instead 2 See (Abney 91) for the definition of a chunk.</Paragraph>
    <Paragraph position="7"> of specifying all the ways a constituent can be rewritten, it is sufficient to express how it begins and/or ends. This allows to achieve similar results but with far fewer rules, and without a need for determinization or minimization because rules which are written that way are de-facto deterministic. So in a sense, our approach bears some similarities with the constraint-based formalism because we resort to &amp;quot;local rules&amp;quot; (Karlsson &amp; al. 95), but we focus on identifying constituent boundaries (and not syntactic functions), and allow any level of embedding thanks to the use of a stack.</Paragraph>
    <Paragraph position="8"> In the first part of this paper, we present our tool: a shallow-parser compiler. In a second part, we present output samples as well as several evaluations for French and for English, where the tool has been used to develop both an NP-chunker and a richer shallow-parser. We also explain why our approach is more tolerant to POS-tagging errors. Finally, we discuss some other practical uses which are made of this shallow-parser compiler.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML