File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/03/n03-3006_intro.xml
Size: 3,281 bytes
Last Modified: 2025-10-06 14:01:42
<?xml version="1.0" standalone="yes"?> <Paper uid="N03-3006"> <Title>A low-complexity, broad-coverage probabilistic Dependency Parser for English</Title> <Section position="3" start_page="0" end_page="0" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> Many extensions to text-based, data-intensive knowledge management approaches, such as Information Retrieval or Data Mining, focus on integrating the impressive recent advances in language technology. For this, they need fast, robust parsers that deliver linguistic data which is meaningful for the subsequent processing stages. This paper presents such a parsing system. Its output is a hierarchical structure of syntactic relations, functional dependency structures, which are discussed in section 2.</Paragraph> <Paragraph position="1"> The parser differs on the one hand from successful Dependency Grammar implementations (e.g. (Lin, 1998), (Tapanainen and J&quot;arvinen, 1997)) by using a statistical base, and on the other hand from state-of-the-art statistical approaches (e.g. (Collins, 1999)) by carefully following an established formal grammar theory, Dependency Grammar (DG). It combines two probabilistic models of language, similar to (Collins, 1999), which are discussed in section 3. Both are supervised and based on Maximum Likelihood Estimation (MLE). The first one is based on the lexical probabilities of the heads of phrases, similar to (Collins and Brooks, 1995). It calculates the probability of finding specific syntactic relations (such as subject, sentential object, etc.) between given lexical heads. Two simple extensions for the interaction between several dependents of the same mother node are also used. The second probability model is a PCFG for the production of the VP. Although traditional CFGs are not part of DG, VP PCFG rules can model verb subcategorization frames, an important DG component.</Paragraph> <Paragraph position="2"> The parser has been trained, developed and tested on a large collection of syntactically analyzed sentences, the Penn Treebank (Marcus et al., 1993). It is broad-coverage and robust and returns an optimal set of partial structures when it fails to find a complete structure for a sentence. It has been designed to keep complexity as low as possible during the parsing process in order to be fast enough to be useful for parsing large amounts of unrestricted text. This has been achieved by observing the following constraints, discussed in section 4: * using a restrictive hand-written linguistic grammar The parsing system uses a divide-and-conquer approach.</Paragraph> <Paragraph position="3"> Low-level linguistic tasks that can be reliably solved by finite-state techniques are handed over to them. These low-level tasks are the recognition of part-of-speech by means of tagging, and the recognition of base NPs and verbal groups by means of chunking. The parser then relies on the disambiguation decisions of the tagging and chunking stage and can profit from a reduced search space, at the cost of a slightly decreased performance due to tagging and chunking errors.</Paragraph> <Paragraph position="4"> The paper ends with a preliminary evaluation of this work in progress.</Paragraph> </Section> class="xml-element"></Paper>