File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/abstr/02/j02-1005_abstr.xml
Size: 2,584 bytes
Last Modified: 2025-10-06 13:42:24
<?xml version="1.0" standalone="yes"?> <Paper uid="J02-1005"> <Title>Squibs and Discussions The DOP Estimation Method Is Biased and Inconsistent</Title> <Section position="2" start_page="0" end_page="0" type="abstr"> <SectionTitle> 1. Introduction </SectionTitle> <Paragraph position="0"> The data-oriented parsing or DOP approach to statistical natural language analysis has attracted considerable attention recently and has been used to produce statistical language models based on various kinds of linguistic representation, as described in Bod (1998). These models are based on the intuition that statistical generalizations about natural languages should be stated in terms of &quot;chunks&quot; or &quot;fragments&quot; of linguistic representations. Linguistic representations are produced by combining these fragments, but unlike in stochastic models such as Probabilistic Context-Free Grammars, a single linguistic representation may be generated by several different combinations of fragments. These fragments may be large, permitting DOP models to describe nonlocal dependencies. Usually the fragments used in a DOP model are themselves obtained from a training corpus of linguistic representations. For example, in DOP1 or Tree-DOP the fragments are typically all the connected multinode trees that appear as subgraphs of any tree in the training corpus.</Paragraph> <Paragraph position="1"> This note shows that the estimation procedure standardly used to set the parameters or fragment weights of a DOP model (see, for example, Bod [1998]) is biased and inconsistent. This means that as sample size increases, the corresponding sequence of probability distributions estimated by this procedure does not converge to the true distribution that generated the training data. Consistency is usually regarded as the minimal requirement any estimation method must satisfy (Breiman 1973; Shao 1999), and the inconsistency of the standard DOP estimation method suggests it may be worth looking for other estimation methods. Note that while the bulk of DOP research uses the estimation procedure studied here, recently there has been research that has used other estimators for DOP models (Bonnema, Buying, and Scha 1999; Bod 2000), and it would be interesting to investigate the statistical properties of these estimators as well.</Paragraph> <Paragraph position="2"> Depictions of three different derivations of the same tree representation of Alex likes pizza, with arrows indicating the sites of tree fragment substitutions.</Paragraph> </Section> class="xml-element"></Paper>