File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/03/p03-1055_intro.xml
Size: 3,549 bytes
Last Modified: 2025-10-06 14:01:48
<?xml version="1.0" standalone="yes"?> <Paper uid="P03-1055"> <Title>Deep Syntactic Processing by Combining Shallow Methods</Title> <Section position="3" start_page="0" end_page="0" type="intro"> <SectionTitle> 2 Annotation of empty elements </SectionTitle> <Paragraph position="0"> Different linguistic theories offer various treatments of non-local head-dependent relations (referred to by several other terms such as extraction, discontinuity, movement or long-distance dependencies).</Paragraph> <Paragraph position="1"> The underlying idea, however, is the same: extraction sites are marked in the syntactic structure and this mark is connected (co-indexed) to the control- null Type Freq. Example NP-NP 987 Sam was seen * WH-NP 438 the woman who you saw *T* PRO-NP 426 * to sleep is nice COMP-SBAR 338 Sam said 0 Sasha snores UNIT 332 $ 25 *U* WH-S 228 Sam had to go, Sasha said *T* WH-ADVP 120 Sam told us how he did it *T* CLAUSE 118 Sam had to go, Sasha said 0 COMP-WHNP 98 the woman 0 we saw *T* ALL 3310 ling constituent.</Paragraph> <Paragraph position="2"> The experiments reported here rely on a training corpus annotated with non-local dependencies as well as phrase-structure information. We used the Wall Street Journal (WSJ) part of the Penn Tree-bank (Marcus et al., 1993), where extraction is represented by co-indexing an empty terminal element (henceforth EE) to its antecedent. Without committing ourselves to any syntactic theory, we adopt this representation.</Paragraph> <Paragraph position="3"> Following the annotation guidelines (Bies et al., 1995), we distinguish seven basic types of EEs: controlled NP-traces (NP), PROs (PRO), traces of Aa0 -movement (mostly wh-movement: WH), empty complementizers (COMP), empty units (UNIT), and traces representing pseudo-attachments (shared constituents, discontinuous dependencies, etc.: PSEUDO) and ellipsis (ELLIPSIS). These labels, however, do not identify the EEs uniquely: for instance, the label WH may represent an extracted NP object as well as an adverb moved out of the verb phrase. In order to facilitate antecedent recovery and to disambiguate the EEs, we also annotate them with their parent nodes. Furthermore, to ease straightforward comparison with previous work (Johnson, 2002), a new label CLAUSE is introduced for COMP-SBAR whenever it is followed by a moved clause WH-S. Table 1 summarizes the most frequent types occurring in the development data, Section 0 of the WSJ corpus, and gives an example for each, following Johnson (2002).</Paragraph> <Paragraph position="4"> For the parsing and antecedent recovery experiments, in the case of WH-traces (WH-a1a2a1a2a1 ) and controlled NP-traces (NP-NP), we follow the standard technique of marking nodes dominating the empty element up to but not including the parent of the antecedent as defective (missing an argument) with a gap feature (Gazdar et al., 1985; Collins, 1997).1 Furthermore, to make antecedent co-indexation possible with many types of EEs, we generalize Collins' approach by enriching the annotation of non-terminals with the type of the EE in question (eg. WH-NP) by using different gap+ features (gap+WH-NP; cf. Figure 1). The original non-terminals augmented with gap+ features serve as new non-terminal labels.</Paragraph> <Paragraph position="5"> In the experiments, Sections 2-21 were used to train the models, Section 0 served as a development set for testing and improving models, whereas we present the results on the standard test set, Section 23.</Paragraph> </Section> class="xml-element"></Paper>