File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/05/w05-0639_intro.xml

Size: 8,766 bytes

Last Modified: 2025-10-06 14:03:11

<?xml version="1.0" standalone="yes"?>
<Paper uid="W05-0639">
  <Title>The Integration of Syntactic Parsing and Semantic Role Labeling</Title>
  <Section position="3" start_page="0" end_page="238" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> Semantic parsing, identifying and classifying the semantic entities in context and the relations between them, potentially has great impact on its downstream applications, such as text summarization, question answering, and machine translation. As a result, semantic parsing could be an important intermediate step for natural language comprehension. In this paper, we investigate the task of Semantic Role Labeling (SRL): Given a verb in a sentence, the goal is to locate the constituents which are arguments of the verb, and assign them appropriate semantic roles, such as, Agent, Patient, and Theme.</Paragraph>
    <Paragraph position="1"> Previous SRL systems have explored the effects of using different lexical features, and experimented on different machine learning algorithms. (Gildea and Palmer, 2002; Pradhan et al., 2005; Punyakanok et al., 2004) However, these SRL systems generally extract features from sentences processed by a syntactic parser or other shallow parsing components, such as a chunker and a clause identifier. As a result, the performance of the SRL systems relies heavily on those syntax-analysis tools.</Paragraph>
    <Paragraph position="2"> In order to improve the fundamental performance of an SRL system, we trained parsers with training data containing not only syntactic constituent information but also semantic argument information. The new parsers generate more correct constituents than that trained on pure syntactic information. Because the new parser generate different constituents than a pure syntactic parser, we also explore the possibility of combining the output of several parsers with the help of a voting post-processing component.</Paragraph>
    <Paragraph position="3"> This paper is organized as follows: Section 2 demonstrates the components of our SRL system.</Paragraph>
    <Paragraph position="4"> We elaborate the importance of training a new parser and outline our approach in Section 3 and Section 4.</Paragraph>
    <Section position="1" start_page="0" end_page="237" type="sub_section">
      <SectionTitle>
2.1 Parsing
</SectionTitle>
      <Paragraph position="0"> Previous SRL systems usually use a pure syntactic parser, such as (Charniak, 2000; Collins, 1999), to retrieve possible constituents. Once the boundary of a constituent is defined, there is no way to change it in later phases. Therefore the quality of the syntactic parser has a major impact on the final per- null formance of an SRL system, and the percentage of correct constituents that is generated by the syntactic parser also defines the recall upper bound of an SRL system. In order to attack this problem, in addition to Charniak's parser (Charniak, 2000), our system combine two parser which are trained on both syntactic constituent information and semantic argument information. (See Section 3)</Paragraph>
    </Section>
    <Section position="2" start_page="237" end_page="237" type="sub_section">
      <SectionTitle>
2.2 Pruning
</SectionTitle>
      <Paragraph position="0"> Given a parse tree, a pruning component filters out the constituents which are unlikely to be semantic arguments in order to facilitate the training of the Argument Identification component. Our system uses the heuristic rules introduced by (Xue and Palmer, 2004). The heuristics first spot the verb and then extract all the sister nodes along the verb spine of the parse tree. We expand the coverage by also extracting all the immediate children of an S, ADVP, PP and NP node. This stage generally prunes off about 80% of the constituents given by a parser. For our newly trained parsers, we also extract constituents which have a secondary constituent label indicating the constituent in question is an argument.</Paragraph>
    </Section>
    <Section position="3" start_page="237" end_page="237" type="sub_section">
      <SectionTitle>
2.3 Argument Identification and Classification
</SectionTitle>
      <Paragraph position="0"> We have as our Argument Identification component a binary maximum-entropy classifier to determine whether a constituent is an argument or not. If a constituent is tagged as an argument, the Argument Classification component, which is a multi-class maximum-entropy classifier, would assign it a semantic role. The implementation of both the Argument Identification and Classification components makes use of the Mallet package1.</Paragraph>
      <Paragraph position="1"> The lexical features we use to train these two components are taken from (Xue and Palmer, 2004).</Paragraph>
      <Paragraph position="2"> We trained the Argument Identification component with the following single features: the path from the constituent to the verb, the head word of the constituent and its POS tag, and the distance between the verb and the constituent, and feature combinations: the verb and the phrasal type of the constituent, the verb and the head word of the constituent. If the parent node of the constituent is a PP node, then we also include the head word of the PP  node and the feature combination of the verb and the head word of the PP node.</Paragraph>
      <Paragraph position="3"> In addition to the features listed above, the Argument Classification component also contains the following features: the verb, the first and the last content word of the constituent, the phrasal type of the left sibling and the parent node, voice (passive or active), position of the constituent relative to the verb, the subcategorization frame, and the syntactic frame which describes the sequential pattern of the noun phrases and the verb in the sentence.</Paragraph>
    </Section>
    <Section position="4" start_page="237" end_page="237" type="sub_section">
      <SectionTitle>
2.4 Post Processing
</SectionTitle>
      <Paragraph position="0"> The post processing component merges adjacent discontinuous arguments and marks the R-arguments based on the content word and phrase type of the argument. Also it filters out arguments according to the following constraints:  1. There are no overlapping arguments.</Paragraph>
      <Paragraph position="1"> 2. There are no repeating core arguments.</Paragraph>
      <Paragraph position="2">  In order to combine the different systems, we also include a voting scheme. The algorithm is straightforward: Suppose there are N participating systems, we pick arguments with N votes, N-1 votes ..., and finally 1 vote. The way to break a tie is based on the confidence level of the argument given by the system. Whenever we pick an argument, we need to check whether this argument conflicts with previously selected arguments based on the constraints described above.</Paragraph>
      <Paragraph position="3"> 3 Training a Parser with Semantic</Paragraph>
    </Section>
    <Section position="5" start_page="237" end_page="238" type="sub_section">
      <SectionTitle>
Argument Information
</SectionTitle>
      <Paragraph position="0"> A good start is always important, especially for a successful SRL system. Instead of passively accepting candidate constituents from the upstream syntactic parser, an SRL system needs to interact with the parser in order to obtain improved performance.</Paragraph>
      <Paragraph position="1"> This motivated our first attempt which is to integrate syntactic parsing and semantic parsing as a single step, and hopefully as a result we would be able to discard the SRL pipeline. The idea is to augment the Penn Treebank (Marcus et al., 1994) constituent labels with the semantic role labels from the Prop-Bank (Palmer et al., 2005), and generate a rich training corpus. For example, if an NP is also an ar- null gument ARG0 of a verb in the given sentence, we change the constituent label NP into NP-ARG0. A parser therefore is trained on this new corpus and should be able to serve as an SRL system at the same time as predicting a parse.</Paragraph>
      <Paragraph position="2"> However, this ideal approach is not feasible.</Paragraph>
      <Paragraph position="3"> Given the fact that there are many different semantic role labels and the same constituent can be different arguments of different verbs in the same sentence, the number of constituent labels will soon grow out of control and make the parser training computationally infeasible. Not to mention that anchor verb information has not yet been added to the constituent label, and general data sparseness. As a compromise, we decided to integrate only Argument Identification with syntactic parsing. We generated the training corpus by simply marking the constituents which are also semantic arguments.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML