File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-2413_metho.xml
Size: 11,626 bytes
Last Modified: 2025-10-06 14:09:24
<?xml version="1.0" standalone="yes"?> <Paper uid="W04-2413"> <Title>Semantic Role Labelling With Chunk Sequences</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 Chunk Sequences as Instances </SectionTitle> <Paragraph position="0"> All studies of semantic role labelling we are aware of have used constituents as instances for classification.</Paragraph> <Paragraph position="1"> However, constituents are not available in the shallow syntactic information provided by this task. Two other levels of granularity are available in the data: words and chunks. In a pretest, we found that words are too fine grained, such that learners find it very difficult to identify argument boundaries on the word level. Chunks, too, are problematic, since one third of the arguments span more than one chunk, and for one tenth of the arguments the boundaries do not coincide with any chunk boundaries.</Paragraph> <Paragraph position="2"> We decided to use chunk sequences as instances for classification. They can describe multi-chunk and partchunk arguments, and by approximating constituents, they allow the use of linguistically informed features. In the sentence in Figure 1, Britain's manufacturing industry forms a sequence of type NP_NP. To make sequences more distinctive, we conflate whole clauses embedded deeper than the target to S: For the target transforming, we characterise the sequence for to boost exports as S rather than VP_NP. An argument boundary inside a chunk is indicated by the part of speech of the last included word: For boost the sequence is VP(NN).</Paragraph> <Paragraph position="3"> To determine &quot;good&quot; sequences, we collected argument realisations from the training corpus, generalising them by simple heuristics (e.g. removing anything enclosed in brackets). The generalised argument sequences exhibit a Zipfian distribution (see Fig. 2). NP is by far the most frequent sequence, followed by S. An example of a very infrequent argument chunk sequence is NP_PP_NP_PP_NP_VP_PP_NP_NP (in words: a bonus in the form of charitable donations made from an employer 's treasury).</Paragraph> <Paragraph position="4"> The chunk sequence approach also allows us to consider the divider chunk sequences that separate arguments and targets. For example, A0s are usually divided from the target by the empty divider, while A2 arguments are Britain 's manufacturing industry is transforming itself to boost exports</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> NNP POSVBG NN VBZVBG PRP TO NN NNS [NP ][NP ][VP ][NP][VP ][NP ] [S ] </SectionTitle> <Paragraph position="0"> sequences and dividers in the training data separated from it by e.g. a typical A1 sequence. Generalised divider chunk sequences separating actual arguments and targets in the training set show a Zipfian distribution similar to the chunk sequences (see Fig. 2).</Paragraph> <Paragraph position="1"> As instances to be classified, we consider all sequences whose generalised sequence and divider each appear at least 10 times for an argument in the training corpus, and whose generalised sequence and divider appear together at least 5 times. The first cutoff reduces the number of sequences from 1089 to 87, and the number of dividers from 999 to 120, giving us 581,813 sequences as training data (about twice as many as words), of which 45,707 are actual argument labels. The additional filter for sequence/divider pairs reduces the training data to 354,916 sequences, of which 43,622 are actual arguments. We pay for the filtering by retaining only 87.49% of arguments on the training set (83.32% on the development set).</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Classification </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.1 Maximum Entropy Modelling </SectionTitle> <Paragraph position="0"> We use a log-linear model as classifier, which defines the probability of a class a1 given an feature vector a2a3 as</Paragraph> <Paragraph position="2"> where a15 is a normalisation constant, a38 to the maximum entropy constraint which ensures that the least committal optimal model is learnt. We used theestimate software for estimation, which implements the LMVM algorithm (Malouf, 2002) and was kindly provided by Rob Malouf.</Paragraph> <Paragraph position="3"> We chose a maximum entropy approach because it can integrate many different sources of information without assuming independence of features. Also, models with minimal commitment are good predictors of future data. Maxent models have found wide application in NLP during the recent years; for semantic role labelling (on FrameNet data) see (Fleischman et al., 2003).</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.2 Classification Procedure </SectionTitle> <Paragraph position="0"> The most straightforward procedure would be to have the classifier assign all argument classes plus NOLABEL to sequences. However, this proved ineffective due to the prevalence of NOLABEL: Since this class makes up more than 80% of the training sequences, the classifier concentrates on assigning NOLABEL well.</Paragraph> <Paragraph position="1"> Therefore, we divide the task of automatic semantic role assignment into two classification subtasks: argument identification and argument labelling. Argument identification is a binary decision for all sequences between LABEL (semantic argument) and NOLABEL (no semantic argument), which allows us to pool the frequencies of all argument labels. Argument labelling then assigns proper semantic roles only to those sequences that were recognised as LABELs in the first step.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.3 Features </SectionTitle> <Paragraph position="0"> We experimented with four types of features: shallow (mostly co-occurrence and distance statistics), higher-level (linguistically informed), divider and em (results of the EM-clustering).</Paragraph> <Paragraph position="1"> Shallow Features. Our shallow features comprise statistics on the current sequence and its position as well as on the target: the sequence itself, the target lemma, the length of the current sequence in chunks, its absolute frequency, its position (before or after the target, as first or last sequence in the sentence), its distance to the target in question and its embedding depth in comparison with the target (with regard to clause embedding). We also count how often we have seen the current sequence as an argument for the current target lemma, and as which argument. Other features describe the context of the sequence: whether it is embedded in an admissible sequence or embeds one, and a two-chunk history. We also list the arguments for which the sequence is the best candidate, judging by its frequency.</Paragraph> <Paragraph position="2"> Higher-Level Features. Our higher-level features comprise a heuristically determined superchunk label which is an abstraction of the chunk sequence (one of NP, VP, PP, S, ADVP, ADJP, and the rest class THING), the preposition of the sequence (if it either starts with or is directly preceded by a preposition), and the lemma and part of speech of the heuristically determined head of the sequence. We also check if the sequence in question is an NP (by its superchunk) directly before or after the target, if the sequence contains prepositions in unusual positions, if it consists of the word n't or not, and if the target lemma is passive.</Paragraph> <Paragraph position="3"> Divider Features. These are shallow and higher-level features related to the divider sequences: the divider itself, its superchunk, and we state whether, judging by the divider, the sequence is an argument. A similar feature judges this by the combination of divider and sequence.</Paragraph> <Paragraph position="4"> Features based on EM-Based Clustering. We use EM-based clustering to measure the fit between a target verb, an argument position of the verb, and the head lemma (or head named entity) of a sequence.</Paragraph> <Paragraph position="5"> EM-based clustering, originally introduced for the induction of a semantically annotated lexicon (Rooth et al., 1999), regards classes as hidden variables in the context of maximum likelihood estimation from incomplete data via the expectation maximisation algorithm.</Paragraph> <Paragraph position="6"> In our application, we aim at deriving a probability distribution a4a6a5a27a43a44a9 on verb-argument pairs a43 from the training data. Using the key idea that a43 is conditioned on an unobserved class a1a46a45a48a47 , we define the probability of a pair</Paragraph> <Paragraph position="8"> The last line is warranted by the assumption that a43a67a51 and a43a25a54 are independent and are only conditioned on a1 . This assumption makes clustering feasible in the first place. We use the EM algorithm to maximise the incomplete data log-likelihood a68 a11a70a69a17a71a73a72a4a6a5a39a43a10a9a67a74a76a75a77a4a6a5a27a43a10a9 as a function of the probability distribution a4 for a given empirical probability distribution a72a4 .</Paragraph> <Paragraph position="9"> In two additional features, we substitute the head word by the sequence and divider characterisation respectively, using EM clustering to measure the fit between target verb, argument position, and sequence (or divider).</Paragraph> </Section> </Section> <Section position="6" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Finding the Best Chain of Sequences </SectionTitle> <Paragraph position="0"> Classification only determines probable argument labels for individual chunk sequences. We still have to determine the most probable chain of chunk sequences (such as A0 A1) that covers the whole sentence.</Paragraph> <Paragraph position="1"> Recall that there are about 1.6 times as many sequences as words, many of which overlap; therefore, exhaustive searching is infeasible. Instead, we first run a beam search with a simple probability model to identify the a78 most probable chains of chunk sequences. Then, we re-rank them to take global considerations into account.</Paragraph> <Paragraph position="2"> Beam Search. For each sentence, we build an agenda of partial argument chains. We calculate the probability of each chain as a79</Paragraph> <Paragraph position="4"> thereby assuming independence of sequences. For each sequence, we add the three most probable classes assigned by the argument labelling step. The result of the beam search are the a78 most probable (according to a79 ) chains that cover the whole sentence. We found that increasing the beam width a78 to more than 20 increased performance only marginally.</Paragraph> <Paragraph position="5"> Re-ranking. Due to the independence assumption in the beam search, chains that are assigned high probability may still be globally improbable. We therefore multiply each chain's probability a79 by its empirical probability a79a92a91 in the training data, using a79a92a91 as a prior. However, since these counts are still sparse, we exploit the fact that duplicate argument labels (i.e. discontinuous arguments) are relatively infrequent in the PropBank data by discounting chains with duplicate arguments by a factor a93 , which we empirically optimised as a93 a11a95a94a10a83 a96 .</Paragraph> </Section> class="xml-element"></Paper>