File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/98/p98-2184_metho.xml

Size: 13,450 bytes

Last Modified: 2025-10-06 14:15:02

<?xml version="1.0" standalone="yes"?>
<Paper uid="P98-2184">
  <Title>How Verb Subcategorization Frequencies Are Affected By Corpus Choice</Title>
  <Section position="4" start_page="1123" end_page="1124" type="metho">
    <SectionTitle>
3 Experiment 1
</SectionTitle>
    <Paragraph position="0"> The purpose of the first experiment is to analyze the general (non-verb-specific) differences between argument structure frequencies in the data sources. In order to do this, the data for each verb in the corpus was normalized to remove the effects of verb frequency. The average frequency of each subcategorization frame was calculated for each corpus. The average frequencies for each of the data sources were then compared.</Paragraph>
    <Section position="1" start_page="1123" end_page="1124" type="sub_section">
      <SectionTitle>
3.1 Results
</SectionTitle>
      <Paragraph position="0"> We found that the three corpora consisting of connected discourse (BC, WSJ, SWBD) shared a common set of differences when compared to the CFJCF sentence production data. There were three general categories of differences between the corpora, and all can be related to discourse type.</Paragraph>
      <Paragraph position="1"> These categories are:  (1) passive sentences (2) zero anaphora (3) quotations  The CFJCF single sentence productions had the smallest number of passive sentences. The connected spoken discourse in Switchboard had more passives, followed by the written discourse in the Wall Street Journal and the Brown Corpus.  Passive is generally used in English to emphasize the undergoer (to keep the topic in subject position) and/or to de-emphasize the identity of the agent (Thompson 1987). Both of these reasons are affected by the type of discourse. If there is no preceding discourse, then there is no pre-existing topic to keep in subject position. In addition, with no context for the sentence, there is less likely to be a reason to de-emphasize the agent of the sentence.</Paragraph>
      <Paragraph position="2">  The increase in zero anaphora (not overtly mentioning understood arguments) is caused by two factors. Generally, as the amount of surrounding context increases (going from single sentence to connected discourse) the need to overtly express all of the arguments with a verb decreases.</Paragraph>
      <Paragraph position="3">  Verbs that can describe actions (agree, disappear, escape, follow, leave, sing, wait) were typically used with some form of argument in single sentences, such as: &amp;quot;I had a test that day, so I really wanted to escape from school.&amp;quot; (CFJCF data).</Paragraph>
      <Paragraph position="4"> Such verbs were more likely to be used without any arguments in connected discourse as in: &amp;quot;She escaped , crawled through the usual mine fields, under barbed wire, was shot at, swam a river, and we finally picked her up in Linz.&amp;quot; (Brown Corpus) In this case, the argument of &amp;quot;escaped&amp;quot;, (&amp;quot;imprisonment&amp;quot;) was understood from the previous sentence. Verbs of propositional attitude (agree, guess, know, see, understand) are typically used transitively in written corpora and single-sentence production: &amp;quot;I guessed the right answer on the quiz.&amp;quot; (CFJCF).</Paragraph>
      <Paragraph position="5"> In spoken discourse, these verbs are more likely to be used metalinguistically, with the previous  Quotations are usually used in narrative, which is more likely in connected discourse than in an isolated sentence. This difference mainly effects verbs of communication (e.g. answer, ask, call, describe, read, say, write).</Paragraph>
      <Paragraph position="6">  These verbs are used in corpora to discuss details of the contents of communication: &amp;quot;Turning to the reporters, she asked, &amp;quot;Did you hear her?&amp;quot;'(Brown) In single sentence production, they are used to describe the (new) act of communication itself * &amp;quot;He asked a lot of questions at school.&amp;quot; (CFJCF) We are currently working on systematically identifying indirect quotes in the corpora and the CFJCF data to analyze in more detail how they fit in to this picture.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="1124" end_page="1125" type="metho">
    <SectionTitle>
4 Experiment 2
</SectionTitle>
    <Paragraph position="0"> Our first experiment factors were the suggested that discourse primary cause of subcategorization differences. One way to test this hypothesis is to eliminate discourse factors and see if this removes subcategorization differences.</Paragraph>
    <Paragraph position="1"> We measure the difference between the way a verb is used in two different corpora by counting the number of sentences (per hundred) where a verb in one corpus would have to be used with a different subcategorization in order for the two corpora to yield the same subcategorization frequencies. This same number can also be calculated for the overall subcategorization frequencies of two corpora to show the overall difference between the two corpora.</Paragraph>
    <Paragraph position="2"> Our procedure for measuring the effect of discourse is as follows (illustrated using passive as an example):  % Passive - WSJ (adjusted) vs CFJCF 3. re-measure the difference between two corpora (WSJ vs CFJCF) 4. amount of improvement = size of discourse effect  This method was applied to the passive, quote, and zero subcat frames, since these are the ones that show discourse-based differences. Before  the mapping, WSJ has a difference of 17 frames/100 overall difference when compared with CFJCF. After the mapping, the difference is only 9.6 frames/100 overall difference. This indicates that 43% of the overall cross-verb differences between these two corpora are caused by discourse effects.</Paragraph>
    <Paragraph position="3"> We use this mapping procedure to measure the size and consistency of the discourse effects. A more sophisticated mapping procedure would be appropriate for other purposes since the verbs with the best matches between corpora are actually made worse by this mapping procedure.</Paragraph>
  </Section>
  <Section position="6" start_page="1125" end_page="1125" type="metho">
    <SectionTitle>
5 Experiment 3
</SectionTitle>
    <Paragraph position="0"> Argument preference was also affected by verb semantics. To examine this effect, we took two sample ambiguous verbs, &amp;quot;charge&amp;quot; and &amp;quot;pass&amp;quot;. We hand coded them for semantic senses in each of the corpora we used as follows: Examples of 'charge' taken from BC.</Paragraph>
    <Paragraph position="1"> accuse: &amp;quot;His petition charged mental cruelty.&amp;quot; attack: &amp;quot;When he charged Mickey was ready.&amp;quot; money: &amp;quot;... 20 per cent ... was all he charged the traders.&amp;quot; Examples of 'pass' taken from BC.</Paragraph>
    <Paragraph position="2"> movement: &amp;quot;Blue Throat's men spotted him ... as he passed.&amp;quot; law&amp;quot; 'q'he President noted that Congress last year passed a law providing grants ...&amp;quot; transfer: &amp;quot;He asked, when she passed him a glass.&amp;quot; test: &amp;quot;Those who T stayed had * to pass tests.&amp;quot; We then asked two questions:  1. Do different verb senses have different argument structure preferences? 2. Do different corpora have different verb  sense preferences, and therefore potentially different argument structure preferences? For both verbs examined (pass and charge) there was a significant effect of verb sense on argument structure probabilities (by X 2 p &lt;.001 for 'charge' and p &lt;.001 for 'pass'). The following chart shows a sample of this difference: that NP NPPP passive Charge(accuse) 32 0 24 25 Sample Frames and Senses from WSJ We then analyzed how often each sense was used in each of the corpora and found that there was again a significant difference (by X 2 p &lt;.001 for 'charge' ~ nd p &lt;.001 for 'pass').</Paragraph>
    <Paragraph position="3">  This analysis shows that it is possible for shifts in the relative frequency of each of a verbs senses to influence the observed subcat frequencies.</Paragraph>
    <Paragraph position="4"> We are currently extending our study to see if verb senses have constant subcategorization frequencies across corpora. This would be useful for word sense disambiguation and for parsing. If the verb sense is known, then a parser could use this information to help look for likely arguments. If the subcatagorization is known, then a disambiguator could use this information to find the sense of the verb. These could be used to bootstrap each other relying on the heuristic that only one sense is used within any discourse (Gale, Church, &amp; Yarowsky 1992).</Paragraph>
  </Section>
  <Section position="7" start_page="1125" end_page="1126" type="metho">
    <SectionTitle>
6 Evaluation
</SectionTitle>
    <Paragraph position="0"> We had previously hoped to evaluate the accuracy of our treebank induduced subcategorization probabilities by comparing them with the COMLEX hand-coded probabilities (Macleod and  Grishman 1994), but we used a different set of subcategorization frames than COMLEX.</Paragraph>
    <Paragraph position="1"> Instead, we hand checked a random sample of our data for errors.</Paragraph>
    <Paragraph position="2"> to find arguments that were located to the left of the verb. This is because arbitrary amounts of structure can intervene, expecially in the case of traces.</Paragraph>
    <Paragraph position="3"> The error rate in our data is between 3% and 7% for all verbs excluding 'say' type verbs such as 'answer', 'ask', 'call', 'read', 'say', and 'write'. The error rate is given as a range due to the subjectivity of some types of errors. The errors can be divided into two classes; errors which are due to mis-parsed sentences in Treebank ~, and errors which are due to the inadequacy of our search strings in indentifying certain syntactic 9atterns.</Paragraph>
    <Paragraph position="4">  misc. miss-parsed sentences 1% Errors based on our search strinl~s missed traces and displaced arguments 1% &amp;quot;say&amp;quot; verbs missing quotes 6% Error rate by category In trying to estimate the maximum amount of error in our data, we found cases where it was possible to disagree with the parses/tags given in Treebank. Treebank examples given below include prepositional attachinent (1), the verbparticle/preposition distinction (2), and the NP/adverbial distinction (3).</Paragraph>
    <Paragraph position="5">  1. &amp;quot;Sam, I thought you \[knew \[everything\]~ \[about Tokyo\]pp\]&amp;quot; (BC) 2. &amp;quot;...who has since moved \[on to other methods\]pp?&amp;quot; (BC) 3. &amp;quot;Gross stopped \[bricfly\]Np?, then went on.&amp;quot; (Be) Missed traces and displaced argument errors were a result of the difficulty in writing search strings 1 All of our search patterns are based only on the  information available in the Treebank 1 coding system, since the Brown Corpus is only available in this scheme. The error rate for corpora available in Treebank 2 form would have been lower had we used all available information.</Paragraph>
    <Paragraph position="6"> Six percent of the data (overall) was improperly classified due to the failure of our search patterns to identify all of the quote-type arguments which occur in 'say' type verbs. The identification of these elements is particularly problematic due to the asyntactic nature of these arguments, ranging from a sound (He said 'Argh!') to complex sentences. The presence or absense of quotation marks was not a completely reliable indicator of these arguments. This type of error affects only a small subset of the total number of verbs. 27% of the examples of these verbs were mis-classified, always by failing to find a quote-type argument of the verb. Using separate search strings for these verbs would greatly improve the accuracy of these searches.</Paragraph>
    <Paragraph position="7"> Our eventual goal is to develop a set of regular expressions that work on fiat tagged corpora instead of TreeBank parsed structures to allow us to gather information from larger corpora than have been done by the TreeBank project (see Manning 1993 and Gahl 1998).</Paragraph>
  </Section>
  <Section position="8" start_page="1126" end_page="1127" type="metho">
    <SectionTitle>
7 Conclusion
</SectionTitle>
    <Paragraph position="0"> We find that there are significant differences between the verb subcategorization frequencies generated through experimental methods and corpus methods, and between the frequencies found in different corpora. We have identified two distinct sources for these differences. Discourse influences are caused by the changes in the ways language is used in different discourse types and are to some extent predictable from the discourse type of the corpus in question. Semantic influences are based on the semantic context of the discourse. These differences may be predictable from the relative frequencies of each of the possible senses of the verbs in the corpus. An extensive analysis of the frame and sense frequencies of different verbs across different corpora is needed to verify this. This work is presently being carried out by us and others (Baker, Fillmore, &amp; Lowe 1998). It is certain, however, that verb sense and  discourse type play an important role in the frequencies observed in different experimental and corpus based sources of verb subcategorization</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML