File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/03/n03-1029_intro.xml

Size: 3,564 bytes

Last Modified: 2025-10-06 14:01:42

<?xml version="1.0" standalone="yes"?>
<Paper uid="N03-1029">
  <Title>Digression: What's Statistical Parsing Good For?</Title>
  <Section position="3" start_page="0" end_page="0" type="intro">
    <SectionTitle>
2 The Trigram Baseline
</SectionTitle>
    <Paragraph position="0"> As a baseline, we replicated the three-state HMM method of Beeferman et al. (1998). In this section, we describe that method, which we use as the basis for our extensions.</Paragraph>
    <Paragraph position="1"> The input to comma restoration is a sentence x = x1 : : :xn of words and punctuation but no commas. We would like to generate a restored string y = y1 : : : yn+c, which is the string x with c commas inserted. The selected y should maximize conformance with a simple tri2We might expect nonstatistical parsers also not to be useful, but for a different reason, their fragility. Rather than delivering partially correct results, they partially deliver correct results. But that is a different issue.</Paragraph>
    <Paragraph position="3"/>
    <Paragraph position="5"> We take the string x to be the observed output of an HMM with three states and transition probabilities dependent on output; the states encode the position of commas in a reconstructed string. Figure 1 depicts the automaton. The start state (1) corresponds to having seen a word with no prior or following comma, state (2) a word with a following comma, and state (3) a word with a prior but no following comma. It is easy to see that a path through the automaton traverses a string y with probability Qn+ci=1 p(yi j yi 2yi 1). The decoded string y can therefore be computed by Viterbi decoding.3 This method requires a trigram language model p().</Paragraph>
    <Paragraph position="6"> We train this language model on sections 02-22 of the Penn Treebank Wall Street Journal data (WSJ)4, comprising about 36,000 sentences. The CMU Statistical Language Modeling Toolkit (Clarkson and Rosenfeld, 1997) was used to generate the model. Katz smoothing was used to incorporate lower-order models. The model  with minor variations from the Treebank version, for instance, a small number of missing sentences and some variation in the tags. Runs of the experiments below using the Treebank versions of the data yield essentially identical results.</Paragraph>
    <Paragraph position="7"> was then tested on the approximately 2300 sentences of WSJ Section 23. Precision of the comma restoration was 71.1% and recall 55.2%. F-measure, calculated as 2PR=(P + R), where P is precision and R recall, is 62.2%. Overall 96.3% of all comma placement decisions were made correctly, a metric we refer to as token accuracy. Sentence accuracy, the percentage of sentences correctly restored, was 47.0%. (These results are presented as model 1 in Table 1.) This is the baseline against which we evaluate our alternative comma restoration models.</Paragraph>
    <Paragraph position="8"> Beeferman et al. present an alternative trigram model, which computes the following:</Paragraph>
    <Paragraph position="10"> That is, an additional penalty is assessed for not placing a comma at a given position. By penalizing omission of a comma between two words, the model implicitly rewards commas; we would therefore expect higher recall and correspondingly lower precision.5 In fact, the method with the omission penalty (model 2 in Table 1), does have higher recall and lower precision, essentially identical Fmeasure, but lower sentence accuracy. Henceforth, the models described here do not use an omission penalty.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML