File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/n04-4028_metho.xml

Size: 6,100 bytes

Last Modified: 2025-10-06 14:08:55

<?xml version="1.0" standalone="yes"?>
<Paper uid="N04-4028">
  <Title>Confidence Estimation for Information Extraction</Title>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Field Confidence Estimation
</SectionTitle>
    <Paragraph position="0"> The Viterbi algorithm finds the most likely state sequence matching the observed word sequence. The word that Viterbi matches with a particular FSM state is extracted as belonging to the corresponding database field. We can obtain a numeric score for an entire sequence, and then turn this into a probability for the entire sequence by normalizing. However, to estimate the confidence of an individual field, we desire the probability of a subsequence, marginalizing out the state selection for all other parts of the sequence. A specialization of Forward-Backward, termed Constrained Forward-Backward (CFB), returns exactly this probability.</Paragraph>
    <Paragraph position="1"> Because CRFs are conditional models, Viterbi finds the most likely state sequence given an observation sequence, defined as s[?] = argmaxs pL(s|o). To avoid an exponential-time search over all possible settings of s, Viterbi stores the probability of the most likely path at time t that accounts for the first t observations and ends in state si. Following traditional notation, we define this probability to be dt(si), where d0(si) is the probability of starting in each state si, and the recursive formula is:</Paragraph>
    <Paragraph position="3"> [dT(si)].</Paragraph>
    <Paragraph position="4"> The Forward-Backward algorithm can be viewed as a generalization of the Viterbi algorithm: instead of choosing the optimal state sequence, Forward-Backward evaluates all possible state sequences given the observation sequence. The &amp;quot;forward values&amp;quot; at+1(si) are recursively defined similarly as in Eq. 2, except the max is replaced by a summation. Thus we have</Paragraph>
    <Paragraph position="6"> terminating in Zo =summationtexti aT(si) from Eq. 1.</Paragraph>
    <Paragraph position="7"> To estimate the probability that a field is extracted correctly, we constrain the Forward-Backward algorithm such that each path conforms to some subpath of constraints C = &lt;sq ...sr&gt; from time step q to r. Here, sq [?] C can be either a positive constraint (the sequence must pass through sq) or a negative constraint (the sequence must not pass through sq).</Paragraph>
    <Paragraph position="8"> In the context of information extraction, C corresponds to an extracted field. The positive constraints specify the observation tokens labeled inside the field, and the negative constraints specify the field boundary. For example, if we use states names B-TITLE and I-JOBTITLE to label tokens that begin and continue a JOBTITLE field, and the system labels observation sequence &lt;o2,...,o5&gt; as a JOBTITLE field, then C = &lt;s2 = B-JOBTITLE,</Paragraph>
    <Paragraph position="10"> The calculations of the forward values can be made to conform to C by the recursion aprimeq(si) =</Paragraph>
    <Paragraph position="12"> if si similarequal sq</Paragraph>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
0 otherwise
</SectionTitle>
    <Paragraph position="0"> for all sq [?] C, where the operator si similarequal sq means si conforms to constraint sq. For time steps not constrained by C, Eq. 3 is used instead.</Paragraph>
    <Paragraph position="1"> If aprimet+1(si) is the constrained forward value, then Zprimeo = summationtexti aprimeT(si) is the value of the constrained lattice, the set of all paths that conform to C. Our confidence estimate is obtained by normalizing Zprimeo using Zo, i.e. Zprimeo [?] Zo.</Paragraph>
    <Paragraph position="2"> We also implement an alternative method that uses the state probability distributions for each state in the extracted field. Let gt(si) = p(si|o1,...,oT) be the probability of being in state i at time t given the observation sequence . We define the confidence measure GAMMA to be producttextvi=u gi(si), where u and v are the start and end indices of the extracted field.</Paragraph>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Record Confidence Estimation
</SectionTitle>
    <Paragraph position="0"> We can similarly use CFB to estimate the probability that an entire record is labeled correctly. The procedure is the same as in the previous section, except that C now specifies the labels for all fields in the record.</Paragraph>
    <Paragraph position="1"> We also implement three alternative record confidence estimates. FIELDPRODUCT calculates the confidence of each field in the record using CFB, then multiplies these values together to obtain the record confidence. FIELD-MIN instead uses the minimum field confidence as the record confidence. VITERBIRATIO uses the ratio of the probabilities of the top two Viterbi paths, capturing how much more likely s[?] is than its closest alternative.</Paragraph>
  </Section>
  <Section position="7" start_page="0" end_page="0" type="metho">
    <SectionTitle>
5 Reranking with Maximum Entropy
</SectionTitle>
    <Paragraph position="0"> We also trained two conditional maximum entropy classifiers to classify fields and records as being labeled correctly or incorrectly. The resulting posterior probability of the &amp;quot;correct&amp;quot; label is used as the confidence measure. The approach is inspired by results from (Collins, 2000), which show discriminative classifiers can improve the ranking of parses produced by a generative parser.</Paragraph>
    <Paragraph position="1"> After initial experimentation, the most informative inputs for the field confidence classifier were field length, the predicted label of the field, whether or not this field has been extracted elsewhere in this record, and the CFB confidence estimate for this field. For the record confidence classifier, we incorporated the following features: record length, whether or not two fields were tagged with the same label, and the CFB confidence estimate.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML