File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/02/c02-1151_metho.xml

Size: 15,340 bytes

Last Modified: 2025-10-06 14:07:51

<?xml version="1.0" standalone="yes"?>
<Paper uid="C02-1151">
  <Title>Probabilistic Reasoning for Entity &amp; Relation Recognition/</Title>
  <Section position="3" start_page="13" end_page="13" type="metho">
    <SectionTitle>
2 Global Inference of Entities/Relations
</SectionTitle>
    <Paragraph position="0"> The problem at hand is that of producing a coherent labeling of entities and relations in a given sentence. Conceptually, the entities and relations can be viewed, taking into account the mutual dependencies, as the labeled graph in Figure 1, where the nodes represent entities (e.g. phrases) and the links denote the binary relations between the entities. Each entity and relation has several properties - denoted as labels of nodes and edges in the graph. Some of the properties, such as words inside the entities, can be read directly from the input; others, like pos tags of words in the context of the sentence, are easy to acquire via learned classifiers. However, properties like semantic types of phrases (i.e., class labels, such as &amp;quot;people&amp;quot;, &amp;quot;locations&amp;quot;) and relations among them are more difficult to acquire. Identifying the labels of entities and relations is treated here as the target of our learning problem. In particular, we learn these target properties as functions of all other &amp;quot;simple to acquire&amp;quot; properties of the sentence.</Paragraph>
    <Paragraph position="1"> To describe the problem in a formal way, we first define sentences and entities as follows.</Paragraph>
    <Paragraph position="2"> Definition 2.1 (Sentence &amp; Entity) A sentence S is a linked list which consists of words w and entities E. An entity can be a single word or a set of consecutive words with a predefined boundary. Entities in a sentence are labeled as E1;E2;C/C/C/ according to their order, and they take values that range over a set of entity types CE.</Paragraph>
    <Paragraph position="3"> Notice that determining the entity boundaries is also a difficult problem - the segmentation (or phrase detection) problem (Abney, 1991; Punyakanok and Roth, 2001). Here we assume it is solved and given to us as input; thus we only concentrate on classification.</Paragraph>
    <Paragraph position="4">  Example 2.1 The sentence in Figure 2 has three entities: E1 = &amp;quot;Dole&amp;quot;, E2 = &amp;quot;Elizabeth&amp;quot;, and E3 = &amp;quot;Salisbury, N.C.&amp;quot; Dole 's wife , Elizabeth , is a native of Salisbury , N.C. E1 E2 E3  A relation is defined by the entities that are involved in it (its arguments). In this paper, we only discuss binary relations.</Paragraph>
    <Paragraph position="5"> Definition 2.2 (Relation) A (binary) relation Rij = (Ei;Ej) represents the relation between Ei and Ej, where Ei is the first argument and Ej is the second. In addition, Rij can range over a set of entity types CR.</Paragraph>
    <Paragraph position="6"> Example 2.2 In the sentence given in Figure 2, there are six relations between the entities: R12 = (&amp;quot;Dole&amp;quot;, &amp;quot;Eliz-</Paragraph>
    <Paragraph position="8"> (&amp;quot;Salisbury, N.C.&amp;quot;, &amp;quot;Elizabeth&amp;quot;) We define the types (i.e. classes) of relations and entities as follows.</Paragraph>
    <Paragraph position="9"> Definition 2.3 (Classes) We denote the set of predefined entity classes and relation classes as CE and CR respectively. CE has one special element other ent, which represents any unlisted entity class. Similarly, CR also has one special element other rel, which means the involved entities are irrelevant or the relation class is undefined. When clear from the context, we use Ei and Rij to refer to the entity and relation, as well as their types (class labels).</Paragraph>
    <Paragraph position="10"> Example 2.3 Suppose CE = f other ent, person, location g and CR = f other rel, born in, spouse of g.</Paragraph>
    <Paragraph position="11"> For the entities in Figure 2, E1 and E2 belong to person and E3 belongs to location. In addition, relation R23 is born in, R12 and R21 are spouse of. Other relations are other rel.</Paragraph>
    <Paragraph position="12"> The class label of a single entity or relation depends not only on its local properties, but also on properties of other entities and relations. The classification task is somewhat difficult since the predictions of entity labels and relation labels are mutually dependent. For instance, the class label of E1 depends on the class label of R12 and the class label of R12 also depends on the class label of E1 and E2. While we can assume that all the data is annotated for training purposes, this cannot be assumed at evaluation time. We may presume that some local properties such as the word, pos, etc. are given, but none of the class labels for entities or relations is.</Paragraph>
    <Paragraph position="13"> To simplify the complexity of the interaction within the graph but still preserve the characteristic of mutual dependency, we abstract this classification problem in the following probabilistic framework. First, the classifiers are trained independently and used to estimate the probabilities of assigning different labels given the observation (that is, the easily classified properties in it). Then, the output of the classifiers is used as a conditional distribution for each entity and relation, given the observation. This information, along with the constraints among the relations and entities, is used to make global inferences for the most probable assignment of types to the entities and relations involved.</Paragraph>
    <Paragraph position="14"> The class labels of entities and relations in a sentence must satisfy some constraints. For example, if E1, the first argument of R12, is a location, then R12 cannot be born in because the first argument of relation born in has to be a person. We define constraints as follows.</Paragraph>
    <Paragraph position="15"> Definition 2.4 (Constraint) A constraint C is a 3-tuple (R;E1;E2), where R 2 CR and E1;E2 2 CE. If the class label of a relation is R, then the legitimate class labels of its two entity arguments are E1 and E2 respectively. null Example 2.4 Some examples of constraints are: (born in, person, location), (spouse of, person, person), and (murder, person, person) The constraints described above could be modeled using a joint probability distribution over the space of values of the relevant entities and relations. In the context of this work, for algorithmic reasons, we model only some of the conditional probabilities. In particular, the probability P(RijjEi;Ej) has the following properties.</Paragraph>
    <Paragraph position="16"> Property 1 The probability of the label of relation Rij given the labels of its arguments Ei and Ej has the following properties.</Paragraph>
    <Paragraph position="17"> + P(Rij = other reljEi = e1;Ej = e2) = 1, if there exists no r, such that (r;e1;e2) is a constraint.</Paragraph>
    <Paragraph position="18"> + P(Rij = rjEi = e1;Ej = e2) = 0, if there exists no constraint c, such that c = (r;e1;e2).</Paragraph>
    <Paragraph position="19"> Note that the conditional probabilities do not need to be specified manually. In fact, they can be easily learned from an annotated training dataset.</Paragraph>
    <Paragraph position="20"> Under this framework, finding the most suitable coherent labels becomes the problem of searching the most probable assignment to all the E and R variables. In other words, the global prediction e1;e2;:::;en;r12;r21;:::;rn(n!1) satisfies the following equation.</Paragraph>
    <Paragraph position="22"> argmaxei;rjkProb(E1;:::;En;R12;R21;:::;Rn(n!1)):</Paragraph>
  </Section>
  <Section position="4" start_page="13" end_page="13" type="metho">
    <SectionTitle>
3 Computational Approach
</SectionTitle>
    <Paragraph position="0"> Each nontrivial property of the entities and relations, such as the class label, depends on a very large number of variables. In order to predict the most suitable coherent labels, we would like to make inferences on several variables. However, when modeling the interaction between the target properties, it is crucial to avoid accounting for dependencies among the huge set of variables on which these properties depend. Incorporating these dependencies into our inference is unnecessary and will make the inference intractable. Instead, we can abstract these dependencies away by learning the probability of each property conditioned upon an observation.</Paragraph>
    <Paragraph position="1"> The number of features on which this learning problem depends could be huge, and they can be of different granularity and based on previous learned predicates (e.g.</Paragraph>
    <Paragraph position="2"> pos), as caricatured using the &amp;quot;network-like&amp;quot; structure in Figure 1. Inference is then made based on the probabilities. This approach is similar to (Punyakanok and Roth, 2001; Lafferty et al., 2001) only that there it is restricted to sequential inference, and done for syntactic structures.</Paragraph>
    <Paragraph position="3"> The following subsections describe the details of these two stages. Section 3.1 explains the feature extraction method and learning algorithm we used. Section 3.2 introduces the idea of using a belief network in search of the best global class labeling and the applied inference algorithm.</Paragraph>
    <Section position="1" start_page="13" end_page="13" type="sub_section">
      <SectionTitle>
3.1 Learning Basic Classifiers
</SectionTitle>
      <Paragraph position="0"> Although the labels of entities and relations from a sentence mutually depend on each other, two basic classifiers for entities and relations are first learned, in which a multi-class classifier for E(or R) is learned as a function of all other &amp;quot;known&amp;quot; properties of the observation. The classifier for entities is a named entity classifier, in which the boundary of an entity is predefined (Collins and Singer, 1999). On the other hand, the relation classifier is given a pair of entities, which denote the two arguments of the target relation. Accurate predictions of these two classifiers seem to rely on complicated syntax analysis and semantics related information of the whole sentence. However, we derive weak classifiers by treating these two learning tasks as shallow text processing problems. This strategy has been successfully applied on several NLP tasks, such as information extraction (Califf and Mooney, 1999; Freitag, 2000; Roth and Yih, 2001) and chunking (i.e. shallow paring) (Munoz et al., 1999).</Paragraph>
      <Paragraph position="1"> It assumes that the class labels can be decided by local properties, such as the information provided by the words around or inside the target. Examples include the spelling of a word, part-of-speech, and semantic related attributes acquired from external resources such as WordNet.</Paragraph>
      <Paragraph position="2"> The propositional learner we use is SNoW (Roth, 1998; Carleson et al., 1999) 1 SNoW is a multi-class classifier that is specifically tailored for large scale learning tasks. The learning architecture makes use of a network of linear functions, in which the targets (entity classes or relation classes, in this case) are represented as linear  functions over a common feature space. Within SNoW, we use here a learning algorithm which is a variation of Winnow (Littlestone, 1988), a feature efficient algorithm that is suitable for learning in NLP-like domains, where the number of potential features is very large, but only a few of them are active in each example, and only a small fraction of them are relevant to the target concept.</Paragraph>
      <Paragraph position="3"> While typically SNoW is used as a classifier, and predicts using a winner-take-all mechanism over the activation value of the target classes, here we rely directly on the raw activation value it outputs, which is the weighted linear sum of the features, to estimate the posteriors.</Paragraph>
      <Paragraph position="4"> It can be verified that the resulting values are monotonic with the confidence in the prediction, therefore is a good source of probability estimation. We use softmax (Bishop, 1995) over the raw activation values as probabilities. Specifically, suppose the number of classes is n, and the raw activation values of class i is acti. The posterior estimation for class i is derived by the following equation.</Paragraph>
      <Paragraph position="6"/>
    </Section>
    <Section position="2" start_page="13" end_page="13" type="sub_section">
      <SectionTitle>
3.2 Bayesian Inference Model
</SectionTitle>
      <Paragraph position="0"> Broadly used in the AI community, belief network is a graphical representation of a probability distribution (Pearl, 1988). It is a directed acyclic graph (DAG), where the nodes are random variables and each node is associated with a conditional probability table which defines the probability given its parents. We construct a belief network that represents the constraints existing among R's and E's. Then, for each sentence, we use the classifiers from section 3.1 to compute the Prob(Ejobservations) and Prob(Rjobservations), and use the belief network to compute the most probable global predictions of the class labels.</Paragraph>
      <Paragraph position="1"> The structure of our belief network, which represents the constraints is a bipartite graph. In particular, the variable E's and R's are the nodes in the network, where the E nodes are in one layer, and the R nodes are in the other. Since the label of a relation is dependent on the entity classes of its arguments, the links in the network connect the entity nodes, and the relation nodes that have these entities as arguments. For instance, node Rij has two incoming links from nodes Ei and Ej. The conditional probabilities P(RijjEi;Ej) encodes the constraints as in Property 1. As an illustration, Figure 3 shows a belief network that consists of 3 entity nodes and 6 relation nodes.</Paragraph>
      <Paragraph position="2"> Finding a most probable class assignment to the entities and relations is equivalent to finding the assignment of all the variables in the belief network that maximizes the joint probability. However, this mostprobable-explanation (MPE) inference problem is intractable (Roth, 1996) if the network contains loops</Paragraph>
      <Paragraph position="4"> work. Therefore, we resort to the following approximation method instead.</Paragraph>
      <Paragraph position="5"> Recently, researchers have achieved great success in solving the problem of decoding messages through a noisy channel with the help of belief networks (Gallager, 1962; MacKay, 1999). The network structure used in their problem is similar to the network used here, namely a loopy bipartite DAG. The inference algorithm they used is Pearl's belief propagation algorithm (Pearl, 1988), which outputs exact posteriors in linear time if the network is singly connected (i.e. without loops) but does not guarantee to converge for loopy networks. However, researchers have empirically demonstrate that by iterating the belief propagation algorithm several times, the outputted values often converge to the right posteriors (Murphy et al., 1999). Due to the existence of loops, we also apply belief propagation algorithm iteratively as our inference procedure.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML