XML Viewer - w06-3606

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-3606_metho.xml
Size: 24,685 bytes
Last Modified: 2025-10-06 14:10:58
<?xml version="1.0" standalone="yes"?>
<Paper uid="W06-3606">
  <Title>Practical Markov Logic Containing First-Order Quantifiers with Application to Identity Uncertainty</Title>
  <Section position="4" start_page="41" end_page="45" type="metho">
    <SectionTitle>
3 Markov logic networks
</SectionTitle>
    <Paragraph position="0"> Let F = {Fi} be a set of first order formulae with corresponding real-valued weights w = {wi}. Given a set of constants C = {ci}, define ni(x) to be the number of true groundings of Fi realized in a setting  of the world given by atomic formulae x. A Markov logic network (MLN) (Richardson and Domingos, 2004) defines a joint probability distribution over possible worlds x. In this paper, we will work with discriminative MLNs (Singla and Domingos, 2005), which define the conditional distribution over a set of query atoms y given a set of evidence atoms x.</Paragraph>
    <Paragraph position="1"> Using the normalizing constant Zx, the conditional distribution is given by</Paragraph>
    <Paragraph position="3"> where Fy [?] F is the set ofclauses for which at least one grounding contains a query atom, and ni(x,y) is the number of true groundings of the ith clause containing evidence atom x and query atom y.</Paragraph>
    <Section position="1" start_page="42" end_page="42" type="sub_section">
      <SectionTitle>
3.1 Inference Complexity in Ground
Markov Networks
</SectionTitle>
      <Paragraph position="0"> The set of predicates and constants in Markov logic define the structure of a Markov network, called a ground Markov network. In discriminative Markov logic networks, this resulting network is a conditional Markov network (also known as a conditional random field (Lafferty et al., 2001)).</Paragraph>
      <Paragraph position="1"> From Equation 1, the formulae Fy specify the structure of the corresponding Markov network as follows: Each grounding of a predicate specified in Fy has a corresponding node in the Markov network; and an edge connects two nodes in the network if and only if their corresponding predicates co-occur in a grounding of a formula Fy. Thus, the complexity of the formulae in Fy will determine the complexity of the resulting Markov network, and therefore the complexity of inference. When Fy contains complex first-order quantifiers, the resulting Markov network may contain a prohibitively large number of nodes.</Paragraph>
      <Paragraph position="2"> For example, let the set of constants C be the set of authors {ai}, papers {pi}, and conferences {ci} from a research publication database. Predicates may include AuthorOf(ai,pj), AdvisorOf(ai,aj), and ProgramCommittee(ai,cj). Each grounding of a predicate corresponds to a random variable in the corresponding Markov network.</Paragraph>
      <Paragraph position="3"> It is important to notice how query predicates and evidence predicates differ in their impact on inference complexity. Grounded evidence predicates result in observed random variables that can be highly connected without resulting in an increase in inference complexity. For example, consider the binary evidence predicate HaveSameLastName(ai ...ai+k).</Paragraph>
      <Paragraph position="4"> This aggregate predicate reflects information about a subset of (k [?] i + 1) constants. The value of this predicate is dependent on the values of HaveSameLastName(ai,ai+1), HaveSameLastName(ai,ai+2), etc. However, since all of the corresponding variables are observed, inference does not need to ensure their consistency or model their interaction.</Paragraph>
      <Paragraph position="5"> In contrast, complex query predicates can make inference more difficult. Consider the query predicate HaveSameAdvisor(ai ...ai+k). Here, the related predicatesHaveSameAdvisor(ai,ai+1), HaveSameAdvisor(ai,ai+2), etc., all correspond to unobserved binary random variables that the model must predict. To ensure their consistency, the resulting Markov network must contain dependency edges between each of these variables, resulting in a densely connected network. Since inference in general in Markov networks scales exponentially with the size of the largest clique, inference in the grounded network quickly becomes intractable.</Paragraph>
      <Paragraph position="6"> One solution is to limit the expressivity of the predicates. In the previous example, we can decompose the predicate HaveSameAdvisor(ai ...ai+k) into its (k [?] i + 1)2 corresponding pairwise predicates, such as HaveSameAdvisor(ai,ai+1). Answering an aggregate query about the advisors of a group of students can be handled by a conjunction of these pairwise predicates.</Paragraph>
      <Paragraph position="7"> However, as discussed in Sections 1 and 2, we would like to reason about objects, not just pairs of mentions, because this enables richer evidence predicates. For example, the evidence predicates AtLeastTwoCoauthoredPapers(ai ...ai+k) and NumberOfStudents(ai) can be highly predictive of the query predicate HaveSameAdvisor(ai ...ai+k).</Paragraph>
      <Paragraph position="8"> Below, we describe a discriminative MLN for identity uncertainty that is able to reason at the object level.</Paragraph>
    </Section>
    <Section position="2" start_page="42" end_page="43" type="sub_section">
      <SectionTitle>
3.2 Identity uncertainty
</SectionTitle>
      <Paragraph position="0"> Typically, MLNs make a unique names assumption, requiring that different constants refer to distinct objects. In the publications database example, each author constant ai is a string representation of one author mention found in the text of a citation. The unique names assumption assumes that each ai refers to a distinct author in the real-world. This simplifies the network structure at the risk of weak or fallacious predictions (e.g., AdvisorOf(ai,aj) is erroneous if ai and aj actually refer to the same author). The identity uncertainty problem is the task of removing the unique names assumption by determining which  constants refer to the same real-world objects.</Paragraph>
      <Paragraph position="1"> Richardson and Domingos (2004) address this concern by creating the predicate Equals(ci,cj) between each pair of constants. While this retains the coherence of the model, the restriction to pairwise predicates can be a drawback if there exist informative features over sets of constants. In particular, by only capturing features of pairs of constants, this solution cannot model the compatibility of object attributes, only of constant attributes (Section 2).</Paragraph>
      <Paragraph position="2"> Instead, we desire a conditional model that allows predicates to be defined over a set of constants.</Paragraph>
      <Paragraph position="3"> One approach is to introduce constants that represent objects, and connect them to their mentions by predicates such asIsMentionOf(ci,cj). In addition to computational issues, this approach also somewhat problematically requires choosing the number of objects. (See Richardson and Domingos (2004) for a brief discussion.) Instead, we propose instantiating aggregate predicates over sets of constants, such that a setting of these predicates implicitly determines the number of objects. This approach allows us to model attributes over entire objects, rather than only pairs of constants. In the following sections, we describe aggregate predicates in more detail, as well as the approximations necessary to implement them efficiently.</Paragraph>
    </Section>
    <Section position="3" start_page="43" end_page="44" type="sub_section">
      <SectionTitle>
3.3 Aggregate predicates
</SectionTitle>
      <Paragraph position="0"> Aggregate predicates are predicates that take as arguments an arbitrary number of constants. For example, the HaveSameAdvisor(ai ...ai+k) predicate in the previous section is an example of an aggregate predicate over k [?]i +1 constants.</Paragraph>
      <Paragraph position="1"> Let IC = {1...N} be the set of indices into the set of constants C, with power setP(IC). For any subset d [?] P(IC), an aggregate predicate A(d) defines a property over the subset of constants d.</Paragraph>
      <Paragraph position="2"> Note that aggregate predicates can be translated into first-order formulae. For example, HaveSameAdvisor(ai ...ai+k) can be re-written as [?](ax,ay) [?] {ai ...ai+k} SameAdvisor(ax,ay).</Paragraph>
      <Paragraph position="3"> By using aggregate predicates we make explicit the fact that we are modeling the attributes at the object level.</Paragraph>
      <Paragraph position="4"> We distinguish between aggregate query predicates, which represent unobserved aggregate variables, and aggregate evidence predicates, which represent observed aggregate variables. Note that using aggregate query predicates can complicate inference, since they represent a collection of fully connected hidden variables. The main point of this paper is that although these aggregate query predicates are specifiable in MLNs, they have not been utilized because of the resulting inference complexity. We show that the gains made possible by these predicates often outweigh the approximations required for inference. null As discussed in Section 3.1, for each aggregate query predicates A(d), it is critical that the model predictconsistentvaluesforeveryrelatedsubsetofd.</Paragraph>
      <Paragraph position="5"> Enforcing this consistency requires introducing dependency edges between aggregate query predicates that share arguments. In general, this can be a difficult problem. Here, we focus on the special case for identity uncertainty where the main query predicate under consideration is AreEqual(d).</Paragraph>
      <Paragraph position="6"> The aggregate query predicate AreEqual(d) is true if and only if all constants di [?] d refer to the same object. Since each subset of constants corresponds to a candidate object, a (consistent) setting of all the AreEqual predicates results in a solution to the object identification problem. The number of objects is chosen based on the optimal grounding of each of these aggregate predicates, and therefore does not require a prior over the number of objects.</Paragraph>
      <Paragraph position="7"> That is, once all the AreEqual predicates are set, they determine a clustering with a fixed number of objects. The number of objects is not modeled or set directly, but is implied by the result of MAP inference. (However, a posterior over the number of objects could be modeled discriminatively in an MLN (Richardson and Domingos, 2004).) This formulation also allows us to compute aggregate evidence predicates over objects to help predict the values of each AreEqual predicate. For example, NumberFirstNames(d) returns the number of different first names used to refer to the object referenced by constants d. In this way, we can model aggregate features of an object, capturing the compatibility among its attributes.</Paragraph>
      <Paragraph position="8"> For a given C, there are |P(IC) |possible groundings of the AreEqual query predicates. Naively implemented, such an approach would require enumerating all subsets of constants, ultimately resulting in an unwieldy network.</Paragraph>
      <Paragraph position="9"> An equivalent way to state the problem is that using N-ary predicates results in a Markov network with one node for each grounding of the predicate.</Paragraph>
      <Paragraph position="10"> Since in the general case there is one grounding for each subset of C, the size of the corresponding Markov network will be exponential in |C|. See Figure 1 for an example instantiation of an MLN with three constants (a,b,c) and one AreEqual predicate. null In this paper, we provide algorithms to perform approximate inference and parameter estimation by incrementally instantiating these predicates  by an MLN with three constants and the aggregate predicate AreEqual, instantiated for all possible subsets with size [?] 2.</Paragraph>
      <Paragraph position="11"> as needed.</Paragraph>
    </Section>
    <Section position="4" start_page="44" end_page="45" type="sub_section">
      <SectionTitle>
3.4 MAP Inference
</SectionTitle>
      <Paragraph position="0"> Maximum a posteriori (MAP) inference seeks the solution to</Paragraph>
      <Paragraph position="2"> where y[?] is the setting of all the query predicates Fy (e.g. AreEqual) with the maximal conditional density.</Paragraph>
      <Paragraph position="3"> In large, densely connected Markov networks, a common approximate inference technique is loopy belief propagation (i.e. the max-product algorithm applied to a cyclic graph). However, the use of aggregate predicates makes it intractable even to instantiate the entire network, making max-product an inappropriate solution.</Paragraph>
      <Paragraph position="4"> Instead, we employ an incremental inference technique that grounds aggregate query predicates in an agglomerative fashion based on the model's current MAP estimates. This algorithm can be viewed as a greedy agglomerative search for a local optimum of P(Y|X), and has connections to recent work on correlational clustering (Bansal et al., 2004) and graph partitioning for MAP estimation (Boykov et al., 2001).</Paragraph>
      <Paragraph position="5"> First, note that finding the MAP estimate does not require computing Zx, since we are only interested in the relative values of each configuration, and Zx is fixed for a given x. Thus, at iteration t, we compute an unnormalized score for yt (the current setting of the query predicates) given the evidence predicates x as follows:</Paragraph>
      <Paragraph position="7"> where Ft [?] Fy is the set of aggregate predicates representing a partial solution to the object identification task for constants C, specified by yt.</Paragraph>
      <Paragraph position="8"> Algorithm 1 Approximate MAP Inference Algorithm null  1: Given initial predicates F0 2: while ScoreIsIncreased do 3: F[?]i = FindMostLikelyPredicate(Ft) 4: F[?]i = true 5: Ft = ExpandPredicates(F[?]i ,Ft) 6: end while  Algorithm 1 outlines a high-level description of the approximate MAP inference algorithm. The algorithm first initializes the set of query predicated F0 such that all AreEqual predicates are restricted to pairs of constants, i.e. AreEqual(ci,cj) [?](i,j). This is equivalent to a Markov network containing one unobserved random variable for each pair of constants, where each variable indicates whether a pair of constants refer to the same object.</Paragraph>
      <Paragraph position="9"> Initially, each AreEqual predicate is assumed false. In line 3, the procedure FindMostLikelyPredicate iterates through each query predicate in Ft, setting each to true in turn and calculating its impact on the scoring function. The procedure returns the predicate F[?]i such that setting F[?]i to True results in the greatest increase in the scoring function S(yt,x).</Paragraph>
      <Paragraph position="10"> Let (c[?]i ...c[?]j) be the set of constants &amp;quot;merged&amp;quot; by setting their AreEqual predicate to true. The ExpandPredicates procedure creates new predicates AreEqual(c[?]i ...c[?]j,ck ...cl) corresponding to all the potential predicates created by merging the constants c[?]i ...c[?]j with any of the other previously merged constants. For example, after the first iteration, a pair of constants (c[?]i,c[?]j) are merged. The set of predicates are expanded to include AreEqual(c[?]i,c[?]j,ck) [?]ck, reflecting all possible additional references to the proposed object referenced by c[?]i,c[?]j.</Paragraph>
      <Paragraph position="11"> This algorithm continues until there is no predicate that can be set to true that increases the score function.</Paragraph>
      <Paragraph position="12"> In this way, the final setting of Fy is a local maximum of the score function. As in other search algorithms, we can employ look-ahead to reduce the greediness of the search (i.e., consider multiple merges simultaneously), although we do not include look-ahead in experiments reported here.</Paragraph>
      <Paragraph position="13"> It is important to note that each expansion of the aggregate query predicates Fy has a corresponding set of aggregate evidence predicates. These evidence predicates characterize the compatibility of the attributes of each hypothesized object.</Paragraph>
      <Paragraph position="14">  The space required for the above algorithm scales Ohm(|C|2), since in the initialization step we must ground a predicate for each pair of constants. We use the canopy method of McCallum et al. (2000), which thresholds a &amp;quot;cheap&amp;quot; similarity metric to prune unnecessary comparisons. This pruning can be done at subsequent stages of inference to restrict which predicates variables will be introduced.</Paragraph>
      <Paragraph position="15"> Additionally, we must ensure that predicate settings at time t do not contradict settings at t [?] 1 (e.g. if Ft(a,b,c) = 1, then Ft+1(a,b) = 1). By greedily setting unobserved nodes to their MAP estimates, the inference algorithm ignores inconsistent settings and removes them from the search space.</Paragraph>
    </Section>
    <Section position="5" start_page="45" end_page="45" type="sub_section">
      <SectionTitle>
3.5 Parameter estimation
</SectionTitle>
      <Paragraph position="0"> Given a fully labeled training set D of constants annotated with their referent objects, we would like to estimate the value of w that maximizes the likelihood of D. That is, w[?] = argmaxw Pw(y|x).</Paragraph>
      <Paragraph position="1"> When the data are few, we can explicitly instantiate all AreEqual(d) predicates, setting their corresponding nodes to the values implied by D. The likelihood is given by Equation 1, where the normal-</Paragraph>
      <Paragraph position="3"> Although this sum over yprime to calculate Zx is exponential in |y|, many inconsistent settings can be pruned as discussed in Section 3.4.1.</Paragraph>
      <Paragraph position="4"> In general, however, instantiating the entire set of predicates denoted by y and calculating Zx is intractable. Existing methods for MLN parameter estimation include pseudo-likelihood and voted perceptron (Richardson and Domingos, 2004; Singla and Domingos, 2005). We instead follow the recent success in piecewise training for complex undirected graphical models (Sutton and McCallum, 2005) by making the following two approximations. First, we avoid calculating the global normalizer Zx by calculating local normalizers, which sum only over the two values for each aggregate query predicate grounded in the training data. We therefore maximize the sum of local probabilities for each query predicate given the evidence predicates.</Paragraph>
      <Paragraph position="5"> This approximation can be viewed as constructing a log-linear binary classifier to predict whether an isolated set of constants refer to the same object.</Paragraph>
      <Paragraph position="6"> Input features include arbitrary first-order features over the input constants, and the output is a binary variable. The parameters of this classifier correspond to the w weights in the MLN. This simplification results in a convex optimization problem, which we solve using gradient descent with L-BFGS, a second-order optimization method (Liu and Nocedal, 1989). The second approximation addresses the fact that all query predicates from the training set cannot be instantiated. We instead sample a subset FS [?] Fy and maximize the likelihood of this subset. The sampling is not strictly uniform, but is instead obtained by collecting the predicates created while performing object identification using a weak method (e.g. string comparisons). More explicitly, predicates are sampled from the training data by performing greedy agglomerative clustering on the training mentions, using a scoring function that computes the similarity between two nodes by string edit distance. The goal of this clustering is not to exactly reproduce the training clusters, but to generate correct and incorrect clusters that have similar characteristics (size, homogeneity) to what will be present in the testing data.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="45" end_page="46" type="metho">
    <SectionTitle>
4 Experiments
</SectionTitle>
    <Paragraph position="0"> We perform experiments on two object identification tasks: citation matching and author disambiguation.</Paragraph>
    <Paragraph position="1"> Citation matching is the task of determining whether two research paper citation strings refer to the same paper. We use the Citeseer corpus (Lawrence et al., 1999), containing approximately 1500 citations, 900 of which are unique. The citations are manually labeled with cluster identifiers, and the strings are segmented into fields such as author, title, etc. The citation data is split into four disjoint categories by topic, and the results presented are obtained by training on three categories and testing on the fourth.</Paragraph>
    <Paragraph position="2"> Using first-order logic, we create a number of aggregate predicates such as AllTitlesMatch, AllAuthorsMatch, AllJournalsMatch, etc., as well as their existential counterparts, ThereExistsTitleMatch, etc. We also include count predicates, which indicate the number of these matches in a set of constants.</Paragraph>
    <Paragraph position="3"> Additionally, we add edit distance predicates, which calculate approximate matches1 between title fields, etc., for each pair of citations in a set of citations. Aggregate features are used for these, such as &amp;quot;there exists a pair of citations in this cluster which have titles that are less than 30% similar&amp;quot; and &amp;quot;the minimum edit distance between titles in a cluster is greater than 50%.&amp;quot; We evaluate using pairwise precision, recall, and F1, which measure the system's ability to predict whether each pair of constants refer to the same object or not. Table 1 shows the advantage of our  citation matching task, where Objects is an MLN using aggregate predicates, and Pairs is an MLN using only pairwise predicates. Objects outperforms Pairs on three of the four testing sets.</Paragraph>
    <Paragraph position="4">  task. Objects outperforms Pairs on two of the three testing sets.</Paragraph>
    <Section position="1" start_page="46" end_page="46" type="sub_section">
      <SectionTitle>
Objects Pairs
</SectionTitle>
      <Paragraph position="0"> pr re f1 pr re f1 miller d 73.9 29.3 41.9 44.6 1.0 61.7 li w 39.4 47.9 43.2 22.1 1.0 36.2 smith b 61.2 70.1 65.4 14.5 1.0 25.4 proposed model (Objects) over a model that only considers pairwise predicates of the same features (Pairs). Note that Pairs is a strong baseline that performs collective inference of citation matching decisions, but is restricted to use only IsEqual(ci,cj) predicates over pairs of citations. Thus, the performance difference is due to the ability to model first-order features of the data.</Paragraph>
      <Paragraph position="1"> Author disambiguation is the task of deciding whether two strings refer to the same author. To increase the task complexity, we collect citations from the Web containing different authors with matching last names and first initials. Thus, simply performing a string match on the author's name would be insufficient in many cases. We searched for three common last name / first initial combinations (Miller, D; Li, W; Smith, B). From this set, we collected 400 citations referring to 56 unique authors. For these experiments, we train on two subsets and test on the third.</Paragraph>
      <Paragraph position="2"> We generate aggregate predicates similar to those used for citation matching. Additionally, we include features indicating the overlap of tokens from the titles and indicating whether there exists a pair of authors in this cluster that have different middle names. This last feature exemplifies the sort of reasoning enabled by aggregate predicates: For example, consider a pairwise predicate that indicates whether two authors have the same middle name.</Paragraph>
      <Paragraph position="3"> Very often, middle name information is unavailable, so the name &amp;quot;Miller, A.&amp;quot; may have high similarity to both &amp;quot;Miller, A. B.&amp;quot; and &amp;quot;Miller, A. C.&amp;quot;. However, it is unlikely that the same person has two different middle names, and our model learns a weight for this feature. Table 2 demonstrates the advantage of this method.</Paragraph>
      <Paragraph position="4"> Overall, Objects achieves F1 scores superior to Pairs on 5 of the 7 datasets. These results indicate the potential advantages of using complex first-order quantifiers in MLNs. The cases in which Pairs out-performs Objects are likely due to the fact that the approximate inference used in Objects is greedy.</Paragraph>
      <Paragraph position="5"> Increasing the robustness of inference is a topic of future research.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML