File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/06/n06-1015_intro.xml

Size: 9,556 bytes

Last Modified: 2025-10-06 14:03:23

<?xml version="1.0" standalone="yes"?>
<Paper uid="N06-1015">
  <Title>Word Alignment via Quadratic Assignment</Title>
  <Section position="3" start_page="112" end_page="115" type="intro">
    <SectionTitle>
2 Models
</SectionTitle>
    <Paragraph position="0"> We begin with a quick summary of the maximum weight bipartite matching model in (Taskar et al., 2005). More precisely, nodes V = Vs [?] Vt correspond to words in the &amp;quot;source&amp;quot; (Vs) and &amp;quot;target&amp;quot; (Vt) sentences, and edges E = {jk : j [?] Vs, k [?] Vt} correspond to alignments between word pairs.1 The edge weights sjk represent the degree to which word j in one sentence can be translated using the word k in the other sentence. The predicted alignment is chosen by maximizing the sum of edge scores. A matching is represented using a set of binary variables yjk that are set to 1 if word j is assigned to word k in the other sentence, and 0 otherwise. The score of an assignment is the sum of edge scores: s(y) = summationtextjk sjkyjk. For simplicity, let us begin by assuming that each word aligns to one or zero words in the other sentence; we revisit the issue of fertility in the next section. The maximum weight bipartite matching problem, arg maxy[?]Y s(y), can be solved using combinatorial algorithms for min-cost max-flow, expressed in a linear programming (LP) formulation as follows:</Paragraph>
    <Paragraph position="2"> where the continuous variables zjk are a relaxation of the corresponding binary-valued variables yjk. This LP is guaranteed to have integral (and hence optimal) solutions for any scoring function s(y) (Schrijver, 2003). Note that although the above LP can be used to compute alignments, combinatorial algorithms are generally more efficient. For  greater than one to correctly label. (a) The guess of the baseline M model. (b) The guess of the M+F fertility-augmented model.</Paragraph>
    <Paragraph position="3"> example, in Figure 1(a), we show a standard construction for an equivalent min-cost flow problem. However, we build on this LP to develop our extensions to this model below. Representing the prediction problem as an LP or an integer LP provides a precise (and concise) way of specifying the model and allows us to use the large-margin framework of Taskar (2004) for parameter estimation described in Section 3.</Paragraph>
    <Paragraph position="4"> For a sentence pair x, we denote position pairs by xjk and their scores as sjk. We let sjk = wlatticetopf(xjk) for some user provided feature mapping f and abbreviate wlatticetopf(x,y) = summationtextjk yjkwlatticetopf(xjk). We can include in the feature vector the identity of the two words, their relative positions in their respective sentences, their part-of-speech tags, their string similarity (for detecting cognates), and so on.</Paragraph>
    <Section position="1" start_page="113" end_page="114" type="sub_section">
      <SectionTitle>
2.1 Fertility
</SectionTitle>
      <Paragraph position="0"> An important limitation of the model in Eq. (1) is that in each sentence, a word can align to at most one word in the translation. Although it is common that words have gold fertility zero or one, it is certainly not always true. Consider, for example, the bitext fragment shown in Figure 2(a), where backbone is aligned to the phrase 'epine dorsal. In this figure, outlines are gold alignments, square for sure alignments, round for possibles, and filled squares are target alignments (for details on gold alignments, see Section 4). When considering only the sure alignments on the standard Hansards dataset, 7 percent of the word occurrences have fertility 2, and 1 percent have fertility 3 and above; when considering the possible alignments high fertility is much more common--31 percent of the words have fertility 3 and above.</Paragraph>
      <Paragraph position="1"> One simple fix to the original matching model is to increase the right hand sides for the constraints in Eq. (1) from 1 to D, where D is the maximum allowed fertility. However, this change results in an undesirable bimodal behavior, where maximum weight solutions either have all words with fertility 0 or D, depending on whether most scores sjk are positive or negative. For example, if scores tend to be positive, most words will want to collect as many alignments as they are permitted. What the model is missing is a means for encouraging the common case of low fertility (0 or 1), while allowing higher fertility when it is licensed. This end can be achieved by introducing a penalty for having higher fertility, with the goal of allowing that penalty to vary based on features of the word in question (such as its frequency or identity).</Paragraph>
      <Paragraph position="2"> In order to model such a penalty, we introduce indicator variables zdj* (and zd*k) with the intended meaning: node j has fertility of at least d (and node k has fertility of at least d). In the following LP, we introduce a penalty of summationtext2[?]d[?]D sdj*zdj* for fertility of node j, where each term sdj* [?] 0 is the penalty increment for increasing the fertility from d [?] 1 to</Paragraph>
      <Paragraph position="4"> zdj*, [?]j [?] Vs.</Paragraph>
      <Paragraph position="5"> We can show that this LP always has integral solutions by a reduction to a min-cost flow problem. The construction is shown in Figure 1(b). To ensure that the new variables have the intended semantics, we need to make sure that sdj* [?] sdprimej* if d [?] dprime, so that the lower cost zdj* is used before the higher cost zdprimej* to increase fertility. This restriction im- null source and sink. All edge capacities are 1, with edges between round nodes (j, k) have cost [?]sjk, edges from source and to sink have cost 0. (b) Expanded min-cost flow graph with new edges from source and to sink that allow fertility of up to 3. The capacities of the new edges are 1 and the costs are 0 for solid edges from source and to sink, s2j*, s2*k for dashed edges, and s3j*, s3*k for dotted edges. (c) Three types of pairs of edges included in the QAP model, where the nodes on both sides correspond to consecutive words. for more than  plies that the penalty must be monotonic and convex as a function of the fertility.</Paragraph>
      <Paragraph position="6"> To anticipate the results that we report in Section 4, adding fertility to the basic matching model makes the target alignment of the backbone example feasible and, in this case, the model correctly labels this fragment as shown in Figure 2(b).</Paragraph>
    </Section>
    <Section position="2" start_page="114" end_page="115" type="sub_section">
      <SectionTitle>
2.2 First-order interactions
</SectionTitle>
      <Paragraph position="0"> An even more significant limitation of the model in Eq. (1) is that the edges interact only indirectly through the competition induced by the constraints. Generative alignment models like the HMM model (Vogel et al., 1996) and IBM models 4 and above (Brown et al., 1990; Och and Ney, 2003) directly model correlations between alignments of consecutive words (at least on one side). For example, Figure 3 shows a bitext fragment whose gold alignment is strictly monotonic. This monotonicity is quite common - 46% of the words in the hand-aligned data diagonally follow a previous alignment in this way. We can model the common local alignment configurations by adding bonuses for pairs of edges. For example, strictly monotonic alignments can be encouraged by boosting the scores of edges of the form &lt;(j, k), (j + 1, k + 1)&gt; . Another trend, common in English-French translation (7% on the hand-aligned data), is the local inversion of nouns and adjectives, which typically involves a pair of edges &lt;(j, k + 1), (j + 1, k)&gt; . Finally, a word in one language is often translated as a phrase (consecutive sequence of words) in the other language. This pattern involves pairs of edges with the same origin on one side: &lt;(j, k), (j, k+1)&gt; or &lt;(j, k), (j+1, k)&gt; . All three of these edge pair patterns are shown in Figure 1(c). Note that the set of such edge pairs Q = {jklm : |j [?] l |[?] 1,|k [?] m |[?] 1} is of linear size in the number of edges.</Paragraph>
      <Paragraph position="1"> Formally, we add to the model variables zjklm which indicate whether both edge jk and lm are in the alignment. We also add a corresponding score sjklm, which we assume to be non-negative, since the correlations we described are positive. (Negative scores can also be used, but the resulting formulation we present below would be slightly different.) To enforce the semantics zjklm = zjkzlm, we use a pair of constraints zjklm [?] zjk; zjklm [?] zlm.</Paragraph>
      <Paragraph position="2"> Since sjklm is positive, at the optimum, zjklm =  min(zjk, zlm). If in addition zjk, zlm are integral (0 or 1), then zjklm = zjkzlm. Hence, solving the following LP as an integer linear program will find the optimal quadratic assignment for our model:</Paragraph>
      <Paragraph position="4"> zjklm [?] zjk, zjklm [?] zlm, [?]jklm [?] Q.</Paragraph>
      <Paragraph position="5"> Note that we can also combine this extension with the fertility extension described above.</Paragraph>
      <Paragraph position="6"> To once again anticipate the results presented in Section 4, the baseline model of Taskar et al. (2005) makes the prediction given in Figure 3(a) because the two missing alignments are atypical translations of common words. With the addition of edge pair features, the overall monotonicity pushes the alignment to that of Figure 3(b).</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML