File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/93/p93-1003_metho.xml
Size: 6,459 bytes
Last Modified: 2025-10-06 14:13:32
<?xml version="1.0" standalone="yes"?> <Paper uid="P93-1003"> <Title>AN ALGORITHM FOR FINDING NOUN PHRASE CORRESPONDENCES IN BILINGUAL CORPORA</Title> <Section position="4" start_page="94304" end_page="94304" type="metho"> <SectionTitle> COMPONENTS </SectionTitle> <Paragraph position="0"> Figure 1 illustrates how the corpus is analyzed.</Paragraph> <Paragraph position="1"> The words in sentences are first tagged with their corresponding part-of-speech categories. Each tagger contains a hidden Markov model (HMM), which is trained using samples of raw text from the Hansards for each language. The taggers are robust and operate with a low error rate \[Kupiec, 1992\]. Simple noun phrases (excluding pronouns and digits) are then extracted from the sentences by finite-state recognizers that are specified by regular expressions defined in terms of part-of-speech categories. Simple noun phrases are identified because they are most reliably recognized; it is also assumed that they can be identified unambiguously. The only embedding that is allowed is by prepositional phrases involving &quot;of&quot; in English and &quot;de&quot; in French, as noun phrases involving them can be identified with relatively low error (revisions to this restriction are considered later).</Paragraph> <Paragraph position="2"> Noun phrases are placed in an index to associate a unique identifier with each one.</Paragraph> <Paragraph position="3"> A noun phrase is defined by its word sequence, excluding any leading determiners. Singular and plural forms of common nouns are thus distinct and assigned different positions in the index. For each sentence corresponding to an alignment, the index positions of all noun phrases in the sentence are recorded in a separate data structure, providing a compact representation of the corpus.</Paragraph> <Paragraph position="4"> So far it has been assumed (for the sake of simplicity) that there is always a one-to-one mapping between English and French sentences. In practice, if an alignment program produces blocks of several sentences in one or both languages, this can be accommodated by treating the block instead as a single bigger &quot;compound sentence&quot; in which noun phrases have a higher number of possible correspondences.</Paragraph> </Section> <Section position="5" start_page="94304" end_page="94304" type="metho"> <SectionTitle> THE MAPPING ALGORITHM </SectionTitle> <Paragraph position="0"> Some terminology is necessary to describe the algorithm concisely. Let there be L total alignments in the corpus; then Ei is the English sentence for alignment i. Let the function C/(Ei) be the number of noun phrases identified in the sentence. If there are k of them, k = C/(Ei), and they can be referenced by j = 1...k. Considering the j'th noun phrase in sentence Ei, the function I~(Ei, j) produces an identifier for the phrase, which is the position of the phrase in the English index. If this phrase is at position s, then I~(Ei,j) = s.</Paragraph> <Paragraph position="1"> In turn, the French sentence Fi will contain C/(Fi) noun phrases and given the p'th one, its position in the French index will be given by/~(Fi, p). It will also be assumed that there are a total of VE and Vr phrases in the English and French indexes respectively. Finally, the indicator function I 0 has the value unity if its argument is true, and zero otherwise.</Paragraph> <Paragraph position="2"> Assuming these definitions, the algorithm is I English sentence E i</Paragraph> <Paragraph position="4"> stated in Figure 2. The equations assume a directionality: finding French &quot;target&quot; correspondences for English &quot;source&quot; phrases. The algorithm is reversible, by swapping E with F.</Paragraph> <Paragraph position="5"> The model for correspondence is that a source noun phrase in Ei is responsible for producing the various different target noun phrases in Fi with correspondingly different probabilities. Two quantities are calculated; Cr(s, t) and Pr(s, t). Computation proceeds by evaluating Equation (1), Equation (2) and then iteratively applying Equations (3) and (2); r increasing with each successive iteration. The argument s refers to the English noun phrase nps(s) having position s in the English index, and the argument t refers to the French noun phrase npF(t) at position t in the French index. Equation (1) assumes that each English noun phrase in Ei is initially equally likely to correspond to each French noun phrase in Fi. All correspondences are thus equally weighted, reflecting a state of ignorance. Weights are summed over the corpus, so noun phrases that co-occur in several sentences will have larger sums. The weights C0(s, t) can be interpreted as the mean number of times that npF(t) corresponds to apE(s) given the corpus and the initial assumption of equiprobable correspondences.</Paragraph> <Paragraph position="6"> These weights can be used to form a new estimate of the probability that npF(t) corresponds to npE(s), by considering the mean number of times npF(t) corresponds to apE(s) as a fraction of the total mean number of correspondences for apE(s), as in Equation (2). The procedure is then iterated using Equations (3), and (2) to obtain successively refined, convergent estimates of the prob-</Paragraph> <Paragraph position="8"> ability that ripE(t) corresponds to ripE(s). The probability of correspondences can be used as a method of ranking them (occurrence counts can be taken into account as an indication of the reliability of a correspondence). Although Figure 2 defines the coefficients simply, the algorithm is not implemented literally from it. The algorithm employs a compact representation of the correspondences for efficient operation. An arbitrarily large corpus can be accommodated by segmenting it appropriately. null The algorithm described here is an instance of a general approach to statistical estimation, represented by the EM algorithm \[Dempster et al., 1977\]. In contrast to reservations that have been expressed \[Gale and Church, 1991a\] about using the EM algorithm to provide word correspondences, there have been no indications that prohibitive amounts of memory might be required, or that the approach lacks robustness. Unlike the other methods that have been mentioned, the approach has the capability to accommodate more context to improve performance.</Paragraph> </Section> class="xml-element"></Paper>