File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/w05-0617_metho.xml
Size: 12,878 bytes
Last Modified: 2025-10-06 14:09:55
<?xml version="1.0" standalone="yes"?> <Paper uid="W05-0617"> <Title>Morphology Induction From Term Clusters</Title> <Section position="5" start_page="128" end_page="189" type="metho"> <SectionTitle> 3 Method </SectionTitle> <Paragraph position="0"> In our experiments and the discussion that follows, stems are sub-strings of words, to which attach affixes, which are sub-string classes denoted by perl-style regular expressions (e.g., e?d$ or ^re). A transform is an affix substitution which entails a change of clusters. We depict the affix part of the 2While we have not experimented with other clustering approaches, we assume that the accuracy of the derived morphological information is not very sensitive to the particular methodology.</Paragraph> <Paragraph position="1"> transform using a perl-style s/// operator. For example, the transform s/ed$/ing/ corresponds to the operation of replacing the suffix ed with ing.</Paragraph> <Section position="1" start_page="129" end_page="129" type="sub_section"> <SectionTitle> 3.1 Overview </SectionTitle> <Paragraph position="0"> The process of moving from term clusters to a transform automaton capable of analyzing novel forms consists of four stages: 1. Acquire candidate transformations. By searching for transforms that align a large number of terms in a given pair of clusters, we quickly identify affixation patterns that are likely to have syntactic significance.</Paragraph> <Paragraph position="1"> 2. Weighting stems and transforms. The output of Step 1 is a set of transforms, some overlapping, others dubious. This step weights them according to their utility across the vocabulary, using an algorithm similar to Hubs and Authorities (Kleinberg, 1998).</Paragraph> <Paragraph position="2"> 3. Culling transforms. We segment the words in the vocabulary, using the transform weights to choose among alternative segmentations. Following this segmentation step, we discard any transform that failed to participate in at least one segmentation.</Paragraph> <Paragraph position="3"> 4. Constructing an automaton. From the remaining transforms we construct an automaton, the nodes of which correspond to clusters, the edges to transforms. The resulting data structure can be used to construct morphological parses.</Paragraph> <Paragraph position="4"> The remainder of this section describes each of these steps in detail.</Paragraph> </Section> <Section position="2" start_page="129" end_page="130" type="sub_section"> <SectionTitle> 3.2 Acquiring Transforms </SectionTitle> <Paragraph position="0"> Once we are in possession of a sufficiently large number of term clusters, the acquisition of candidate transforms is conceptually simple. For each pair of clusters, we count the number of times each possible transform is in evidence, then discard those transforms occurring fewer than some small number of times.</Paragraph> <Paragraph position="1"> For each pair of clusters, we search for suffix or prefix pairs, which, when stripped from matching members in the respective clusters lead to as s/ful$/less/ pain harm use ...</Paragraph> <Paragraph position="2"> s/^/over/ charged paid hauled ...</Paragraph> <Paragraph position="3"> s/cked$/wing/ kno sho che ...</Paragraph> <Paragraph position="4"> s/nd$/ts/ le se fi ...</Paragraph> <Paragraph position="5"> s/s$/ed/ recall assert add ...</Paragraph> <Paragraph position="6"> s/ts$/ted/ asser insis predic ...</Paragraph> <Paragraph position="7"> s/es$/ing/ argu declar acknowledg ...</Paragraph> <Paragraph position="8"> s/s$/ing/ recall assert add ...</Paragraph> <Paragraph position="9"> from the Wall Street Journal after the acquisition step.</Paragraph> <Paragraph position="10"> large a cluster intersection as possible. For example, if walked and talked are in Cluster 1, and walking and talking are in Cluster 2, then walk and talk are in the intersection, given the transform s/ed$/ing/. In our experiments, we retain any cluster-to-cluster transform producing an intersection having at least three members.</Paragraph> <Paragraph position="11"> Table 2 lists some transforms derived from the WSJ as part of this process, along with a few of the stems they match. These were chosen for the sake of illustration; this list does not necessarily the reflect the quality or distribution of the output. (For example, transforms based on the pattern s/$/s/ easily form the largest block.) A frequent problem is illustrated by the transforms s/s$/ed/ and s/ts$/ted/. Often, we observe alternative segmentations for the same words and must decide which to prefer. We resolve most of these questions using a simple heuristic. If one transform subsumes another--if the vocabulary terms it covers is a strict superset of those covered by the other transform--then we discard the second one. In the table, all members of the transform s/ts$/ted/are also members of s/s$/ed/, so we drop s/ts$/ted/from the set.</Paragraph> <Paragraph position="12"> The last two lines of the table represent an obvious opportunity to generalize. In cases like this, where two transforms are from the same cluster pair and involve source or destination affixes that differ in a single letter, the other affixes being equal, we introduce a new transform in which the elided letter is optional (in this example, the transform s/e?s$/ing/). The next step seeks to resolve this uncertainty.</Paragraph> <Paragraph position="14"/> </Section> <Section position="3" start_page="130" end_page="130" type="sub_section"> <SectionTitle> 3.3 Weighting Stems and Transforms </SectionTitle> <Paragraph position="0"> The observation that morphologically significant affixes are more likely to be frequent than arbitrary word endings is central to MDL-based systems. Of course, the same can be said about word stems: a string is more likely to be a stem if it is observed with a variety of affixes (or transforms). Moreover, our certainty that it is a valid stem increases with our confidence that the affixes we find attached to it are valid.</Paragraph> <Paragraph position="1"> This suggests that candidate affixes and stems can &quot;nominate&quot; each other in a way analogous to &quot;hubs&quot; and &quot;authorities&quot; on the Web (Kleinberg, 1998). In this step, we exploit this insight in order to weight the &quot;stem-ness&quot; and &quot;affix-ness&quot; of candidate strings. Our algorithm is closely based on the Hubs and Authorities Algorithm. We say that a stem and transform are &quot;linked&quot; if we have observed a stem to participate in a transform. Beginning with a uniform distribution over stems, we zero the weights associated with transforms, then propagate the stem weights to the transforms. For each stem a0 and transform a1 , such that a0 and a1 are linked, the weight of a0 is added to the weight of a1 . Next, the stem weights are zeroed, and the transform weights propagated to the stems in the same way. This procedure is iterated a few times or until convergence (five times in these experiments).</Paragraph> </Section> <Section position="4" start_page="130" end_page="131" type="sub_section"> <SectionTitle> 3.4 Culling Transforms </SectionTitle> <Paragraph position="0"> The output of this procedure is a weighting of candidate stems, on the one hand, and transforms, on the other. Table 3 shows the three highest-weighted and three lowest-weighted transforms from an experiment involving the 10,000 most frequent words in the WSJ.</Paragraph> <Paragraph position="1"> Although these weights have no obvious linguis- null tic interpretation, we nevertheless can use them to filter further the transform set. In general, however, there is no single threshold that removes all dubious transforms. It does appear to hold, though, that correct transforms (e.g., s/$/s/) outweigh competing incorrect transforms (e.g., s/w$/ws/). This observation motivates our culling procedure: We apply the transforms to the vocabulary in a competitive segmentation procedure, allowing highly weighted transforms to &quot;out-vote&quot; alternative transforms with lower weights. At the completion of this pass through the vocabulary, we retain only those transforms that contribute to at least one successful segmentation. null Table 4 lists the segmentation procedure. In this pseudocode, a2 is a word, a19 a transform, and a20 a stem. The operation a20a71a24a44a19 produces the set of (two) words generated by applying the affixes of a19 to a20 ; the operation a2a72a50a73a19 (the stemming operation) removes the longest matching affix of a19 from a2 . Given a word a2 , we first find the set of transforms associated with a2 , grouping them by the pair of words to which they correspond (Lines 4-8). For example, given the word &quot;created&quot;, and the transforms s/ed$/ing/, s/ted$/ting/, and s/s$/d/, the first two transforms will be grouped together in</Paragraph> <Paragraph position="3"> will be part of a different group.</Paragraph> <Paragraph position="4"> Once we have grouped associated transforms, we use them to stem a2 , accumulating evidence for different stemmings in a45 . In Line 15, we then discard all but the highest scoring stemming. The score of this stemming is then added to its &quot;global&quot; score in Line 16.</Paragraph> <Paragraph position="5"> The purpose of this procedure is the suppression of spurious segmentations in Line 15. Although this pseudocode returns only the highest weighted segmentation, it is usually the case that all candidate segmentations stored in a0 are valid, i.e., that several or all breakpoints of a product of multiple affixation are correctly found. And it is a byproduct of this procedure that we require for the final step in our pipeline: In addition to accumulating stemming scores, we record the transforms that contributed to them. We refer to this set of transforms as the culled set.</Paragraph> </Section> <Section position="5" start_page="131" end_page="189" type="sub_section"> <SectionTitle> 3.5 Constructing an Automaton </SectionTitle> <Paragraph position="0"> Given the culled set of transforms, creation of a parser is straightforward. In the last two steps we have considered a transform to be a pair of affixes a16 a3a5a4 a20 a3a5a6 a24 . Recall that for each such transform there are one or more cluster-specific transforms of the form a16a8a7 a4 a20 a3 a4 a20a9a7 a6 a20 a3 a6 a24 in which the source and destination affixes correspond to clusters.</Paragraph> <Paragraph position="1"> We now convert this set of specific transforms into an automaton in which clusters form the nodes and arcs are affixation operations. For every transform</Paragraph> <Paragraph position="3"> labeling it with the general transform a16 a3 a4 a20 a3 a6 a24 , and draw the inverse arc from a7 a6 to a7 a4 .</Paragraph> <Paragraph position="4"> We can now use this automaton for a kind of unsupervised morphological analysis. Given a word, we construct an analysis by finding paths through the automaton to known (or possibly unknown) stem words. Each step replaces one (possibly empty) affix with another one, resulting in a new word form.</Paragraph> <Paragraph position="5"> In general, many such paths are possible. Most of these are redundant, generated by following given affixation arcs to alternative clusters (there are typically several plural noun clusters, for example) or collapsing compound affixations into a single operation. null the Wall Street Journal corpus.</Paragraph> <Paragraph position="6"> In our experiments, we generate all possible paths under the constraint that an operation lead to a known longer wordform, that it be a possible stem of the given word, and that the operation not constitute a loop in the search.3 We then sort the analysis traces heuristically and return the top one as our analysis. In comparing two traces, we use the following criteria, in order: a10 Prefer the trace with the shorter starting stem. a10 Prefer the trace involving fewer character edits. (The number of edits is summed across the traces, the trace with the smaller sum preferred.) null a10 Prefer the trace having more correct cluster assignments of intermediate wordforms.</Paragraph> <Paragraph position="7"> a10 Prefer the longer trace.</Paragraph> <Paragraph position="8"> Note that it is not always clear how to perform an affixation. Consider the transform s/ing$/e?d/, for example. In practice, however, this is not a source of difficulty. We attempt both possible expansions (with or without the &quot;e&quot;). If either produces a known wordform which is found in the destination cluster, we discard the other one. If neither resulting wordform can be found in the destination cluster, both are added to the frontier in our search.</Paragraph> </Section> </Section> class="xml-element"></Paper>