File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/n03-2002_metho.xml

Size: 5,798 bytes

Last Modified: 2025-10-06 14:08:16

<?xml version="1.0" standalone="yes"?>
<Paper uid="N03-2002">
  <Title>Factored Language Models and Generalized Parallel Backoff</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 Factored Language Models
</SectionTitle>
    <Paragraph position="0"> In a factored language model, a word is viewed as a vector of k factors, so that wt [?] {f1t ,f2t ,...,fKt }. Factors can be anything, including morphological classes, stems, roots, and other such features in highly inflected languages (e.g., Arabic, German, Finnish, etc.), or data-driven word classes or semantic features useful for sparsely inflected languages (e.g., English). Clearly, a two-factor FLM generalizes standard class-based language models, where one factor is the word class and the other is words themselves. An FLM is a model over factors, i.e., p(f1:Kt |f1:Kt[?]1:t[?]n), that can be factored as a product of probabilities of the form p(f|f1,f2,...,fN).</Paragraph>
    <Paragraph position="1"> Our task is twofold: 1) find an appropriate set of factors, and 2) induce an appropriate statistical model over those factors (i.e., the structure learning problem in graphical models (Bilmes, 2003; Friedman and Koller, 2001)).</Paragraph>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Generalized Parallel Backoff
</SectionTitle>
    <Paragraph position="0"> An individual FLM probability model can be seen as a directed graphical model over a set of N + 1 random variables, with child variable F and N parent variables F1 through FN (if factors are words, then F = Wt and Fi = Wt[?]i). Two features make an FLM distinct from a standard language model: 1) the variables {F,F1,...,FN} can be heterogeneous (e.g., words, word clusters, morphological classes, etc.); and 2) there is no obvious natural (e.g., temporal) backoff order as in standard word-based language models. With word-only models, back-off proceeds by dropping first the oldest word, then the next oldest, and so on until only the unigram remains. In p(f|f1,f2,...,fN), however, many of the parent variables might be the same age. Even if the variables have differing seniorities, it is not necessarily best to drop the oldest variable first.</Paragraph>
    <Paragraph position="1">  step backoff paths, where exactly one variable is dropped per backoff step. The SRILM-FLM extensions, however, also support multi-level backoff.</Paragraph>
    <Paragraph position="2"> We introduce the notion of a backoff graph (Figure 1) to depict this issue, which shows the various backoff paths from the all-parents case (top graph node) to the unigram (bottom graph node). Many possible backoff paths could be taken. For example, when all variables are words, the path A[?]B[?]E[?]H corresponds to tri-gram with standard oldest-first backoff order. The path A[?]D[?]G[?]H is a reverse-time backoff model. This can be seen as a generalization of lattice-based language modeling (Dupont and Rosenfeld, 1997) where factors consist of words and hierarchically derived word classes. In our GPB procedure, either a single distinct path is chosen for each gram or multiple parallel paths are used simultaneously. In either case, the set of back-off path(s) that are chosen are determined dynamically (at &amp;quot;run-time&amp;quot;) based on the current values of the variables. For example, a path might consist of nodes A[?] (BCD) [?] (EF) [?]G where node A backs off in parallel to the three nodes BCD, node B backs off to nodes (EF), C backs off to (E), and D backs off to (F).</Paragraph>
    <Paragraph position="3"> This can be seen as a generalization of the standard backoff equation. In the two parents case, this becomes:</Paragraph>
    <Paragraph position="5"> where dN(f,f1,f2) is a standard discount (determining the smoothing method), pML is the maximum likelihood distribution, a(f1,f2) are backoff weights, and g(f,f1,f2) is an arbitrary non-negative backoff function of its three factor arguments. Standard backoff occurs with g(f,f1,f2) = pBO(f|f1), but the GPB procedures can be obtained by using different g-functions. For example, g(f,f1,f2) = pBO(f|f2) corresponds to a different backoff path, and parallel backoff is obtained by using an appropriate g (see below). As long as g is non-negative, the backoff weights are defined as follows:</Paragraph>
    <Paragraph position="7"> This equation is non-standard only in the denominator, where one may no longer sum over the factors f only with counts greater than t. This is because g is not necessarily a distribution (i.e., does not sum to unity). Therefore, backoff weight computation can indeed be more expensive for certain g functions, but this appears not to be prohibitive as demonstrated in the next few sections.</Paragraph>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 SRILM-FLM extensions
</SectionTitle>
    <Paragraph position="0"> During the recent 2002 JHU workshop (Kirchhoff et al., 2003), significant extensions were made to the SRI language modeling toolkit (Stolcke, 2002) to support arbitrary FLMs and GPB procedures. This uses a graphicalmodel like specification language, and where many different backoff functions (19 in total) were implemented.</Paragraph>
    <Paragraph position="1"> Other features include: 1) all SRILM smoothing methods at every node in a backoff graph; 2) graph level skipping; and 3) up to 32 possible parents (e.g., 33-gram). Two of the backoff functions are (in the three parents case):</Paragraph>
    <Paragraph position="3"> (call this g2) where N() is the count function. Implemented backoff functions include maximum/min (normalized) counts/backoff probabilities, products, sums, mins, maxs, (weighted) averages, and geometric means.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML