XML Viewer - p04-1015

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/p04-1015_metho.xml
Size: 28,971 bytes
Last Modified: 2025-10-06 14:08:57
<?xml version="1.0" standalone="yes"?>
<Paper uid="P04-1015">
  <Title>Incremental Parsing with the Perceptron Algorithm</Title>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 The General Framework
</SectionTitle>
    <Paragraph position="0"> In this section we describe a general framework - linear models for NLP - that could be applied to a diverse range of tasks, including parsing and tagging. We then describe a particular method for parameter estimation, which is a generalization of the perceptron algorithm. Finally, we give an abstract description of an incremental parser, and describe how it can be used with the perceptron algorithm. null</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.1 Linear Models for NLP
</SectionTitle>
      <Paragraph position="0"> We follow the framework outlined in Collins (2002; 2004). The task is to learn a mapping from inputs x2X to outputs y2Y. For example,X might be a set of sentences, with Y being a set of possible parse trees. We assume:.</Paragraph>
      <Paragraph position="1"> Training examples (xi;yi) for i = 1:::n..</Paragraph>
      <Paragraph position="2"> A function GEN which enumerates a set of candidates GEN(x) for an input x..</Paragraph>
      <Paragraph position="3"> A representation mapping each (x;y)2X Y to a feature vector (x;y)2Rd..</Paragraph>
      <Paragraph position="4"> A parameter vector 2Rd.</Paragraph>
      <Paragraph position="5"> The components GEN; and define a mapping from an input x to an output F(x) through</Paragraph>
      <Paragraph position="7"> where (x;y) is the inner product Ps s s(x;y).</Paragraph>
      <Paragraph position="8"> The learning task is to set the parameter values using the training examples as evidence. The decoding algorithm is a method for searching for the arg max in Eq. 1. This framework is general enough to encompass several tasks in NLP. In this paper we are interested in parsing, where (xi;yi), GEN, and can be defined as follows: null Each training example (xi;yi) is a pair where xi is a sentence, andyi is the gold-standard parse for that sentence.</Paragraph>
      <Paragraph position="9"> Given an input sentence x, GEN(x) is a set of possible parses for that sentence. For example, GEN(x) could be defined as the set of possible parses for x under some context-free grammar, perhaps a context-free grammar induced from the training examples.</Paragraph>
      <Paragraph position="10"> The representation (x;y) could track arbitrary features of parse trees. As one example, suppose that there are m rules in a context-free grammar (CFG) that defines GEN(x). Then we could define the i'th component of the representation, i(x;y), to be the number of times the i'th context-free rule appears in the parse tree (x;y). This is implicitly the representation used in probabilistic or weighted CFGs.</Paragraph>
      <Paragraph position="11"> Note that the difficulty of finding the arg max in Eq. 1 is dependent on the interaction of GEN and . In many cases GEN(x) could grow exponentially with the size of x, making brute force enumeration of the members of GEN(x) intractable. For example, a context-free grammar could easily produce an exponentially growing number of analyses with sentence length. For some representations, such as the &amp;quot;rule-based&amp;quot; representation described above, the arg max in the set enumerated by the CFG can be found efficiently, using dynamic programming algorithms, without having to explicitly enumerate all members of GEN(x). However in many cases we may be interested in representations which do not allow efficient dynamic programming solutions. One way around this problem is to adopt a two-pass approach, where GEN(x) is the top N analyses under some initial model, as in the reranking approach of Collins (2000). In the current paper we explore alternatives to reranking approaches, namely heuristic methods for finding the arg max, specifically incremental beam-search strategies related to the parsers of Roark (2001a) and Ratnaparkhi (1999).</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.2 The Perceptron Algorithm for Parameter
Estimation
</SectionTitle>
      <Paragraph position="0"> We now consider the problem of setting the parameters, , given training examples (xi;yi). We will briefly review the perceptron algorithm, and its convergence properties - see Collins (2002) for a full description. The algorithm and theorems are based on the approach to classification problems described in Freund and Schapire (1999).</Paragraph>
      <Paragraph position="1"> Figure 1 shows the algorithm. Note that the most complex step of the method is finding zi = arg maxz2GEN(xi) (xi;z) - and this is precisely the decoding problem. Thus the training algorithm is in principle a simple part of the parser: any system will need a decoding method, and once the decoding algorithm is implemented the training algorithm is relatively straightforward. null We will now give a first theorem regarding the convergence of this algorithm. First, we need the following definition: Definition 1 Let GEN(xi) = GEN(xi) fyig. In other words GEN(xi) is the set of incorrect candidates for an example xi. We will say that a training sequence (xi;yi) for i = 1:::n is separable with margin &gt; 0 if there exists some vector U withjjUjj= 1 such that 8i;8z2GEN(xi); U (xi;yi) U (xi;z) (2) (jjUjjis the 2-norm of U, i.e.,jjUjj= pPs U2s.) Next, define Ne to be the number of times an error is made by the algorithm in figure 1 - that is, the number of times that zi6= yi for some (t;i) pair. We can then state the following theorem (see (Collins, 2002) for a proof): Theorem 1 For any training sequence (xi;yi) that is separable with margin , for any value of T, then for the perceptron algorithm in figure 1 Ne R  where R is a constant such that 8i;8z 2 GEN(xi) jj (xi;yi) (xi;z)jj R.</Paragraph>
      <Paragraph position="2"> This theorem implies that if there is a parameter vector U which makes zero errors on the training set, then after a finite number of iterations the training algorithm will converge to parameter values with zero training error. A crucial point is that the number of mistakes is independent of the number of candidates for each example Inputs: Training examples (xi;yi) Algorithm: Initialization: Set = 0 For t = 1:::T, i = 1:::n Output: Parameters Calculate zi = arg maxz2GEN(xi) (xi;z)</Paragraph>
      <Paragraph position="4"> (i.e. the size of GEN(xi) for each i), depending only on the separation of the training data, where separation is defined above. This is important because in many NLP problems GEN(x) can be exponential in the size of the inputs. All of the convergence and generalization results in Collins (2002) depend on notions of separability rather than the size of GEN.</Paragraph>
      <Paragraph position="5"> Two questions come to mind. First, are there guarantees for the algorithm if the training data is not separable? Second, performance on a training sample is all very well, but what does this guarantee about how well the algorithm generalizes to newly drawn test examples? Freund and Schapire (1999) discuss how the theory for classification problems can be extended to deal with both of these questions; Collins (2002) describes how these results apply to NLP problems.</Paragraph>
      <Paragraph position="6"> As a final note, following Collins (2002), we used the averaged parameters from the training algorithm in decoding test examples in our experiments. Say ti is the parameter vector after the i'th example is processed on the t'th pass through the data in the algorithm in figure 1. Then the averaged parameters AVG are defined as AVG = Pi;t ti=NT. Freund and Schapire (1999) originally proposed the averaged parameter method; it was shown to give substantial improvements in accuracy for tagging tasks in Collins (2002).</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.3 An Abstract Description of Incremental
Parsing
</SectionTitle>
      <Paragraph position="0"> This section gives a description of the basic incremental parsing approach. The input to the parser is a sentence x with length n. A hypothesis is a triple hx;t;ii such that x is the sentence being parsed, t is a partial or full analysis of that sentence, and i is an integer specifying the number of words of the sentence which have been processed. Each full parse for a sentence will have the form hx;t;ni. The initial state is hx;;;0i where ; is a &amp;quot;null&amp;quot; or empty analysis.</Paragraph>
      <Paragraph position="1"> We assume an &amp;quot;advance&amp;quot; function ADV which takes a hypothesis triple as input, and returns a set of new hypotheses as output. The advance function will absorb another word in the sentence: this means that if the input to ADV ishx;t;ii, then each member of ADV(hx;t;ii) will have the formhx;t0;i+1i. Each new analysis t0 will be formed by somehow incorporating the i+1'th word into the previous analysis t.</Paragraph>
      <Paragraph position="2"> With these definitions in place, we can iteratively define the full set of partial analysesHi for the firstiwords of the sentence as H0(x) = fhx;;;0ig, and Hi(x) = [h02Hi 1(x)ADV(h0) for i = 1:::n. The full set of parses for a sentencexis then GEN(x) =Hn(x) where n is the length of x.</Paragraph>
      <Paragraph position="3"> Under this definition GEN(x) can include a huge number of parses, and searching for the highest scoring parse, arg maxh2Hn(x) (h) , will be intractable.</Paragraph>
      <Paragraph position="4"> For this reason we introduce one additional function, FILTER(H), which takes a set of hypothesesH, and returns a much smaller set of &amp;quot;filtered&amp;quot; hypotheses. Typically, FILTER will calculate the score (h) for each h 2H, and then eliminate partial analyses which have low scores under this criterion. For example, a simple version of FILTER would take the top N highest scoring members ofHfor some constant N. We can then redefine the set of partial analyses as follows (we useFi(x) to denote the set of filtered partial analyses for the first i words of the sentence):</Paragraph>
      <Paragraph position="6"> The parsing algorithm returns arg maxh2Fn (h) .</Paragraph>
      <Paragraph position="7"> Note that this is a heuristic, in that there is no guarantee that this procedure will find the highest scoring parse, arg maxh2Hn (h) . Search errors, where arg maxh2Fn (h) 6= arg maxh2Hn (h) , will create errors in decoding test sentences, and also errors in implementing the perceptron training algorithm in Figure 1. In this paper we give empirical results that suggest that FILTER can be chosen in such a way as to give efficient parsing performance together with high parsing accuracy.</Paragraph>
      <Paragraph position="8"> The exact implementation of the parser will depend on the definition of partial analyses, of ADV and FILTER, and of the representation . The next section describes our instantiation of these choices.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="0" end_page="88" type="metho">
    <SectionTitle>
3 A full description of the parsing
</SectionTitle>
    <Paragraph position="0"> approach The parser is an incremental beam-search parser very similar to the sort described in Roark (2001a; 2004), with some changes in the search strategy to accommodate the perceptron feature weights. We first describe the parsing algorithm, and then move on to the baseline feature set for the perceptron model.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 Parser control
</SectionTitle>
      <Paragraph position="0"> The input to the parser is a string wn0 , a grammar G, a mapping from derivations to feature vectors, and a parameter vector . The grammar G = (V;T;Sy; S;C;B) consists of a set of non-terminal symbols V, a set of terminal symbols T, a start symbol Sy 2 V, an end-ofconstituent symbol S2V, a set of &amp;quot;allowable chains&amp;quot;C, and a set of &amp;quot;allowable triples&amp;quot; B. S is a special empty non-terminal that marks the end of a constituent. Each chain is a sequence of non-terminals followed by a terminal symbol, for example hSy ! S ! NP ! NN !</Paragraph>
      <Paragraph position="2"> lines represent potential attachments Trashi. Each &amp;quot;allowable triple&amp;quot; is a tuple hX;Y;Zi where X;Y;Z 2 V. The triples specify which non-terminals Z are allowed to follow a non-terminal Y under a parent X. For example, the triple hS,NP,VPi specifies that a VP can follow an NP under an S. The triplehNP,NN, Siwould specify that the S symbol can follow an NN under an NP - i.e., that the symbol NN is allowed to be the final child of a rule with parent NP The initial state of the parser is the input string alone, wn0 . In absorbing the first word, we add all chains of the form Sy::: ! w0. For example, in figure 2 the chain hSy!S!NP!NN!Trashiis used to construct an analysis for the first word alone. Other chains which start with Sy and end with Trash would give competing analyses for the first word of the string.</Paragraph>
      <Paragraph position="3"> Figure 2 shows an example of how the next word in a sentence can be incorporated into a partial analysis for the previous words. For any partial analysis there will be a set of potential attachment sites: in the example, the attachment sites are under the NP or the S. There will also be a set of possible chains terminating in the next word - there are three in the example. Each chain could potentially be attached at each attachment site, giving 6 ways of incorporating the next word in the example.</Paragraph>
      <Paragraph position="4"> For illustration, assume that the set B is fhS,NP,VPi, hNP,NN,NNi, hNP,NN, Si, hS,NP,VPig. Then some of the 6 possible attachments may be disallowed because they create triples that are not in the set B. For example, in figure 2 attaching either of the VP chains under the NP is disallowed because the triplehNP,NN,VPiis not in B. Similarly, attaching the NN chain under the S will be disallowed if the triple hS,NP,NNi is not in B. In contrast, adjoininghNN!caniunder the NP creates a single triple,hNP,NN,NNi, which is allowed. Adjoining either of the VP chains under the S creates two triples, hS,NP,VPiandhNP,NN, Si, which are both in the set B.</Paragraph>
      <Paragraph position="5"> Note that the &amp;quot;allowable chains&amp;quot; in our grammar are what Costa et al. (2001) call &amp;quot;connection paths&amp;quot; from the partial parse to the next word. It can be shown that the method is equivalent to parsing with a transformed context-free grammar (a first-order &amp;quot;Markov&amp;quot; grammar) - for brevity we omit the details here.</Paragraph>
      <Paragraph position="6"> In this way, given a set of candidatesFi(x) for the first i words of the string, we can generate a set of candidates  sections of the Wall St. Journal Treebank, and out-of-vocabulary (OOV) rate on the held-out corpus.</Paragraph>
      <Paragraph position="7"> for the first i + 1 words, [h02Fi(x)ADV(h0), where the ADV function uses the grammar as described above. We then calculate (h) for all of these partial hypotheses, and rank the set from best to worst. A FILTER function is then applied to this ranked set to giveFi+1. Lethk be the kth ranked hypothesis in Hi+1(x). Then hk 2Fi+1 if and only if (hk) k. In our case, we parameterize the calculation of k with as follows:</Paragraph>
      <Paragraph position="9"> The problem with using left-child chains is limiting them in number. With a left-recursive grammar, of course, the set of all possible left-child chains is infinite. We use two techniques to reduce the number of left-child chains: first, we remove some (but not all) of the recursion from the grammar through a tree transform; next, we limit the left-child chains consisting of more than two non-terminal categories to those actually observed in the training data more than once. Left-child chains of length less than or equal to two are all those observed in training data. As a practical matter, the set of left-child chains for a terminal x is taken to be the union of the sets of left-child chains for all pre-terminal part-of-speech (POS) tags T for x.</Paragraph>
      <Paragraph position="10"> Before inducing the left-child chains and allowable triples from the treebank, the trees are transformed with a selective left-corner transformation (Johnson and Roark, 2000) that has been flattened as presented in Roark (2001b). This transform is only applied to left-recursive productions, i.e. productions of the form A ! A .</Paragraph>
      <Paragraph position="11"> The transformed trees look as in figure 3. The transform has the benefit of dramatically reducing the number of left-child chains, without unduly disrupting the immediate dominance relationships that provide features for the model. The parse trees that are returned by the parser are then de-transformed to the original form of the grammar for evaluation2.</Paragraph>
      <Paragraph position="12"> Table 1 presents the number of left-child chains of length greater than 2 in sections 2-21 and 24 of the Penn Wall St. Journal Treebank, both with and without the flattened selective left-corner transformation (FSLC), for gold-standard part-of-speech (POS) tags and automatically tagged POS tags. When the FSLC has been applied and the set is restricted to those occurring more than once  representation; and (c) a flat structure that is unambiguously equivalent to (b)</Paragraph>
      <Paragraph position="14"> nodes.</Paragraph>
      <Paragraph position="15"> in the training corpus, we can reduce the total number of left-child chains of length greater than 2 by half, while leaving the number of words in the held-out corpus with an unobserved left-child chain (out-of-vocabulary rate -OOV) to just one in every thousand words.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.2 Features
</SectionTitle>
      <Paragraph position="0"> For this paper, we wanted to compare the results of a perceptron model with a generative model for a comparable feature set. Unlike in Roark (2001a; 2004), there is no look-ahead statistic, so we modified the feature set from those papers to explicitly include the lexical item and POS tag of the next word. Otherwise the features are basically the same as in those papers. We then built a generative model with this feature set and the same tree transform, for use with the beam-search parser from Roark (2004) to compare against our baseline perceptron model.</Paragraph>
      <Paragraph position="1"> To concisely present the baseline feature set, let us establish a notation. Features will fire whenever a new node is built in the tree. The features are labels from the left-context, i.e. the already built part of the tree. All of the labels that we will include in our feature sets are i levels above the current node in the tree, and j nodes to the left, which we will denote Lij. Hence, L00 is the node label itself; L10 is the label of parent of the current node; L01 is the label of the sibling of the node, immediately to its left; L11 is the label of the sibling of the parent node, etc. We also include: the lexical head of the current constituent (CL); the c-commanding lexical head (CC) and its POS (CCP); and the look-ahead word (LK) and its POS (LKP). All of these features are discussed at more length in the citations above. Table 2 presents the baseline feature set.</Paragraph>
      <Paragraph position="2"> In addition to the baseline feature set, we will also present results using features that would be more difficult to embed in a generative model. We included some punctuation-oriented features, which included (i) a Boolean feature indicating whether the final punctuation is a question mark or not; (ii) the POS label of the word after the current look-ahead, if the current look-ahead is punctuation or a coordinating conjunction; and (iii) a Boolean feature indicating whether the look-ahead is punctuation or not, that fires when the category immediately to the left of the current position is immediately preceded by punctuation.</Paragraph>
      <Paragraph position="3"> 4 Refinements to the Training Algorithm This section describes two modifications to the &amp;quot;basic&amp;quot; training algorithm in figure 1.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.1 Making Repeated Use of Hypotheses
</SectionTitle>
      <Paragraph position="0"> Figure 4 shows a modified algorithm for parameter estimation. The input to the function is a gold standard parse, together with a set of candidates F generated by the incremental parser. There are two steps. First, the model is updated as usual with the current example, which is then added to a cache of examples. Second, the method repeatedly iterates over the cache, updating the model at each cached example if the gold standard parse is not the best scoring parse from among the stored candidates for that example. In our experiments, the cache was restricted to contain the parses from up to N previously processed sentences, where N was set to be the size of the training set.</Paragraph>
      <Paragraph position="1"> The motivation for these changes is primarily efficiency. One way to think about the algorithms in this paper is as methods for finding parameter values that satisfy a set of linear constraints - one constraint for each incorrect parse in training data. The incremental parser is Input: A gold-standard parse = g for sentence k of N. A set of candidate parses F. Current parameters . A Cache of triples hgj;Fj;cji for j = 1:::N where each gj is a previously generated gold standard parse, Fj is a previously generated set of candidate parses, and cj is a counter of the number of times that has been updated due to this particular triple. Parameters T1 and T2 controlling the number of iterations below. In our experiments, T1 = 5 and T2 = 50. Initialize the Cache to include, for j = 1:::N, hgj;;;T2i. Step 1: Step 2: Calculate z = arg maxt2F (t) For t = 1:::T1;j = 1:::N If (z6= g) then = + (g) (z) If cj &lt;T2 then Set the kth triple in the Cache tohg;F;0i Calculate z = arg maxt2Fj (t) If (z6= gj) then</Paragraph>
      <Paragraph position="3"> a method for dynamically generating constraints (i.e. incorrect parses) which are violated, or close to being violated, under the current parameter settings. The basic algorithm in Figure 1 is extremely wasteful with the generated constraints, in that it only looks at one constraint on each sentence (the arg max), and it ignores constraints implied by previously parsed sentences. This is inefficient because the generation of constraints (i.e., parsing an input sentence), is computationally quite demanding.</Paragraph>
      <Paragraph position="4"> More formally, it can be shown that the algorithm in figure 4 also has the upper bound in theorem 1 on the number of parameter updates performed. If the cost of steps 1 and 2 of the method are negligible compared to the cost of parsing a sentence, then the refined algorithm will certainly converge no more slowly than the basic algorithm, and may well converge more quickly.</Paragraph>
      <Paragraph position="5"> As a final note, we used the parameters T1 and T2 to limit the number of passes over examples, the aim being to prevent repeated updates based on outlier examples which are not separable.</Paragraph>
    </Section>
    <Section position="4" start_page="0" end_page="88" type="sub_section">
      <SectionTitle>
4.2 Early Update During Training
</SectionTitle>
      <Paragraph position="0"> As before, define yi to be the gold standard parse for the i'th sentence, and also define yji to be the partial analysis under the gold-standard parse for the first j words of the i'th sentence. Then if yji =2Fj(xi) a search error has been made, and there is no possibility of the gold standard parse yi being in the final set of parses,Fn(xi). We call the following modification to the parsing algorithm during training &amp;quot;early update&amp;quot;: if yji =2Fj(xi), exit the parsing process, passyji,Fj(xi) to the parameter estimation method, and move on to the next string in the training set. Intuitively, the motivation behind this is clear. It makes sense to make a correction to the parameter values at the point that a search error has been made, rather than allowing the parser to continue to the end of the sentence.</Paragraph>
      <Paragraph position="1"> This is likely to lead to less noisy input to the parameter estimation algorithm; and early update will also improve efficiency, as at the early stages of training the parser will frequently give up after a small proportion of each sentence is processed. It is more difficult to justify from a formal point of view, we leave this to future work.</Paragraph>
      <Paragraph position="2"> Figure 5 shows the convergence of the training algorithm with neither of the two refinements presented; with just early update; and with both. Early update makes  Number of passes over training data F[?]measure parsing accuracy No early update, no repeated use of examplesEarly update, no repeated use of examples Early update, repeated use of examplesFigure 5: Performance on development data (section f24) after each pass over the training data, with and without repeated use of examples and early update.</Paragraph>
      <Paragraph position="3"> an enormous difference in the quality of the resulting model; repeated use of examples gives a small improvement, mainly in recall.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="88" end_page="88" type="metho">
    <SectionTitle>
5 Empirical results
</SectionTitle>
    <Paragraph position="0"> The parsing models were trained and tested on treebanks from the Penn Wall St. Journal Treebank: sections 2-21 were kept training data; section 24 was held-out development data; and section 23 was for evaluation. After each pass over the training data, the averaged perceptron model was scored on the development data, and the best performing model was used for test evaluation. For this paper, we used POS tags that were provided either by the Treebank itself (gold standard tags) or by the perceptron POS tagger3 presented in Collins (2002). The former gives us an upper bound on the improvement that we might expect if we integrated the POS tagging with the parsing.</Paragraph>
    <Paragraph position="1"> 3For trials when the generative or perceptron parser was given POS tagger output, the models were trained on POS tagged sections 2-21, which in both cases helped performance slightly.</Paragraph>
    <Paragraph position="2">  Table 3 shows results on section 23, when either gold-standard or POS-tagger tags are provided to the parser4. With the base features, the generative model outperforms the perceptron parser by between a half and one point, but with the additional punctuation features, the perceptron model matches the generative model performance. Of course, using the generative model and using the perceptron algorithm are not necessarily mutually exclusive. Another training scenario would be to include the generative model score as another feature, with some weight in the linear model learned by the perceptron algorithm. This sort of scenario was used in Roark et al.</Paragraph>
    <Paragraph position="3"> (2004) for training an n-gram language model using the perceptron algorithm. We follow that paper in fixing the weight of the generative model, rather than learning the weight along the the weights of the other perceptron features. The value of the weight was empirically optimized on the held-out set by performing trials with several values. Our optimal value was 10.</Paragraph>
    <Paragraph position="4"> In order to train this model, we had to provide generative model scores for strings in the training set. Of course, to be similar to the testing conditions, we cannot use the standard generative model trained on every sentence, since then the generative score would be from a model that had already seen that string in the training data. To control for this, we built ten generative models, each trained on 90 percent of the training data, and used each of the ten to score the remaining 10 percent that was not seen in that training set. For the held-out and testing conditions, we used the generative model trained on all of sections 2-21.</Paragraph>
    <Paragraph position="5"> In table 4 we present the results of including the generative model score along with the other perceptron features, just for the run with POS-tagger tags. The generative model score (negative log probability) effectively provides a much better initial starting point for the perceptron algorithm. The resulting F-measure on section 23 is 2.1 percent higher than either the generative model or perceptron-trained model used in isolation.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML