XML Viewer - w06-1657

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-1657_metho.xml
Size: 21,882 bytes
Last Modified: 2025-10-06 14:10:46
<?xml version="1.0" standalone="yes"?>
<Paper uid="W06-1657">
  <Title>Markov Chains and Author Unmasking: An Investigation</Title>
  <Section position="5" start_page="483" end_page="484" type="metho">
    <SectionTitle>
3 Sequence Kernel Based Approaches
</SectionTitle>
    <Paragraph position="0"> Kernel based techniques, such as SVMs, allow the comparison of, and discrimination between, vectorial as well as non-vectorial objects. In a binary SVM, the opinion on whether object X belongs to class -1 or +1 is given by:</Paragraph>
    <Paragraph position="2"> where k(XA,XB) is a symmetric kernel function which reflects the degree of similarity between 3Personal correspondence with the authors of (Chen and Goodman, 1999).</Paragraph>
    <Paragraph position="3"> objects XA and XB, while S = (sj)|S|j=1 is a set of support objects with corresponding class labels (yj [?] {[?]1,+1} )|S|j=1 and weights L = (lj)|S|j=1. The kernel function, b as well as sets S and L define a hyperplane which separates the +1 and -1 classes.</Paragraph>
    <Paragraph position="4"> Given a training dataset, quadratic programming  basedoptimisationisusedtomaximisetheseparation margin4 (Sch&amp;quot;olkopf and Smola, 2002; Shawe-Taylor and Cristianini, 2004).</Paragraph>
    <Paragraph position="5"> Recently,kernelsformeasuringthesimilarityof texts based on sequences of characters and words have been proposed (Cancedda et al., 2003; Leslie et al., 2004; Vishwanathan and Smola, 2003). One kernel belonging to this family is:</Paragraph>
    <Paragraph position="7"> where Qstar represents all possible sequences, in XA and XB, of the symbols in Q. In turn, Q is a set of possible symbols, which can be characters, e.g. Q = { 'a', 'b', 'c', ***}, or words, e.g. Q = {'kangaroo', 'koala', 'platypus', ***}.</Paragraph>
    <Paragraph position="8"> Furthermore, C(q|X) is the number of occurrences of sequence q in X, and wq is the weight for sequence q. If the sequences are restricted to have only one item, Eqn. (4) for the case of words is in effect a bag-of-words kernel (Cancedda et al., 2003; Shawe-Taylor and Cristianini, 2004).</Paragraph>
    <Paragraph position="9"> In this work we have utilised weights that were dependent only on the length of each sequence, i.e. wq = w|q|. By default w|q |= 0, modified by one of the following functions: specific length: w|q |= 1, if |q |= t bounded range: w|q |= 1, if |q |[?] [1,t] bounded linear decay: w|q |= 1 + 1[?]|q|t , if |q |[?] [1,t] bounded linear growth: w|q |= |q |/ t, if |q |[?] [1,t] where t indicates a user defined maximum sequence length.</Paragraph>
    <Paragraph position="10"> To allow comparison of texts with different lengths, a normalised version (Sch&amp;quot;olkopf and Smola, 2002; Shawe-Taylor and Cristianini, 2004) of the kernel can be used:</Paragraph>
    <Paragraph position="12"> It has been suggested that SVM discrimination based on character sequence kernels in effect utilises a noisy version of stemming (Cancedda et al., 2003). As such, word sequence kernels could be more effective than character sequence  kernels, since proper word stems, instead of full words, can be explicitly used. However, it must be noted that Eqn. (4) implicitly maps texts to a feature space which has one dimension for each of the possible sequences comprised of the symbols from Q (Cancedda et al., 2003). When using words, the number of unique symbols (i.e. |Q|) can be much greater than when using characters (e.g. 10,000 vs 100); furthermore, for a given text the number of words is always smaller than the number of characters. For a given sequence length, these observations indicate that for word sequence kernels the implicit feature space representationcanhaveconsiderablyhigherdimension- null ality and be sparser than for character sequence kernels, which could lead to poorer generalisation of the resulting classifier.</Paragraph>
  </Section>
  <Section position="6" start_page="484" end_page="487" type="metho">
    <SectionTitle>
4 Evaluation
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="484" end_page="484" type="sub_section">
      <SectionTitle>
4.1 &amp;quot;Columnists&amp;quot; Dataset
</SectionTitle>
      <Paragraph position="0"> We have compiled a dataset that is comprised of texts from 50 newspaper journalists, with a minimum of 10,000 words per journalist. Journalists were selected based on their coverage of several topics; any journalist who covered only one specific area (e.g. sports or economics) was not included in the dataset. Apart from removing all advertising material and standardising the representation by converting any unicode characters to their closest ASCII counterparts, no further editing was performed. The dataset is available for use by other researchers by contacting the authors.</Paragraph>
    </Section>
    <Section position="2" start_page="484" end_page="485" type="sub_section">
      <SectionTitle>
4.2 Setup
</SectionTitle>
      <Paragraph position="0"> The experiments followed a verification setup, where a given text material was classified as either having been written by a hypothesised author or as not written by that author (i.e. a two class discrimination task). This is distinct from a closed set identification setup, where a text is assigned as belonging to one author out of a pool of authors.</Paragraph>
      <Paragraph position="1"> The presentation of an impostor text (a text known  the number of characters and number of words.</Paragraph>
      <Paragraph position="2"> For comparison purposes, this paper has about 5900 words.</Paragraph>
      <Paragraph position="3"> No. characters 1750 3500 7000 14000 28000 No. words 312 625 1250 2500 5000 not to be written by the hypothesised author) will be referred to as an impostor claim, while the presentation of a true text (a text known to be written by the hypothesised author) will be referred to as a true claim.</Paragraph>
      <Paragraph position="4"> Foragiventext, oneofthefollowingtwoclassificationerrorscanoccur: (i)afalsepositive, where an impostor text is incorrectly classified as a true text; (ii) a false negative, where a true text is incorrectly classified as an impostor text. The errorsaremeasuredintermsofthefalsepositiverate null (FPR) and the false negative rate (FNR). Following the approach often used within the biometrics field, the decision threshold was then adjusted so that the FPR is equal to the FNR, giving Equal Error Rate (EER) performance (Ortega-Garcia et al., 2004; Sanderson et al., 2006).</Paragraph>
      <Paragraph position="5"> The authors in the database were randomly assigned into two disjoint sections: (i) 10 background authors; (ii) 40 evaluation authors. For the case of Markov chain approaches, texts from the background authors were used to construct the generic author model, while for kernel based approaches they were used to represent the negative class. In both cases, text materials each comprised of approx. 28,000 characters were used, via randomly choosing a sufficient number of sentences from the pooled texts. Table 1 shows a correspondence between the number of characters and words, using the average word length of 5.6 characters including a trailing whitespace (found on the whole dataset).</Paragraph>
      <Paragraph position="6"> For each author in the evaluation section, their material was randomly split5 into two continuous parts: training and testing. The split occurred without breaking sentences. The training material was used to construct the author model, while the test material was used to simulate a true claim as well as impostor claims against all other authors' models. Note that if material from the evaluation section was used for constructing the generic author model, the system would have prior knowledge about the writing style of the authors used for the impostor claims.</Paragraph>
      <Paragraph position="7"> For each configuration of an approach (where, for example, the configuration is the order of the Markovchains), theaboveprocedurewasrepeated ten times, with the randomised assignments and splitting being done each time. The final results  were then obtained in terms of the mean and the corresponding standard deviation of the ten EERs (the standard deviations are shown as error bars in the result figures). Based on preliminary experiments, stemming was used for word based approaches (Manning and Sch&amp;quot;utze, 1999).</Paragraph>
    </Section>
    <Section position="3" start_page="485" end_page="487" type="sub_section">
      <SectionTitle>
4.3 Experiments and Discussion
</SectionTitle>
      <Paragraph position="0"> In the first experiment we studied the effects of varying the order for character and word Markov chain approaches, while the amount of training material was fixed at approx. 28,000 characters and the test material (for evaluation authors) was decreased from approx. 28,000 to 1,750 characters. Results are presented in Fig. 1.</Paragraph>
      <Paragraph position="1"> The results show that 2nd order chains of characters generally obtain the best performance.</Paragraph>
      <Paragraph position="2"> However, the difference in performance between  as statistically insignificant due to the large overlap of the error bars. The best performing word chain approach had an order of zero, with higher orders (not shown) having virtually the same performance as the 0th order. Its performance is largely similar to the 2nd order character chain approach, with the latter obtaining a somewhat lower error rate at 28,000 characters.</Paragraph>
      <Paragraph position="3"> The second experiment was similar to the first, with the difference being that the amount of training material and test material was decreased from approx. 28,000 to 1,750 characters. The main change between the results of this experiment (shown in Fig. 2) and the previous experiment's results is the faster degradation in performance as the number of characters is decreased. We comment on this effect later.</Paragraph>
      <Paragraph position="4"> In the third experiment we utilised SVMs with character sequence kernels and studied the effects of chunk size. As SVMs employ support objects in the definition of the discriminant function (see Section 3), the training material was split into varying size chunks, ranging from approximately 62 to 4000 characters. Each of the chunks can become a support chunk. Naturally, the smaller the chunk size, the larger the number of chunks. As the split was done without breaking sentences, the effective chunk size tended to be somewhat larger.</Paragraph>
      <Paragraph position="5"> If there is less words available than a given chunk size, then all of the remaining words are used for forming a chunk. Based on preliminary experiments, the bounded range weight function with  sented in Fig. 3, indicate that the optimum chunk size is approximately 500 characters for the three cases. Furthermore, the optimum chunk size appears to be independent of the number of available chunks for training.</Paragraph>
      <Paragraph position="6"> In the fourth experiment we studied the effects of various weight functions and sequence lengths for the character sequence kernel. The amount of training and test material was fixed at approx. 28,000 characters. Based on the results from the previous experiment, chunk size was set at 500. Results for specific length (Fig. 4) suggest that most of the reliable discriminatory information is contained in sequences of length 2. The error rates for the bounded range and bounded lin- null kernel approach for varying chunk sizes. Bounded range weight function with t=3 was used.</Paragraph>
      <Paragraph position="7">  kernel approach for various weight functions. The size of training and test materials was fixed at approx. 28,000 characters. Chunk size of 500 characters was used. Error bars were omitted for clarity. null ear decay functions are quite similar, with both reaching minima for sequences of length 4; most of the improvement occurs when the sequences reach a length of 3. This indicates that while sequences with a specific length of 3 and 4 are less reliable than sequences with a specific length of 2, they contain (partly) complementary information which is useful when combined with information from shorter lengths. Emphasising longer lengths of 5 and 6 (via the bounded linear growth function) achieves a minor, but noticeable, performance degradation. We conjecture that the degradation is caused by the sparsity of relatively long sequences, which affects the generalisation of the classifier.</Paragraph>
      <Paragraph position="8"> The fifth experiment was devoted to an evaluation of the effects of chunk size for the word se- null nel approach for varying chunk sizes. Specific length weight function with t=1 was used.</Paragraph>
      <Paragraph position="9"> quence approach. To keep the results comparable with the character sequence approach (third experiment), the training material was split into varying size chunks, ranging from approximately 62 to 8000 characters. Based on the results from the first experiment, the specific length weight function with t=1 was used6 (resulting in a bag-of-words kernel).</Paragraph>
      <Paragraph position="10"> The amount of training and test material was equal and three cases were evaluated: 28,000, 14,000 and 7,000 characters. Results, shown in Fig. 5, suggest that the optimum chunk size is approximately 4000 characters for the three cases. As mentioned in Section 3, for the word based approach the implicit feature space representation can have considerably higher dimensionality and be sparser than for the character based approach. Consequently, longer texts would be required to adequately populate the feature space. This is reflected by the optimum chunk size for the word basedapproach, whichisroughlyanorderofmagnitude larger than the optimum chunk size for the character based approach.</Paragraph>
      <Paragraph position="11"> In the sixth experiment we compared the performance of character sequence kernels (using the bounded range function with t=4) and several configurations of the word sequence kernels. The amount of training material was fixed at approx. 28,000 characters and the test material was decreased from approx. 28,000 to 1,750 characters. Based on the results of previous experiments, chunk size was set to 500 for the character based approach and to 4000 for the word based</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="487" end_page="487" type="metho">
    <SectionTitle>
NUMBER OF CHARACTERS
EER
</SectionTitle>
    <Paragraph position="0"> char: len=4 (bounded range) word: len=1 (specific length) word: len=2 (specific length) word: len=2 (bounded linear decay)  quence kernel approaches using fixed size training material (approx. 28,000 characters) and varying size test material.</Paragraph>
    <Paragraph position="1"> approach. Fig. 6 shows that word sequences with a specific length of 2 lead to considerably worse performance than sequences of length 1 (i.e. individual words). Furthermore, the best performing combination of lengths (i.e. via the bounded linear decay function7) does not provide better performance than using individual words. The character sequence kernels consistently achieve a lower error rate than the best performing word sequence kernel. This suggests that the sparse feature space representation, described in Section 3, is becoming an issue.</Paragraph>
    <Paragraph position="2"> The final experiment was similar to the sixth, with the difference being that the amount of training material and test material was decreased from approx. 28,000 to 1,750 characters. As observed for the Markov chain approaches, the main change between the results of this experiment (shown in Fig. 7) and the previous experiment's results is the faster degradation in performance as the number of characters is decreased. Along with the results from experiments 1 and 2, this indicates that the amount of training material has considerably more influence on discrimination performance than the amount of test material.</Paragraph>
    <Paragraph position="3"> In Fig. 8 it can be observed that the best performing Markov chain based approach (characters, 2nd order) obtains comparable performance to the character sequence kernel based approach (using the bounded range function with t=4).</Paragraph>
    <Paragraph position="4"> 7Other combinations of lengths were also evaluated, though the results are not shown here.</Paragraph>
  </Section>
  <Section position="8" start_page="487" end_page="487" type="metho">
    <SectionTitle>
NUMBER OF CHARACTERS
EER
</SectionTitle>
    <Paragraph position="0"> char: len=4 (bounded range) word: len=1 (specific length) word: len=2 (specific length) word: len=2 (bounded linear decay)  kernel approach with the best Markov chain approach for two cases: (A) varying size of training and test material, (B) fixed size training material (approx. 28,000 characters) and varying size test material.</Paragraph>
  </Section>
  <Section position="9" start_page="487" end_page="489" type="metho">
    <SectionTitle>
5 Author Unmasking On Short Texts
</SectionTitle>
    <Paragraph position="0"> Koppel &amp; Schler (2004) proposed an alternative method for author verification. Rather than treating the verification problem directly as a two-class discrimination task (as done in Section 4), an &amp;quot;author unmasking&amp;quot; curve is first built. A vector representing the &amp;quot;essential features&amp;quot; of the curve is then classified in a traditional SVM setting. The unmasking procedure is reminiscent of the recursive feature elimination procedure first proposed in the context of gene selection for cancer classification (Guyon et al., 2002).</Paragraph>
    <Paragraph position="1"> Instead of having an author specific model (as in the Markov chain approach) or an author specific SVM, a reference text is used. The text to be  band using Wilde's Woman of No Importance as well as the works of other authors as reference texts.</Paragraph>
    <Paragraph position="2"> classified as well as the reference text are divided into chunks; the features representing each chunk are the counts of pre-selected words. Each point in the author unmasking curve is the cross-validation accuracy of discriminating between the two sets of chunks (using a linear SVM). At each iteration, several of the most discriminative features are removed from further consideration.</Paragraph>
    <Paragraph position="3"> The underlying hypothesis is that if the two given texts have been written by the same author, the differences between them will be reflected in a relatively small number of features. Koppel &amp; Schler (2004) observed that for texts authored by the same person, the extent of the cross-validation accuracy degradation is much larger than for texts written by different authors. Encouraging classification results were obtained for long texts (books available from Project Gutenberg8).</Paragraph>
    <Paragraph position="4"> In this section we first confirm the unmasking effect for long texts and then show that for shorter texts (i.e. approx. 5000 words), the effect is considerably less distinctive.</Paragraph>
    <Paragraph position="5"> For the first experiment we followed the setup in (Koppel and Schler, 2004), i.e. the same books, chunks with a size of approximately 500 words, 10 fold cross-validation, removing 6 features at each iteration, and using 250 words with the highest average frequency in both texts as the set of pre-selected words. Fig. 9 shows curves for unmasking Oscar Wilde's An Ideal Husband using Wilde's Woman of No Importance (same-author curve) as well as the works of other authors as reference texts (different-author curves). As can  acter sequence kernel approach (t = 4, bounded range) and character Markov chain approach (2nd order).</Paragraph>
    <Paragraph position="6"> Approach mean EER std. dev.</Paragraph>
    <Paragraph position="7"> Author unmasking 30.88 4.32 Character sequence kernel 8.08 2.08 Character Markov chain 8.14 1.79 be observed, the unmasking effect is most pronounced for Wilde's text. Furthermore, this figure has a close resemblance to Fig. 2 in (Koppel and Schler, 2004).</Paragraph>
    <Paragraph position="8"> In the second experiment we used text materials from the Columnists dataset. Each author's text material was divided into two sections of approximately 5000 words, with the one of the sections randomly selected to be the reference material, leaving the other as the test material. Based on preliminary experiments, the number of pre-selected words was set to 100 (with the highest average frequency in both texts) and the size of the chunks was set to 200 words. The remainder of the unmasking procedure setup was the same as for the first experiment. The setup for verification trials was similar to the setup in Section 4.2, with the difference being that the background authors were used to generate same-author and different-author curves for training the secondary SVM. In all cases features from each curve were extracted, as done in (Koppel and Schler, 2004), prior to further processing.</Paragraph>
    <Paragraph position="9"> Table 2 provides a comparison between the performance of the unmasking approach with that of the character sequence kernel and character  Markov chain based approaches, as evaluated in Section 4. Fig. 10 shows representative curves resulting from unmasking of the test material from author A, using A's as well as other authors' reference materials. Generally, the unmasking effect for the same-author curves is considerably less pronounced and in some cases it is non-existent.</Paragraph>
    <Paragraph position="10"> More dangerously, different-author curves often have close similarities to same-author curves. The results and the above observations hence suggest that the unmasking method is less useful when dealing with relatively short texts.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML