File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/n06-2035_metho.xml
Size: 6,785 bytes
Last Modified: 2025-10-06 14:10:13
<?xml version="1.0" standalone="yes"?> <Paper uid="N06-2035"> <Title>Weblog Classification for Fast Splog Filtering: A URL Language Model Segmentation Approach</Title> <Section position="3" start_page="137" end_page="137" type="metho"> <SectionTitle> 2 Engineering of splogs </SectionTitle> <Paragraph position="0"> Frequently sploggers indicate the semantic content of the weblogs using descriptive phrases-often noun groups (non-recursive noun phrases) like adult-video-mpegs. There are different varieties of splogs: commercial products (especially electronics), vacations, mortgages, and adult-related.</Paragraph> <Paragraph position="1"> Users don't want to see splogs in their results and marketing intelligence applications are affected when data contains splogs. Existing approaches to splog filtering employ statistical classifiers (e.g., SVMs) trained on the tokens in a URL after tokenization on punctuation (Kolari et al., 2006).</Paragraph> <Paragraph position="2"> To avoid being identified as a splog by such systems one of the creative techniques that sploggers use is to glue words together into longer tokens for which there will not be statistical information (e.g., businessopportunitymoneyworkathome is unlikely to appear in the training data while business, opportunity, money, work, at and home are likely to have been seen in training). Another approach to dealing with splogs is having a list of splog websites (SURBL, 2006). Such an approach based on blacklists is now less effective because bloghosts provide tools which can be used for the automatic creation of a large quantity of splogs.</Paragraph> </Section> <Section position="4" start_page="137" end_page="138" type="metho"> <SectionTitle> 3 Splog filtering </SectionTitle> <Paragraph position="0"> The weblog classifier uses a segmenter which splits the URL in tokens and then the token sequence is used for supervised learning and classification.</Paragraph> <Section position="1" start_page="137" end_page="138" type="sub_section"> <SectionTitle> 3.1 URL segmentation </SectionTitle> <Paragraph position="0"> The segmenter first tokenizes the URLs on punctuation symbols. Then the current URL tokens are examined for further possible segmentation. The segmenter uses a sliding window of n (e.g., 6) characters. Going from left to right in a greedy fashion the segmenter decides whether to split after the current third character. Figure 1 illustrates the processing of www.dietthatworks.com when considering the token dietthatworks. The character '*' indicates that the left and right tri-grams are kept together while '*' indicates a point where the segmenter decides a break should occur. The segmentation decisions are d i e * t t hatworks d i e t * t h aatworks based on counts collected during training. For example, during the segmentation of dietthatworks in the case of i e t * t h a we essentially consider how many times we have seen in the training data the 6-gram 'iettha' vs. 'iet tha'. Certain characters (e.g., digits) are generalized both during training and segmentation.</Paragraph> </Section> <Section position="2" start_page="138" end_page="138" type="sub_section"> <SectionTitle> 3.2 Classification </SectionTitle> <Paragraph position="0"> For the weblog classification a simple Na&quot;ive Bayes classifier is used. Given a token sequence T =</Paragraph> <Paragraph position="2"> In the last step we made the conditional independence assumption. For calculating P(t</Paragraph> <Paragraph position="4"> Laplace (add one) smoothing (Jurafsky & Martin, 2000). We have also explored classification via simple voting techniques such as:</Paragraph> <Paragraph position="6"> Because we are interested in having control over the precision/recall of the classifier we introduce a score meant to be used for deciding whether to label a URL as unknown.</Paragraph> <Paragraph position="7"> If score(T) exceeds a certain threshold t we label T as spam or good using the greater probability of P(spam|T) or P(good|T). To control the presicion of the classifier we can tune t. For instance, when we set t = 0.75 we achieve 93.3% of precision which implied a recall of 50.9%. An alternate commonly used technique to compute a score is to look at the log likelihood ratio.</Paragraph> </Section> </Section> <Section position="5" start_page="138" end_page="139" type="metho"> <SectionTitle> 4 Experiments and results </SectionTitle> <Paragraph position="0"> First we discuss the segmenter. 10,000 spam and 10,000 good weblog URLs and their corresponding HTML pages were used for the experiments. The 20,000 weblog HTML pages are used to induce the segmenter. The first experiment was aimed at finding how common extra segmentation beyond punctuation is as a phenomenon. The segmenter was run on the actual training URLs. The number of URLs that are additionally segmented besides the segmentation on punctuation are reported in Table 1.</Paragraph> <Paragraph position="1"> The multiple segmentations need not all occur on the same token in the URL after initial segmentation on punctuations.</Paragraph> <Paragraph position="2"> The segmenter was then evaluated on a separate test set of 1,000 URLs for which the ground truth for the segmentation was marked. The results are in Table 2. The evaluation is only on segmentation events and does not include tokenization decisions around punctuation.</Paragraph> <Paragraph position="3"> Figure 2 shows long tokens which are correctly split. The weblog classifier was then run on the test set. The results are shown in Table 3.</Paragraph> <Paragraph position="4"> cash * for * your * house unlimitted * pet * supllies jim * and * body * fat weight * loss * product * info kick * the * boy * and * run bringing * back * the * past food * for * your * speakers accuracy 78% prec. spam 82% rec. spam 71% f-meas spam 76% prec. good 74% rec. good 84% f-meas good 79% The performance of humans on this task was also evaluated. Eight individuals performed the splog identification just looking at the unsegmented URLs. The results for the human annotators are given in Table 4. The average accuracy of the humans (76%) is similar to that of the system (78%).</Paragraph> <Paragraph position="5"> Mean s accuracy 76% 6.71 prec. spam 83% 7.57 rec. spam 65% 6.35 f-meas spam 73% 7.57 prec. good 71% 6.35 rec. good 87% 6.39 f-meas good 78% 6.08 Table 4: Results for the human annotators From an information retrieval perspective if only 50.9% of the URLs are retrieved (labelled as either spam or good and the rest are labelled as unknown) then of the spam/good decisions 93.3% are correct. This is relevant for cases where a URL splog filter is in cascade followed by, for example, a content-based one.</Paragraph> </Section> class="xml-element"></Paper>