File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/abstr/06/n06-2035_abstr.xml

Size: 998 bytes

Last Modified: 2025-10-06 13:44:55

<?xml version="1.0" standalone="yes"?>
<Paper uid="N06-2035">
  <Title>Weblog Classification for Fast Splog Filtering: A URL Language Model Segmentation Approach</Title>
  <Section position="1" start_page="0" end_page="0" type="abstr">
    <SectionTitle>
Abstract
</SectionTitle>
    <Paragraph position="0"> This paper shows that in the context of statistical weblog classification for splog filtering based on n-grams of tokens in the URL, further segmenting the URLs beyond the standard punctuation is helpful. Many splog URLs contain phrases in which the words are glued together in order to avoid splog filtering techniques based on punctuation segmentation and unigrams. A technique which segments long tokens into the words forming the phrase is proposed and evaluated. The resulting tokens are used as features for a weblog classifier whose accuracy is similar to that of humans (78% vs. 76%) and reaches 93.3% of precision in identifying splogs with recall of 50.9%.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML