File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/relat/05/w05-0207_relat.xml
Size: 2,472 bytes
Last Modified: 2025-10-06 14:15:52
<?xml version="1.0" standalone="yes"?> <Paper uid="W05-0207"> <Title>Using Syntactic Information to Identify Plagiarism</Title> <Section position="4" start_page="37" end_page="37" type="relat"> <SectionTitle> 2 Related Work </SectionTitle> <Paragraph position="0"> We de ne expression as the linguistic choices of authors in presenting a particular content (Uzuner, 2005; Uzuner and Katz, 2005). Linguistic similarity between works has been studied in the text classi cation literature for identifying the style of an author. However, it is important to differentiate expression from style. Style refers to the linguistic elements that, independently of content, persist over the works of an author and has been widely studied in authorship attribution. Expression involves the linguistic elements that relate to how an author phrases particular content and can be used to identify potential copyright infringement or plagiarism. Similarities in the expression of similar content in two different works signal potential copying. We hypothesize that syntax plays a role in capturing expression of content. Our approach to recognizing paraphrased works is based on phrase structure of sentences in general, and structure of verb phrases in particular.</Paragraph> <Paragraph position="1"> Most approaches to similarity detection use computationally cheap but linguistically less informed features (Peng and Hengartner, 2002; Sichel, 1974; Williams, 1975) such as keywords, function words, word lengths, and sentence lengths; approaches that include deeper linguistic information, such as syntactic information, usually incur signi cant computational costs (Uzuner et al., 2004). Our approach identi es useful linguistic information without incurring the computational cost of full text parsing; it uses context-free grammars to perform high-level syntactic analysis of part-of-speech tagged text (Brill, 1992). It turns out that such a level of analysis is suf cient to capture syntactic information related to creative aspects of writing; this in turn helps improve recognition of paraphrased documents. The results presented here show that extraction of useful linguistic information for text classi cation purposes does not have to be computationally prohibitively expensive, and that despite the tradeoff between the accuracy of features and computational ef ciency, we can extract linguisticallyinformed features without full parsing.</Paragraph> </Section> class="xml-element"></Paper>