File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/05/w05-0207_intro.xml
Size: 3,677 bytes
Last Modified: 2025-10-06 14:03:07
<?xml version="1.0" standalone="yes"?> <Paper uid="W05-0207"> <Title>Using Syntactic Information to Identify Plagiarism</Title> <Section position="3" start_page="0" end_page="37" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> To plagiarize is to steal and pass off (the ideas or words of another) as one's own; [to] use (another's production) without crediting the source; [or] to commit literary theft [by] presenting as new and original an idea or product derived from an existing source .1 Plagiarism is frequently encountered in academic settings. According to turnitin.com, a 2001 survey of 4500 high school students revealed that 15% [of students] had submitted a paper obtained in large part from a term paper mill or web-site . Increased rate of plagiarism hurts quality of education received by students; facilitating recognition of plagiarism can help teachers control this damage.</Paragraph> <Paragraph position="1"> To facilitate recognition of plagiarism, in the recent years many commercial and academic products have been developed. Most of these approaches identify verbatim plagiarism2 and can fail when works are paraphrased. To recognize plagiarism in paraphrased works, we need to capture similarities that go beyond keywords and verbatim overlaps. Two works that exhibit similarity both in their conceptual content (as indicated by keywords) and in their expression of this content should be considered more similar than two works that are similar only in content. In this context, content refers to the story or the information; expression refers to the linguistic choices of authors used in presenting the content, i.e., creative elements of writing, such as whether authors tend toward passive or active voice, whether they prefer complex sentences with embedded clauses or simple sentences with independent clauses, as well as combinations of such choices.</Paragraph> <Paragraph position="2"> Linguistic information can be a source of power for measuring similarity between works based on their expression of content. In this paper, we use linguistic information related to the creative aspects of writing to improve recognition of paraphrased documents as a rst step towards plagiarism detection.</Paragraph> <Paragraph position="3"> To identify a set of features that relate to the linguistic choices of authors, we rely on different syntactic expressions of the same content. After identifying the relevant features (which we call syntactic elements of expression), we rely on patterns in the use of these features to recognize paraphrases of works.</Paragraph> <Paragraph position="4"> In the absence of real-life plagiarism data, in this paper, we use a corpus of parallel translations of novels as surrogate for plagiarism data. Translations of titles, i.e., original works, into English by different people provide us with books that are paraphrases of the same content. We use these paraphrases to automatically identify: 1. Titles even when they are paraphrased, and 2. Pairs of book chapters that are paraphrases of each other.</Paragraph> <Paragraph position="5"> Our rst experiment shows that syntactic elements of expression outperform all baselines in recognizing titles even when they are paraphrased, providing a way of recognizing copies of works based on the similarities in their expression of content. Our second experiment shows that similarity measurements based on the combination of t df-weighted keywords and syntactic elements of expression out-perform the weighted keywords in recognizing pairs of book chapters that are paraphrases of each other.</Paragraph> </Section> class="xml-element"></Paper>