Abstract- The copying of programming assignments by students specially at the undergraduate as well as postgraduate level is a common practice. Efficient mechanisms for detecting plagiarised code is therefore needed. Text based plagiarism detection techniques do not work well with source codes. In this paper we are going to analyse a code- based plagiarism detection technique which is employed by various plagiarism detection tools like JPlag, MOSS, CodeMatch etc.
The word Plagiarism is derived from the Latin word plagiarie which means to kidnap or to abduct. In academicia or industry plagiarism refers to the act of copying materials without actually acknowledging the original source[1]. Plagiarism is considered as an ethical offence which may incur serious disciplinary actions such as sharp reduction in marks and even expulsion from the university in severe cases. Student plagiarism primarily falls into two categories: text-based plagiarism and code-based plagiarism. Instances of text based plagiarism includes word to word copy, paraphrasing, plagiarism of secondary sources, plagiarism of ideas, plagiarism of secondary sources, plagiarism of ideas, blunt plagiarism or authorship plagiarism etc. Plagiarism is considered code based when a student copies or modifies a program required to be submitted for a programming assignment. Code based plagiarism includes verbatim copying, changing comments, changing white space and formatting, renaming identifiers, reordering code blocks, changing the order of operators/ operands in expression, changing data types, adding redundant statement or variables, replacing control structures with equivalent structures etc[2].
Text based plagiarism detection techniques do not work well with a coded input or a program. Experiments have suggested that text based systems ignore coding syntax, an indispensable part of any programming construct thus posing a serious drawback. To overcome this problem code-based plagiarism detection techniques were developed. Code-based plagiarism detection techniques can be classified into two categories viz. Attributed oriented plagiarism detection and Structure oriented plagiarism detection.
Attribute oriented plagiarism detection systems measure properties of assignment submissions[3]. The following attributes are considered:
Based on the above attributes, the degree of similarity of two programs can be considered.
Structure oriented plagiarism detection systems deliberately ignore easily modifiable programming elements such as comments, additional white spaces and variable names. This makes this system less susceptible to addition of redundant information as compared to attribute oriented plagiarism detection systems. A student who is aware of this kind of plagiarism detection system being deployed at his institution would rather complete the assignment by himself/herself instead of working on a tedious and time consuming modification task.
Steven Burrows in his paper ” Efficient and Effective Plagiarism Detection for Large Code Repositories”[3] provided an algorithm for code -based plagiarism detection. The algorithm comprises of the following steps:
Figure: 1.0
Let us consider a simple C program:
#include
int main( ) {
int var;
for (var=0; var<5; var++)
{
printf(“%dn”, var);
}
return 0;
}
|
Programming Construct |
Token |
|
int main for return ( ) { } = < + , ALPHANUM STRING |
S N R g A B j l K J D E N 5 |
Table 1.0: Token list for program in Figure 1.0.
Here ALPHANAME refers to any function name, variable name or variable value. STRING refers to double enclosed character(s).
The corresponding token stream for the program in Figure 1.0 is given as
SNABjSNRANKNNJNNDDBjNA5ENBlgNl
Now the above token is converted to N-gram representation. In our case the value of N is chosen as 4. The corresponding tokenization of the above token stream is shown below:
SNAB NABj ABjS BjSN jSNR SNRA NRAN RANK ANKN NKNN KNNJ NNJN NJNN JNND NNDD NDDB DDBj DBjN BjNA jNA5 NA5E A5EN 5ENB ENBl NBlg BlgN lgNl
These 4-grams are generated using the sliding window technique. The sliding window technique generates N-grams by moving a “window” of size N across all parts of the string from left to right of the token stream.
The use of N-grams is an appropriate method of performing structural plagiarism detection because any change to the source code will only affect a few neighbouring N-grams. The modified version of the program will have a large percentage of unchanged N-grams, hence it will be easy to detect plagiarism in this program .
The second step is to create an inverted index of these N-grams . An inverted index consists of a lexicon and an inverted list. It is shown below:
|
Lexicon |
Inverted List |
|
Apple |
1: 25,3 |
|
Orange |
1: 26,2 |
|
Banana |
1: 22,5 |
|
Mango |
3: 31,1 33,3 15,2 |
|
Grapes |
2: 24,6 26,1 |
Table 2.0: Inverted Index
Referring to above inverted index for mango, we can conclude that mango occurs in three documents in the collection. It occurs once in document no. 31, thrice in document no. 33 and twice in document no. 15. Similarly we can represent our 4-gram representation of Figure 1.0 with the help of an inverted index. The inverted index for any five 4-grams is shown below in Table 3.0.
|
Lexicon |
Inverted List |
|
5ENB |
2: 1,1 2,2 |
|
A5EN |
2: 1,1 2,2 |
|
ABjS |
2: 1,1 2,1 |
|
ANKN |
2: 1,1 2,1 |
|
BgNl |
1: 2,1 |
|
……… |
……… |
Table 3.0: Inverted Index
The next step is to query the index. It is understandable that each query is an N-gram representation of a program. For a token stream of t tokens, we require (t − n + 1) N-grams where n is the length of the N-gram . Each query returns the ten most similar programs matching the query program and these are organised from most similar to least similar. If the query program is one of the indexed programs, we would expect this result to produce the highest score. We assign a similarity score of 100% to the exact or top match[3]. All other programs are given a similarity score relative to the top score .
Burrows experiment compared against an index of 296 programs shown in Table 4.0 presents the top ten results of one N-gram program file (0020.c). In this example, it is seen that the file scored against itself generates the highest relative score of 100.00%. This score is ignored, but it is used to generate a relative similarity score for all other results. We can also see that the program 0103.c is very similar to program 0020.c with a score of 93.34% .
Rank Query Index Raw Similarity
File File Score Score
|
1 |
0020.c |
0020.c |
369.45 |
100% |
|
2 |
0020.c |
0103.c |
344.85 |
93.34% |
|
3 |
0020.c |
0092.c |
189.38 |
51.26% |
|
4 |
0020.c |
0151.c |
185.05 |
50.09% |
|
5 |
0020.c |
0267.c |
167.82 |
45.43% |
|
6 |
0020.c |
0150.c |
164.67 |
44.57% |
|
7 |
0020.c |
0137.c |
158.67 |
42.93% |
|
8 |
0020.c |
0139.c |
154.31 |
41.76% |
|
9 |
0020.c |
0269.c |
129.17 |
34.96% |
|
10 |
0020.c |
0241.c |
126.87 |
34.33% |
Table 4.0: Results of the program 0020.c compared to an index of 296 programs.
The salient features of this tool are presented below:
The salient features of this tool are presented below:
The salient features of this plagiarism detection tool are as follows:
|
JPlag |
MOSS |
CodeMatch |
|
|
Birth |
1996 |
1994 |
2003 |
|
Inventor |
Guido Malpohl |
Alex Aiken |
Bob Zeidman |
|
Availability |
Free |
Free |
Free(till 1 MB use) |
|
Algorithm used |
Greedy String Tiling |
Winnowing Algorithm |
Statement/ Comment/ Instruction/ Identifier matching |
|
Languages supported |
C, C++, C#, Java, Schema and Natural Text |
26 languages |
26 languages |
|
Results displayed |
HTML Histogram |
HTML basic report |
HTML pair code matching |
In this paper we learnt a structured code-based plagiarism technique known as Scalable Plagiarism Detection. Various processes like tokenization, indexing and query-indexing were also studied. We also studied various salient features of various code-based plagiarism detection tools like JPlag, CodeMatch and MOSS.
References
Why Work with Us
Top Quality and Well-Researched Papers
We always make sure that writers follow all your instructions precisely. You can choose your academic level: high school, college/university or professional, and we will assign a writer who has a respective degree.
Professional and Experienced Academic Writers
We have a team of professional writers with experience in academic and business writing. Many are native speakers and able to perform any task for which you need help.
Free Unlimited Revisions
If you think we missed something, send your order for a free revision. You have 10 days to submit the order for review after you have received the final document. You can do this yourself after logging into your personal account or by contacting our support.
Prompt Delivery and 100% Money-Back-Guarantee
All papers are always delivered on time. In case we need more time to master your paper, we may contact you regarding the deadline extension. In case you cannot provide us with more time, a 100% refund is guaranteed.
Original & Confidential
We use several writing tools checks to ensure that all documents you receive are free from plagiarism. Our editors carefully review all quotations in the text. We also promise maximum confidentiality in all of our services.
24/7 Customer Support
Our support agents are available 24 hours a day 7 days a week and committed to providing you with the best customer experience. Get in touch whenever you need any assistance.
Try it now!
How it works?
Follow these simple steps to get your paper done
Place your order
Fill in the order form and provide all details of your assignment.
Proceed with the payment
Choose the payment system that suits you most.
Receive the final file
Once your paper is ready, we will email it to you.
Our Services
No need to work on your paper at night. Sleep tight, we will cover your back. We offer all kinds of writing services.
Essays
No matter what kind of academic paper you need and how urgent you need it, you are welcome to choose your academic level and the type of your paper at an affordable price. We take care of all your paper needs and give a 24/7 customer care support system.
Admissions
Admission Essays & Business Writing Help
An admission essay is an essay or other written statement by a candidate, often a potential student enrolling in a college, university, or graduate school. You can be rest assurred that through our service we will write the best admission essay for you.
Reviews
Editing Support
Our academic writers and editors make the necessary changes to your paper so that it is polished. We also format your document by correctly quoting the sources and creating reference lists in the formats APA, Harvard, MLA, Chicago / Turabian.
Reviews
Revision Support
If you think your paper could be improved, you can request a review. In this case, your paper will be checked by the writer or assigned to an editor. You can use this option as many times as you see fit. This is free because we want you to be completely satisfied with the service offered.