Duplicate detection is a data compression techniquefor identifying duplicate copies of repeating data. Today, duplicate detection technique need to process ever larger datasets in ever shorter time but maintaining the quality of datasets. We present adaptive and progressive approaches that signiﬁcantly increase efﬁciency for ﬁnding duplicates. In this paper,the adaptive and progressive approaches and different algorithms are used to detect and calculate the percentage of duplications from source code. Duplication is a big concern in academics and it can be a problem in every course. Duplications occurs when someone copy or present others work as their own work. Students make duplications in different areas: homework assignments, essays,projects,coding, etc. In this paper we focus on programming languages and detect the percentage of duplications in programming assignments.
Keywords—Data Cleaning, Stop Word Elimination, Stemming, Code Clone, Duplicate Detection