« UML Tools | Main | Measuring Productivity on High Performance Computers »

Clone Code Detection Tools and Algorithms

Out of my research need, I have collected the list of publicly available tools for "clone detection", that is, the detection of identical or similar code fragments from the given source programs.

Name Supported languages Approach License Usage
Duploc C, C++, Java line-by-line string matching GPL, written in Smalltalk
PHP:Duploc PHP ??? GPL, written in PHP
pmd (cpd) Java Karp-Rabin string matching algorithm BSD-style Use ant task shown on the link (for avoiding file names enumeration)
Simian Java, C#, C, C++, COBOL, Ruby, JSP, ASP, HTML, XML, Visual Basic, or any text files ??? Free for non-commercial and open source use, written in .NET 1.1 or Java 1.4, source code unavailable java -jar .../simian-2.2.4.jar -recurse *.java
SimScan Java works on the parsed source tree (using ANTLR parser) Free for non-commercial and open source use, written in Java, source code unavailable
dupwatch (dupwatch.jar, dupwatch.tgz) Java Finding duplicated code via "metric fingerprints" ???
Here are some links to the tools not publicly available:
  • Code clone related tools summarized by Osaka University, the developer of CloneWarrior/CCFinder/Gemini (currently available only upon request)
  • JPlag: tool for detecting software plagiarism. account available on request
  • Moss: a tool for detecting cheating in university programming classes. Free internet service available only to instructors in programming courses
  • CloneDR (Clone Doctor), famous, but commercial tool. Their paper says its based on AST ignoring identifier names
  • Dup and Pdiff by Brenda S. Baker at Bell lab. Only the pepers are available
  • Dotplot: similarity pattern visualization tool