如何计算两份代码的相似度?
正好之前的毕设中阅读了相关的文献 这里依照我肤浅的理解也来回答一下
计算两份代码的相似度是有很多实际的应用背景的,比如在代码仓库中(Software Repository)中,大量的代码来自于相互的复制粘贴,如果能够合理地检测到相似的代码,将相似的代码只保存一份,那么可以减少大量的存储冗余,同时也可以更好地保证数据的一致性。
相关的研究有不少,在论文[*]中,提到了五种主要类型的相似性比较,分别是:
- Textual comparison,
- Token comparison,
- Metric comparison,
- Comparison of abstract syntax trees (AST),
- Comparison of program dependency graphs (PDG) ,
- 其他一些类型的比较方法。
对于这五种相似度对比方法及其他的一些方法,文章是这么阐述的:
Textual comparison. The approach of Ducasse et al. compares whole lines to each other textually [4]. To increase performance, lines are partitioned using a hash function for strings. Only lines in the same partition are compared. The result is visualized as a dot plot, where each dot indicates a pair of cloned lines. Clones may be found as certain patterns in those dot plots visually. Consecutive lines can be summarized to larger cloned sequences automatically as uninterrupted diagonals or displaced diagonals in the dot plot. Johnson [13] uses the efficient string matching by Karp and Rabin [14] based on fingerprints. Token comparison. Baker’s technique is also a line- based comparison. Instead of a string comparison, the token sequences of lines are compared efficiently through a suffix tree. First, each token sequence for a whole line is summarized by a so-called functor that abstracts from concrete values of identifiers and literals [1]. The functor characterizes this token sequence uniquely. Assigning functors can be viewed as a perfect hash function. Concrete values of identifiers and literals are captured as parameters o this functor. An encoding of these parameters abstracts from their concrete values but not from their order so that code fragments may be detected that differ only in systematic renaming of parameters. Two lines are clones if they match in their functors and parameter encoding. The functors and their parameters are summarized in a suffix tree, a trie that represents all suffixes of the program in a compact fashion. A suffix tree can be built in time and space linear to the input length [7], [15]. Every branch in the suffix tree represents program suffixes with common beginnings, hence, cloned sequences. Kamiya et al. increase recall for superficially different yet equivalent sequences by normalizing the token sequences [9]. Because syntax is not taken into account, the found clones may overlap different syntactic units, which cannot be replaced through functional abstraction. In either a preprocessing [16], [17] or a postprocessing [18] step, clones that completely fall in syntactic blocks can be found if block delimiters are known. Metric comparison. Merlo et al. gather different metrics for code fragments and compare these metric vectors instead of comparing code directly [2], [3], [12], [19]. An allowable distance (for instance, euclidean distance) for these metric vectors can be used as a hint for similar code. Specific metric-based techniques were also proposed for clones in Web sites [20], [21]. Comparison of abstract syntax trees (AST). Baxter et al. partition subtrees of the abstract syntax tree of a program based on a hash function and then compare subtrees in the same partition through tree matching (allowing for some divergences) [8]. A similar approach was proposed earlier by Yang [22] using dynamic programming to find differ- ences between two versions of the same file. Comparison of program dependency graphs (PDG). Control and data flow dependencies of a function may be represented by a program dependency graph; clones may be identified as isomorphic subgraphs [10], [11]; because this problem is NP-hard, Krinke uses approximative solutions. **Other techniques.**Marcus and Maletic use latent semantic indexing (an information retrieval technique) to identify fragments in which similar names occur [23]. Leitao [24] combines syntactic and semantic techniques through a combination of specialized comparison functions that com- pare various aspects (similar call subgraphs, commutative operators, user-defined equivalences, and transformations into canonical syntactic forms). Each comparison function yields an evidence that is summarized in an evidence-factor model yielding a clone likelihood. Wahler et al. [25] and Li et al. [26] cast the search for similar fragments as a data mining problem. Statement sequences are summarized to item sets. An adapted data mining algorithm searches for frequent item sets.
对于上述段落中具体引用的文章,这里不贴出来了。
这篇文章的Abstract如下:
Abstract—Many techniques for detecting duplicated source code (software clones) have been proposed in the past. However, it is not yet clear how these techniques compare in terms of recall and precision as well as space and time requirements. This paper presents an experiment that evaluates six clone detectors based on eight large C and Java programs (altogether almost 850 KLOC). Their clone candidates were evaluated by one of the authors as an independent third party. The selected techniques cover the whole spectrum of the state-of-the-art in clone detection. The techniques work on text, lexical and syntactic information, software metrics, and program dependency graphs.
文章里面提供了大量的相似性计算公式和实验对比,如果感兴趣,可以深入阅读此文章。
Reference:
[*]. Bellon, S., Koschke, R., Antoniol, G., Krinke, J., & Merlo, E. (2007). Comparison and evaluation of clone detection tools. Software Engineering, IEEE Transactions on, 33(9), 577–591. IEEE.
