如何计算两份代码的相似度？

构建与工程机器学习与数据科学11 分钟

陈然

2013年8月2日修改于 2016年12月21日

知乎/回答

和 AI 讨论这个回答

ChatGPT

Claude

正好之前的毕设中阅读了相关的文献这里依照我肤浅的理解也来回答一下

计算两份代码的相似度是有很多实际的应用背景的，比如在代码仓库中（Software Repository）中，大量的代码来自于相互的复制粘贴，如果能够合理地检测到相似的代码，将相似的代码只保存一份，那么可以减少大量的存储冗余，同时也可以更好地保证数据的一致性。

相关的研究有不少，在论文[*]中，提到了五种主要类型的相似性比较，分别是：

Textual comparison,
Token comparison,
Metric comparison,
Comparison of abstract syntax trees (AST),
Comparison of program dependency graphs (PDG) ,
其他一些类型的比较方法。

对于这五种相似度对比方法及其他的一些方法，文章是这么阐述的：

Textual comparison. The approach of Ducasse et al. compares whole lines to each other textually [4]. To increase performance, lines are partitioned using a hash function for strings. Only lines in the same partition are compared. The result is visualized as a dot plot, where each dot indicates a pair of cloned lines. Clones may be found as certain patterns in those dot plots visually. Consecutive lines can be summarized to larger cloned sequences automatically as uninterrupted diagonals or displaced diagonals in the dot plot. Johnson [13] uses the efficient string matching by Karp and Rabin [14] based on fingerprints. Token comparison. Baker’s technique is also a line- based comparison. Instead of a string comparison, the token sequences of lines are compared efficiently through a suffix tree. First, each token sequence for a whole line is summarized by a so-called functor that abstracts from concrete values of identifiers and literals [1]. The functor characterizes this token sequence uniquely. Assigning functors can be viewed as a perfect hash function. Concrete values of identifiers and literals are captured as parameters o this functor. An encoding of these parameters abstracts from their concrete values but not from their order so that code fragments may be detected that differ only in systematic renaming of parameters. Two lines are clones if they match in their functors and parameter encoding. The functors and their parameters are summarized in a suffix tree, a trie that represents all suffixes of the program in a compact fashion. A suffix tree can be built in time and space linear to the input length [7], [15]. Every branch in the suffix tree represents program suffixes with common beginnings, hence, cloned sequences. Kamiya et al. increase recall for superficially different yet equivalent sequences by normalizing the token sequences [9]. Because syntax is not taken into account, the found clones may overlap different syntactic units, which cannot be replaced through functional abstraction. In either a preprocessing [16], [17] or a postprocessing [18] step, clones that completely fall in syntactic blocks can be found if block delimiters are known. Metric comparison. Merlo et al. gather different metrics for code fragments and compare these metric vectors instead of comparing code directly [2], [3], [12], [19]. An allowable distance (for instance, euclidean distance) for these metric vectors can be used as a hint for similar code. Specific metric-based techniques were also proposed for clones in Web sites [20], [21]. Comparison of abstract syntax trees (AST). Baxter et al. partition subtrees of the abstract syntax tree of a program based on a hash function and then compare subtrees in the same partition through tree matching (allowing for some divergences) [8]. A similar approach was proposed earlier by Yang [22] using dynamic programming to find differ- ences between two versions of the same file. Comparison of program dependency graphs (PDG). Control and data flow dependencies of a function may be represented by a program dependency graph; clones may be identified as isomorphic subgraphs [10], [11]; because this problem is NP-hard, Krinke uses approximative solutions. Other techniques.Marcus and Maletic use latent semantic indexing (an information retrieval technique) to identify fragments in which similar names occur [23]. Leitao [24] combines syntactic and semantic techniques through a combination of specialized comparison functions that com- pare various aspects (similar call subgraphs, commutative operators, user-defined equivalences, and transformations into canonical syntactic forms). Each comparison function yields an evidence that is summarized in an evidence-factor model yielding a clone likelihood. Wahler et al. [25] and Li et al. [26] cast the search for similar fragments as a data mining problem. Statement sequences are summarized to item sets. An adapted data mining algorithm searches for frequent item sets.

对于上述段落中具体引用的文章，这里不贴出来了。

这篇文章的Abstract如下：

Abstract—Many techniques for detecting duplicated source code (software clones) have been proposed in the past. However, it is not yet clear how these techniques compare in terms of recall and precision as well as space and time requirements. This paper presents an experiment that evaluates six clone detectors based on eight large C and Java programs (altogether almost 850 KLOC). Their clone candidates were evaluated by one of the authors as an independent third party. The selected techniques cover the whole spectrum of the state-of-the-art in clone detection. The techniques work on text, lexical and syntactic information, software metrics, and program dependency graphs.

文章里面提供了大量的相似性计算公式和实验对比，如果感兴趣，可以深入阅读此文章。

Reference:

[*]. Bellon, S., Koschke, R., Antoniol, G., Krinke, J., & Merlo, E. (2007). Comparison and evaluation of clone detection tools. Software Engineering, IEEE Transactions on, 33(9), 577–591. IEEE.

继续阅读

全部内容

知乎/回答2025年3月12日6 分钟

当前AI大模型哪家最好用？各家主要擅长哪些领域？普通人使用，推荐哪个？

作为一个日常工作中有超过 50% 时间都在与各种大模型对话的人，想跟大家分享一下我在使用各个主流模型时的一些体验和感受。 Google Gemini 系列优点： 1. 极其便宜：无论是免费额度还是付费使用，价格都非常有优势。 2. Context Window 超级大：1-2M 的上下文长度，基本能覆盖绝大部分任务场...

知乎/回答2024年12月9日2 分钟

编程的难点在于哪里？是逻辑、数学、算法，还是模块、框架、接口的掌握？

编程最大的难点其实都不在这些技术层面上，而是在于如何把一个复杂的现实问题转化为计算机可以理解和执行的形式。我之前中提到过一个观点："部署、运维等工具的逐渐标准化使得重复造轮子的需求越来越弱，越来越多的工程师核心任务并不是研发，而是整合（Integration）。"这背后的深意是：现在的编程工作，难点已经从"如何实现"...

知乎/回答2015年12月20日1 分钟

Docker 的应用场景在哪里？

Docker for Data Science at Trulia 和一点感悟 - Hello陈然! - 知乎专栏我们在Trulia Data Science Team 里面推行了Docker。一方面是搭建API做SOA，另一方面也在帮助每一个Data Scientist 都可以直接从最开始算法分析、开发一直做到部署...