SHERLOCK is an AI-written code detector. Given a piece of input code string $\mathbf s$, labelled with either human
or AI
, for example:
Human written:
#include <algorithm> #include <iostream> #include <vector> using namespace std; int main() { vector<int> line; int pro; for (int i = 0; i < 10; ++i) { cin >> pro; line.push_back(pro); } sort(line.begin(), line.end()); for (int i = 9; i > 6; -i) { cout << line[i] << endl; } return 0; }
GPT written:
// Problem Name: Cheater's Nim // Input: int N; int a_1; int a_2; ... int a_N; // Output: int cheater_stones; // Logic: int main() { cin >> N; cin >> a_1; cin >> a_2; ... cin >> a_N; // Initialize variables int cheater_stones = 0; int remaining_stones = 0; // Loop through all piles for (int i = 1; i <= N; i++) { // If there are no stones left in the pile, skip it if (a_i == 0) continue; // Compute the maximum number of stones the cheater can take from this pile int max_stones = min(a_i, remaining_stones); // Update the number of remaining stones remaining_stones -= max_stones; // Update the number of stones the cheater has eaten cheater_stones += max_stones; // Print the number of stones the cheater has eaten so far cout << cheater_stones << endl; // If there are no more stones left, break if (remaining_stones == 0) break; } // If the cheater has eaten all the stones, print -1 if (cheater_stones == 0) cout << -1 << endl; return 0; }
SHERLOCK maps a given string $\mathbf s$ into a predicted label $\mathcal S (\mathbf s) \to {0, 1}$. Naturally, we would want a detector as astute as Sherlock Holmes so that he can identify, for a given code input, the exactly true state of its original author, pinned down lines to lines. For example, sometimes a piece of code can be co-authored by both GPTs and assembled by the programmer himself. But in this research, we narrow down to study a binary differentiation task to detect AI-written code, out of educational content. That is to say, we focus on beginner-level, task-specific coding problems and the answer, like the preceeding two examples.
The conjecture backbone of the technique is proposed in 2023 by Mitchell et al. [1], that AI-generated text exhibits local optimality in the likelihood curvature of its source model, meaning it tends to have lower probability or higher loss after perturbations. Human-written text, on the other hand, does not exhibit this property - perturbation and original text’s probability log-likelihood are more stable.
Btw, perturbation means, randomly masking some words (tokens) and then use another GPT model to fill in the blanks. Like this:
We re-designed a tokenize-and-perturb method tailored for codes, then replicated Mitchell’s paper’s result and found that for codes, similar rules apply
Now then, everything is ready and we can implement SHERLOCK. For a piece of code $\mathbf s_0$, it is perturbed $N$ times to obtain ${\hat {\mathbf {s_i}}}_{i = 1}^N$, and fed into a “grading” language model to obtain its loss function probability log-likelihood $p(\hat {\mathbf s_i})$ - i.e. the probability of such sentence in view of the grading language model. After this manipulation, we obtain $$ \mathbf p(\mathbf s) = \begin{bmatrix} p(\mathbf s_0)\cr p(\mathbf {\hat s_1})\cr …\cr p(\mathbf {\hat s_N}) \end{bmatrix} $$ This probability vector is then used for subsequent ML binary classifier training.
We built a small dataset ourselves - cleaning up a Kaggle dataset of codeforces competition to obtain the problems and the solutions submitted. The solutions are in c++ and labelled as human-written since the date of the dataset is prior-GPT (2019). The problems are then used to prompt 70-b llama to obtain AI-labeled codes. Eventually, our dataset consists of $M = 216$ samples for both labels respectively. And here’s the result:
Finally, the code is here, and, the main paper that we referred to:
[1] Eric Mitchell et al. “DetectGPT: Zero-Shot Machine-Generated Text Detection Using Probability Curvature”. In: Proceedings of the 40th International Conference on Machine Learning. ICML’23. 2023.