Multi-χ: Identifying Multiple Authors from Source Code Files

Mohammed Abuhamad, Tamer Abuhmed, DaeHun Nyang, David Mohaisen
2020 Proceedings on Privacy Enhancing Technologies  
AbstractMost authorship identification schemes assume that code samples are written by a single author. However, real software projects are typically the result of a team effort, making it essential to consider a finegrained multi-author identification in a single code sample, which we address with Multi-χ. Multi-χ leverages a deep learning-based approach for multi-author identification in source code, is lightweight, uses a compact representation for efficiency, and does not require any code
more » ... rsing, syntax tree extraction, nor feature selection. In Multi-χ, code samples are divided into small segments, which are then represented as a sequence of n-dimensional term representations. The sequence is fed into an RNN-based verification model to assist a segment integration process which integrates positively verified segments, i.e., integrates segments that have a high probability of being written by one author. Finally, the resulting segments from the integration process are represented using word2vec or TF-IDF and fed into the identification model. We evaluate Multi-χ with several Github projects (Caffe, Facebook's Folly, Tensor-Flow, etc.) and show remarkable accuracy. For example, Multi-χ achieves an authorship example-based accuracy (A-EBA) of 86.41% and per-segment authorship identification of 93.18% for identifying 562 programmers. We examine the performance against multiple dimensions and design choices, and demonstrate its effectiveness.
doi:10.2478/popets-2020-0044 fatcat:baodtyoke5f4dcqsoaqubaydqq