Feasibility of deception in code attribution

Thumbnail Image
Journal Title
Journal ISSN
Volume Title
University of New Brunswick
Code authorship attribution is the process used to identify the probable author of given code, based on unique characteristics that reflect an author's programming style. Inspired by social studies in the attribution of literary works, in the past two decades researchers have examined the effectiveness of code attribution in the computer software domain, including computer security. Authorship attribution techniques have found a broad application in code plagiarism detection, biometric research, forensics, and malware analysis. Studies show that analysis of software might effectively unveil the digital identity of a programmer, reflected through variables and structures, programming language, employed development tools, their settings and, more importantly, how and what these tools are being used to do. Authorship attribution has been a prosperous area of research when an assumption can be made that the author of an unknown program has been honest in their writing style and does not try to modify it. In this thesis, we investigate the feasibility of deception of source code attribution techniques. We begin by exploring how data characteristics and feature selection influence both the accuracy and performance of attribution methods. Within this context, it is necessary to understand whether the results obtained by previous studies depend on the data source, quality, and context or the type of features used. It gives us the opportunity to dive deeper into the process of code authorship attribution to be able to understand its potential weaknesses. To evaluate current code attribution systems, we present an adversarial model defined by the adversary's goals, knowledge, and capabilities; for each group, we categorize them by the possible variations. Modeling the role of attackers figures prominently in enhancing the cybersecurity defense. We believe that having a solid understanding of the possible attacks can help in the research and deployment of reliable code authorship attribution systems. We present an author imitation attack that deceives current authorship attribution systems by imitating a coding style of a targeted developer. We investigate the attack's feasibility on open-source software repositories. To subvert an author imitation attack and to help in protecting the developer's privacy, we introduce an author obfuscation method and novel coding style transformations. The idea of author obfuscation is to allow authors to preserve the readability of their source code while removing identifying stylistic features that can be leveraged for code attribution. Code obfuscation, common in software development, typically aims to disguise the appearance of the code making it difficult to understand and reverse engineer. In contrast, the proposed author obfuscation hides the original author's style by leaving the source code visible, readable and understandable. In summary, this thesis presents original research work that not only advances the knowledge in code authorship attribution field but also contributes to the overall safety of our digital world by providing author obfuscation methods to protect the privacy of the developers.