Computational criminology: at-scale quantitative analysis of the evolution of cybercrime forums
Repository URI
Repository DOI
Change log
Authors
Abstract
Cybercrime forums and marketplaces are used by members to share hacking techniques, general community-building discussions, and trade hacking tools. While there is a large corpus of literature studying these platforms, from a cross-forum ecosystem comparison to smaller qualitative analyses of specific crime types within a single forum, there has been little research into studying these over time. Using the CrimeBB dataset from the Cambridge Cybercrime Centre, this first contribution of the thesis explores the evolution of a large cybercrime forum, from growth to gradual decline from peak activity, with research questions grounded in the digital drift framework from criminological theory. This finds a trend towards financially-driven cybercrime over time, by users and the forum as a whole. The second contribution of the thesis presents a method for detecting trending terms, using a lightweight natural language processing method to handle queries, given the size of the dataset. Evaluation using manual annotations showed more relevant salient terms were detected over TF-IDF. Finally, the third contribution of the thesis applies signalling theory to analyse the usage of argot (jargon and slang) on the forum, finding a negative correlation with reputation usage, and clustering to find a decreasing use of argot over time. Part of this contribution includes a lightweight argot detection pipeline with word embeddings aligned with manual annotations. Overall, the combination of different approaches, including criminological theory driving research directions, natural language processing to analyse forum text data, machine learning for classifications, and data science techniques, all contribute to provide a unique interdisciplinary perspective within the field of cybercrime community research, both drawing insights into these communities and contributing novel tools for measurements of large, noisy text data.