I’ve been invited to be the keynote speaker at the First Mexican Workshop on Plagiarism Detection and Authorship Analysis. The program for the event is here:
First Mexican Workshop on Plagiarism Detection and Authorship Analysis
Details of my Talk are below:
Demystifying the power of character n-grams in attribution tasks
A common feature in traditional machine learning approaches to authorship attribution is the bag-of-character n-grams. These tiny units of text have been found time and again to be powerful discriminants of writing style. But only until recently researchers have tried to disentangle the reasons behind this fact. During the first part of my talk I will present our own attempts at demystifying the power behind character n-grams for this task. I will also present a new formulation for the use of these features that leads to lower dimensional representations and higher prediction accuracies.
The second part of my talk will focus on the use of character n-grams as pivot features to perform domain adaptation. The goal here is to solve the domain mismatch in attribution problems where the training data used to build the models is different from the test data. I will present our proposed modifications to the Structural Correspondence Learning algorithm, a well-known domain adaptation approach, that allows to more accurately distinguish authors in a cross-domain setting with limited training data.
Both contributions highlighted in this talk are empirically evaluated in attribution tasks. However they are applicable to any task where style is an important aspect, including genre classification and author profiling.