Author Profiling using stylometry
Lead Research Organisation:
University of Birmingham
Abstract
This project aims to further the science of profiling through language so that it can be done quickly and effectively as soon as a written sample is retrieved. We will therefore focus on the task of profiling an author: The goal is to extract stylistic signals, patterns specific to a linguistic community from their writing, to see if these signals are detectable when members of that community write in a context outside of that linguistic community. The theory behind this is that we all take part in different linguistic communities, and our assumption is that each linguistic community we are a part of leaves a mark in our writing and speaking style, some of which may be detectable using stylometry, the quantitative study of writing style.
In order to achieve this, we will first take a look at the stylometric profiling tasks that have already been done, and the success in their methodology, in order to provide a nuanced summary of the tools profilers can already have at their disposal and how to use them. Doing this will also allow us to understand the needs of the profiling community, in order to create a list of priorities that will translate into experiments we carry out.
Each profiling task we embark on will most likely require a new corpus with its own curation needs, as we must make sure to minimize confounding variables. If properly maintained and updated, the corpora we create can also serve for other profilers to carry out their work with a corpus that is known (through cross-validation and our experimentation) to work for a particular profiling task.
To mitigate the risks for each profiling task, we will gather the corpora incrementally, so as to have regular checks for success and accuracy that will allow us to consistently make reports of the project's progress and decide which tasks are feasible.
In order to achieve this, we will first take a look at the stylometric profiling tasks that have already been done, and the success in their methodology, in order to provide a nuanced summary of the tools profilers can already have at their disposal and how to use them. Doing this will also allow us to understand the needs of the profiling community, in order to create a list of priorities that will translate into experiments we carry out.
Each profiling task we embark on will most likely require a new corpus with its own curation needs, as we must make sure to minimize confounding variables. If properly maintained and updated, the corpora we create can also serve for other profilers to carry out their work with a corpus that is known (through cross-validation and our experimentation) to work for a particular profiling task.
To mitigate the risks for each profiling task, we will gather the corpora incrementally, so as to have regular checks for success and accuracy that will allow us to consistently make reports of the project's progress and decide which tasks are feasible.
Organisations
People |
ORCID iD |
Jack Grieve (Primary Supervisor) | |
Alejandro Jawerbaum (Student) |
Studentship Projects
Project Reference | Relationship | Related To | Start | End | Student Name |
---|---|---|---|---|---|
ES/P000711/1 | 01/10/2017 | 30/09/2027 | |||
2881667 | Studentship | ES/P000711/1 | 01/10/2023 | 30/09/2027 | Alejandro Jawerbaum |