The analysis of names from the 2011 Census of Population

Lead Research Organisation: University College London
Department Name: Geography

Abstract

Previous research conducted at UCL has demonstrated that a name very often provides an open and accessible statement of the cultural, ethnic and linguistic characteristics of its bearer (e.g. Mateos et al 2011). Additional light may be shed upon these characteristics by parental choice of fore-(given) name, while changing fashions often render forenames a valid indicator of age and other geographic and social characteristics. This information has been used to develop working classifications of names, and they have been successfully used to augment incomplete data records for audit purposes - for example in gauging the success of NHS preventive care initiatives across different ethnic groups. However, these classifications have been developed using incomplete address registers (such as the public version of the Electoral Roll) and telephone directories.

There are a number of shortcomings to the data sources hitherto used in this kind of research that limit the usefulness of the resulting classifications when applied to new datasets:
1. The data sources underlying the classifications provide incomplete and probably biased representations of the population-at-large. For example, public electoral registers do not include (young and immigrant) non-voters or (privacy sensitive) 'opt out' individuals, and public telephone directories provide less than universal coverage and few given names.
2. Commercial classifications of the age profiles of given names are typically restricted to the 16+ age cohorts, and supplementation with ONS birth name data (e.g. www.ons.gov.uk/ons/rel/vsob1/baby-names--england-and-wales/2012/stb-baby-names-2012.html) is error prone because young children may move abroad and immigrants may bring young children with them. Thus these sources do not allow an inclusive snapshot of the population resident in the UK at any specific moment in time.
3. Whilst 'crowd sourced' validation is possible (e.g. www.onomap.org), there is no comprehensive means of comparing predicted and objective (e.g. age) or self-assigned (e.g. ethnicity) characteristics.
4. Little focus has been developed upon refining attempts to classify 'hard to reach' groups, such as Caribbeans, whose ethnicity can likely only be ascertained through subtle associations between forename-surname pairings.
5. The clustering procedure has been largely aspatial, in significant part because of unevenness of geographical coverage and the absence of highly granular location information.

This research will address these shortcomings through use of the best available secondary dataset for developing an enriched classification and conducting sensitivity analysis to refine and improve its universal application across the UK. Individual level Census data will be used in order to develop a classification of given and family names into cultural, ethnic and linguistic groups, by extending the methodology of Mateos et al (2011). Crucially, and for the first time, the results of this classification will be compared to individual and household data on Ethnicity, National Identity, Country of Birth, First and Second Language Spoken and Nationality. This will make it possible to investigate the causes of apparent errors in the classification, and to identify the small geographic areas in which they are concentrated. Through an iterative procedure, secure online facilities will be used to improve the classification in the light of these results.

Comparison of household and individual classification results with self assignments in terms of Census measures of identity will make it possible to make the classification tool sensitive to indicators of cultural assimilation, whether through inter-marriage or duration of residence, at scales from the local to the national.

The 'surname regions' will also be used to add regional context to the 2011 ONS Output Area Classification.

Planned Impact

The UK is becoming an increasingly diverse society in ethnic terms, and the evidence of the 2011 Census is that ethnic minorities are becoming increasingly dispersed throughout the UK. For the time being, the 2011 Census of Population provides a very valuable snapshot of the size and detailed geography of ethnic minority populations, but it is one that will not be refreshed at this level of granularity for nearly 10 years, and only then if there is a 2021 Census. If the next 10 years see similarly rapid population changes to those of 2001-11 (e.g. of Romanian immigrants), ONS will find the classifier of considerable use in coding up administrative data sources, such as NHS records, in order to provide detailed analysis of the local and regional effects of demographic change. This is a primary motivation for the ONS contribution to this research.

A great strength of the classification will be that it can be run 'on site' at any location where a dataset containing personal names is stored. The software will be written so as to provide aggregate as well as individual estimates of ethnicity, and by suppressing the latter option it is perfectly possible to allow only aggregate results that do not create disclosure risks. As such, it is will prove useful in a range of settings. For example, crime statistics and police interventions may be coded by ethnicity; higher education institutions will be able to monitor the effectiveness of their constituent divisions in implementing widening participation initiatives; and NHS trusts will be able to profile users of services such as breast screening, A&E or GP referrals. Indeed, any organisation that has legitimate access to individual names records will be able to undertake audits of the extent to which their activities draw upon local labour markets or serve local (or broader) communities.

Although calibrated on UK data, the names classification is likely to be of wider geographical use, for example by researchers and government bodies with access to administrative datasets elsewhere in the EU.

The geodemographics industry will benefit from the research. Value added reseller CACI already uses the UCL names classification in its ACORN system, and so the classification produced in this research is likely to improve the discriminatory power of future versions of the classification. The software tool will be of value, therefore, in updating geodemographic classifications over time. In a similar way, the retail industry will benefit by becoming better able to differentiate between its customers and to better tailor their offerings to them.

In the much longer term, the research will have positive impacts upon amateur genealogy. 2011 Census data will only be released in 2111, and will be error prone with regard to names because of typographic errors for online Census submissions and OCR errors on forms completed by hand. The results of the data cleaning that will be a necessary part of this research will be retained alongside Census records by the ONS, to the tangible benefit of future generations.

Publications

10 25 50
 
Description The research has produced the promised 'Ethnicity Estimator' software as a deliverable, although following ethical review at the Office for National Statistics the software release is only available for making predictions on aggregate data, and not individual level data. The software is available, free of charge, to approved users of the ESRC Consumer Data Research Centre following successful application at ee.cdrc.ac.uk. This research has also led to a very popular public website that enables users to identify the geographic origins of their names, as well as predictions of where pairs of individuals likely first met.
Exploitation Route A free-to-use tool is available to approved researchers through the CDRC websites (ee.cdrc.ac.uk). This enables researchers to classify lists of names according to Census of Population ethnicity categories. The tool is of interest to migration researchers and policy analysts concerned with migration and regional development, and to date (March 2019) eight licences have been granted to researchers working in overseas universities. The main website associated with the research (http://named.publicprofiler.org) was used by well over a million unique users during the period of research funding.
Sectors Agriculture, Food and Drink,Communities and Social Services/Policy,Creative Economy,Digital/Communication/Information Technologies (including Software),Education,Financial Services, and Management Consultancy,Healthcare,Leisure Activities, including Sports, Recreation and Tourism,Government, Democracy and Justice,Culture, Heritage, Museums and Collections,Retail,Security and Diplomacy,Other

URL http://ee.cdrc.ac.uk
 
Description This research has developed a classification of given and family names using the Office for National Statistics (ONS) VML link to 2011 Census of Population data for England and Wales. A software tool has been developed to enable users to code up lists of personal names with predictions of ethnicity. This tool has been made available to approved researchers through the ee.cdrc.ac.uk web portal, and as of March 2019 eight licences have been granted to researchers working in overseas universities. The software only became available in 2018, much later than expected, because of delays arising from ethical review at the Office for National Statistics. The value and impact of this software is very wide-ranging. Future users might be able to answer questions such as: - how does participation in breast screening programmes vary between ethnic minorities? - how does use of accident and emergency facilities vary between different groups of recent migrants, which bring different experiences and user behaviours from their home countries? - which refugees came from French speaking Africa? - which mail order book customers would like catalogues of foreign language books? - how do the stage routines of different US comedians reflect their ethnic and cultural roots? - what types and brands of cosmetics might different customers use? These and other questions have been addressed using software that we have developed in related research and based on consumer databases, yet this is in some important respects unsatisfactory in that such data sources are subject to various sources of bias. This grant application was predicated on the belief that the best classifications should use the best data, and that use of the 2011 Census would allow a range of statistical and self-assignment issues to be addressed. The project has been a very successful collaboration between UCL (and latterly the Consumer Data Research Centre, CDRC) and ONS, although extracting results in a timely fashion from the VML has entailed a steep learning curve. The research findings have also been used by ONS and CDRC to evaluate the use of new technologies and procedures to fashion the best public sector data into software products that are efficient, effective and safe to use. We are exploring use cases including healthcare, marketing, migration analysis and social capital formation, and elsewhere we have demonstrated the value of these applications using names classification tools based upon consumer registers. Concerns from the ONS ethical review committee led to rethinking how the tool may be disseminated. The ethical review committee was of the view that individual level predictions should not be made by licenced users. ONS therefore agreed to amend the software so that only data for aggregations of individuals are estimated. It was necessary to undertake penetration testing, in order to minimise the risk that the tool might be obtained by unauthorised users.
First Year Of Impact 2015
Sector Communities and Social Services/Policy,Creative Economy,Digital/Communication/Information Technologies (including Software),Education,Healthcare,Leisure Activities, including Sports, Recreation and Tourism,Government, Democracy and Justice,Retail,Transport
Impact Types Cultural,Societal,Economic,Policy & public services

 
Description Gave written evidence for the Public Administration and Constitutional Affairs Committee's Governance of Statistics inquiry
Geographic Reach National 
Policy Influence Type Participation in a advisory committee
Impact I was asked to give written evidence to the Public Administration and Constitutional Affairs Committee's Governance of Statistics inquiry. I answered the following three questions based upon the CDRC's relationship with the Office for National Statistics: 1) To what extent could new information improve public service delivery, contribute to the richness of existing official statistics or produce new statistics? 2) If you have had any dealings with UKSA / ONS or government departments, how would you characterise their interest in new data sources, either as producers of data or users? 3) Do you have any comments on the governance of UKSA / ONS and its culture?
URL http://data.parliament.uk/writtenevidence/committeeevidence.svc/evidencedocument/public-administrati...
 
Description Meeting with Japanese Government CIO and National Strategy Office of ICT, to demonstrate our data visualisations
Geographic Reach National 
Policy Influence Type Gave evidence to a government review
 
Description CACI Ltd 
Organisation CACI International Inc
Department CACI Ltd
Country United Kingdom 
Sector Private 
PI Contribution Data acquisition for CDRC service. Also involved in the Masters Research Dissertation Programme (formally known as the Retail Research Masters Research initiative) gives masters level students the opportunity to conduct their dissertation with the co-supervision of a consumer organisation. The Consumer Data Research Centre facilitate the initiative. The student is offered a stimulating project to work on with co-supervision from the a major UK business. They receive industry advice on how data are used in real world problem solving, and experience of addressing problems that matter. A prize is awarded to the best three dissertations (funded by the CDRC). Research findings are showcased to an audience of leading retailers, at the Annual Data Analysts User Group annual conference (formally known as Demographics User Group) and the work is published and made freely available on the CDRC website.
Collaborator Contribution To provide the Consumer Register datasets for 2014-2019 for the purposes of social science research. Member of CDRC Advisory Board. Also partner in the Masters Research Dissertation Programme where representatives from major UK consumer companies, or businesses which handle consumer data, co-supervise the Masters dissertations. This presents a great opportunity to get a Masters student to help the organisation make progress with research into major current issues, and projects that have been considered, but not had the resource to carry out. The student works with data - the organisation's customer data, open data or academic sector data and can maximise the value obtained from this. The partner provides the student with a £500 bursary on completion of a successful dissertation.
Impact Member of CDRC Advisory Board. CDRC Data Licence Agreement for Consumer Registers data available to approved CDRC Users through Secure tier of CDRC service. Agreement not extended beyond Feb 2019. Data will nolonger be available through CDRC Service. Data used in CDRC Linked Consumer Registers https://data.cdrc.ac.uk/product/linked-consumer-registers controlled product Population Churn https://data.cdrc.ac.uk/product/cdrc-population-churn-index and Modelled Ethnicity Proportions https://data.cdrc.ac.uk/dataset/modelled-ethnicity-proportions safeguarded products Data used by CDRC PhD student Guy Lansley for thesis 'Big Data: Geodemographics and Representation' CDRC case study: Lansley, G., Li, W., Longley, P.A. 'Representing Population Dynamics from Administrative and Consumer Registers' https://www.cdrc.ac.uk/wp-content/uploads/2017/03/Areas-and-Activities.pdf. CDRC Masters Research Dissertation Programme: Yiqao Huang (2015) Shopping centre's turnover estimation using microsimulation: an exploratory research in Inverness; Eirini Milaiou (2016) Topic extraction and document classification on textual survey data with unsupervised modelling techniques.
Start Year 2015
 
Description Data Licence Agreement: Datatalk Ltd 
Organisation Datatalk Statistical Solutions Ltd
Country United Kingdom 
Sector Private 
PI Contribution Data acquisition for CDRC service.
Collaborator Contribution Provision of historic electoral roll (1998-2002) data to CDRC service for the purposes of social science research.
Impact Historic electoral roll data products available through the CDRC Safeguarded and Secure service.
Start Year 2017
 
Title named Website 
Description Creates map of surname relative distribution KDEs for the UK, with KDE-multiplied combined version for pairs of surnames (e.g. couples). 
Type Of Technology Webtool/Application 
Year Produced 2016 
Impact Wide coverage on media (Daily Mail, Telegraph, Independent etc) online portals, resulting in over 1 million visitors and 3 million searches carried out, and nearly 100000 feedback votes. 
URL http://named.publicprofiler.org/
 
Title named.publicprofiler.org 
Description Highly innovative publicly available website that allows individuals to identify the origins of their surnames along with those of their partners/friends. The tool allows users to view predictions of where they and their partners/friends first met. The tool was widely featured in the regional, national and international press and has attracted hundreds of thousands of unique users. 
Type Of Technology Webtool/Application 
Year Produced 2015 
Impact The website has been featured extensively in the regional, national, international and online press and has attracted hundreds of thousands of unique users. It remains a popular tool whenever blogs or web forums pick it up in activities adn events. 
URL http://named.publicprofiler.org
 
Description Creating a New Open Geodemographic Classification of the UK Using households and 2011 Census Data 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Joint presentation with Chris Gale and Alex Singleton (University of Liverpool) at 46th Annual Conference of Regional Science Association International - British and Irish Section, Harrogate, Yorkshire, August 22nd to 24th 2017. To discuss issues arising from the publication of 'The Routledge Handbook of Census Resources, Methods and Applications', ed. by John Stillwell. The debate had will inform the development, format, and outputs of the 2021 UK Census of Population.
Year(s) Of Engagement Activity 2017
URL http://www.rsai-bis.org/
 
Description Invited opening keynote speaker, European Commission DG REGIO Workshop: 'How can Regional Policy Benefit from Big Data? 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Here I spoke about the merits of Big Data to provide new insights for regional policy; I also referenced some of the key work of the Consumer Data Research Centre and some of its outputs since the grant started.
Year(s) Of Engagement Activity 2016
 
Description Invited opening keynote speaker: Smart Geospatial Expo, Seoul 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact I delivered the keynote, where I discussed the work of the Consumer Data Research Centre and fielded questions by attendees, leading to a lengthy discussion and debate. The CDRC reported heightened enquiries post my presentation.
Year(s) Of Engagement Activity 2016
 
Description Invited panellist and presenter at the Association of American Geographers Annual Conference (AAG 2016) 
Form Of Engagement Activity A formal working group, expert panel or dialogue
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact I was invited to the Association of American Geographers Annual Conference in San Francisco to debate and discuss various issues pertaining to Geography and the Consumer Data Research Centre, sparking questions and discussions about the CDRC. We witnessed an increase in enquiries post my panel session.
Year(s) Of Engagement Activity 2016
 
Description Media: 'What's in a surname? A new view of the United States base on the distribution of common last names.' National Geographic, Feb 2011, p20-21 
Form Of Engagement Activity A magazine, newsletter or online publication
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Public/other audiences
Results and Impact Publication in a magazine with a very large international circulation.

An increase in interest in the related website worldnames.publicprofiler.org, which attracts nearly 1 million unique users per annum.
Year(s) Of Engagement Activity 2011
URL http://ngm.nationalgeographic.com/2011/02/geography/usa-surnames-interactive
 
Description Media: Citylabs.com 'Mapping the Hotspots of Britain's surnames' 
Form Of Engagement Activity A magazine, newsletter or online publication
Part Of Official Scheme? No
Geographic Reach National
Primary Audience Media (as a channel to the public)
Results and Impact Citylabs.com are an American based media house looking to expand into the UK. This is the second time they have reffed one of my projects as part of their aim to expand. As a result of their feature the site was accessed by large numbers of users.
Year(s) Of Engagement Activity 2016
 
Description Media: Daily Mail 'Where is YOUR surname from?' 
Form Of Engagement Activity A press release, press conference or response to a media enquiry/interview
Part Of Official Scheme? No
Geographic Reach National
Primary Audience Media (as a channel to the public)
Results and Impact Popular national newspaper featured the Named site in their online newspaper. The feature article was located on the homepage, very near to the main headline story. As a result we witnessed a huge increase of traffic to the site and also received a significantly high volume of requests about names/surnames from members of the general public.
Year(s) Of Engagement Activity 2016
 
Description Media: Mirror Online 'Where does your surname originate from' 
Form Of Engagement Activity A magazine, newsletter or online publication
Part Of Official Scheme? No
Geographic Reach National
Primary Audience Media (as a channel to the public)
Results and Impact One of the UK's most popular national newspapers featured the Named site online. As a result the site reported an increased number of hits.
Year(s) Of Engagement Activity 2016
 
Description Media: Stuff.co.nz 'Where does your surname come from' 
Form Of Engagement Activity A magazine, newsletter or online publication
Part Of Official Scheme? No
Geographic Reach National
Primary Audience Media (as a channel to the public)
Results and Impact New Zealand's popular online news portal featured the Named site; as a result I received significant follow up emails voicing interest in the project from academic, media and the general public.
Year(s) Of Engagement Activity 2016
 
Description Media: Sydney Morning Herald 'Where does your name come from? Website offers clues to origin of name' 
Form Of Engagement Activity A magazine, newsletter or online publication
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Media (as a channel to the public)
Results and Impact One of Australia's prominent media houses ran a feature article on the Named site, with references to another one of my names project. As a result I received a large volume of emails from international members of the public, interested in learning about their own surname.
Year(s) Of Engagement Activity 2016
 
Description Media: Telegraph Online 'Where does your surname come from - this simple search can tell you' 
Form Of Engagement Activity A magazine, newsletter or online publication
Part Of Official Scheme? No
Geographic Reach National
Primary Audience Media (as a channel to the public)
Results and Impact Telegraph Online featured the Named site; as a result we reported a huge surge in hits to the website as well as requests for further media appearances.
Year(s) Of Engagement Activity 2016
 
Description Media: Twickenham warms to Billy Twelvetrees - a name we'll remember. Daily Telegraph, 5th February 2013, p 21 
Form Of Engagement Activity A magazine, newsletter or online publication
Part Of Official Scheme? No
Geographic Reach National
Primary Audience Media (as a channel to the public)
Results and Impact Stimulation of further public interest in the results of the research

Increased interest in the associated website worldnames.publicprofiler.org, that continues to attract c. 1,500 unique users every day and nearly 1 million unique users per annum (there are very high peaks in usage following media features on the work).
Year(s) Of Engagement Activity 2013
URL http://www.telegraph.co.uk/sport/rugbyunion/international/england/9847653/Twickenham-warms-to-Billy-...
 
Description Media: Wales Online - New map says our Sportsstars' surnames are more common in England than in Wales 
Form Of Engagement Activity A press release, press conference or response to a media enquiry/interview
Part Of Official Scheme? No
Geographic Reach National
Primary Audience Media (as a channel to the public)
Results and Impact Online coverage of the Named site - Wales Online is an online news resource targeted at the Wales population but has a far wider readership. This was one of the first online media references to the Named site and we subsequently received a large volume of hits to the site.
Year(s) Of Engagement Activity 2016
 
Description Media:Kent & Sussex Courier 'Map reveals where your surname is most popular' 
Form Of Engagement Activity A magazine, newsletter or online publication
Part Of Official Scheme? No
Geographic Reach Regional
Primary Audience Media (as a channel to the public)
Results and Impact Local/regional coverage of the Named site by Kent's popular local online paper. As a result there was an increased traffic to the website.
Year(s) Of Engagement Activity 2016
 
Description Media:Metro Online 'This map will show you where your surname is most popular in the UK' 
Form Of Engagement Activity A magazine, newsletter or online publication
Part Of Official Scheme? No
Geographic Reach National
Primary Audience Media (as a channel to the public)
Results and Impact Online coverage of the Named site - Metro Online is an online equivalent of the free newspaper distributed across London. This also has an excellent online readership across London. As a result of this feature we received significant media/research interest in our site.
Year(s) Of Engagement Activity 2016