The analysis of names from the 2011 Census of Population

Lead Research Organisation: University College London
Department Name: Geography

Abstract

Previous research conducted at UCL has demonstrated that a name very often provides an open and accessible statement of the cultural, ethnic and linguistic characteristics of its bearer (e.g. Mateos et al 2011). Additional light may be shed upon these characteristics by parental choice of fore-(given) name, while changing fashions often render forenames a valid indicator of age and other geographic and social characteristics. This information has been used to develop working classifications of names, and they have been successfully used to augment incomplete data records for audit purposes - for example in gauging the success of NHS preventive care initiatives across different ethnic groups. However, these classifications have been developed using incomplete address registers (such as the public version of the Electoral Roll) and telephone directories.

There are a number of shortcomings to the data sources hitherto used in this kind of research that limit the usefulness of the resulting classifications when applied to new datasets:
1. The data sources underlying the classifications provide incomplete and probably biased representations of the population-at-large. For example, public electoral registers do not include (young and immigrant) non-voters or (privacy sensitive) 'opt out' individuals, and public telephone directories provide less than universal coverage and few given names.
2. Commercial classifications of the age profiles of given names are typically restricted to the 16+ age cohorts, and supplementation with ONS birth name data (e.g. www.ons.gov.uk/ons/rel/vsob1/baby-names--england-and-wales/2012/stb-baby-names-2012.html) is error prone because young children may move abroad and immigrants may bring young children with them. Thus these sources do not allow an inclusive snapshot of the population resident in the UK at any specific moment in time.
3. Whilst 'crowd sourced' validation is possible (e.g. www.onomap.org), there is no comprehensive means of comparing predicted and objective (e.g. age) or self-assigned (e.g. ethnicity) characteristics.
4. Little focus has been developed upon refining attempts to classify 'hard to reach' groups, such as Caribbeans, whose ethnicity can likely only be ascertained through subtle associations between forename-surname pairings.
5. The clustering procedure has been largely aspatial, in significant part because of unevenness of geographical coverage and the absence of highly granular location information.

This research will address these shortcomings through use of the best available secondary dataset for developing an enriched classification and conducting sensitivity analysis to refine and improve its universal application across the UK. Individual level Census data will be used in order to develop a classification of given and family names into cultural, ethnic and linguistic groups, by extending the methodology of Mateos et al (2011). Crucially, and for the first time, the results of this classification will be compared to individual and household data on Ethnicity, National Identity, Country of Birth, First and Second Language Spoken and Nationality. This will make it possible to investigate the causes of apparent errors in the classification, and to identify the small geographic areas in which they are concentrated. Through an iterative procedure, secure online facilities will be used to improve the classification in the light of these results.

Comparison of household and individual classification results with self assignments in terms of Census measures of identity will make it possible to make the classification tool sensitive to indicators of cultural assimilation, whether through inter-marriage or duration of residence, at scales from the local to the national.

The 'surname regions' will also be used to add regional context to the 2011 ONS Output Area Classification.

Planned Impact

The UK is becoming an increasingly diverse society in ethnic terms, and the evidence of the 2011 Census is that ethnic minorities are becoming increasingly dispersed throughout the UK. For the time being, the 2011 Census of Population provides a very valuable snapshot of the size and detailed geography of ethnic minority populations, but it is one that will not be refreshed at this level of granularity for nearly 10 years, and only then if there is a 2021 Census. If the next 10 years see similarly rapid population changes to those of 2001-11 (e.g. of Romanian immigrants), ONS will find the classifier of considerable use in coding up administrative data sources, such as NHS records, in order to provide detailed analysis of the local and regional effects of demographic change. This is a primary motivation for the ONS contribution to this research.

A great strength of the classification will be that it can be run 'on site' at any location where a dataset containing personal names is stored. The software will be written so as to provide aggregate as well as individual estimates of ethnicity, and by suppressing the latter option it is perfectly possible to allow only aggregate results that do not create disclosure risks. As such, it is will prove useful in a range of settings. For example, crime statistics and police interventions may be coded by ethnicity; higher education institutions will be able to monitor the effectiveness of their constituent divisions in implementing widening participation initiatives; and NHS trusts will be able to profile users of services such as breast screening, A&E or GP referrals. Indeed, any organisation that has legitimate access to individual names records will be able to undertake audits of the extent to which their activities draw upon local labour markets or serve local (or broader) communities.

Although calibrated on UK data, the names classification is likely to be of wider geographical use, for example by researchers and government bodies with access to administrative datasets elsewhere in the EU.

The geodemographics industry will benefit from the research. Value added reseller CACI already uses the UCL names classification in its ACORN system, and so the classification produced in this research is likely to improve the discriminatory power of future versions of the classification. The software tool will be of value, therefore, in updating geodemographic classifications over time. In a similar way, the retail industry will benefit by becoming better able to differentiate between its customers and to better tailor their offerings to them.

In the much longer term, the research will have positive impacts upon amateur genealogy. 2011 Census data will only be released in 2111, and will be error prone with regard to names because of typographic errors for online Census submissions and OCR errors on forms completed by hand. The results of the data cleaning that will be a necessary part of this research will be retained alongside Census records by the ONS, to the tangible benefit of future generations.

Publications

10 25 50
publication icon
Harris R (2017) More bark than bytes? Reflections on 21+ years of geocomputation in Environment and Planning B: Urban Analytics and City Science

publication icon
Kandt J (2016) Regional surnames and genetic structure in Great Britain. in Transactions (Institute of British Geographers : 1965)

publication icon
Lansley G (2016) The geography of Twitter topics in London in Computers, Environment and Urban Systems

publication icon
Lansley G (2016) Deriving age and gender from forenames for consumer analytics in Journal of Retailing and Consumer Services

publication icon
Longley P (2015) Geo-temporal Twitter demographics in International Journal of Geographical Information Science

 
Description This research has developed an innovative predictive model of ethnicity based upon forename and surname associations. The work has used the Office for National Statistics (ONS) VML facility, which enables the predictive success of the model to be assessed. Additionally the research has led to a very popular public website that enables users to identify the geographic origins of their names, as well as predictions of where pairs of individuals likely first met.
Exploitation Route A free-to-use tool will be made available through ONS and/or CDRC websites to allow users to classify lists of names according to Census of Population ethnicity categories. The tool will be of interest to migration researchers and policy analysts concerned with migration and regional development. The main website associated with the research was used by well over a million unique users during the research.
Sectors Agriculture, Food and Drink,Communities and Social Services/Policy,Creative Economy,Digital/Communication/Information Technologies (including Software),Education,Financial Services, and Management Consultancy,Healthcare,Leisure Activities, including Sports, Recreation and Tourism,Government, Democracy and Justice,Culture, Heritage, Museums and Collections,Retail,Security and Diplomacy,Other

URL http://named.publicprofiler.org
 
Description This research has developed a classification of given and family names using the Office for National Statistics (ONS) VML link to 2011 Census of Population data for England and Wales. A software tool has been developed to enable users to code up lists of personal names with predictions of ethnicity. The stated intention remains to make the tool available to the widest possible range of users. The value and impact of this software is potentially very wide-ranging. Users might be able to answer questions such as: - how does participation in breast screening programmes vary between ethnic minorities? - how does use of accident and emergency facilities vary between different groups of recent migrants, which bring different experiences and user behaviours from their home countries? - which refugees came from French speaking Africa? - which mail order book customers would like catalogues of foreign language books? - how do the stage routines of different US comedians reflect their ethnic and cultural roots? - what types and brands of cosmetics might different customers use? These and other questions have been answered using software that we have developed based on consumer databases, yet this is in some important respects unsatisfactory in that such data sources are subject to various sources of bias. This grant application was predicated on the belief that the best classifications should use the best data, and that use of the 2011 Census would allow a range of statistical and self-assignment issues to be addressed. The project has been a very successful collaboration between UCL (and latterly the Consumer Data Research Centre, CDRC) and ONS, although extracting results in a timely fashion from the VML has entailed a steep learning curve. Additional hurdles still need to be overcome before the tool is made available through the CDRC under terms and conditions agreed between ONS and UCL. Most of these hurdles have arisen because of concerns flagged during ONS' new and comprehensive ethical review procedures, and all parties have wished to address and accommodate concerns fully. At the time of writing, ONS is redrafting a Memorandum of Understanding and an agreed communications plan to market the names classification tool. ONS IT staff also still need to finalise security arrangements around the licencing of the tool. Some further ethical review may be required. At the time of writing, final sign off of the tool is expected in March or April 2018. To date, the research findings have principally been used by ONS and CDRC to evaluate the use of new technologies and procedures to fashion the best public sector data into software products that are efficient, effective and safe to use. Use cases include wide areas of healthcare policy, marketing, migration analysis and social capital formation, and we have demonstrated the value of these applications using names classification tools based upon consumer registers. We envisage that the use of 2011 Census data to similar ends to be still more valuable in economic and social applications. Finalising dissemination of the software tool has been much slower than either party anticipated because of new ethical procedures. Concerns from the ONS ethical review committee have led to rethinking how the tool may be disseminated. The ethical review committee was of the view that individual level predictions should not be made by licenced users. ONS has therefore agreed to amend the software so that only data for aggregations of individuals are estimated. It has also been necessary to undertake penetration testing, in order to ensure that the tool is not obtained and used by unauthorised users.
First Year Of Impact 2018
Sector Communities and Social Services/Policy,Creative Economy,Digital/Communication/Information Technologies (including Software),Education,Healthcare,Leisure Activities, including Sports, Recreation and Tourism,Government, Democracy and Justice,Retail,Transport
Impact Types Cultural,Societal,Economic,Policy & public services

 
Description Meeting with Japanese Government CIO and National Strategy Office of ICT, to demonstrate our data visualisations
Geographic Reach National 
Policy Influence Type Gave evidence to a government review
 
Description Data Licence Agreement: Datatalk Ltd 
Organisation Datatalk Statistical Solutions Ltd
Country United Kingdom 
Sector Private 
PI Contribution Data acquisition for CDRC service.
Collaborator Contribution Provision of historic electoral roll (1998-2002) data to CDRC service for the purposes of social science research.
Impact Historic electoral roll data products available through the CDRC Safeguarded and Secure service.
Start Year 2017
 
Title named Website 
Description Creates map of surname relative distribution KDEs for the UK, with KDE-multiplied combined version for pairs of surnames (e.g. couples). 
Type Of Technology Webtool/Application 
Year Produced 2016 
Impact Wide coverage on media (Daily Mail, Telegraph, Independent etc) online portals, resulting in over 1 million visitors and 3 million searches carried out, and nearly 100000 feedback votes. 
URL http://named.publicprofiler.org/
 
Description Media: 'What's in a surname? A new view of the United States base on the distribution of common last names.' National Geographic, Feb 2011, p20-21 
Form Of Engagement Activity A magazine, newsletter or online publication
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Public/other audiences
Results and Impact Publication in a magazine with a very large international circulation.

An increase in interest in the related website worldnames.publicprofiler.org, which attracts nearly 1 million unique users per annum.
Year(s) Of Engagement Activity 2011
URL http://ngm.nationalgeographic.com/2011/02/geography/usa-surnames-interactive
 
Description Media: Twickenham warms to Billy Twelvetrees - a name we'll remember. Daily Telegraph, 5th February 2013, p 21 
Form Of Engagement Activity A magazine, newsletter or online publication
Part Of Official Scheme? No
Geographic Reach National
Primary Audience Media (as a channel to the public)
Results and Impact Stimulation of further public interest in the results of the research

Increased interest in the associated website worldnames.publicprofiler.org, that continues to attract c. 1,500 unique users every day and nearly 1 million unique users per annum (there are very high peaks in usage following media features on the work).
Year(s) Of Engagement Activity 2013
URL http://www.telegraph.co.uk/sport/rugbyunion/international/england/9847653/Twickenham-warms-to-Billy-...