A Profiler for Crime, Criminal Justice and Social Harm

Lead Research Organisation: University of Salford
Department Name: Nursing, Midwifery, Soc Work & Sciences


While government has been the custodian of statistical information about society, particularly about crime, criminal justice and social harm, an open society depends upon the wider accessibility of data to support its deliberations. Since the public at large cannot be expected to collate and analyse such data by themselves, those acting on its behalf have an important role to play in contributing to informed public understanding and debate. Nevertheless, although the 'data explosion' is generating large amounts of relevant information about social conditions, it is becoming increasingly difficult for the charitable or 'third sector' of civil society to grasp and make use of this to help society as a whole come to the best decisions about what to do about troubling social issues such as crime and justice.

The objective of this project is to prototype the development of a 'Profiler' that can enable civil society opinion formers to better understand factors affecting crime, criminal justice and social harm. In so doing, it will provide civil society representatives with valuable insights into how social, economic and demographic factors shape the distribution of social harm in society. It will seek to improve overall strategic policy-making in crime and criminal justice by helping the charitable sector to debate with government, and to inform parliamentarians, the media, the voluntary sector, faith communities, minority groups, and the general public itself, thereby expanding the evidence base available for responsible and representative policy-making for society as a whole.

The project aims to develop a 'Profiler' that harnesses the tools of the information-based 'Big Data' revolution that is sweeping through the commercial and media world. The purpose of the Profiler is to enable its users to assemble and collate data on social and economic conditions and relate them to statistics on crime and criminal justice. The term Big Data encapsulates both the large-scale, complex and fast-moving nature of information in contemporary society and the new and powerful computational tools that are being developed to analyse it. Delivering on the promise of Big Data means that the Profiler must communicate information that is understandable, timely, and meets users' needs and requirements, enabling both producers and consumers of information to reach outcomes together, thereby creating new and useful insights that would not have been apparent due to the scale and complexity of the information.

Hitherto, this complex research work has had to be carried out by academic researchers. Although findings, alongside government statistics, are stored by the UK Data Service, they have not been easily useable by charities and voluntary organisations or accessible to the general public. This project is a unique partnership between criminologists, data scientists, and the foremost educational charity in the criminal justice field to overcome these obstacles. By bringing advanced analytical tools and expert criminological knowledge together in a powerful, user-friendly computer system, it is hoped to make this specialist knowledge more widely accessible and thereby to improve the quality of decision-making in criminal justice policy and practice for the benefit of society as a whole.

Planned Impact

The beneficiaries of this project, and the ways in which they will benefit, are as follows.

1. The Centre for Crime and Justice Studies. The Centre is a charity with a long history of educational work, bringing the results of applied research to policy makers, opinion formers and practitioners. Presenting and explaining complex criminal justice datasets to non-specialist audiences is key to the Centre's mission. This project will greatly enhance the Centre's ability to do this. The Centre will apply the Profiler through a number of ongoing communications outputs, including through its database of contacts (numbering some 10,000 individual contacts) its website, regular events and publications. It will also supplement and enhance its existing annual publication - UK Justice Policy Review - the only regular assessment of criminal justice policy developments in the UK aimed at a non-specialist audience.

2. A range of third sector organisations delivering criminal justice services, as well as those engaged in public education and campaigning work promoting evidence-based policy making. Criminal justice service delivery does not operate in a vacuum. The social arrangements and demographics of an area influence the efficacy and impact of criminal justice services. Those engaged in service delivery will benefit by being able to cost, target and personalise criminal justice services more effectively. Those involved in campaigning and public education work will be able to draw on a rich source of data that places criminal justice information in a wider social context.

3. Charitable trusts that support and fund criminal justice service delivery, education and campaigning. The Profiler will offer an invaluable source of independent and accessible data to help inform trusts that fund criminal justice policy development and service delivery of areas of policy and intervention sites that are likely to deliver the greatest impact.
Local policy makers and commissioners, such as local authorities and Police and Crime Commissioners. Reliable, accurate and accessible local data plays an important role in the planning, commissioning, targeting and delivery of local services in ways that offer value for money and maximise impact.

4. National policy makers, such as Members of Parliament and civil servants working in the Home Office and Ministry of Justice. The Profiler will be of great use to MPs seeking to understand the interaction of criminal justice and social factors in their constituencies, as well as placing these interactions in a wider, national context. For civil servants the Profiler will connect up related but discrete datasets, assisting in the development of evidence-based policy options.

5. Inspection and audit bodies, such as the National Audit Office and HM Inspectorate of Constabulary. The National Audit Office will find the Profiler a useful additional tool for informing their value for money audits. With the adoption of an annual cycle of inspections of all police forces in England and Wales, HM Inspectorate of Constabulary will find the Profiler to be useful tool to assess police performance against key criminal justice and social data.

6. Journalists and opinion formers in the national and local media. The Profiler will provide valuable data and analysis for a broad range of crime, criminal justice and social harm stories.
Local communities and the wider general public. Much current crime data aimed at the general public give raw figures but offer little or no context for understanding the data. The data can also be relatively opaque or arcane for the non-specialist. The Profiler will provide a deep and rich context for local communities and the public to understand crime data, presenting them in ways that are accessible and engaging.


10 25 50
Description The project was to prototype the development of a Profiler for Crime, Criminal Justice and Social Harm, that will give proof to the concept that Big Data Informatics (BDI) can enable civil society opinion formers such as the Centre for Crime and Justice Studies (CCJS) to better understand factors affecting crime, criminal justice and social harm.

The key findings of this project can be summarised as follows:

1. The development of the foundations of a new domain that could be labelled as "Computational Criminology" as part of the advancement of theory and method in criminological research. The theoretical concepts are based on the development of the Justice Matrix conceptual model.

The Justice Matrix can be defined as:
• A conceptual map of the 'universe' of crime and society that encompasses the criminal justice system
• A revolutionary new informatics tool (The Profiler) that harnesses 'Big Data' analytics and visualization technologies.

The application of the Justice Matrix will inform the work of CCJS especially the UK Criminal Justice Policy Review.

There are two fundamental social processes that shape criminal justice:
• Victimization: the experience of citizens of harm perpetrated against their persons, community and property, both by other citizens, corporate bodies and the State itself.
• Criminalization: the process whereby the State responds to crime, including the protection of citizens from victimization, and the application of justice in the distribution of the criminal sanction against those held responsible in law for victimization.
Nevertheless, the available statistical data (mostly generated by the State as a by-product of the work of its agencies and institutions) confounds these two processes in its measurement schema. Additionally, the data measure single attributes and activities rather than express the criminal justice system as a holistic albeit complex systemic entity.
The aims of the Justice Matrix are to:
• Measure the latency in the system of relationship amongst its component parts
• Model the synergy amongst its component parts

The Justice Matrix combines data from four different domains:

a) Domain 1: Social, economic and demographic data: the socio-economic and demographic characteristics of the whole population, mainly via the census. It will provide the foundations of the Justice Matrix, providing key social, economic and demographic data on the UK population as a whole, derived primarily from the Census, serving as the social context of criminal justice.

b) Domain 2: Criminal Justice administrative data: the cases the justice system deals with through its various services. It is selective in terms of its data and cases through the processes of selection it pursues. It over-penalises the few that are criminalised. A variety of data tracking the process of criminalisation, including police recorded crime data, crime victimization survey data, prosecution, conviction and sentencing data.

c) Domain 3: Ideological and political data: data on public attitudes towards crime and criminal justice, and political priorities and policy agendas, will be a key variable explaining the process of criminalisation (i.e. the State's response). The ideological and cultural attitudes, understandings and assumptions, shared by the population in different ways and to different degrees. The main data are survey ones. This could also include political and legislative activity on crime-related matters (e.g. via extension of UK Policy Agendas project).

d) Domain 4: Criminal justice institutions: this domain will include data on the institutions of the state, such as the number of institutions (prisons, courts etc), the number of staff (how many police etc), budgetary expenditure, physical infrastructure (how many courts, prisons etc); levels of staffing, and expenditure.

Two kinds of ways of coming up with defining a social statistic:
• Nominalist: starts with an (arbitrary) administrative definition of a population with particular attributes (e.g. x number of people using this service/in receipt of benefits etc). It starts with the users of those services so ignores those that do not fit these categories.
• Realist: assumes that there are real phenomena, regardless of administrative boundaries (e.g. people with a problematic drug problem, not 'drug offenders'). It is the job of statistics to quantify this.
So in terms of the statistics in the Matrix:
• Most criminal justice statistics are nominalist.
• Some of the data - notably the census - are realist.
Most analysis of these statistics has been accomplished within the scope of the General Linear Model but this does not capture the complexity, latency and synergy amongst the components of the criminal justice system.
In separating them into the four matrices - criminal justice, ideology, socio-economic and institutions/coercion - we have to resolve the tensions of measurement and analysis.
• The purpose of the Justice Matrix is to enable an understanding of dynamics processes and relationships, rather than to prove a pre-determined assumption (e.g. that poverty causes criminalisation). It will be an aid to the development and testing of given hypotheses about crime and society that will inform policy and public understanding.
• The scope of the Matrix has to be vast in order to serve its purpose. The four domains map out the universe.

2. The detailed analysis of the Crime Survey for England and Wales (CSEW) has led to the conclusion that in its current form, it is not suitable for automation and certainly not for the data mining and big data analysis tools without extensive pre processing and reorganisation of the data. It is suitable for a general linear regression analysis. We could propose a revised form if needed (currently not produced as part of this project).

Survey data is known for its complexity especially its high dimensionality. The Crime Survey for England and Wales (CSEW), formerly known as the British Crime Survey, is a survey conducted by the Office for National Statistics (ONS) in the UK every year. It is carried out face to face, interviewing a number of households in England and Wales. The purpose of this survey is for the government to understand the true level of crime. The survey is able to find out about crimes that do not get reported to, or recorded by, the police. It was previously shown that only 4 in 10 crimes are actually reported to the police. Therefore, the CSEW provides a better reflection of the extent of crime than police recorded figures. It offers a useful benchmark for police recorded crime. Furthermore, CSEW provides valuable survey data for criminologists. The results of the survey play an important role in informing government policies. Until now, the surveys are analysed using traditional statistical methods and models and are not fully exploited to harness the wealth of information they contain. We believe that this is due to the complexity, variability and heterogeneity of the data.
Indeed, current use of data mining in criminology focuses on finding crime patterns and trends. This new area is often referred to as computational criminology and draws from Computer Science, Applied Mathematics and Criminology disciplines. It is primarily about explaining and learning about the mechanisms underlying criminal activities.
In this project and for the first time, we will attempt to understand the data that is collected from the CSEW and investigate the application of data mining techniques on CSEW as part of the crime profiler project (a proof of concept) that will be used as the basis for a data driven informed policies that can be used by local authorities and police forces.

From 2001 to 2014, 14 years of data was collected from CSEW. Prior to 2009, CSEW includes the following datasets,
- General victim form (16 years old +);
- General non-victim form (16 years old +).
From 2009, children aged 10-15 are also included in the interview. Therefore, two more datasets were included.
- Victim form for 10-15 years old;
- Non-victim form for 10-15 years old.
In the first phase of this research, no-victim forms (16 years old+) from CSEW were chosen. The non-victim form includes not only the victims' data but also the non-victims' data.
The class features are 18 different types of crime. The following offences are covered: violence (though murder cannot be included), robbery, theft (personal, burglary, vehicle, bicycle, other household) and criminal damage. It is a binary classification problem. There is a mixture of discrete features and continuous features in each dataset.

Each dataset has a large number of features. On average, there are 2400 features per year. Large number of features means a high dimensional data. High dimensional data can contain high degree of irrelevant and redundant information. It is a common problem in data mining area and this can lead to:
- The production of complex models, which are hard for researchers and users to interpret;
- The increases of the run-time complexity and degrades the performance of learning algorithms;
- The problem of over fitting.

Feature selection is an essential process when facing high dimensional data. Without the availability of domain experts, feature selection algorithms such as wrappers, filters and embedded methods are normally adopted. In this project, a group of experienced criminologists are members of the project team and they are used to select a subset of features that are more relevant and important. These are selected from the following areas of the CSEW:
- Household Grid;
- Demographic Data;
- The Index of Multiple Deprivation;
- Calibration weight.
The feature selection process allowed the reduction of the number of features to be about 150, which are used as the basis for building the models. Furthermore, depending on the type of crime, some of the values were missing from the class feature. For example, if the household does not own a car during the survey year, the value for car related crime would be marked as missing. For each type of crime, the data with missing class value have been removed.

There are many methods used to tackle the problem of imbalanced data. The main principle is to adjust the class distribution of the datasets. They can be classified into two main techniques: over-sampling the minority class and under-sampling the majority class. Due to the nature of the CSEW data, criminologists are interested in non-victim data as much as victim data. As such, we have adopted the over-sampling method.
One simple over-sampling method is creating copies of randomly selected instances from minority class. The problem of this method is that it can cause over fitting. The produced models can be too specific but not general enough. The alternative would be to create synthetic samples from the minority class.
The most popular synthetic sampling technique is called SMOTE (Synthetic Minority Over-sampling Technique). It creates synthetic samples form the minority class instead of creating copies. The algorithm selects two or more similar instances using a distance measure. It creates one feature of an instance at a time by allocating a random number within the difference to the neighbouring instances. By adopting SMOTE, we balanced the binary classes' ratio to be roughly 1:1.

3. The use of unsupervised models such as association rules was shown to be ineffective with the crime survey data. Unsupervised models yield better results and were more intuitive to the end users.

In this project, class feature (crime types) is pre-defined. Therefore it is a typical supervised learning problem. Furthermore, it is a binary classification problem (Victim vs Non-victim). The users of this research are criminologists who want to gain a better understanding of CSEW data. It is primarily about explanation. Criminologists try to learn how underlying mechanism work. Therefore, a descriptive modeling method is preferred and the Decision Tree learning method is chosen.
Decision tree learning is one of the most commonly used data mining techniques. A decision tree is a flow-chart-like structure. The topmost node in a tree is the root node. Each internal node represents a test on an attribute. Each branch denotes the outcome of a test. Each leaf node is a class label.

The produced decision tree models can be simply interpreted as a set of rules in a form of 'IF A and B and C Then D' where A, B and C are the conditions of the rule, D is the result of the rule. The rules help the criminologists to analyse the characteristics of people that are more likely to be the victims of the crime as well as the characteristics of people that are less likely to be the victims. In particular, C4.5 is used in this research to produce the descriptive models. C4.5 uses normalized information gain to choose the attributes one after the other to form the decision tree. C4.5 handles both numeric features and discrete features. It can also handle missing values.
Large size of the produced decision tree may have the problem of over fitting. Pruning is the technique to solve the over fitting problem. It reduces the size of a learning tree without reducing predictive accuracy. There are many pruning methods. In this research, reduce error pruning is adopted. It has the advantage of simplicity and speed.

A purposely-designed data mining tool CSDM (Crime Survey Data Miner) is developed. The main functions of the developed tool include data preprocessing and data mining.
CSDM is developed in Python. Pandas is used for data manipulation and analysis. WEKA is used for data balancing and data mining. Therefore, Python-weka-wrapper is used to let WEKA communicate with python code.

In this project , one decision tree is produced for one type of crime for each year. Overall, we have produced 252 decision trees and each decision tree can be easily interpreted to be a set of rules. The size of the rules varies from single condition to multiple conditions. Here is an example to illustrate a typical decision tree:
- Single condition:
The house has home insurance = true,
House Damage = false.

- Multiple conditions:
Attends government employment training scheme = True, AND
Religion = No religion, AND
General health condition = Good, AND
Marital Status = Single, AND
Own a bike = Yes, AND
Interviewer is HRP = Yes, AND
Number of bike owned = 2, AND
Household's position on the Crime and Disorder Index = 20% of most deprived LSOAs
Car Damage = True.

4. Data visualisation became unavoidable and an important part of the project as it was clear from the first experiments that the end users (CCJS) find it difficult and overwhelmed by the amount of data produced and its complexity. The first prototype was used to enable data visualization of binary connected data sets (ie the data set was comprised of a set of attributes with a single weight value and a set of arcs defining binary relationships linking one attribute to one another with a force value) this was imported from a text file exported from the Pajek graph solving system. This was a desktop and 2D visualisation tool. It went to some extent to contribute to a better understanding of the models by the end users but it remained complex.

The core application was developed in C/C++ using the OpenSceneGraph (OSG) library for graphical rendering and the Qt toolkit for GUI framework. Physical simulation was enacted as a bespoke library that extended previous research and enacted specific modifications for the data sets used. This was delivered as a desktop application with a traditional mouse/keyboard interface capable of running on typical office based PCs and laptops.

In this model a single decision tree was imported. Each route through the decision tree (to a true or false screener question outcome) was defined by a set of binary spring connections between the attributes (demographic characteristics) within the route, with the spring force defined by the support value for the route outcome. The physical simulation system then resolved this network of spring forces by considering each connection as a basic spring/damper component applying Hooke's law for calculation of spring force of all connections on an attribute to determine resultant force.

This mechanism allowed the DT representation to self sort. In the initial prototype this was enacted as a unconstrained un-directed graph with no fixed points. User's were free to interact with the graph and both drag and 'pin' attributes (nodes) in place to explore the relationships between the connected attribute set within the DT description. This enabled exploration of the relationships, However, as this was applied it became apparent that, at this stage, the most interesting visualisation was presented when the endpoints of the decision tree were put into tension, ie fixed at opposing positions. This allowed the demographic attributes (age, employment status, housing type, etc) to sort into clusters between the DT outcomes and, through their proximity to either endpoint, show the relative strength of contribution to the outcome.

Overall this was a successful first prototype. It both demonstrated that the use of force based simulation of the decision tree attribute relationships presented an effective representational paradigm, supporting both user interaction, facilitating exploration, and that the interpretation of the visual representation was consistent with more traditional analysis of the data set. In effect this was a secondary data mining process that mined the DT through force based simulation to resolve a single coherent view of the attributes contribution to the question outcomes.

As the prototype developed 3 key requirements emerged from the data mining experts.

a)Representation of multiple decision trees within a single coherent visual model that enabled exploration and investigation of the combined (and selectable) set of questions represented
b) Standard problem domain layouts as the starting point for graph resolution that enabled both objective focused (question outcome) and profile focused (demographic modelling) of the mined model.
c) Bias reduction through automated clustering to ensure that interpretation of the spatial relationships within the data visualisation presentation were consistent for each user.

Implementation of the full connected network model required the most significant conceptual change within the visualisation system. Defining each decision tree as a distinct network within a common attribute set required the development of a new data layer within the model. Initially all attributed had been considered as unique individual members within a single model. However, each separate tree imparted a different relationship model that was independent of the others. Therefore a layer construction was added. Any attributed included in any tree was added to the global set of nodes, but each relationship was associated with the decision tree that defined it. This enabled:

a) Unique colour highlighting of the relationship line within the visual presentation, with all relationships for a single decision tree having a common colour
b) Enable/Disable of individual decision trees within the physics solving model. In this a disabled tree removed all the relationships for that tree from the physics solving process and reference counting of the active relationships imparting force on each node enabled the positional update of the node to be disabled when it was not part of the active decision tree(s) collective attribute set. In this case the nodes would become hidden from view when they were inactive. This allowed the graph of the connected node sets to re-organise two the constraints provided by only the active decision trees within the model, thereby allowing the users to articulate 'questions' in the form of selected active decision trees.

A further refinement to the physical simulation model was also made. This was partially to de-clutter the visual form, which now comprised many nodes (100s) expressed as spheres and a massively increased set of relationships (10000s) expressed as drawn coloured lines. In this the previous binary relationship was adapted to to a multi-body form in which the attributes form a single route through to a decision tree endpoint operated within a single spring model including a phantom centre point positioned by averaging the attribute node positions for the connected set at each iteration of the physical simulator. This reduced the computational complexity, through a reduction of the number of relationships that needed to be evaluated, thereby improving simulation performance and enabling better user interaction, and decluttered the scene with a simplified rendering model. In the case of the latter, each decision tree was now represented by a collection of attribute nets, with each denoting a different tree route to a question outcome. Multiple tree representations overlaid additional constraint relationships, sharing nodes where they were common.

5. Immersive data visualisation was then used to map the desktop prototype application to a fully immersive virtual environment system (octave). This enabled exploration of visual representations within a ego centric display system that supported improved modalities for recognition of data relationships and clustering. The Data Scientists, criminologist and end users involved in the project found this approach more accessible, enabling better knowledge understanding and discovery within the systematic representation of the data set. A short video of this visualisation is uploaded on YouTube and can be visualised at: https://www.youtube.com/watch?v=WQKyPJB1Wv4&feature=youtu.be

6. Towards the end of the project as both users and developers have developed a better understanding of the needs and what the data science technology can offer, we have attempted to aggregate the decision trees obtained over many years and used times series analysis to get a better understanding of the evolution of the different variables over a period of time. Some very interesting results, although at an early stage, are obtained and the end users could see the power of such tools in not only understanding the historical development of specific crimes over time but also the ability to predict future development. A natural development of this work is to map crime variations to various sociological, economical and policy changes.
Exploitation Route The data analysis and visualisation tools will be shared with CCJS that will use them in their studies and further development of the project.
Sectors Communities and Social Services/Policy,Government, Democracy and Justice

URL https://www.youtube.com/watch?v=WQKyPJB1Wv4&feature=youtu.be
Description In the Case for Support we said that the aim of this project was to prototype the development of a Profiler for Crime, Criminal Justice and Social Harm, that will give proof to the concept that Big Data Informatics (BDI) can enable civil society opinion formers such as the Centre for Crime and Justice Studies (CCJS) to better understand factors affecting crime, criminal justice and social harm. In so doing, it was intended to provide civil society with a better understanding of how social, economic and demographic factors shape the distributions of social harm in society. This would improve strategic policy-making in crime and criminal justice by up-skilling the civil society sector, thereby expanding the knowledge-based production of policy. In pursuing this aim, it was hoped that the project would help in the development of Computational Criminology as part of the advancement of theory and method in criminological research. Undoubtedly, these aims were highly ambitious. The biggest obstacle to realising them was conceptual: it turned out that in order to provide a proof of concept, in the shape of an automated analytical and visual 'machine' (i.e. The Profiler), considerable effort was needed to work out what 'the concept' was. At the end of 12 months work (reduced from the 18 months we had originally anticipated), we have achieved the theoretical, methodological and technical advances needed to prove our concept. Nevertheless, the steps we have taken to get this far mean that we have had less time to achieve or aim of transferring the technology, particularly making the Profiler user-friendly. In the first place, we needed to define what 'Big Data' meant in this context. Much of the work into Big Data thus far has focussed on individual records (data), often collected as part of an organisation's operations (e.g. sales data). Consequently, the defining features of Big Data have come to be seen as consisting of Volume, Velocity and Variety, typically 'long and thin' data matrices to which the techniques of data-mining (the so-called 'algorithms') can be applied to process the data into meaningful patterns (i.e. statistics). In contrast, the focus of our efforts was the official statistics relevant to the relationship between Crime and Society. This data tends to be 'short and fat': a smaller number of cases relative to a much larger number of 'variables' (attributes, indicators). Governments not only use official statistics for policy-making but also intend them to inform and to be used by the public. Public interest also sees official statistics as a means of holding government to account. Nevertheless, official statistics are already pre-processed data. When coming to apply data-mining techniques, it became apparent that much of this statistical pre-processing had limited the scope and flexibility of data-mining to come up with interesting and revealing new patterns and, in having reduced the data in length and width, meant that the power of our analytic techniques was largely redundant. Aside from measurement and coverage issues affecting the official statistics (which, of course, we could do little about), the statistics are also set-up and produced primarily to make them suitable for 'general linear regression analysis', whether geographically or in time-series (the amount of pre-processing being necessary in order to meet the limiting assumptions of the General Linear Model (GLM). In contrast to the hypothetical-deductive epistemology underlying this approach, Big Data Analytics is a technology that is inductive, aimed at discovery, hypothesis- or policy-generating rather than hypothesis-testing or policy-confirming. Yet the statistics are not set-up to facilitate Big Data Analytics. Finally, crime and justice statistics are by and large collected haphazardly, reflecting more the characteristics of the agencies that produce them than the phenomena they purport to represent. Again, this limits the capacity of Big Data Analytics. To give an example: in order to apply our techniques to advantage to national level data, we would have needed a common data matrix consisting, at a minimum, of the attributes: Year x Gender x Age, and usefully Ethnicity, in order to allow CCJS to investigate the well-known inequalities of the criminal justice system, what is called the 'intersectionality' of justice. In practice, it proved impossible to collate even this minimal data matrix, even across the available criminal justice data-set. Consequently, the advantages of time-series data-mining could not be utilised. In sum, we conclude, as ever, that the construction of data is determined by its intended use or, indeed, by its intended neglect (as is the case currently with Police Recorded Crime data). While we have not changed our view that Big Data Analytics would revolutionise the analysis of crime and justice issues, in order to fully realise its power and capacity we conclude that what is first required is the development of a comprehensive ontology of the statistical data, with a view to encouraging data suppliers to move towards a common, preferably disaggregated framework for outputting data suitable for Big Data analytical techniques. Of course, this illustrates the inherent circularity of statistics; that the data is only as good as the means used to produce it. Nevertheless, if the Office of National Statistics and/or the UK Data Archive are serious in their ambition to make official statistics available for Big Data analysis, in this field of policy, at least, they will need to do a very large amount of pre-processing. The price of not moving in this direction is that criminal justice policy-making becomes increasingly behind the curve of what might be achieved in evidence-based policy-making, incapacitating the good governance of crime in society. Nevertheless, the advances we have made in this project are the exceptions that prove this rule. The over-riding issue with crime and justice statistics is not so much size in terms of length but size in terms of width, that is, the high-dimensionality of the data (the many attributes of a 'data unit', whether person, place or region), and the resulting complexity of the associations and potentially causal networks (as well as spurious associations) within the set of attributes. We have made advances in grasping this fundamental issue in two directions: conceptual, and empirical. First, we are developing a conceptual framework (incidentally inspired by the work of the sociologist Michael Mann) called The Justice Matrix. We see this as a high-dimensional, holistic conceptual space covering the terrain of Crime and Society, consisting of four Domains into which data can be organised: • Social: social and demographic characteristics of the population • Crime: information on the incidence and nature of crime and harm phenomena • Ideology: information and records of political and administrative (policy) intentions and aims • Resources: information on the institutions of criminal justice and the resources they deploy The Justice Matrix will provide us with a conceptual framework both to organise data from various sources, and to test relationships and associations within and between domains. It proved useful in giving the data-mining a pre-processing 'map' of the data in order to orientate and circumscribe analysis. In the initial phase of the analysis and mining of the crime survey data, unsupervised learning such as association rules mining seemed to be the most appropriate and promising techniques to be used to satisfy the end users requirements. However, because the complexity of relationships within the data is overwhelming, we could not proceed with 'unsupervised' analysis without generating a large number of potentially trivial associations, which would be onerous to evaluate one-by-one. Thus the Justice Matrix not only provides an initial 'supervision' of the analysis but will also become the theoretical framework for selecting and organising the results. The advantage is that it will provide theoretical criteria, alongside purely internal, data-driven criteria, for selecting and investigating patterns - that is, providing a means whereby theory can get back in to Big Data. In time, as the work continues, the Justice Matrix is likely to develop both as a 'data warehouse' and as a platform for analytical development, including forecasting and simulation of policy options. This was an important aim of the project at the outset and, while we have not moved greatly towards it technically, we now feel confident that we have the appropriate architecture in place in order now to make substantive progress. Second, in order to explore the potential of data-mining techniques to model the complexity of Crime and Society, we have used Classification Tree Analysis (also called Decision Trees, or Partition Trees) to model the correlates of crime victimization from the Crime Survey for England and Wales, between 2002 and 2013-14. To our knowledge, this is one of the first attempts to use data-mining techniques on social survey data. For each of these sweeps of the CSEW, we have taken the range of crimes and assessed their associations with around 150 features taken from individual respondents' demographic profiles and the household in which they reside. Since the sample sizes range from 33,000 to 46,000 cases per sweep, the resulting data matrix (cases x variables in GLM terminology), corresponding to the Social and Crime Domains of the Justice Matrix, is large, meriting the application of Big Data Analytics to bring parsimony to complexity (a kind of informatic Occam's Razor). As described below, this has resulted in a large number of Classification Trees. These now become the second phase of our analytical work, which will extend beyond this Project. The analytical strategies to be pursued will be to analyse patterns amongst: • All Crime Trees in a Year • Individual Crime Trees across Years • All Crime Trees across All Years These strategies will themselves generate complex data that needs to be simplified in order to extract relevant patterns and useful information. The Visualiser (described below) will serve as an essential heuristic device for detecting patterns amongst data, and for suggesting hypotheses, to inform more detailed work. Integrated into the Justice Matrix, the analysis of these Trees will form the core of a programme of future work that may, in time, realise the ambitions of the Justice Matrix to form a solid empirical basis for the analysis of Crime and Society. While the ultimate goal is still quite a way off, the work the team has managed to achieve over the course of this Project has laid the foundations, at least, for a profound change, perhaps even a paradigm shift, in how we can analyse and, hopefully, change the way we think about Crime and Society.
First Year Of Impact 2016
Sector Communities and Social Services/Policy,Government, Democracy and Justice
Impact Types Societal

Description Consultation with UK Statistics Authority about New Generation Crime Statistics
Geographic Reach National 
Policy Influence Type Contribution to a national consultation/review
Description Visiting Fellowship (Dr Tim Hope) 
Organisation Centre for Crime and Justice Studies
Country United Kingdom 
Sector Private 
PI Contribution The Centre for Crime and Justice Studies, London, is a partner in our grant under the ESRC Civil Society Data Partnership programme. The Visiting Fellowship (for 3 years) will allow for the further development and dissemination of our research, and provide a partnership from which to seek further funding, especially from the charitable sector.
Collaborator Contribution CCJS will provide office space and research facilities, an email address, and support Dr Hope's travel expenses.
Impact Dr Hope is assisting CCJS in setting up a group of Visiting Fellows, and planning a programme of collaboration and advice. He is also assisting in discussions about the re-launch of 'Criminal Justice Matters'. Dr Hope writes regularly for the CCJS blog. Leads CCJS in advising the UK Statistics Authority on 'new generation crime statistics'
Start Year 2015
Description Seminar presentation, University of Liverpool 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach Regional
Primary Audience Postgraduate students
Results and Impact Presentation to academic and postgraduate members of the International Crime Research Unit, Department of Law and Criminology, University of Liverpool
Year(s) Of Engagement Activity 2016