Big Data Analysis Techniques Applied to the NA62 Experiment at CERN

Lead Research Organisation: Lancaster University
Department Name: Physics

Abstract

The NA62 experiment at CERN aims to measure precisely the 10-10 branching ratio of the decay of a positive kaon into a positive pion and two neutrinos (Kpnn). In 2016 the experiment has collected enough kaon decays to observe Kpnn; in 2017 NA62 has taken a factor 10 more data and a comparable statistics is expected in 2018. The analysis of the 2016 data has been carried on using mostly a cut-based technique applied to raw-level quantities. This is a proven method to observe Kpnn, but it does not provide enough signal acceptance for a precise branching ratio measurement and it is not easily scalable to higher statistics data samples. The present project aims to develop an efficient data reduction scheme for NA62 and to apply multi-variate techniques to the Kpnn analysis.
Already in 2017 NA62 produced over a petabyte of raw data which are under processing using a traditional high energy physics (HEP) analysis model. Such a model employs several stages of data processing: calibration, reconstruction, data quality assessments, filtering. Physics analysis can only starts on filtered datasets because of computing throughput considerations, with access to data quality and calibration information being crucial. From the data science perspective there are several noteworthy features of this approach. The first is that the intermediate reconstructed dataset, expected of order of 10 petabytes at the end of 2018, is useless from the physicist's perspective. Nevertheless, a data science approach would still involve annotating the input raw data to produce a calibrated raw dataset which could then be queried at analysis level. Secondly the reconstructed dataset is three times the size of the input raw dataset: this is normal for the commissioning phase of an experiment, but not applicable in exploitation phase.
The proposed data science project is to address the current limitations of the NA62 analysis model and reduce the time to produce physics results. In the first stage of the project, the reconstructed data size and I/O performance will be studied with a view to reducing the size and improving the I/O throughput using NA62 computing resources. Both the reconstructed and the filtered datasets will need to be studied to ensure the best performance for the end-user and for bulk data processing workflows. In the second stage, the analysis model itself will be studied with a view to potentially leveraging data science approaches to reduce the time to physics insight. Several approaches are worthy of investigation, including the possibilities for data homogenisation (reducing the complexity of analysis code) and analysis homogenisation (reducing the complexity of analysis workflows). Novel approaches to analysis model improvements include spark-style analysis, requiring a dedicated analysis facility to provide the infrastructure that could be extremely interesting when considering a more unified approach to supporting HEP computing demands in future.
As a final test of the effectiveness of the new analysis model, the project aims to apply machine learning techniques to the Kpnn analysis with the goal to increase the signal acceptance. The development of these techniques will take advantage from data reduction to efficiently create and optimize training, validation and testing samples, which are the core of any successful machine learning application to data analysis. The impact of these techniques on particle identification, photon rejection and tracking will be studied and several algorithm investigated, using specific HEP-packages like TMVA, but also exploring solutions outside HEP, like scikit-learn or Keras packages.

Publications

10 25 50

Studentship Projects

Project Reference Relationship Related To Start End Student Name
ST/P006795/1 01/10/2017 30/09/2024
2039270 Studentship ST/P006795/1 01/12/2017 31/05/2022 Joseph Carmignani
 
Description I've implemented a machine learning algorithm based on a Neural Net (NN) that learn from data by doing a simple mapping between input and output to classify by predictions.
The objectives achieved are the following:
- NN algorithm works well enough, can learn the physics features and shows a higher performance even from Raw Data (Low Level Variables).
-NNs are flexible we can adjust the architecture to adapt with our data.
- Features Engineering is a must if we want to take full advantage of NN potentials (in Parameter Space).
- The predictions of NN showed a stability over the runs and outperformed a classic discriminant based on logarithmic likelihood.
Exploitation Route Future Work will use the NN for Background rejection in addition to:
-We must work on a MC sample to compare with data as well.
-Development of a more general algorithm to handle fake tracks built from pileup and Kaon hits in the NA62 experiment at CERN.
- Study of intensity effects over accidental beam activity build up in sub-detectors.
- Implementation of the NN and other MVA methods to the main analysis.
- Investigation of other areas of the analysis that could benefit of a Multi-Variate-Analysis (i.e. tracking in the STRAW, Calorimetery and definition of the signal region).
Sectors Digital/Communication/Information Technologies (including Software),Education,Other