Newton STFC-NARIT: Using astronomy surveys to train Thai researchers in Big Data analysis

Lead Research Organisation: University of Sheffield
Department Name: Physics and Astronomy

Abstract

The most effective way of reducing levels of poverty in developing countries is through their continued economic growth, which leads to increased levels of income per person. To remain competitive, however, a developing economy needs access to a workforce with increasingly sophisticated skills. For Thailand today, this means skills that enable innovation, allowing it to compete against other developing and developed economies. With more and more sectors collecting data on their customers, production lines, distribution networks, stock prices, etc., one of the most crucially needed of these "high-level" skills is the ability to handle large amounts of digital data. However, as present, Thai students lack ready access to very large datasets, which prevents their training in this area. Our project addresses this problem by combining the resources and experience of UK astronomers of very large datasets, with the skills of Thai Data Scientists in digital data analysis. Through this collaboration, the Thai students will gain access to a vast amounts of digital data in the form of astronomical surveys. Working under the supervision of the Thai and UK partners, the students will develop software that will automatically analyse this data and, in doing so, gain important experience in handling vast quantities of digital data. Furthermore, the software resulting from the students' research will form an important component of the pipeline used to process data from a new telescope being developed by UK astronomers, and will thus directly benefit UK science. On completion of the project, the skills acquired by the Thai students will be readily transferrable to a diverse range of economic sectors such as information technology, medicine, finance, security, etc., thereby helping the further economic development of Thailand.

Planned Impact

The immediate beneficiaries of this research will be the Thai graduate students who will receive training in high-level data handling skills while conducting their research project. Following the project, these students will be able to apply these skills within a diverse range of growth sectors, such as finance, medicine, logistics, information technology etc. In doing so, they will benefit the wider Thai economy by helping grow and compete internationally through innovation.

Secondary beneficiaries will be the scientists and astronomers who are involved in GOTO and NARIT. By developing software to automatically analyse large amounts of digital data, the outcome of the students' research may play an important role in extracting information from GOTO and NARIT's facilities. In the latter case, this could be in the form of automatically analysing data taken by the network of Thai Robotic Telescope distributed across Thailand and in Chile. This has the prospect of making the data more accessible to students, schoolchildren and the general public, thereby encouraging their interest in astronomy.

Publications

10 25 50
 
Description Early on in the project, we learned that the number of "real" sources that we want our algorithms to identify are massively outnumbered by the number of "false" sources that are also detected by traditional methods. This presents a major problem for training machine-learning algorithms, as the algorithm gets very good at identifying the false sources (since there are so many of them) but very poor at identifying the rare real sources. Our students had to develop clever ways to either expand, or upsample, the real sources so that there were more of them in the data, or downsample the false sources. We found that downsampling was much more effective, and enabled us to train our algorithms much more effectively.

Another student involved in the project has been investigating how best to store the data from our telescopes for easy and fast retrieval and analysis. Unfortunately, our initial attempts at doing this proved too slow to ingest the data, so we are investing newer techniques which are proving to be far more rapid at ingesting data.
Exploitation Route In our second Newton Award, we are further developing our Machine Learning algorithms. We are exploiting the downsampling techniques developed during the first Newton project, but now we are training more advanced Machine Learning algorithms to work directly on the images, as opposed to features measured from the images (such as shape, brightness, etc).
Sectors Aerospace, Defence and Marine,Agriculture, Food and Drink,Financial Services, and Management Consultancy

 
Description The primary aim of this Newton-funded project is to provide training to students to develop storage solutions and analytics for large amounts of digital data (Big Data). The students involved in the project have all received such training, with three of the four students now either using their skills within their employment, or pursuing a further study (at PhD level). It should also be noted that this project has led to three other funded projects - one further Newton Award, and two GCRF awards. Within the Newton Award we are further developing our algorithms to store and analyse large amounts of digital data, and training three more students in the process. The GCRF projects are funding collaborative research with Thai businesses from a wide range of different sectors, including the aviation, retail, tourism, and tech sectors. By developing software to help these businesses process and analyse their data, we are providing real economic benefit to Thai industries.
First Year Of Impact 2018
Sector Aerospace, Defence and Marine,Digital/Communication/Information Technologies (including Software),Education,Leisure Activities, including Sports, Recreation and Tourism,Pharmaceuticals and Medical Biotechnology
Impact Types Societal,Economic

 
Description Capacity Building in Software and Hardware Infrastructures and Data Handling through Astronomy
Amount £154,832 (GBP)
Organisation Science and Technologies Facilities Council (STFC) 
Sector Public
Country United Kingdom
Start 04/2018 
End 03/2020
 
Description Global Challenges Research Fund Foundation Award 2017
Amount £73,977 (GBP)
Funding ID ST/R002614/1 
Organisation Science and Technologies Facilities Council (STFC) 
Sector Public
Country United Kingdom
Start 02/2018 
End 03/2019
 
Title Machine-learning-based methods to handle extremely unbalanced datasets 
Description The project required us to identify a small number of true positives amongst a large number of false positives. The ratio of false to true positives is extremely high, much higher than is typically encountered by machine learning algorithms. As such, off-the-shelf algorithms were unable to cope with this data, so we had to develop our own, which is working well. 
Type Of Material Computer model/algorithm 
Year Produced 2018 
Provided To Others? Yes  
Impact It has enabled us to identify the small number of true positives in our data. It is too early to say what wider implications this will have on other research. 
 
Description Big Data Workshop held at Chiang Mac University for the benefit of staff and students. 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach Local
Primary Audience Undergraduate students
Results and Impact Roughly 60 staff and students attended a workshop in which astronomers and data scientists delivered presentations describing the levels of data output from astronomy and potential synergies with data science.
Year(s) Of Engagement Activity 2017