Big Data Challenges in Graph Databases

Lead Research Organisation: Newcastle University
Department Name: Sch of Computer Science

Abstract

A distributed graph database partitions graph data across multiple machines in a cluster. Partitioning a graph is non-trivial and much research has been dedicated to finding the optimal solution. Regardless of the chosen partitioning some edges will be distributed between servers: with the nodes an edge connects residing on separate machines. It is a challenge to maintain the correctness of distributed edges as records are being concurrently modified.

Several contemporary distributed graph databases (e.g. JanusGraph) use existing NoSQL databases (e.g. Apache Cassandra) as a storage backend, adapting them with an API in order to handle a graph data model. This approach is attractive as the underlying store offers high scalability. However, it only offers weak isolation guarantees which has serious ramifications for the integrity of graph database systems. It has been recently established that weak isolation can lead to corruption of distributed edges and then to irreversible database corruption to a significant degree, in intervals that are worryingly small compared to database lifetime.

Maintaining correctness of edge information at nodes residing on separate partitions requires the introduction of ACID transactions into the graph database. A straightforward approach may use existing solutions: two-phase commit for atomicity and durability, pessimistic two-phase locking for concurrency control, and Paxos for data replication. However, this architecture would be critically harmful for the performance of graph database workloads.

The focus of this research is to develop a suite of efficient distributed transaction protocols tailored for graph database workloads, which maintain the correctness of distributed edges and offer suitable transactional throughput and latency. Application specific use cases that can tolerate weaker isolation levels and therefore better performance will also be explored, any conceptual underpinnings that permit such safe relaxations will be derived.

Planned Impact

The CDT will have impact in a range of areas:


Industrial and Public Sector Impact

The Centre's main impact will be made through its graduates: it will develop highly skilled researchers with the theoretical and practical skills to transform existing organisations, and create successful new companies.

We have already obtained commitment from 30 partner organisations both large and small, regional, national and international, who wish to work closely with the CDT (as evidenced by the letters of support). Impact on them will come through students working on projects specified by partners, students being placed with partners during their PhD, and ultimately through students moving into positions of influence in organisations when they graduate.
The norm for all software developed in the CDT will be to release it as open source so that it can be exploited by industry. In our experience this can attract companies and be a catalyst for productive collaboration - code from our previous projects has been widely used internationally.


Economic Impact

The global cloud computing market is expected to grow from $38 billion in 2010 to $121 billion in 2015 (M&M, 2013). Working productively with partners will maximise the chances of economic impact, which will come through organisations using their newfound skills, expertise and tools to realise their potential to transform themselves.

UK industry faces a huge skills gap in this area. Demand for big data staff has risen exponentially (912%) over the past five years from 400 advertised vacancies in 2007 to almost 4,000 in 2012 (e-skills UK, Jan 2013). Over the next five years analysts forecast a 92% rise in the demand for big data skills with around 132K new jobs being created in the UK (e-skills UK, Jan 2013). The CDT will provide expert practitioners to fill this gap.

The reason Newcastle City Council is setting up the £2M cloud business engagement facility that will be co-located with the CDT is that it believes that it can transform the local economy by up-skilling existing workers. This investment brings funding for CPD, cloud events and other outreach activities that will disseminate the knowledge developed in the CDT.


Societal Impact

We will build on the knowledge and pathways created in the Social Inclusion through the Digital Economy Hub (SiDE: 2009-15), which is tackling big data challenges across a range of areas of societal importance e.g. healthcare and mobility for older people. We will build on our existing, long-term relationships with SiDE partners; maintain our links with organisations that represent disadvantaged groups; and work directly with users through the 3000 person User Pool created by the SiDE project.

The CDT also has a strong set of investigators tackling key healthcare challenges through the use of cloud computing in medicine, biology and neuroscience. These subjects are now under a deluge of data, and increasingly researchers (including those in the pool of potential supervisors for this CDT) are using cloud computing to extract knowledge from it.

An annual public engagement open-day will disseminate the CDT's work to a diverse audience.


Academic Impact

Academic impact will come from the graduating students (some of whom will stay in academia), ideas (through publications), the publication of open source software and our delivery of training courses to other CDTs and researchers.

The placing of CDT students at our overseas partner Universities - Berkeley and PUCRS, Brazil (please see letters of support) - will provide a way for our student's research to have direct international impact.

Publications

10 25 50