eHive-RPC: A Remote Procedure Call Public Interface for eHive

Lead Research Organisation: European Bioinformatics Institute

Department Name: Ensembl Group

Abstract

Ensembl developed 'eHive' as a production system that manages and optimizes the running of tasks (called 'jobs'), on a compute cluster that may have thousands of Central Processing Units (CPUs). A CPU is the hardware within a computer that carries out the instructions of a computer program by performing the basic input and output operations of the system. Each computer has one or more CPUs.

Some compute clusters comprise many thousands of CPUs distributed amongst many computers. With so many computers and CPUs, it is important that jobs are sent to these CPUs in a fair and efficient manner, especially when many users are competing to use the same resources. Clusters usually rely on a central queuing system that holds a list of all the jobs that need to be run and can give individual computers in the cluster explicit instructions about which job to execute. This type of queuing system works well if the jobs each take an hour or more to complete. However, when jobs complete faster than they can be scheduled it creates a processing bottleneck e.g. if a job executes in minutes or less. The usual way to solve the bottleneck is to implement another system on top of the scheduler that 'batches' similar jobs together to make operations more efficient.

eHive's novel solution to the issue of job queuing is to move away from this central job scheduling: eHive is a 'distributed' processing system based on 'autonomous agents' with the behavioural structure of honeybees, hence the term 'eHive'. eHive maintains the ability to monitor and track jobs via a central 'blackboard'. Workers are efficiently created on a compute cluster, known as a meadow, with no specific task assigned to them. Once running, each worker contacts the blackboard, is able to find the most suitable kind of job, specializes to claim work and runs multiple jobs of this type in a row. Workers are able to re-specialize to claim other types of jobs once they exhaust their original designation. Each worker regularly updates its status in the blackboard to allow other workers to optimize the overall job distribution.

The benefits of eHive are (a) a reduction in the overhead of individual job processing, (b) an increase in the maximum number of tasks that can be running at any one time, (c) an increase in the tolerance to faults in the compute cluster, and (d) the allowance of complicated processes running in parallel.

Although eHive was originally designed for the purpose of Ensembl, its functionality is applicable to all data types that have large compute requirements. In this project we aim to transform the possibilities of eHive further, by developing a 'Remote Procedure Call system (RPC) for eHive. This will allow jobs to run on remote clusters as well as local clusters, thereby expanding the use of eHive to multiple compute clusters and cloud computing services. This will enable wider use of eHive within data-intensive fields in the life sciences and beyond.

Technical Summary

eHive-RPC attempts to bring a system for creating efficient Remote Procedure Calls (RPC) for concurrent pipelines. eHive is a distributed processing system built by the Ensembl Project and used to control single pipelines on large-scale compute farms.

RPC allows inter-process communication, meaning that local code can cause the execution of code in another address space. There is no limit on how long the remote task should take, which makes waiting for the response inefficient. Instead, developers use asynchronous solutions that allow the local work to be queued. Some solutions will immediately hand back a token that identifies a unit of work and requires periodic client polling asking if the work is finished. Polling is usually considered inefficient too, as clients cannot proceed the moment the remote call has finished.

Our system attempts to solve this problem by extending the eHive workflow system. eHive is essentially event-driven, and is able to control concurrent work units using semaphores. In this context, a blocking work unit can represent a remote call, and another, blocked, work unit can be local. The latter can be unblocked using an inter-pipeline semaphore system when the RPC call has completed. The remote server will communicate task completion, including appropriate error reporting.

eHive-RPC also requires efficient two-way data transfer between local and remote servers. Here we aim to extend an existing architecture called accumulators and move towards the transparent transfer of data between caller and remote server.

Our implementation will be integrated into the existing eHive project both as a client and a server, making any eHive pipeline an RPC target. We expect the system to be generic and useable by any client or workflow engines e.g. Galaxy or Taverna. We also aim to disseminate eHive knowledge by hosting two training courses alongside extensive documentation of the protocols and message formats developed by this project.

Planned Impact

The last decade has seen the advancement of laboratory techniques that enable research data to be produced at a cheaper and faster rate, e.g. 'Next-Generation' DNA sequencing. In addition new laboratory techniques now make it possible to probe new areas of biology, such as gene regulation, through epigenetic mechanisms. These rapid improvements in methodology impact many different disciplines in the life sciences, from basic research to applied areas such as plant and animal breeding.

The primary beneficiaries from this proposed development for eHive will be Bioinformaticians in academia and industry, both in the UK and beyond, including those supporting research by analyzing data, and those producing and maintaining archives and data resources for the research community. Bioinformaticians are actively working on developing new algorithms and more efficient means of handling and processing these data, so that they can be interpreted quickly and accurately. In order to process these data, large-scale compute is required and the management of these data on compute clusters becomes an increasing challenge.

World-leading pharmaceutical companies, bioinformatics service companies, and animal breeding companies have in-house Bioinformaticians to produce customized data analysis on private data. These companies therefore have the expertise to use software such as eHive. Evidence that EMBL-EBI supports these areas includes our long-standing Industry Programme and the more recent announcement for the Centre for Therapeutic Target Validation (CTTV).

Suppliers of open source and commercial 'omics tools (e.g. Taverna and Galaxy) will also benefit from access to compute farms and software that our eHive development will provide.

Enabling research in these areas impacts socio-economic outcomes, contributing both in areas of basic research that promote understanding, as well as benefitting the wider public with improved health care for humans and animals, and productivity increases in agriculture.

How will these users benefit?
In this grant we propose to undertake some major enhancements to eHive. These enhancements will set the stage for the use of eHive in a wider context than is currently possible, and we believe that this will make it a more appealing and more accessible tool for Bioinformaticians and Bioscientists who wish to run large-scale compute.

Enabling eHive to communicate across more than one compute cluster will bring about novel use-cases for eHive:

- For time-critical work, being able to redirect certain job types to a second cluster (eg. Compute cloud) will allow the required workload to be achievable in the limited time frame. Running a pipeline within only one compute cluster can be a disadvantage when that cluster is being heavily used by other users, or when the capacity of a cluster is small.

- Being able to run a pipeline that spans more than one compute cluster will enable the user to run sections of their pipeline on the most appropriate cluster for that type of job. Some compute clusters are optimized for a particular type of job, and may work well for one type of pipeline but not another: compute clusters may be optimized for the number of jobs running simultaneously, the number and size of files stored on disk (several small files versus few large files), multi-threaded jobs, etc.

Support for eHive is growing and we now have users in both academia (the Roslin Institute, EMBL-EBI, Gramene in Cold Spring Harbor, USA) as well as industry (Eagle Genomics). Having a common job scheduling tool promotes efficiency, as one group can focus on the development for the tool and all groups benefit from the improvements.

We support outreach activities, and there is demand for training workshops (Eagle genomics, pers. comm. with Kathryn Beal). Our mailing list is open and developers from any background are free to ask questions, help, and share opinions on the use of eHive.

Funded Value:

£132,014

Funded Period:

Sep 15 - Mar 17

Funder:

BBSRC

Project Status:

Closed

Project Category:

Research Grant

Project Reference:

BB/M020398/1

Principal Investigator:

Andrew Yates

Paul Flicek

Research Subject:

Tools, technologies & methods (100%)

Research Topic:

Bioinformatics (25%)

High Performance Computing (25%)

Tools for the biosciences (25%)

eScience (25%)

Organisations

European Bioinformatics Institute (Lead Research Organisation)

People	ORCID iD
Andrew Yates (Principal Investigator)
Paul Flicek (Principal Investigator)

Publications

Author Name

Title Publication Date Published

10 25 50

Aken BL (2017) Ensembl 2017. in Nucleic acids research

Cunningham F (2019) Ensembl 2019. in Nucleic acids research

Cunningham F (2022) Ensembl 2022 in Nucleic Acids Research

Cunningham F (2015) Improving the Sequence Ontology terminology for genomic variant annotation. in Journal of biomedical semantics

Howe KL (2021) Ensembl 2021. in Nucleic acids research

Newman V (2018) The Ensembl Genome Browser: Strategies for Accessing Eukaryotic Genome Data. in Methods in molecular biology (Clifton, N.J.)

Yates A (2016) Ensembl 2016 in Nucleic Acids Research

Yates AD (2020) Ensembl 2020. in Nucleic acids research

Zerbino D (2018) Ensembl 2018 in Nucleic Acids Research

Key Findings
Software and Technical Products
Engagement Activities


Description	During the course of this award we have been able to repurpose a number of existing eHive concepts into a remote procedure calling framework and have been able to make an efficient and scalable solution. We have expanded our outreach and documentation of the eHive system to increase its usefulness to researchers but also to target those who run large scale processing workflows in research and industry. One of the biggest barriers to using remote procedures efficiently is the time it takes to tell a calling process when a remote task has completed. Normally when working with remote procedure frameworks this is done using a generated unique identifier and period checking of the status associated with this identifier. Periodic checking of an identifier is a slow process and brings with it a lag between one process finishing a new one starting. eHive no longer relies on this as it has expanded its key control framework called semaphores to work in this setting. Semaphores are counters that start tasks when they count down to zero when work completes. The semaphores are sent back the the caller meaning dependent jobs can start the moment a remote task finishes and provides the best of both previously mentioned solutions. In addition eHive can now understand local and remote data structures through the use of URLs (universal resource locators) similar to those used on the web. Our Application Programmatic Interfaces (APIs) hide this complexity meaning pipelines do not need to be significantly adapted to work with remote calls. We have been able to implement this handoff procedure using a simple HTTP API and have demonstrated internally that this system works. We have also reached out to two major communities to promote the usage of eHive and of the products of this research. The first is communities that already use eHive pipelines or work in the same domain as us. The second is the Perl programming community to target those who do not work in research and expand the user base of eHive. Since completion of the grant we have expended on the work conducted to expand the number of languages understood by eHive to include Python and Java. Both represent popular bioinformatics languages and help to encourage wider adoption of the eHive tool.
Exploitation Route	All documentation and training materials have been made publicly available. The developments made on this project will continue to be supported and enhanced by the Ensembl team for the foreseeable future. This extends to both feature development and bug fixes. Through our outreach to industry, via the Perl conference and workshop circuit, we aim to attract a new audience of eHive users and will continue this advertising process outside of the bioinformatics community. In addition we will continue advertising, training and outreach within bioinformatics as this is where the majority of our user community resides. Our work on expanding eHive to other programming languages makes the platform much more appealing as we reduce the dependencies on Perl.
Sectors	Digital/Communication/Information Technologies (including Software) Pharmaceuticals and Medical Biotechnology


Title	Task 1: Efficient control flow transfer between pipelines
Description	We have enabled remote procedure calling (RPC) in the eHive system and have demonstrated internally remote job execution over a HTTP interface. Input data is sent to the remote server during job creation and sent back when the job finishes execution. The system now impelments a semaphore control system across the entire system and uses these extensively to both halt and restart pipelines. Semaphores are handled locally but transmit their state to a remote calling system once certain conditions have been met i.e. all jobs have been run. During this period the pipeline that created the remote task does not need to pause and can continue with local processing. With all developments the eHive internal Application Programmatic Interfaces have been developed to hide this complexity away from the pipeline designer.
Type Of Technology	Software
Year Produced	2017
Open Source License?	Yes
Impact	The development shows that RPC over a HTTP interface is possible within the eHive infrastructure. Jobs maybe created locally in one eHive pipeline and be executed elsewhere. Both developments enable to transfer of a pipeline's control/flow to a remote system for further execution and done so in a manner that reduces network traffic between local and remote resources. This ensures remote pipeline execution is achievable, usable and scalable within a closed environment e.g. a master eHive pipeline executing tasks in one or many other pipelines. Providing this system as an external service is not possible due to security constraints and issues concerning unintended denial of service attacks through excessive use by a single user.


Title	Task 2: Efficient data transfer between eHive pipelines
Description	Data transfer between remote pipeline jobs is essential to the smooth execution of tasks in the eHive RPC framework. We have extended our current framework to support write operations between pipeline databases. In addition we attach data output from jobs to semaphores and use these to send results to downstream jobs. This ensures a single transaction of work is sent back to the calling pipeline with all data required for further downstream processing.
Type Of Technology	Software
Year Produced	2017
Open Source License?	Yes
Impact	The software is capable of robust data transfer between a local and remote pipeline ensuring pipeline stability. This will aid in future developments around the RPC framework.


Title	Task 3: Efficient Transfer of Events Back to the Calling Pipeline
Description	We have completed our analysis of how to flow events back to a calling pipeline within the eHive infrastructure. These advances will be made available in the upcoming 2.5 release of the software.
Type Of Technology	Software
Year Produced	2017
Open Source License?	Yes
Impact	We have discovered that communicating in-depth reports to remote pipeline errors e.g. a remote task ran out of memory is counterintuitive. We now believe that it is the remote pipeline's responsibility to only communicate failure back to a calling pipeline when tasks cannot be completed. User errors e.g. those caused by incorrect formatted data must also be inspected by the remote calling pipeline. This increases the cost of building pipelines to be used in a remote mode as they must be capable of handling multiple error conditions in order to make calling pipeline logic easier. The callback system developed is capable of transmitting remote job status back to the calling pipeline and does so on completion of the remote task. Transmitting errors is still problematic and been simplified into if the remote task has completed or failed. All parts are implemented using our internal semaphore system.


Title	Task 4: Outreach
Description	We have redeveloped our existing documentation into a new on-line manual using ReadTheDocs (a popular online documentation host and content generator) and took the opportunity to review and redeveloped our existing documentation. Our tutorial has been significantly improved. Specifically we have built a number of bioinformatic slanted example pipelines such as a GC count and kmer pipeline. Both are common bioinformatics workflows and can be used to explain high-level eHive concepts through a system that is predictable and can be inspected in a running system.
Type Of Technology	Software
Year Produced	2017
Open Source License?	Yes
Impact	Our documentation is clearer to understand due to our better structuring and deployment onto a common well-understood platform. ReadTheDocs provides multiple output formats (including PDF and ePub) and holds versioned manual releases. Read the docs is also indexed by highly used search engines such as Google and Bing meaning developers can find these materials far easier than before. The addition of bioinformatic example workflows will help new users to understand how eHive pipelines are formed and how to build their own pipelines. We believe these improvements will help to drive future adoption of eHive.
URL	http://ensembl-hive.readthedocs.io/en/master/


Description	Ensembl workshop in Taiwan
Form Of Engagement Activity	Participation in an activity, workshop or similar
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Professional Practitioners
Results and Impact	Ensembl organised a comprehensive workshop in Taiwan. Part of the content was on eHive, presented by Brandon Walts, at Hsinchu and Taipei, Taiwan.
Year(s) Of Engagement Activity	2017
URL	http://training.ensembl.org/events/2017/2017-02-15-ehivetaiwan


Description	Participation in an activity, workshop or similar - YAPC::EU Perl Conference
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Professional Practitioners
Results and Impact	Andrew Yates presented the eHive software toolkit as a lightning talk at YAPC::EU; a European wide conference for Perl programmers. The talk was well received and triggered extensive discussion with other programmers. This was as part of a much larger engagement at YAPC::EU concerning bioinformatics.
Year(s) Of Engagement Activity	2016
URL	http://act.yapc.eu/ye2016/talk/6929


Description	Perl workshop
Form Of Engagement Activity	Participation in an activity, workshop or similar
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Professional Practitioners
Results and Impact	Brandon Walts gave a talk at a Perl workshop at the University of Westminster on '?Perl meets big data and high performance computing with the eHive framework.'.
Year(s) Of Engagement Activity	2016
URL	http://act.yapc.eu/lpw2016/


Description	Poster presentation at XLDB 2017
Form Of Engagement Activity	Participation in an activity, workshop or similar
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Professional Practitioners
Results and Impact	Brandon Walts presented the eHive platform at XLDB (Extra Large Database) conference. Attendees expressed an interest in the eHive processing infrastructure during the poster session.
Year(s) Of Engagement Activity	2017


Description	eHive training workshop at the Roslin Institute
Form Of Engagement Activity	Participation in an activity, workshop or similar
Part Of Official Scheme?	No
Geographic Reach	National
Primary Audience	Professional Practitioners
Results and Impact	Leo Gordon and Brandon Walts travelled to the Roslin Institute to share their expertise on eHive, and the developments that had been made during the grant. They trained Roslin staff so that they are able to deploy eHive at their institute.
Year(s) Of Engagement Activity	2017