String Constraint Solving with Real-World Regular Expressions

Lead Research Organisation: Royal Holloway, University of London
Department Name: Computer Science

Abstract

Strings in programming languages are sequences of characters that represent any kind of text. They are a fundamental aspect of information representation: user names, passwords, or indeed any kind of text are handled as strings.

The manipulation of strings, however, can also lead to subtle programming errors, that can have consequences for program correctness as well as information security. For example, a malicious user may enter computer code as their username. If the system is not sufficiently secure, this code can allow the user to hack the system. The Open Web Application Security Project lists this kind of attack among the top 10 application security risks.

Despite this kind of attack being well-known, it has proved surprisingly difficult to avoid due to the complex nature of computer programs. Ideally, programming mistakes will be caught during testing. However, manual testing is a tedious and time-consuming process which requires the developer to imagine every possible user input. Automatic test-case generation can take this burden away from the developer and allow more complete testing to be done more efficiently. However, this relies on the tools used for test-case generation to be able to accurately reason about how software will run.

This project will focus on how software deals with strings. Typically "regular expressions" are used for this purpose. However, current research takes an idealised view of regular expressions that omits many important features of the regular expressions provided by modern programming languages. We will address this shortcoming both in the theory of computer science and in practice. In particular, we will create a test-case generation tool-chain that will provide better test-case generation for software dealing with strings. These tools will be tested on real-world industrial code provided by Prodo.ai.

Planned Impact

We describe several pathways to impact of the proposed research

Availability of Tools

The project will create new tools for string constraint solving, as well as augment and enhance tools for symbolic execution and test-case generation. These tools will be made open source and easily available via code-sharing sites such as GitHub, GitLab, &c.. This enables impact of the tools by both making them easily available to potential users, as well as by enabling other developers to extend the tools for their purposes. Although this is a standard form of dissemination, it remains a powerful one that should always be considered.

Symbolic Execution Tools

Symbolic execution is a successful area of research. The tools developed have seen industrial impact. For example, KLEE, which is developed at Imperial College London, has been used by Microsoft, NASA, Fujistu, and IBM, among others. Since our work will benefit symbolic execution tools by providing important string support, our work will be of interest to these companies. We have identified an industrial partner who will provide us with benchmarks on which our symbolic execution tools can be tested. Since test-case generation is part of the solutions provided by our industrial partner, the work has the potential for direct impact.

People Pipeline

The differences in the notion of a regular expression as studied in academia and as used by software developers in practice are numerous. The support of regular expressions in the academic sense by current generation of string constraint solvers is a sure barrier to their wide-spread use. The tools will not meet the needs of their potential users, and the gap will be too large for an average user to bridge. Indeed, bridging this gap from a user perspective has been a driver of the research surrounding the ExpoSE symbolic execution tool.

Similarly, symbolic execution and test-case generation tools will suffer from this discrepancy in definitions. Thus, a company wishing to deploy automatic test-case generation will quickly discover that the inputs returned do not induce the desired behaviours. Our work will provide important backend support for overcoming this problem, improving the usability of symbolic execution and test-case generation tools. This, in turn, will allow greater take-up by industry, improving efficiency of the software development process.

Moreover, we will providing a solver for a problem that is easily understood by software developers that has capabilities incorporating real-world regular expressions, which are highly familiar to programmers and IT professionals alike. This means that potential users will be able to imagine how the tool may be used to address their specific problems. We can then engage with those ideas and find new areas of impact for the work.

The benefits of improved string constraint solving technologies and improved symbolic execution or test-case generation will not be immediately visible to society at large. Instead, the benefits will trickle down, both in the form of improved efficiency of companies who make use of the tools, and in reduction in programming errors. These programming errors may be simple annoyances such as an application crashing or not behaving as expected, or may be more serious, such as cross-site scripting exploits leading to the leak of private user data.

Publications

10 25 50