String Constraint Solving with Real-World Regular Expressions

Lead Research Organisation: Royal Holloway University of London
Department Name: Computer Science

Abstract

Strings in programming languages are sequences of characters that represent any kind of text. They are a fundamental aspect of information representation: user names, passwords, or indeed any kind of text are handled as strings.

The manipulation of strings, however, can also lead to subtle programming errors, that can have consequences for program correctness as well as information security. For example, a malicious user may enter computer code as their username. If the system is not sufficiently secure, this code can allow the user to hack the system. The Open Web Application Security Project lists this kind of attack among the top 10 application security risks.

Despite this kind of attack being well-known, it has proved surprisingly difficult to avoid due to the complex nature of computer programs. Ideally, programming mistakes will be caught during testing. However, manual testing is a tedious and time-consuming process which requires the developer to imagine every possible user input. Automatic test-case generation can take this burden away from the developer and allow more complete testing to be done more efficiently. However, this relies on the tools used for test-case generation to be able to accurately reason about how software will run.

This project will focus on how software deals with strings. Typically "regular expressions" are used for this purpose. However, current research takes an idealised view of regular expressions that omits many important features of the regular expressions provided by modern programming languages. We will address this shortcoming both in the theory of computer science and in practice. In particular, we will create a test-case generation tool-chain that will provide better test-case generation for software dealing with strings. These tools will be tested on real-world industrial code provided by Prodo.ai.

Planned Impact

We describe several pathways to impact of the proposed research

Availability of Tools

The project will create new tools for string constraint solving, as well as augment and enhance tools for symbolic execution and test-case generation. These tools will be made open source and easily available via code-sharing sites such as GitHub, GitLab, &c.. This enables impact of the tools by both making them easily available to potential users, as well as by enabling other developers to extend the tools for their purposes. Although this is a standard form of dissemination, it remains a powerful one that should always be considered.

Symbolic Execution Tools

Symbolic execution is a successful area of research. The tools developed have seen industrial impact. For example, KLEE, which is developed at Imperial College London, has been used by Microsoft, NASA, Fujistu, and IBM, among others. Since our work will benefit symbolic execution tools by providing important string support, our work will be of interest to these companies. We have identified an industrial partner who will provide us with benchmarks on which our symbolic execution tools can be tested. Since test-case generation is part of the solutions provided by our industrial partner, the work has the potential for direct impact.

People Pipeline

The differences in the notion of a regular expression as studied in academia and as used by software developers in practice are numerous. The support of regular expressions in the academic sense by current generation of string constraint solvers is a sure barrier to their wide-spread use. The tools will not meet the needs of their potential users, and the gap will be too large for an average user to bridge. Indeed, bridging this gap from a user perspective has been a driver of the research surrounding the ExpoSE symbolic execution tool.

Similarly, symbolic execution and test-case generation tools will suffer from this discrepancy in definitions. Thus, a company wishing to deploy automatic test-case generation will quickly discover that the inputs returned do not induce the desired behaviours. Our work will provide important backend support for overcoming this problem, improving the usability of symbolic execution and test-case generation tools. This, in turn, will allow greater take-up by industry, improving efficiency of the software development process.

Moreover, we will providing a solver for a problem that is easily understood by software developers that has capabilities incorporating real-world regular expressions, which are highly familiar to programmers and IT professionals alike. This means that potential users will be able to imagine how the tool may be used to address their specific problems. We can then engage with those ideas and find new areas of impact for the work.

The benefits of improved string constraint solving technologies and improved symbolic execution or test-case generation will not be immediately visible to society at large. Instead, the benefits will trickle down, both in the form of improved efficiency of companies who make use of the tools, and in reduction in programming errors. These programming errors may be simple annoyances such as an application crashing or not behaving as expected, or may be more serious, such as cross-site scripting exploits leading to the leak of private user data.
 
Description The project aims to improve the ability of software engineers to detect and diagnose errors in their products. In particular, by providing back-end analysis support to explore possible program inputs and behaviours. A key area of current interest is the handling of text data, which is not well supported by the current state of the art.

Our first work has looked at the combination of text and numerical data (e.g. checking the length of a document). Typically this combination of text and numbers makes the analysis problem much harder. We have initially shown that most combinations found in a popular benchmark suite (90%) can effectively be removed from the problem. 10% is still a significant number. Our second contribution provides a new method for tackling these combinations.

Next, we have developed new methods for the analysis of string constraints that use standard regular expression features such as references. These have not been considered by previous tools. This can be used to further extend the applicability of symbolic execution to "real-world" programs that frequently use regular expressions in a non-trivial way.
Exploitation Route These outcomes can be used to improve the coverage of software testing tools, helping developers identify, diagnose, and repair errors more quickly.
Sectors Aerospace, Defence and Marine,Digital/Communication/Information Technologies (including Software)

URL https://github.com/uuverifiers/ostrich
 
Description The findings have extended the scope of string constraint solvers and further allow the analysis of real-world programs. A next step will be to test these techniques on real-world problems in collaboration with external developers. We have entered the OSTRICH solver in SMT-COMP 2022 and achieved the "best complementary" solver in some categories. This shows we are developing new methods that have impacts over the well-established solvers.
Sector Aerospace, Defence and Marine,Digital/Communication/Information Technologies (including Software)
Impact Types Cultural

 
Title OSTRICH (MonDec) 
Description An extension of our string constraint solver OSTRICH to determine the monadic decomposability of string constraints combined with integer constraints. 
Type Of Technology Webtool/Application 
Year Produced 2020 
Open Source License? Yes  
Impact This tool has allowed us to gather data on the necessity of combined integer and string constraint reasoning. In particular, we found that 90% of string constraints in the Kaluza benchmark suite did not contained integer constraints that could not be handled without dedicated integer solvers. The remaining 10% require integer reasoning. This is a useful datapoint in determining the direction of future research into string constraint solving. 
URL https://github.com/uuverifiers/ostrich/tree/modec
 
Title OSTRICH+ 
Description A string constraint solver for analysing string constraints including advanced regular expression features such as references. 
Type Of Technology Webtool/Application 
Year Produced 2021 
Open Source License? Yes  
Impact The tool is newly released, but we hope it may drive the development of symbolic execution engines that can better identify vulnerabilities in programs such as web applications. 
URL https://github.com/uuverifiers/ostrich