
Detecting and Removing Web Application Vulnerabilities with Static Analysis and Data Mining
Abstract
Introduction
SINCE its appearance in the early 1990s, the Web evolved from a platform to access text and other media to a framework for running complex web applications. These applications appear in many forms, from small home-made to large-scale commercial services (e.g., Google Docs, Twitter, Facebook). However, web applications have been plagued with security problems. For example, a recent report indicates an increase of web attacks of around 33% in 2012 . Arguably, a reason for the insecurity of web applications is that many programmers lack appropriate knowledge about secure coding, so they leave applications with flaws. However, the mechanisms for web application security fall in two extremes. On one hand, there are techniques that put the programmer aside, e.g., web application firewalls and other runtime protections . On the other hand, there are techniques that discover vulnerabilities but put the burden of removing them on the programmer, e.g., black-box testing and static analysis . The paper explores an approach for automatically protecting web applications while keeping the programmer in the loop. The approach consists in analyzing the web application source code searching for input validation vulner abilities and inserting fixes in the same code to correct these flaws. The programmer is kept in the loop by being allowed to understand where the vulnerabilities were found and how they were corrected.
This contributes directly for the security of web applications by removing vulnerabilities, and indirectly by letting the programmers learn from their mistakes. This last aspect is enabled by inserting fixes that follow common security coding practices, so programmers can learn these practices by seeing the vulnerabilities and how they were removed. We explore the use of a novel combination of methods to detect this type of vulnerabilities: static analysis and data mining. Static analysis is an effective mechanisms to find vulnerabilities in source code, but tends to report many false positives (non-vulnerabilities) due to its undecidability [18]. This problem is particularly difficult with languages such as PHP that are weakly typed and not formally specified [7]. Therefore, we complement a form of static analysis, taint analysis, with the use of data mining to predict the existence of false positives.This solution combines two apparently opposite approaches: humans coding the knowledge about vulnerabilities (for taint analysis) versus automatically obtaining that knowledge (with supervised machine learning supporting data mining). To predict the existence of false positives we introduce the novel idea of assessing if the vulnerabilities detected are false positives using data mining.
To do this assessment,we measure attributes of the code that we observed to be associated with the presence of false positives, and use a combination of the three top-ranking classifiers to flag every vulnerability as false positive or not. We explore the use of several classifiers: ID3, C4.5/J48, Random Forest, Random Tree, K-NN, Naive Bayes, Bayes Net, MLP, SVM, and Logistic Regression. Moreover, for every vulnerability classified as false positive, we use an induction rule classifier to show which attributes are associated with it. We explore the JRip, PART, Prism and Ridor induction rule classifiers for this goal. Classifiers are automatically configured using machine learning based on labeled vulnerability data. Ensuring that the code correction is done correctly requires assessing that the vulnerabilities are removed and that the correct behavior of the application is not modified by the fixes. We propose using program mutation and regression testing to confirm, respectively, that the fixes do the function to what they are programmed to (blocking malicious inputs) and that the application remains working as expected (with benign inputs).
Notice that we do not claim that our approach is able to correct any vulnerability, or to detect it, only the input validation vulnerabilities it is programmed to deal with. The paper also describes the design of the Web Application Protection (WAP) tool that implements our approach [1]. WAP analyzes and removes input validation vulnerabilities from code1 written in PHP 5, which according to a recent report is used by more than 77% of the web applications [16]. WAP covers a considerable number of classes of vulnerabilities: SQL injection (SQLI), cross-site scripting (XSS), remote file inclusion, local file inclusion, directory traversal/path traversal, source code disclosure, PHP code injection, and OS command injection. The first two continue to be among the highest positions of the OWASP Top 10 in 2013 [39], whereas the rest are also known to be high risk, especially in PHP. Currently WAP assumes that the background database is MySQL, DB2 or PostgreSQL. The tool might be extended with more flaws and databases, but this set is enough to demonstrate the concept. Designing and implementing WAP was a challenging task.
The tool does taint analysis of PHP programs, a form of data flow analysis. To do a first reduction of the number of false positives, the tool performs global, interprocedural and context-sensitive analysis, which means that data flows are followed even when they enter new functions and other modules (other files). This involves the management of several data structures, but also to deal with global variables (that in PHP can appear anywhere in the code, simply by preceding the name with global or through the $_GLOBALS array) and resolving module names (which can even contain paths taken from environment variables).Handling object orientation with the associated inheritance and polymorphism was also a considerable challenge. We evaluated the tool experiment ally by running it with both simple synthetic code and with 45 open PHP web applications available in the internet, adding up to more than 6,700 files and 1,380,000 lines of code. Our results suggest that the tool is capable of finding and correcting the vulnerabilities from the classes it was programmed to handle. The main contributions of the paper are: (1) an approach for improving the security of web applications by combining detection and automatic correction of vulnerabilities in web applications; (2) a combination of taint analysis and data mining techniques to identify vulnerabilities with low false positives; (3) a tool that implements that approach for web applications written in PHP with several database management systems; (4) a study of the configuration of the data mining component and an experimental evaluation of the tool with a considerable number of open source PHP applications.

Related Work
There is a large corpus of related work so we just summarize the main areas by discussing representative papers, while leaving many others un referenced for lack of space.
Detecting vulnerabilities with static analysis. Static analysis tools automate the auditing of code, either source, binary or intermediate. In the paper we use the term static analysis in a narrow sense to designate static analysis of source code to detect vulnerabilities . The most interesting static analysis tools do semantic analysis based on the abstract syntax tree (AST) of a program. Data flow analysis tools follow the data paths inside a program to detect security problems. The most commonly used data flow analysis technique for security analysis is taint analysis, which marks data that enters the program as tainted and detects if it reaches sensitive functions. Taint analysis tools like CQUAL and Splint (both for C code) use two qualifiers to annotate source code: the untainted qualifier indicates either that a function/parameter returns trustworthy data (e.g., a sanitization function) or that a parameter of a function that requires trustworthy data (e.g., mysql_query); the tainted qualifier means that a function or a parameter return non-trustworthy data (e.g., functions that read user input). Pixy [17] uses taint analysis for verifying PHP code, but extends it with alias analysis that takes into account the existence of aliases, i.e., of two or more variable names that are used to denominate the same variable. SaferPHP uses taint analysis to detect certain semantic vulnerabilities in PHP code: denial of service due to infinite loops and unauthorized operations in databases [33]. WAP also does taint analysis and alias analysis for detecting vulnerabilities, although it goes further by also correcting the code. Furthermore, Pixy does only module-level analysis, whereas WAP does global analysis (i.e.,the analysis is not limited to a module/file,but can involve several).
Vulnerabilities and data mining. Data mining has been used to predict the presence of software defects . These works were based on code attributes such as numbers of lines of code, code complexity metrics, and object-oriented features. Some papers went one step further in the direction of our work by using similar metrics to predict the existence of vulnerabilities in source code . They used attributes such as past vulnerabilities and function calls , or code complexity and developer activities [32]. On the contrary of our work, these others did not aim to detect bugs and identify their location, but to assess the quality of the software in terms of prevalence of defects/vulnerabilities. Shar and Tan presented PhpMinerI and PhpMinerII, two tools that use data mining to assess the presence of vulnerabilities in PHP programs [29], [30]. These tools extract a set of attributes from program slices, then apply data mining algorithms to those attributes. The data mining process is not really done by the tools, but by the WEKA tool [40]. More recently the authors evolved this idea to use also traces or program execution [31]. Their approach is an evolution of the previous works that aimed to assess the prevalence of vulnerabilities, but obtaining a higher accuracy. WAP is quite different because it has to identify the location of vulnerabilities in the source code, so that it can correct them with fixes. Moreover, WAP does not use data mining to identify vulnerabilities but to predict if vulnerabilities found by taint analysis are really so or if, on the contrary, they are false positives.
Correcting vulnerabilities. We propose to use the output of static analysis to remove vulnerabilities automatically. We are aware of a few works that use approximately the same idea of first doing static analysis then doing some kind of protection, but mostly for the specific case of SQL injection and without attempting to insert fixes in a way that can be replicated by a programmer. AMNESIA does static analysis to discover all SQL queries –vulnerable or not– and in runtime checks if the call being made satisfies the format defined by the programmer [11]. Buehrer et al. do something similar by comparing in runtime the parse tree of the SQL statement before and after the inclusion of user input [6]. WebSSARI does also static analysis and inserts runtime guards, but no details are available about what the guards are or how they are inserted [15]. Merlo et al. present a tool that does static analysis of source code, performs dynamic analysis to build syntactic models of legitimate SQL queries, and generates code to protect queries from input that aims to do SQLI [20]. saferXSS does static analysis for finding XSS vulnerabilities, then removes them using functions provided by OWASP’s ESAPI (http://www.owasp.org/index.php/ESAPI) to wrap user inputs [28]. None of these works use data mining or machine learning.
Conclusion
The Detecting and Removing Web Application Vulnerabilities with Static Analysis and Data Mining paper presents an approach for finding and correcting vulnerabilities in web applications and a tool that implements the approach for PHP programs and input validation vulnerabilities. The approach and the tool search for vulnerabilities using a combination of two techniques: static source code analysis and data mining. Data mining is used to identify false positives using a top 3 of machine learning classifiers and to justify their presence using an induction rule classifier. All classifiers were selected after athorough comparison of several alternatives. It is important to note that this combination of detection techniques can not provide entirely correct results. The static analysis problem is undecidable and the resort to data mining can not circumvent this undecidability, only provide probabilistic results. The tool corrects the code by inserting fixes, i.e., sanitization and validation functions. Testing is used to verify if the fixes actually remove the vulnerabilities and do not compromise the (correct) behavior of the applications. The tool was experimented with synthetic code with vulnerabilities inserted on purpose and with a considerable number of open source PHP applications.It was also compared with two source code analysis tools, Pixy and Php MinerII. This evaluation suggests that the tool can detect and correct the vulnerabilities of the classes it is programmed to handle. It was able to find 388 vulnerabilities in 1.4 million lines of code. Its accuracy and precision were approximately 5% better than Php MinerII’s and 45% better than Pixy’s.