Network security and secure computing

Detecting and Removing Web Application Vulnerabilities with Static Analysis and Data Mining

January 20, 2019

1158

Detecting and Removing Web Application Vulnerabilities with Static Analysis and Data Mining

Abstract

Detecting and Removing Web Application Vulnerabilities with Static Analysis and Data Mining,Although a large research effort on web application security has been going on for more than a decade, the security of web applications continues to be a challenging problem. An important part of that problem derives from vulnerable source code, often written in unsafe languages like PHP. Source code static analysis tools are a solution to find vulnerabilities, but they tend to generate false positives, and require considerable effort for programmers to manually fix the code. We explore the use of a combination of methods to discover vulnerabilities in source code with fewer false positives. We combine taint analysis, which finds candidate vulnerabilities, with data mining, to predict the existence of false positives. This approach brings together two approaches that are apparently orthogonal: humans coding the knowledge about vulnerabilities (for taint analysis), joined with the seemingly orthogonal approach of automatically obtaining that knowledge (with machine learning, for data mining). Given this enhanced form of detection, we propose doing automatic code correction by inserting fixes in the source code. Our approach was implemented in the WAP tool, and an experimental evaluation was performed with a large set of PHP applications. Our tool found 388 vulnerabilities in 1.4 million lines of code. Its accuracy and precision were approximately 5% better than PhpMinerII’s and 45% better than Pixy’s.

Introduction

SINCE its appearance in the early 1990s, the Web evolved from a platform to access text and other media to a framework for running complex web applications. These applications appear in many forms, from small home-made to large-scale commercial services (e.g., Google Docs, Twitter, Facebook). However, web applications have been plagued with security problems. For example, a recent report indicates an increase of web attacks of around 33% in 2012 . Arguably, a reason for the insecurity of web applications is that many programmers lack appropriate knowledge about secure coding, so they leave applications with ﬂaws. However, the mechanisms for web application security fall in two extremes. On one hand, there are techniques that put the programmer aside, e.g., web application ﬁrewalls and other runtime protections . On the other hand, there are techniques that discover vulnerabilities but put the burden of removing them on the programmer, e.g., black-box testing and static analysis . The paper explores an approach for automatically protecting web applications while keeping the programmer in the loop. The approach consists in analyzing the web application source code searching for input validation vulner abilities and inserting ﬁxes in the same code to correct these ﬂaws. The programmer is kept in the loop by being allowed to understand where the vulnerabilities were found and how they were corrected.

This contributes directly for the security of web applications by removing vulnerabilities, and indirectly by letting the programmers learn from their mistakes. This last aspect is enabled by inserting ﬁxes that follow common security coding practices, so programmers can learn these practices by seeing the vulnerabilities and how they were removed. We explore the use of a novel combination of methods to detect this type of vulnerabilities: static analysis and data mining. Static analysis is an effective mechanisms to ﬁnd vulnerabilities in source code, but tends to report many false positives (non-vulnerabilities) due to its undecidability [18]. This problem is particularly difﬁcult with languages such as PHP that are weakly typed and not formally speciﬁed [7]. Therefore, we complement a form of static analysis, taint analysis, with the use of data mining to predict the existence of false positives.This solution combines two apparently opposite approaches: humans coding the knowledge about vulnerabilities (for taint analysis) versus automatically obtaining that knowledge (with supervised machine learning supporting data mining). To predict the existence of false positives we introduce the novel idea of assessing if the vulnerabilities detected are false positives using data mining.

To do this assessment,we measure attributes of the code that we observed to be associated with the presence of false positives, and use a combination of the three top-ranking classiﬁers to ﬂag every vulnerability as false positive or not. We explore the use of several classiﬁers: ID3, C4.5/J48, Random Forest, Random Tree, K-NN, Naive Bayes, Bayes Net, MLP, SVM, and Logistic Regression. Moreover, for every vulnerability classiﬁed as false positive, we use an induction rule classiﬁer to show which attributes are associated with it. We explore the JRip, PART, Prism and Ridor induction rule classiﬁers for this goal. Classiﬁers are automatically conﬁgured using machine learning based on labeled vulnerability data. Ensuring that the code correction is done correctly requires assessing that the vulnerabilities are removed and that the correct behavior of the application is not modiﬁed by the ﬁxes. We propose using program mutation and regression testing to conﬁrm, respectively, that the ﬁxes do the function to what they are programmed to (blocking malicious inputs) and that the application remains working as expected (with benign inputs).

Notice that we do not claim that our approach is able to correct any vulnerability, or to detect it, only the input validation vulnerabilities it is programmed to deal with. The paper also describes the design of the Web Application Protection (WAP) tool that implements our approach [1]. WAP analyzes and removes input validation vulnerabilities from code1 written in PHP 5, which according to a recent report is used by more than 77% of the web applications [16]. WAP covers a considerable number of classes of vulnerabilities: SQL injection (SQLI), cross-site scripting (XSS), remote ﬁle inclusion, local ﬁle inclusion, directory traversal/path traversal, source code disclosure, PHP code injection, and OS command injection. The ﬁrst two continue to be among the highest positions of the OWASP Top 10 in 2013 [39], whereas the rest are also known to be high risk, especially in PHP. Currently WAP assumes that the background database is MySQL, DB2 or PostgreSQL. The tool might be extended with more ﬂaws and databases, but this set is enough to demonstrate the concept. Designing and implementing WAP was a challenging task.

The tool does taint analysis of PHP programs, a form of data ﬂow analysis. To do a ﬁrst reduction of the number of false positives, the tool performs global, interprocedural and context-sensitive analysis, which means that data ﬂows are followed even when they enter new functions and other modules (other ﬁles). This involves the management of several data structures, but also to deal with global variables (that in PHP can appear anywhere in the code, simply by preceding the name with global or through the $_GLOBALS array) and resolving module names (which can even contain paths taken from environment variables).Handling object orientation with the associated inheritance and polymorphism was also a considerable challenge. We evaluated the tool experiment ally by running it with both simple synthetic code and with 45 open PHP web applications available in the internet, adding up to more than 6,700 ﬁles and 1,380,000 lines of code. Our results suggest that the tool is capable of ﬁnding and correcting the vulnerabilities from the classes it was programmed to handle. The main contributions of the paper are: (1) an approach for improving the security of web applications by combining detection and automatic correction of vulnerabilities in web applications; (2) a combination of taint analysis and data mining techniques to identify vulnerabilities with low false positives; (3) a tool that implements that approach for web applications written in PHP with several database management systems; (4) a study of the conﬁguration of the data mining component and an experimental evaluation of the tool with a considerable number of open source PHP applications.

Related Work

There is a large corpus of related work so we just summarize the main areas by discussing representative papers, while leaving many others un referenced for lack of space.

Detecting vulnerabilities with static analysis. Static analysis tools automate the auditing of code, either source, binary or intermediate. In the paper we use the term static analysis in a narrow sense to designate static analysis of source code to detect vulnerabilities . The most interesting static analysis tools do semantic analysis based on the abstract syntax tree (AST) of a program. Data ﬂow analysis tools follow the data paths inside a program to detect security problems. The most commonly used data ﬂow analysis technique for security analysis is taint analysis, which marks data that enters the program as tainted and detects if it reaches sensitive functions. Taint analysis tools like CQUAL and Splint (both for C code) use two qualiﬁers to annotate source code: the untainted qualiﬁer indicates either that a function/parameter returns trustworthy data (e.g., a sanitization function) or that a parameter of a function that requires trustworthy data (e.g., mysql_query); the tainted qualiﬁer means that a function or a parameter return non-trustworthy data (e.g., functions that read user input). Pixy [17] uses taint analysis for verifying PHP code, but extends it with alias analysis that takes into account the existence of aliases, i.e., of two or more variable names that are used to denominate the same variable. SaferPHP uses taint analysis to detect certain semantic vulnerabilities in PHP code: denial of service due to inﬁnite loops and unauthorized operations in databases [33]. WAP also does taint analysis and alias analysis for detecting vulnerabilities, although it goes further by also correcting the code. Furthermore, Pixy does only module-level analysis, whereas WAP does global analysis (i.e.,the analysis is not limited to a module/ﬁle,but can involve several).

Vulnerabilities and data mining. Data mining has been used to predict the presence of software defects . These works were based on code attributes such as numbers of lines of code, code complexity metrics, and object-oriented features. Some papers went one step further in the direction of our work by using similar metrics to predict the existence of vulnerabilities in source code . They used attributes such as past vulnerabilities and function calls , or code complexity and developer activities [32]. On the contrary of our work, these others did not aim to detect bugs and identify their location, but to assess the quality of the software in terms of prevalence of defects/vulnerabilities. Shar and Tan presented PhpMinerI and PhpMinerII, two tools that use data mining to assess the presence of vulnerabilities in PHP programs [29], [30]. These tools extract a set of attributes from program slices, then apply data mining algorithms to those attributes. The data mining process is not really done by the tools, but by the WEKA tool [40]. More recently the authors evolved this idea to use also traces or program execution [31]. Their approach is an evolution of the previous works that aimed to assess the prevalence of vulnerabilities, but obtaining a higher accuracy. WAP is quite different because it has to identify the location of vulnerabilities in the source code, so that it can correct them with ﬁxes. Moreover, WAP does not use data mining to identify vulnerabilities but to predict if vulnerabilities found by taint analysis are really so or if, on the contrary, they are false positives.

Correcting vulnerabilities. We propose to use the output of static analysis to remove vulnerabilities automatically. We are aware of a few works that use approximately the same idea of ﬁrst doing static analysis then doing some kind of protection, but mostly for the speciﬁc case of SQL injection and without attempting to insert ﬁxes in a way that can be replicated by a programmer. AMNESIA does static analysis to discover all SQL queries –vulnerable or not– and in runtime checks if the call being made satisﬁes the format deﬁned by the programmer [11]. Buehrer et al. do something similar by comparing in runtime the parse tree of the SQL statement before and after the inclusion of user input [6]. WebSSARI does also static analysis and inserts runtime guards, but no details are available about what the guards are or how they are inserted [15]. Merlo et al. present a tool that does static analysis of source code, performs dynamic analysis to build syntactic models of legitimate SQL queries, and generates code to protect queries from input that aims to do SQLI [20]. saferXSS does static analysis for ﬁnding XSS vulnerabilities, then removes them using functions provided by OWASP’s ESAPI (http://www.owasp.org/index.php/ESAPI) to wrap user inputs [28]. None of these works use data mining or machine learning.

Conclusion

The Detecting and Removing Web Application Vulnerabilities with Static Analysis and Data Mining paper presents an approach for ﬁnding and correcting vulnerabilities in web applications and a tool that implements the approach for PHP programs and input validation vulnerabilities. The approach and the tool search for vulnerabilities using a combination of two techniques: static source code analysis and data mining. Data mining is used to identify false positives using a top 3 of machine learning classiﬁers and to justify their presence using an induction rule classiﬁer. All classiﬁers were selected after athorough comparison of several alternatives. It is important to note that this combination of detection techniques can not provide entirely correct results. The static analysis problem is undecidable and the resort to data mining can not circumvent this undecidability, only provide probabilistic results. The tool corrects the code by inserting ﬁxes, i.e., sanitization and validation functions. Testing is used to verify if the ﬁxes actually remove the vulnerabilities and do not compromise the (correct) behavior of the applications. The tool was experimented with synthetic code with vulnerabilities inserted on purpose and with a considerable number of open source PHP applications.It was also compared with two source code analysis tools, Pixy and Php MinerII. This evaluation suggests that the tool can detect and correct the vulnerabilities of the classes it is programmed to handle. It was able to ﬁnd 388 vulnerabilities in 1.4 million lines of code. Its accuracy and precision were approximately 5% better than Php MinerII’s and 45% better than Pixy’s.

Detecting and Removing Web Application Vulnerabilities with Static Analysis and Data Mining

Detecting and Removing Web Application Vulnerabilities with Static Analysis and Data Mining

Abstract

Introduction

Related Work

Conclusion

MOST POPULAR

GSM Based Billing System

Laboratory Studies on Geotextile Reinforced Soil for Pavements

Computer Application in Civil Engineering-ANN

Projects Topics & Ideas On Statistics Projects

HOT NEWS

Study on Cost Effective Arch Lintel for Rural Houses Using Concrete...

Student Record

Instant Messenger

Advertising-Beacons Project