Network security and secure computing

Fast Detection of Transformed Data Leaks

January 20, 2019

1940

Fast Detection of Transformed Data Leaks

Abstract

Fast Detection of Transformed Data Leaks,The leak of sensitive data on computer systems poses a serious threat to organizational security. Statistics show that the lack of proper encryption on files and communications due to human errors is one of the leading causes of data loss. Organizations need tools to identify the exposure of sensitive data by screening the content in storage and transmission, i.e., to detect sensitive information being stored or transmitted in the clear. However, detecting the exposure of sensitive information is challenging due to data transformation in the content. Transformations (such as insertion and deletion) result in highly unpredictable leak patterns. In this paper, we utilize sequence alignment techniques for detecting complex data-leak patterns. Our algorithm is designed for detecting long and inexact sensitive data patterns. This detection is paired with a comparable sampling algorithm, which allows one to compare the similarity of two separately sampled sequences. Our system achieves good detection accuracy in recognizing transformed leaks. We implement a parallelized version of our algorithms in graphics processing unit that achieves high analysis throughput. We demonstrate the high multithreading scalability of our data leak detection method required by a sizable organization.

Introduction

REPORTS show that the number of leaked sensitive data records has grown 10 times in the last 4 years, and it reached a record high of 1.1 billion in 2014 [3]. A signiﬁcant portion of the data leak incidents are due to human errors, for example, a lost or stolen laptop containing unencrypted sensitive ﬁles, or transmitting sensitive data without using endto-endencryptionsuch as PGP. A recent KasperskyLab survey shows that accidental leak by staff is the leading cause for internal data leaks in corporates [4]. The data-leak risks posed by accidents exceed the risks posed by vulnerable software. In order to minimize the exposure of sensitive data and documents, an organizationneeds to prevent cleartext sensitive data from appearing in the storage or communication. A screening tool can be deployed to scan computer ﬁle systems, server storage, and inspect outboundnetwork trafﬁc. The tool searches for the occurrences of plaintext sensitive data in the content of ﬁles or network trafﬁc. It alerts users and administrators of the identiﬁed data exposure vulnerabilities. For example, an organization’s mail server can inspect the content of outbound email messages searching for sensitive data appearing in unencrypted messages. Data leak detectiondiffersfromthe anti-virus(AV) scanning (e.g., scanning ﬁle systems for malware signatures) or the network intrusion detection systems (NIDS) (e.g., scanning trafﬁc payload for malicious patterns) [5]. AV and NIDS typically employ automata-based string matching (e.g., Aho-Corasick [6], Boyer-Moore [7]), which match static or regular patterns. However, data leak detection imposes new security requirements and algorithmic challenges:

• Data Transformation: The exposed data in the content may be unpredictably transformed or modiﬁed by users or applications, and it may no longer be identical to the original sensitive data, e.g., insertions of metadata or formatting tags, substitutions of characters, and data truncation (partial data leak). Thus, the detection algorithm needs to recognize different kinds of sensitive data variations.

• Scalability: The heavy workload of data leak screening is due to two reasons. a) Long Sensitive Data Patterns: The sensitive data (e.g., customer information, documents, source code) can be of arbitrary length (e.g., megabytes). b) Large Amount of Content: The detection needs to rapidly screen content (e.g., gigabytes to terabytes).

Trafﬁc scanning is more time sensitive than storage scanning, because the leak needs to be discovered before the message is transmitted. Directly applying automata-based string matching to data leak detection is inappropriate and inefﬁcient, because automata are not designed to support unpredictable and arbitrary pattern variations. In data leak detection scenarios, the transformation of leaked data (in the description of regular expression) is unknown to the detection method. Creating comprehensive automata models covering all possible variations of a pattern is infeasible, which leads to O(2n) space complexity (for deterministic ﬁnite automata) or O(2n) time complexity (for nondeterministic ﬁnite automata) where n is the number of automaton states. Therefore, automata approaches cannot be used for detecting long and transformed data leaks. Existing data leak detection approaches are based on set intersection. Set intersection is performed on two sets of n-grams, one from the content and one from sensitive data.

The set intersection gives the amount of sensitive n-grams appearing in the content. The method has been used to detect similar documents on the web [10], shared malicious trafﬁc patterns [11], malware [12], as well as email spam [13]. The advantage of n-grams is the extraction of local features of a string,enablingthe comparisonto toleratediscrepancies.Some advanced versions of the set intersection method utilize Bloom ﬁlter, e.g., [14], which trades accuracy for space complexity and speed. Shu and Yao extended the standard use of n-grams and introduced data-leak detection as a service. They proposed the ﬁrst solution for detecting accidental data leak with semi-honest providers [15]. However, set intersection is orderless, i.e., the ordering of shared n-grams is not analyzed. Thus, set-based detection generates undesirable false alerts, especially when n is set to a small value to tolerant data transformation. In addition, set intersection cannot effectively characterize the scenario when partial data is leaked, which results in false negatives. Therefore, none of the existing techniques is adequate for detecting transformed data leaks. Our solution to the detection of transformed data leaks is a sequence alignment algorithm, executed on the sampled sensitive data sequence and the sampled content being inspected. The alignment produces scores indicating the amount of sensitive data contained in the content.

Our alignment-based solution measures the order of n-grams. It also handles arbitrary variations of patterns without an explicit speciﬁcation of all possible variation patterns. Experiments show that our alignment method substantially outperforms the set intersection method in terms of detection accuracy in a multitude of transformed data leak scenarios. We solve the scalability issue by sampling both the sensitive data and content sequences before aligning them. We enable this procedure by providing the pair of a comparable sampling algorithm and a sampling-obliviousalignment algorithm. The comparable sampling algorithm yields constant samples of a sequence wherever the sampling starts and ends. The sampling-oblivious alignment algorithm infers the similarity between the original unsampled sequences with sophisticated traceback techniques through dynamic programming. The algorithm infers the lost information (i.e., sampled-out elements) based on the matching results of their neighboring elements. Evaluation results show that our design boosts the performance, yet only incurs a very small amount of mismatches. Existingnetworktrafﬁc samplingtechniques,e.g.,[16],only sample the content. Our problem differs from existing sampling problems that both sensitive data and content sequences are sampled. The alignment is performed on the sampled sequences.

Therefore, the samples of similar sequences should be similar so that they can be aligned. We deﬁne a comparable sampling property, where the similarity of two sequences is preserved. For example, if x is a substring of y, thenx should be a substring of y, wherex and y are sampled sequences of x and y, respectively. None of the existing sampling solutions satisﬁes this comparable sampling requirement. Deterministic sampling, e.g., [17], does not imply comparable sampling, either. The key to our comparable sampling is to consider the local context of a sequence while selecting items. Sample items are selected deterministically within a sliding window. The same sampled items are selected in spite of different starting/ending points of sampling procedures. Both of our algorithms are designed to be efﬁciently parallelized. We parallelize our prototypeon a multicore CPU and a GPU. We demonstrate the strong scalability of our design and the high performance of our prototypes. Our GPU-accelerated implementation achieves nearly 50 times of speedup over the CPU version. Our prototype reaches 400Mbps analysis throughput. This performance potentially supports the rapid security scanning of storage and communication required by a sizable organization.

Related Work

Existing commercial data leak detection/prevention solutions include Symantec DLP [14], IdentityFinder [18], GlobalVelocity [19], and GoCloudDLP [20]. GlobalVelocity uses FPGA to accelerate the system. All solutions are likely based on n-gram set intersection. IdentityFinder searches ﬁle systems for short patterns of numbers that may be sensitive (e.g., 16-digit numbers that might be credit card numbers). It does not provide any in-depth similarity tests. Symantec DLP is based on n-grams and Bloom ﬁlters. The advantage of Bloom ﬁlter is space saving. However, as explained in the introduction, Bloom ﬁlter membership testing is based on unordered n-grams, which generates coincidental matches and false alarms. Bloom ﬁlter conﬁgured with a small number of hash functions has collisions, which introduce additional unwanted false positives. Network intrusion detection systems (NIDS) such as Snort [21] and Bro [22] use regular expression to perform string matching in deep packet inspection [23]. Nondeterministic ﬁnite automaton (NFA) with backtracking requires O(2n) time and O(n) space, where n is the number of automaton states. Deterministic ﬁnite automaton (DFA) has a time complexity of O(n) and a space complexity of O(2n) when used with quantiﬁcation. Quantiﬁcation is for expressing optional characters and multiple occurrences in a pattern. DFA’s space complexity can be reduced by grouping similar patterns into one automaton , reducing the number of edges.

These improvements provide a coefﬁcient level of speedup. However, existing string matching approaches based on DFA or NFA cannot automatically match arbitrary and unpredictable pattern variations. Modiﬁed data leak instances cannot be matched or captured if the variation is not manually speciﬁed as a matching pattern. Enumerating all potential variation patterns takes exponential time and space with respect to the length of the pattern. Therefore, it is impractical. In comparison, our sequence alignment solution covers all possible pattern variations in long sensitive data without explicitly specifying them. Another drawback of automata is that it yields binary results. In comparison, alignment provides precise matching scores and allows customized weight functions. Our alignment gives more accurate detection than approximate string matching . Alignment algorithms have been widely used in computational biology applications, and features such as privacy-preserving sequence matching have been studied .

In security literature, NetDialign based on the well-known Dialign algorithms is proposed for network privacy [30]. It performs differential testing among multiple trafﬁc ﬂows. Kreibich and Crowcroft presented an alignment algorithm for trafﬁc intrusion detection systems such as Bro [31]. It is a variant of Jacobson-Vo alignment that calculates the longest common subsequence with the minimum number of gaps. Researchers in [32] reported the use of dynamic programming for computing the similarity of network behaviors and presented a technique to handle behavioral sequences with differing sampling rates. Masquerade attacks in the context of user command sequences can be detected with semi-global sequence alignment techniques . Our data leak detection differs from the above network privacy and IDS problems, and it has new requirements as we have explained in the introduction. Our alignment performs complex inferences needed for aligning sampled sequences, and our solution is also different from fast non-sample alignment in bioinformatics, e.g., BLAST [35]. Another approach to the detection of sensitive data leak is to track the data/metadata movement. Several tools are developed for securing sensitive information on mobile platforms[36]–[38].NadkarniandEnckdescribedan approach to control the sharing of sensitive ﬁles among mobile applications [37]. File descriptors (not the content) are stored, tracked and managed. The access control on ﬁles is enforced through policies. Yang et al. presented a method aiming at detecting the transmission of sensitive data that is not intended by smartphone users via symbolic execution analysis [38]. Hoyle et al. described a visualization method for informing mobile users of information exposure [36]. The information exposure may be caused by improper setting or conﬁguration of access policies. The visualization is through an avatar apparel approach. Croft and Caesar expand the data tracking from a single host to a network and use shadow packets to distinguish normal trafﬁc from leaks [39].

The security goals and requirements in all these studies are very different from ours, leading to different techniques developed and used. iLeak is a system for preventing inadvertent information leaks on a personal computer [40]. It takes advantages of the keyword searching utility present in many modern operating systems. iLeak monitors the ﬁle access activities of processes and searches for system call inputs that involve sensitive data. Unlike our general data leak detection approach, iLeak is designed to secure personal data on a single machine, and its detection capability is restricted by the underlying keyword searching utility, which is not designed for detecting either transformed data leaks or partial data leaks. Bertino and Ghinita addressed the issue of data leaks in database from the perspective of anomaly detection . Normal user behaviors are monitored and modeled in DBMS, and anomalousactivities are identiﬁed with respect to potential data leak activities. Bertino also discussed watermarking and provenance techniques used in data leak prevention and forensics which is investigated in details by Papadimitriou and Garcia-Molina . Privacy is a well-known issue in the cloud. Lin and Squicciarini proposed a generic data protection framework to enforce data security in the cloud . A three-tier data protection framework was proposed by Squicciarini et al. to deal with the data leak caused by indexing in the cloud . Privacy-preserving data leak detection was proposed and further developed where data leak detection operations are outsourced to a semi-honest third-party. The solution is a specialized set intersection method. Therefore, it is different from this paper.

Conclusion

Fast Detection of Transformed Data Leaks,We presented a content inspection technique for detecting leaks of sensitive information in the content of ﬁles or network trafﬁc. Our detection approach is based on aligning two sampled sequences for similarity comparison. Our experimental results suggest that our alignment method is useful for detecting multiple common data leak scenarios. The parallel versions of our prototype provide substantial speedup and indicate high scalability of our design. For future work, we plan to explore data-movement tracking approaches for data leak prevention on a host.

Fast Detection of Transformed Data Leaks

Fast Detection of Transformed Data Leaks

Abstract

Introduction

Related Work

Conclusion

MOST POPULAR

Online Hotel Reservation

Embedded Based on Vehicle Safety System Using GSM Technology

Periodic Census Using Graphical Representation

Watershed Management

HOT NEWS

Importance of Consumer Buying Behaviour

Data Standardization Using Hidden Markov

Secure Data Sharing in Cloud Computing Using Revocable-Storage Identity-Based Encryption

Certificates Using Double Encryption