
Fast Detection of Transformed Data Leaks
Abstract
Introduction
REPORTS show that the number of leaked sensitive data records has grown 10 times in the last 4 years, and it reached a record high of 1.1 billion in 2014 [3]. A significant portion of the data leak incidents are due to human errors, for example, a lost or stolen laptop containing unencrypted sensitive files, or transmitting sensitive data without using endto-endencryptionsuch as PGP. A recent KasperskyLab survey shows that accidental leak by staff is the leading cause for internal data leaks in corporates [4]. The data-leak risks posed by accidents exceed the risks posed by vulnerable software. In order to minimize the exposure of sensitive data and documents, an organizationneeds to prevent cleartext sensitive data from appearing in the storage or communication. A screening tool can be deployed to scan computer file systems, server storage, and inspect outboundnetwork traffic. The tool searches for the occurrences of plaintext sensitive data in the content of files or network traffic. It alerts users and administrators of the identified data exposure vulnerabilities. For example, an organization’s mail server can inspect the content of outbound email messages searching for sensitive data appearing in unencrypted messages. Data leak detectiondiffersfromthe anti-virus(AV) scanning (e.g., scanning file systems for malware signatures) or the network intrusion detection systems (NIDS) (e.g., scanning traffic payload for malicious patterns) [5]. AV and NIDS typically employ automata-based string matching (e.g., Aho-Corasick [6], Boyer-Moore [7]), which match static or regular patterns. However, data leak detection imposes new security requirements and algorithmic challenges:
• Data Transformation: The exposed data in the content may be unpredictably transformed or modified by users or applications, and it may no longer be identical to the original sensitive data, e.g., insertions of metadata or formatting tags, substitutions of characters, and data truncation (partial data leak). Thus, the detection algorithm needs to recognize different kinds of sensitive data variations.
• Scalability: The heavy workload of data leak screening is due to two reasons. a) Long Sensitive Data Patterns: The sensitive data (e.g., customer information, documents, source code) can be of arbitrary length (e.g., megabytes). b) Large Amount of Content: The detection needs to rapidly screen content (e.g., gigabytes to terabytes).
Traffic scanning is more time sensitive than storage scanning, because the leak needs to be discovered before the message is transmitted. Directly applying automata-based string matching to data leak detection is inappropriate and inefficient, because automata are not designed to support unpredictable and arbitrary pattern variations. In data leak detection scenarios, the transformation of leaked data (in the description of regular expression) is unknown to the detection method. Creating comprehensive automata models covering all possible variations of a pattern is infeasible, which leads to O(2n) space complexity (for deterministic finite automata) or O(2n) time complexity (for nondeterministic finite automata) where n is the number of automaton states. Therefore, automata approaches cannot be used for detecting long and transformed data leaks. Existing data leak detection approaches are based on set intersection. Set intersection is performed on two sets of n-grams, one from the content and one from sensitive data.
The set intersection gives the amount of sensitive n-grams appearing in the content. The method has been used to detect similar documents on the web [10], shared malicious traffic patterns [11], malware [12], as well as email spam [13]. The advantage of n-grams is the extraction of local features of a string,enablingthe comparisonto toleratediscrepancies.Some advanced versions of the set intersection method utilize Bloom filter, e.g., [14], which trades accuracy for space complexity and speed. Shu and Yao extended the standard use of n-grams and introduced data-leak detection as a service. They proposed the first solution for detecting accidental data leak with semi-honest providers [15]. However, set intersection is orderless, i.e., the ordering of shared n-grams is not analyzed. Thus, set-based detection generates undesirable false alerts, especially when n is set to a small value to tolerant data transformation. In addition, set intersection cannot effectively characterize the scenario when partial data is leaked, which results in false negatives. Therefore, none of the existing techniques is adequate for detecting transformed data leaks. Our solution to the detection of transformed data leaks is a sequence alignment algorithm, executed on the sampled sensitive data sequence and the sampled content being inspected. The alignment produces scores indicating the amount of sensitive data contained in the content.
Our alignment-based solution measures the order of n-grams. It also handles arbitrary variations of patterns without an explicit specification of all possible variation patterns. Experiments show that our alignment method substantially outperforms the set intersection method in terms of detection accuracy in a multitude of transformed data leak scenarios. We solve the scalability issue by sampling both the sensitive data and content sequences before aligning them. We enable this procedure by providing the pair of a comparable sampling algorithm and a sampling-obliviousalignment algorithm. The comparable sampling algorithm yields constant samples of a sequence wherever the sampling starts and ends. The sampling-oblivious alignment algorithm infers the similarity between the original unsampled sequences with sophisticated traceback techniques through dynamic programming. The algorithm infers the lost information (i.e., sampled-out elements) based on the matching results of their neighboring elements. Evaluation results show that our design boosts the performance, yet only incurs a very small amount of mismatches. Existingnetworktraffic samplingtechniques,e.g.,[16],only sample the content. Our problem differs from existing sampling problems that both sensitive data and content sequences are sampled. The alignment is performed on the sampled sequences.
Therefore, the samples of similar sequences should be similar so that they can be aligned. We define a comparable sampling property, where the similarity of two sequences is preserved. For example, if x is a substring of y, thenx should be a substring of y, wherex and y are sampled sequences of x and y, respectively. None of the existing sampling solutions satisfies this comparable sampling requirement. Deterministic sampling, e.g., [17], does not imply comparable sampling, either. The key to our comparable sampling is to consider the local context of a sequence while selecting items. Sample items are selected deterministically within a sliding window. The same sampled items are selected in spite of different starting/ending points of sampling procedures. Both of our algorithms are designed to be efficiently parallelized. We parallelize our prototypeon a multicore CPU and a GPU. We demonstrate the strong scalability of our design and the high performance of our prototypes. Our GPU-accelerated implementation achieves nearly 50 times of speedup over the CPU version. Our prototype reaches 400Mbps analysis throughput. This performance potentially supports the rapid security scanning of storage and communication required by a sizable organization.

Related Work
Existing commercial data leak detection/prevention solutions include Symantec DLP [14], IdentityFinder [18], GlobalVelocity [19], and GoCloudDLP [20]. GlobalVelocity uses FPGA to accelerate the system. All solutions are likely based on n-gram set intersection. IdentityFinder searches file systems for short patterns of numbers that may be sensitive (e.g., 16-digit numbers that might be credit card numbers). It does not provide any in-depth similarity tests. Symantec DLP is based on n-grams and Bloom filters. The advantage of Bloom filter is space saving. However, as explained in the introduction, Bloom filter membership testing is based on unordered n-grams, which generates coincidental matches and false alarms. Bloom filter configured with a small number of hash functions has collisions, which introduce additional unwanted false positives. Network intrusion detection systems (NIDS) such as Snort [21] and Bro [22] use regular expression to perform string matching in deep packet inspection [23]. Nondeterministic finite automaton (NFA) with backtracking requires O(2n) time and O(n) space, where n is the number of automaton states. Deterministic finite automaton (DFA) has a time complexity of O(n) and a space complexity of O(2n) when used with quantification. Quantification is for expressing optional characters and multiple occurrences in a pattern. DFA’s space complexity can be reduced by grouping similar patterns into one automaton , reducing the number of edges.
These improvements provide a coefficient level of speedup. However, existing string matching approaches based on DFA or NFA cannot automatically match arbitrary and unpredictable pattern variations. Modified data leak instances cannot be matched or captured if the variation is not manually specified as a matching pattern. Enumerating all potential variation patterns takes exponential time and space with respect to the length of the pattern. Therefore, it is impractical. In comparison, our sequence alignment solution covers all possible pattern variations in long sensitive data without explicitly specifying them. Another drawback of automata is that it yields binary results. In comparison, alignment provides precise matching scores and allows customized weight functions. Our alignment gives more accurate detection than approximate string matching . Alignment algorithms have been widely used in computational biology applications, and features such as privacy-preserving sequence matching have been studied .
In security literature, NetDialign based on the well-known Dialign algorithms is proposed for network privacy [30]. It performs differential testing among multiple traffic flows. Kreibich and Crowcroft presented an alignment algorithm for traffic intrusion detection systems such as Bro [31]. It is a variant of Jacobson-Vo alignment that calculates the longest common subsequence with the minimum number of gaps. Researchers in [32] reported the use of dynamic programming for computing the similarity of network behaviors and presented a technique to handle behavioral sequences with differing sampling rates. Masquerade attacks in the context of user command sequences can be detected with semi-global sequence alignment techniques . Our data leak detection differs from the above network privacy and IDS problems, and it has new requirements as we have explained in the introduction. Our alignment performs complex inferences needed for aligning sampled sequences, and our solution is also different from fast non-sample alignment in bioinformatics, e.g., BLAST [35]. Another approach to the detection of sensitive data leak is to track the data/metadata movement. Several tools are developed for securing sensitive information on mobile platforms[36]–[38].NadkarniandEnckdescribedan approach to control the sharing of sensitive files among mobile applications [37]. File descriptors (not the content) are stored, tracked and managed. The access control on files is enforced through policies. Yang et al. presented a method aiming at detecting the transmission of sensitive data that is not intended by smartphone users via symbolic execution analysis [38]. Hoyle et al. described a visualization method for informing mobile users of information exposure [36]. The information exposure may be caused by improper setting or configuration of access policies. The visualization is through an avatar apparel approach. Croft and Caesar expand the data tracking from a single host to a network and use shadow packets to distinguish normal traffic from leaks [39].
The security goals and requirements in all these studies are very different from ours, leading to different techniques developed and used. iLeak is a system for preventing inadvertent information leaks on a personal computer [40]. It takes advantages of the keyword searching utility present in many modern operating systems. iLeak monitors the file access activities of processes and searches for system call inputs that involve sensitive data. Unlike our general data leak detection approach, iLeak is designed to secure personal data on a single machine, and its detection capability is restricted by the underlying keyword searching utility, which is not designed for detecting either transformed data leaks or partial data leaks. Bertino and Ghinita addressed the issue of data leaks in database from the perspective of anomaly detection . Normal user behaviors are monitored and modeled in DBMS, and anomalousactivities are identified with respect to potential data leak activities. Bertino also discussed watermarking and provenance techniques used in data leak prevention and forensics which is investigated in details by Papadimitriou and Garcia-Molina . Privacy is a well-known issue in the cloud. Lin and Squicciarini proposed a generic data protection framework to enforce data security in the cloud . A three-tier data protection framework was proposed by Squicciarini et al. to deal with the data leak caused by indexing in the cloud . Privacy-preserving data leak detection was proposed and further developed where data leak detection operations are outsourced to a semi-honest third-party. The solution is a specialized set intersection method. Therefore, it is different from this paper.
Conclusion
Fast Detection of Transformed Data Leaks,We presented a content inspection technique for detecting leaks of sensitive information in the content of files or network traffic. Our detection approach is based on aligning two sampled sequences for similarity comparison. Our experimental results suggest that our alignment method is useful for detecting multiple common data leak scenarios. The parallel versions of our prototype provide substantial speedup and indicate high scalability of our design. For future work, we plan to explore data-movement tracking approaches for data leak prevention on a host.