
Privacy-Preserving Outsourced Association Rule Mining on Vertically Partitioned Databases
Abstract
Introduction
In recent years, there has been considerable interest in the so-called data mining-as-service paradigm for enabling organizations with limited computational resources and/or data mining expertise to outsource their data mining needs to a third party service provider . As an example, the operational transactional data from various outlets of Safeway, a grocery chain operating in the US and Canada, can be shipped to a third party which provides mining service for Safeway. The Safeway management need not employ an in-house team of data mining experts. Besides, they can cut down their local data management requirements because periodically data is shipped to the service provider who is in charge of maintaining it and conducting mining on it in response to requests from Safeway’s business analysts. It is generally expected that the paradigm of “mining and management of data as service” will grow with the advent and popularity of cloud computing . In the above example, Safeway, the client, is a data owner and the service provider is referred to as the server.
One of the main issues with this paradigm is that the server has access to valuable data of the owner and may learn sensitive information from it. E.g., by looking at the transactions, the server (or an intruder who gains access to the server) can learn which items are co-purchased, and in turn, the mined patterns. However, both the transactions and the mined patterns are the property of Safeway and should remain safe from the server . This problem of protecting important private information of organizations/companies is referred to as ”corporate privacy” [8]. Unlike personal privacy, which only considers the protection of the personal information recorded about individuals, corporate privacy requires that both the individual items and the patterns of the collection of data items are regarded as corporate assets and thus must be protected .
In this position paper, we study the problem of outsourcing the association rule mining task within a corporate privacy-preserving framework. Two recent papers address the same problem as we do. While a detailed comparison appears in Sec. 1.1, the main difference with our work is a formal analysis of privacy (compared to which only offers empirical evidence of security) and precise reconstruction of frequent patterns with exact support (compared to ). We propose an encryption scheme which enables privacy guarantees, and show some preliminary results obtained applying this model over large-scale, real-life transaction databases.
The architecture behind our model is illustrated in Figure 1. The client/owner encrypts its data using an encrypt/decrypt module, which can be essentially treated as a “black box” from its perspective. This module is responsible for transforming the input data into an encrypted database. The server conducts data mining and sends the (encrypted) patterns to the owner. Our encryption scheme has the property that the returned supports are not true supports. The encrypt/decrypt (ED) module recovers the true identity of the returned patterns as well their true supports.
Classic frequent itemset mining and association rule mining algorithms, such as Apriori [6], Eclat [7] and FP-growth [8], were designed for a centralized database setting where the raw data is stored in the central site for mining. Privacy concerns were not considered in this setting. Vaidya and Clifton [9] and Kantarcioglu and Clifton [10] are the first to identify and address privacy issues in horizontally / vertically partitioned databases. Due to an increased understanding of the importance of data privacy (e.g. in the aftermath of the revelations by Edward Snowden, a former NSA contractor), a number of privacy-preserving mining solutions have been proposed in recent times. In their settings, there are multiple data owners wishing to learn association rules or frequent itemsets from their joint data. However, the data owners are not willing to send their raw data to a central site due to privacy concerns. If each data owner has one or more rows (i.e. transactions) in the joint database, we say that the database is horizontally partitioned.
If each data owner has one or more columns in the joint database, the database is considered vertically partitioned. This paper focuses on vertically partitioned databases, and as explained in [11], such databases are useful for market basket analysis. For example, different businesses, such as a fashion designer and a luxury watch designer, sell different products to the same community. These businesses collaborate to mine customer buying patterns from the joint database. A transaction of the database contains the products that a customer had bought from one or more of the participating businesses, and attributes such as the customer credit card numberand date of purchaseare used as TIDs. Therefore,each of the businesses (i.e. data owners) will own some transaction partitions in the joint database. However, these businesses may not wish to disclose such data, which include trade secrets (e.g. there may be other competing businesses sharing the same joint database) and customer privacy (e.g. due to regulations in existing privacy regime). Therefore, a privacypreserving mining solution must be applied. Other use cases can also be found in areas such as automotive safety [9] and national security [12].
In this paper, we propose a cloud-aided privacy-preserving frequent itemset mining solution for vertically partitioned databases, which is then used to build a privacy-preserving association rule mining solution. Both solutions are designed for applications where data owners have a high level of privacy requirement. The solutions are also suitable for data owners looking to outsource data storage – i.e. data owners can outsource their encrypted data and mining task to a semi-trusted (i.e. curious but honest) cloud in a privacypreserving manner. To the best of our knowledge, this is the first work on outsourced association rule mining and frequent itemset mining for vertically partitioned databases. The key underlying techniques in our solutions are an efficient homomorphic encryption scheme and a secure outsourced comparison scheme. The contributions of this paper are three-fold:
• This paper proposes privacy-preserving mining solutions for high privacy requirements. As shown in Figure 1, the proposed solutions are uniquely located in the design space. Compared with most solutions, our solutions achieve a higher privacy level, as most existing solutions require either the sharing / exposure of raw data or the disclosure of the exact supports to data owners. Such requirements result in the leakage of sensitive information of the raw data [11]. Our solutions are designed to avoid such complications. We note that one of the frequent itemset mining solutions in [13] can achieve the same privacy level as our proposed solutions. However, an association rule mining solution cannot be built based on the frequent itemset mining solution in [13]. In contrast, we present solutions for both frequent itemset mining and association rule mining. Moreover, as shown in Section VII, our frequent itemset mining solution is 3 to 5 orders faster. Our solution is significantly more efficient due to our customized homomorphic encryption scheme. The introduction of a semi-trusted third party

(i.e. the cloud) also allows us to securely compute supports and compare supports with a threshold Ts more efficiently – for related work and comparative summary, respectively.
• This paper proposes an efficient homomorphicencryption scheme and a secure outsourced comparison scheme. To avoid the disclosure of supports/confidences, we design an efficient homomorphic encryption scheme to facilitate secure outsourced computation of supports/ confidences, as well as a secure outsourced comparison scheme for comparing supports/confidences with thresholds. The proposed (symmetric homomorphic) encryption scheme is tailored for the proposed comparison. The scheme only requires modular additions and multiplications, and is more efficient than the homomorphic encryption schemes used in other association rule mining and frequent itemset mining solutions. For example, encryption computing in our scheme is three orders of magnitudefaster than respectively.To the best of our knowledge, the proposed secure comparison scheme is the first scheme based on symmetric homomorphic encryption. The proposed schemes are designed for the data mining solutions outlined in this paper, but they can potentially be adopted in a wide range of secure computation applications.
• This paper proposes a ciphertext tag approach for canceling out fictitious data’s effect on mining result. To “hide” the data owner’s raw data from the cloud, we adapt the concept outlined in [16] by encrypting items with a substitution cipher, and adding fictitious transactions as a mitigation against frequency analysis attacks on the substitution cipher. To allow secure and accurate computation of supports, we design a ciphertext tag approach to cancel out fictitious transactions in a privacy-preserving manner. Although our approach is designed for the data mining solutions outlined in this paper, it has potential applications in other secure computation contexts, such as secure data aggregation.
Related Work
In this section, for the purpose of clarifying similarities and differences with our problem setting, we outline the work on privacy-preserving data publishing and mining, which can be classified into the following five categories.
Privacy-preserving data publishing (PPDP): The idea is that data is published by an owner for the common benefit of allowing analysts to mine patterns from it. Data is published with appropriate suppression, generalization, distortion, and/or decomposition such that individual privacy is not compromised and yet the published data is useful for mining . Clearly, this approach can protect personal privacy, but not corporate privacy, i.e., privacy of the assets consisting of the transaction database and the patterns that can be mined from it.
Privacy-preserving data mining (PPDM): The main model here is that private data is collected from a number of sources by a collector for the purpose of consolidating the data and conducting mining. The collector is not trusted with protecting the privacy, so data is subjected to a random perturbation as it is collected. Techniques have been developed for perturbing the data so as to preserve privacy while ensuring the mined patterns or other analytical properties are sufficiently close to the patterns mined from original data. This body of work was pioneered by and has been followed up by several papers since . Again, this approach is not suited for corporate privacy, in that some analytical properties are disclosed.
Secure multiparty mining over distributed datasets (SMPM): Data on which mining is to be performed is partitioned, horizontally or vertically, and distributed among several parties. The partitioned data cannot be shared and must remain private but the results of mining on the “union” of the data are shared among the participants, by means of multiparty secure protocols . They do not consider third parties. This approach partially implements corporate privacy, as local databases are kept private, but it is too weak for our outsourcing problem, as the resulting patterns are disclosed to multiple parties.
Privacy-preserving pattern publishing (PPPP): The central question is how to publish results of mining such as frequent patterns without revealing any sensitive information about the underlying data . Once again, the resulting patterns are disclosed. The particular problem attacked in this paper is outsourcing of pattern mining within a corporate privacy-preserving framework . A key distinction between this problem and the above mentioned PPDM problems is that, in our setting, not only the underlying data but also the mined results are not intended for sharing and must remain private. The work that is most related to ours is [23]. Similar to our work, first, they utilize a one-to-n item mapping together with non-deterministic addition of cipher items to protect the identification of individual items. Second, they assume that the adversary may possess some prior knowledge of frequency of the itemsets, which can be used to decipher the encrypted items. While our attack model focuses on single items with the assumption that the attacker knows the exact frequency of every single item. The major issue left open by is a formal protection result: their security analysis is entirely conducted empirically on various synthetic datasets. The notion of (h,k,p)-coherence employed by Xu et al. [25] achieves an effect similar to our approach of encrypting the database such that items fall into equivalence classes of size ≥k. However, this anonymization method is meant for data publishing purposes, and it is not applicable to privacy-preserving outsourcing.
Conclusion
In this Privacy-Preserving Outsourced Association Rule Mining on Vertically Partitioned Databases paper, we proposed a privacy-preserving outsourced frequent itemset mining solution for vertically partitioned databases. This allows the data owners to outsource mining task ontheir jointdata in a privacy-preservingmanner.privacypreserving manner. Based on this solution, we built a privacypreserving outsourced association rule partitioned databases. Our solutions protect data owner’s raw data from other data owners and the cloud. Our solutions also ensure the privacy of the mining results from the cloud. Compared with most existingsolutions, oursolutions leak less informationabout the data owners’ raw data. Our evaluation has also demonstrated that our solutions are very efficient; therefore, our solutions are suitable to be used by data owners wishing to outsource their databases to the cloud but require a high level of privacy without compromising on performance.