Trust-But-Verify: Verifying Result Correctness of Outsourced Frequent Itemset Mining in Data-Mining-As-A-Service Paradigm

0
782
Trust-but-Verify: Verifying Result Correctness of Outsourced Frequent Itemset Mining in Data-Mining-As-a-Service Paradigm

Trust-But-Verify: Verifying Result Correctness of Outsourced Frequent Itemset Mining in Data-Mining-As-A-Service Paradigm

Abstract

Trust-But-Verify: Verifying Result Correctness of Outsourced Frequent Itemset Mining in Data-Mining-As-A-Service Paradigm Cloud computing is popularizing the computing paradigm in which data is outsourced to a third-party service provider (server) for data mining. Outsourcing, however, raises a serious security issue: how can the client of weak computational power verify that the server returned correct mining result? In this paper, we focus on the specific task of frequent itemset mining. We consider the server that is potentially untrusted and tries to escape from verification by using its prior knowledge of the outsourced data.

We propose efficient probabilistic and deterministic verification approaches to check whether the server has returned correct and complete frequent itemsets. Our probabilistic approach can catch incorrect results with high probability, while our deterministic approach measures the result correctness with 100 percent certainty. We also design efficient verification methods for both cases that the data and the mining setup are updated. We demonstrate the effectiveness and efficiency of our methods using an extensive set of empirical results on real datasets.

INTRODUCTION

Data mining (sometimes called data or knowledge discovery) is the process of analyzing data from different perspectives and summarizing it into useful information – information that can be used to increase revenue, cuts costs, or both. Data mining software is one of thenumber of analytical tools for analyzing data. It allows users to analyze data from many different dimensions or angles, categorize it, and summarize the relationships is identified. Technically, data mining is the process of finding correlations or patterns among dozens of fields in large relational databases.Data mining (the analysis step of the “Knowledge Discovery in Databases” process, or KDD), an interdisciplinary subfield of computer science, the computational process of discovering patterns in large data sets involving methods at the intersection of artificial intelligence , statistics, and database systems. The overall goal to decide upon marketing strategies for their product.

They can use data to compare and contrast among competitors. Data mining interprets its data into real time analysis that can be used to increase sales, promote new product, or delete product that is not value-added to the company. Pattern mining is a data mining method that involves finding existing patterns in data. In this context patterns often means association rules. The original motivation for searching association rules came from the desire to analyze supermarket transaction data, that is, to examine customer behavior in terms of the purchased products. For example, an association rules “beer ⇒ potato chips (80%)” states that four out of five customers that bought beer also bought potato chips. In the context of pattern mining as a tool to identify terrorist l of the data mining process is to extract information from a data set and transform it into an understandable structure for further use.

In this paper, I have proposed the Rob Frugal method. It will investigate encryption schemes that can resist such privacy vulnerabilities. It also interested in exploring how to improve the Rob Frugal algorithm to minimize the number of spurious patternsthis is based on 1–1 substitution ciphers for items and adding fake transactions to make each cipher item share the same frequency as≥k−1 others. It makes use of a compact synopsis of the fake transactions from which the true support of mined patterns from the server can be efficiently recovered. It also proposed a strategy for incremental maintenance of the synopsis against updates consisting of appends and dropping of old transaction batches. Previous research has shown that frequent item set mining can be computationally intensive, due to the huge search space that is exponential to data size as well as the possible explosive number of covered frequent item sets. Therefore, for those clients of limited computational resources, outsourcing.

Data mining uses information from past data to analyze the outcome of a particular problem or situation that may arise. Data mining works to analyze data stored in data warehouses that are used to store that data that is being analyzed. That particular data may come from all parts of business, from the production to the management. Managers also use data mining to decide upon marketing strategies for their product. They can use data to compare and contrast among competitors. Data mining interprets its data into real time analysis that can be used to increase sales, promote new product, or delete product that is not value-added to the company

An authenticated web crawler is a trusted program that computes a specially crafted signature over the web contents it visits. This signature enables

  • The verification of common Internet queries on web pages, such as conjunctive keyword searches—this guarantees that the output of a conjunctive keyword search is correct and complete;
  • The verification of the content returned by such Internet queries—this guarantees that web data is authentic and has not been maliciously altered since the computation of the signature by the crawler. In solution, the search engine returns a cryptographic proof of the query result.

METHODOLOGY

Mining specifications can be done by using Association rule mining. Association rules mining is a very popular data mining techniques and it finds relationships among the different entities of records (for example specifications records). It has received a great deal of attention in the field of knowledge discovery and data mining. The problem of association rules mining was introduced was improved to obtain the Apriori algorithm. The Apriori algorithm employs the downward closure property- if an itemset is not frequent, any superset of it cannot be frequent either. The Apriori algorithm performs a breadth-first search in the search space  The algorithm has been shown gain to significant performance improvement over Trace Miner and FP-Trace Miner.  This proposed system also gives an efficient method to mine the specifications from program execution traces. Traces deviating from common trace population rules are removed. The resultant filtered traces are then separated into multiple clusters. By clustering common traces together, it is expected that the learner is able to learn better and overgeneralization of a subset of traces is not propagated to other clusters. These clusters of filtered traces are then inputted to a specification miner. This algorithm confirms the usefulness of the proposed method in discovering software specifications in iterative pattern form. Besides mining software behavioral pattern, it is believed that the proposed mining technique can potentially be applied to other knowledge discovery domains

TRANSACTION ENTRY

In this module contains a frequent item details which is involved in each transaction. In the grid view control, all the records are displayed from which the records can be modified and new values can be updated. In addition if an item support count is higher than the minimum support count then it will be highlighted.

FREQUENT ITEM TRACES

This modules considered only a node with maximum support count otherwise nodes are removed from each transaction. In addition transaction entries are ordered based on the support count. These details are stored in ‘Ordered’ table and viewed by using grid view control.

PROBABILISTIC APPROACH

In this module apply the probabilistic approach to catch mining result and the predefined correctness/completeness requirement with high probability. The key idea is to construct a set of (in)frequentitemsets from real items, and use these (in)frequent itemsets as evidence to check the integrity of the server’s mining results.

DETERMINISTIC APPROACH

The deterministic approach to catch any incorrect/incomplete frequent itemset mining answer with 100 percent probability. The key idea of our deterministic solution is to require the server to construct cryptographic proofs of the mining results. Both correctness and completeness of the mining results are measured against the proofs with 100 percent certainty.

ANONYMITY

Attribute Creation In this module, attribute id, name, data type and suppress type (No Suppress, Semi Suppress, Full Suppress) are added to the database table. During the Value Generalization Hierarchy creation, the attribute id is selected using the combo box control. Value Generalization Hierarchy In this module, during the form load, the attribute ids are fetched from ‘attributes’ table. An attribute id is selected, its name is displayed; the original value in the data set and the ‘After VGH’ value is keyed in. Data grid view control is also provided to list all the records. Any record value can be modified and updated to the database using ‘Update’ button.

DATA SETS

Data Set Creation Based on the attributes given and their data type, a data set with ‘n’ (Attributes count) number of columns is dynamically created. The record values are keyed in using the data grid view control. Show Original Data Set Based on the values given, the record values from the table ‘DataSetValues’ are displayed using the data grid view control. Show Suppressed Data The record values from the table ‘DataSetValues’ are replaced with ‘*’ marks for suppressed columns table and are displayed using the data grid view control. Show Generalized Data The record values from the table ‘DataSetValues’ are replaced with ‘Value Generalization Hierarchy ‘table and are displayed using the data grid view control. During the replace any single value or range of values are substituted and displayed.

VIEW Show Attributes

In this module, attribute id, name, data type and suppress type (No Suppress, Semi Suppress, Full Suppress) are viewed from the database table ‘Attributes’. Show Data Set Values Based on the values given, the record values from the table ‘Data Set Values’ are displayed using the data grid view control. Show Value Generalization Hierarchy In this module, the attribute id, name, Original value and VGH values are displayed. Data grid view control is provided to view all the records. Show Suppressed Data The record values from the table ‘DataSetValues’ are replaced with ‘*’ marks for suppressed columns table and are displayed using the data grid view control. Show Witness Set After the VGH values applied, the records may contain duplicate values in all columns. Those values are eliminated and witness set is prepared. These records are displayed here.

HOMOMORPHIC PROTOCOL

Suppression Based Method In this module, the suppression based method is used. In Step 1 DB Owner Send coded non suppressed column values (EA(cDeltaI)) to Data Provider. Then Data Provider gives his tuple values of non-suppress columns. In Step 2 Data Provider codes own tuple values and also DB Owners’ tuple value. In Step 3 DB Owner Decrypts EB(EA(cDeltaI)). Both db owner and data provider process the data without disturbing the privacy here.

System Configuration:

H/W System Configuration:-

Processor          : Pentium IV

Speed               : 1 Ghz

RAM                  : 512 MB (min)

Hard Disk          : 20GB

Keyboard           : Standard Keyboard

Mouse               : Two or Three Button Mouse

Monitor             : LCD/LED Monitor

S/W System Configuration:-

Operating System               : Windows XP/7

Programming Language       : Java/J2EE

Software Version                 : JDK 1.7 or above

Database                            : MYSQL

SYSTEM IMPLEMENTATION

The system output is mainly based on privacy preserving method. It will be evaluated using robfrugal algorithm. The evaluation of data can be done by using calculating support value and arrange that in a descending order. Then the data have been split up into two groups to preserve the data from the third party server. Now the data have been encrypted and stored in a server side. Multiplecompanies have access the server. So, the data will be disclosed. For that purpose, client encrypts its data and stores it in a server in some other format. Based on the mining queries server conducts mining and sends encrypted pattern to the client. Finally client decrypts the encrypted pattern and gets true support of the original transactions.

Conclusion and further work

In Trust-but-Verify: Verifying Result Correctness of Outsourced Frequent Itemset Mining in Data-Mining-As-a-Service Paradigm project, a new algorithm TM-Trace Miner is presented using the vertical database representation. Trace ids of each trace set are transformed and compressed to continuous transaction interval lists in a different space using transaction tree and frequent trace sets are found by transaction intervals intersection along a lexicographic tree in depth-first order. Through this project the TM-Trace Miner algorithm has been gain to significant performance improvement over Trace Miner and FP-Trace Miner.

This project also gives an efficient method to mine the specifications from program execution traces. Traces deviating from common trace population rules are removed. The resultant filtered traces are then separated into multiple clusters. By clustering common traces together, it is expected that the learner is able to learn better and over generalization of a subset of traces is not propagated to other clusters. These clusters of filtered traces are then inputted to a specification miner. This algorithm confirms the usefulness of the proposed method in discovering software specifications in iterative pattern form. Besides mining software behavioral pattern, it is believed that the proposed mining technique can potentially be applied to other knowledge discovery domains.