Top-K Dominating Queries on Incomplete Data

0
1463
Top-k dominating queries on incomplete data

Top-K Dominating Queries on Incomplete Data

Abstract

Top-k dominating queries on incomplete data management report in data mining returns the k objects that dominate the maximum number of objects in a given dataset. It combines the advantages of skyline and top-k queries, and plays an important role in many decision support applications. Incomplete data exists in a wide spectrum of real datasets, due to device failure, privacy preservation, data loss, and so on. In this paper, for the first time, we carry out a systematic study of TKD queries on incomplete data, which involves the data having some missing dimensional value(s).

We formalize this problem, and propose a suite of efficient algorithms for answering TKD queries over incomplete data. Our methods employ some novel techniques, such as upper bound score pruning, bitmap pruning, and partial score pruning, to boost query efficiency. Extensive experimental evaluation using both real and synthetic datasets demonstrates the effectiveness of our developed pruning heuristics and the performance of our presented algorithms.

Introduction

Data mining is a powerful new method to detect knowledge within the large amount of the data. Also data mining is the process of discovering meaningful new relationship, patterns and trends by passing large amounts of data stored in corpus, using pattern recognition technologies as well as statistical and mathematical techniques. Data mining sometimes called data or knowledge mining. Data are any facts, numbers, or sequence of characters that can be processed by a computer. Today, organizations are handling large and growing amounts of data in different structure and different databases. Generally, data mining (sometimes called data or knowledge discovery) is the process of analyzing data from different perspectives and summarizing it into useful information – information that can be used to increase revenue, cuts costs, or both.

It allows users to analyze data from many different dimensions or angles, categorize it, and summarize the relationships identified. Technically, data mining is the process of finding reciprocity or patterns among lots of fields in large relational databases. Given a set S with d dimensional objects top k dominating queries ranks these objects base on the number of objects in S dominated by o, and returns k objects that dominates maximum number of objects. The TKD query identifies the most significant objects, and is a powerful decision making tool used to rank objects in real life applications. . In this paper, we take an incomplete dataset where some objects face the missing of attribute values in some dimensions, and study the problem of TKD query and processing over incomplete data. A TKD query on incomplete data returns k objects that dominates the maximum  number of objects from a given incomplete data set.

In brief, the key contributions of this paper are summarized as follows.

  • We formalize the problem of TKD query in the context of incomplete data. To our knowledge, there is no prior work on this problem.
  • We propose efficient algorithms for processing TKD queries on incomplete data, using several novel heuristics.
  • We present an adaptive binning strategy with an efficient method for choosing the appropriate number of bins to minimize the space of bitmap index for IBIG.
  • We conduct extensive experiments using both real and synthetic datasets to demonstrate the effectiveness of our developed pruning heuristics and the performance of our proposed algorithms.

    System Configuration:

    H/W System Configuration:-

    Processor          : Pentium IV

    Speed               : 1 Ghz

    RAM                  : 512 MB (min)

    Hard Disk          : 20GB

    Keyboard           : Standard Keyboard

    Mouse               : Two or Three Button Mouse

    Monitor             : LCD/LED Monitor

    S/W System Configuration:-

    Operating System               : Windows XP/7

    Programming Language       : Java/J2EE

    Software Version                 : JDK 1.7 or above

    Database                            : MYSQL

PROPOSED ALGORITHM

Our proposed UBB calculation restrains the span of applicant set by using upper bound score pruning system for the TKD question on inadequate information. Be that as it may, the upper bound score might be fairly free, along these lines we need to infer the genuine scores for some items (even the entire dataset) by means of thorough match correlations, which corrupts look execution essentially. Along these lines, an effective score calculation strategy is popular. As an answer, we present a recently proposed bitmap list on deficient information and propose the bitmap list guided calculation to tackle the TKD question on deficient information. Consolidating MaxScore procedure, BIG empowers a novel bitmap pruning utilizing a bitmap list, and utilizes quick bit-wise operations for more effective score calculation. Moreover, we additionally build up a moved forward form of BIG (indicated as IBIG) to minimize the bitmap capacity cost by means of the bitmap pressure procedures and an versatile binning procedure. As we probably am aware, the conventional bitmap file depends on entire information, and it bolsters predominance relationship checking by means of bit-wise operations.

In any case, it is not pertinent to our issue which is based on fragmented information. Subsequently, another bitmap list must be intended to manage missing information. Also, the strength relationship of TKD question with inadequate information can’t be determined construct just in light of the bit operations. Along these lines, an productive calculation in view of the bitmap record supporting missing information is likewise fancied. In particular, our new bitmap record is worked as takes after. Initial, a question o is spoken to by a bit string with bits in the bitmap record, where every measurement of o is spoken to by a sub-string with bits.

Top-k Dominating Queries Papadias et al. first introduce the top-k dominating (TKD) query as a variation of skyline queries, and they present a skyline based algorithm for processing TKD queries on the traditional complete dataset indexed by an R-tree. To boost efficiency, Yiu and Mamoulis  propose two approaches based on the aR-tree to tackle the TKD query. More recently, some new variants of TKD queries are studied, including subspace dominating query, continuous top-k dominating query, metric-based top-k dominating query, topk dominating query on massive data, etc. In addition, the probabilistic top-k dominating (PTKD) query has also been explored. Specifically, Lian and Chen investigate PTKD query on uncertain data, which returns the k uncertain objects that are expected to dynamically dominate the largest number of uncertain objects in both the full space and subspace. Zhang et al. consider the threshold-based PTKD query in full spaces. Zhan et al. adopt the parameterized ranking semantics to formally define TKD query on multi-dimensional uncertain objects.

Existing System

Data mining is the process of finding out knowledge from large amount of data stored in the database. Due to the availability of huge amount of data in electronic forms and turning such data into useful information and knowledge for broad application including business management and decision support in information industry in recent years. Data mining has a lot of benefits when using in a specific industry. Besides those sakes, data mining also have its own disadvantages. Data mining do good in business, society, governments as well as the individual. However privacy, security and misuse of information are the big problems, if they donot addressed and resolved properly. Data mining predicts future trends and customer purchase habits , it also helps in decision making and market basket analysis. The cons of datamining are privacy and security, great cost at implementation stage and possible misuse of information. Xuemin Lin, extended the well-known skyline analysis to uncertain data, and developed two algorithms to tackle the problem of calculating probabilistic skylines on uncertain data using real and synthetic data sets.

The Author, W.T.Balke, proposed Distributed Web Information services  are premium examples benefiting from our contributions, they presented a first algorithm that allows to retrieve the skyline over distributed data sources with basic middleware access techniques and have well tried that it features an optimal complexity in terms of object accesses. To overcome the deterioration for higher numbers of lists he also proposed an efficient sampling technique to estimate the size of a skyline by assessing degree of data correlation. The Author, M E Khalifa (ET. Al) Skyline queries aim to prune search space of large numbers of multi-dimensional data to a small set of interesting items by eliminating items that are dominated by others. Existing skyline algorithms assume that all dimensions are available for all the data items.

This paper goes beyond the restrictive assumption as we address the more practical case involving incomplete data items. X.Miao proposed an efficient probabilistic skyline query process on uncertain data streams. As data volume continuously grows its quality may not be high as in usual cases. The data can be defected ,cannot be precise or inaccurate due to the process called data acquiring. The skyline query is widely used for data analysis and to derive the results that meets more than in specific conditions simultaneously. M.Kontaki The Continuous Top-k Dominating query(cTKDQ) method have some limitations. In order to overcome the existing cons, introduces a new indexing structure known as close dominance graph(CDG) to support and maintain the relationship between dynamic data records.

However, CDG takes more time to search results. In this paper they introduce a dictionary based compression algorithm, which was efficient in answering cTKDQ with minimum time and memory. Papadias et al. first introduce the top-k dominating query as a variation of skyline queries, and they present a skylinebased algorithm for processing TKD queries on the traditional complete dataset indexed by an R-tree. Yiu andMamoulis ,propose two approaches based on the aR-tree to tackle the TKD query. More recently, some new variants of TKD queries are studied, including subspace dominating query continuous top-k dominating query metric-based top-k dominating query top-k dominating query on massive data. Gao et al. propose an efficient kISB algorithm for processing k-skyband queries over incomplete data. Lofi et al. present an approach to compute the skyline using crowd-enabled databases with the challenge of dealing with missing information in datasets.

Proposed System

Our proposed UBB algorithm limits the size of candidate set by utilizing upper bound score pruning technique for the TKD query on incomplete data. However, the upper bound score may be rather loose, thereby we have to derive the real scores for many objects (even the whole dataset) via exhaustive pair comparisons, which degrades search performance significantly. Thus, an efficient score computation method is in demand. As a solution, we introduce a newly proposed bitmap index on incomplete data and propose the bitmap index guided algorithm to solve the TKD query on incomplete data. we propose the improved BIG (termed as IBIG) algorithm to efficiently address the storage issue by using the bitmap compression technique and the binning strategy. Specifically, the compression techniques are applied on the “vertical” bitsets, while the binning strategy compresses the bitmap index on the “horizontal” bitsets, i.e., for the bit-string of every object in the dataset. we introduce two most efficient and popular compression techniques, i.e., Word Aligned Hybrid (WAH) and Compressed „n‟ Composable Integer Set (CONCISE) to compress the bitmap index vertically. In this paper, we choose CONCISE instead of WAH. This is because, as shown in CONCISE has better compression ratio than WAH, and its computational complexity is comparable to that of WAH. We also demonstrate that CONCISE does perform better than WAH via an empirical evaluation. IBIG consumes less storage is its advantage.

FUTURE WORK

This paper tries to experience diverse works identified with Top-k Dominating queries on incomplete information. Top –k queries returns top components from a dataset and it is extremely useful in different realtime applications. For the most part skyline based approach is utilized as a part of such cases. More strategies must be executed to discover top components from incomplete dataset. This paper is not an entire reference but rather indenting to help understudies who are occupied with exploring on this topic and gives the brief thought of the same.

Conclusion

In Top-k dominating queries on incomplete data management report in data mining paper, it study the problem of the TKD query on incomplete data where some dimensional values are missing in the dataset. To efficiently address this, we first propose ESB and UBB algorithms, which utilize novel to prune the search space. In order to further reduce the cost of score computation, we present BIG algorithm, which employs the upper bound score pruning, the bitmap pruning and fast bitwise operations based on the bitmap index to improve the score computation and boost query performance accordingly. In order to trade the efficiency for space, we propose IBIG algorithm by using the bitmap compression technique and the binning strategy over BIG, and develop a method to choose the appropriate number of bins. Experimental results on both real and synthetic datasets confirm the effectiveness and efficiency of our algorithms.