Efficient Algorithms for Mining Top-K High Utility Itemsets

0
277
Efficient Algorithms for Mining Top-K High Utility Itemsets

Efficient Algorithms for Mining Top-K High Utility Itemsets

Abstract

Efficient Algorithms for Mining Top-K High Utility Itemsets management report in data mining.High utility itemsets (HUIs) mining is an emerging topic in data mining, which refers to discovering all itemsets having a utility meeting a user-specified minimum utility threshold min_util. However, setting min_util appropriately is a difficult problem for users. Generally speaking, finding an appropriate minimum utility threshold by trial and error is a tedious process for users. If min_util is set too low, too many HUIs will be generated, which may cause the mining process to be very inefficient.

On the other hand, if min_util is set too high, it is likely that no HUIs will be found. In this paper, we address the above issues by proposing a new framework for top-k high utility itemset mining, where k is the desired number of HUIs to be mined. Two types of efficient algorithms named TKU (mining Top-K Utility itemsets) and TKO (mining Top-K utility itemsets in One phase) are proposed for mining such itemsets without the need to set min_util. We provide a structural comparison of the two algorithms with discussions on their advantages and limitations. Empirical evaluations on both real and synthetic datasets show that the performance of the proposed algorithms is close to that of the optimal case of state-of-the-art utility mining algorithms.

INTRODUCTION

Data mining (occasionally called knowledge discovery) is the process of analyzing data from different angles and summarizing it into useful data. Data mining is a tool for analyzing data. It allows users to analyze data from different levels or angles, arrange it, and the relationships among the data are found. Data mining is the process of finding patterns among sufficient of fields in large relational databases.. Several studies have been done to HUI mining , it is very difficult for users to choose a minimum utility threshold. According to the value of threshold, the output size can be very small or very large. The choice of the threshold greatly influence the performance of the algorithms. If the threshold is too low, too many HUIs will be generated and it is very difficult for the users to understand the results. A large number of HUIs also causes the mining algorithms to become inefficient.

If the algorithms generate more HUIs , It uses more resources also. If the threshold is set too high, no HUI will be generated. To find value for the min_util threshold, users need to try different values by guessing and re-executing the algorithms. This process is very time consuming. To limit the output size and to control the itemsets with the highest utilities without setting the thresholds, a better solution is to change the task of mining HUIs as mining top-k high utility itemsets. Here the users specify k. Here k is the number of desired itemsets, instead of specifying the minimum utility threshold. Setting k is more easier than setting the threshold because k is the number of itemsets that the users want to find whereas selecting the threshold depends on database characteristics, which are unknown to users. Parameter k is used instead of the min_util threshold,it is very useful for many applications. This concept is used to analyze customer purchase behavior.

Top k HUI mining is used to find ,what are the top-k sets of products that contribute the highest profit to the company and how to efficiently found these itemsets without setting the min-util threshold. Top-k HUI mining is essential to many applications,It is not an easy task for developing efficient algorithms for mining such patterns .Two algorithms named TKU and TKO are proposed for mining the complete set of top k HUIs in databases without the need to specify the min_util threshold. The TKU algorithm uses a tree-based structure named UP-Tree, it is used to maintain the information of transactions and utilities of itemsets. TKU inherits useful properties from theTWU model and it consists of two phases. In phase I, potential top-k high utility itemsets are generated. In phase II, top-k HUIs are identified from the set of PKHUIS generated in phase I. The next algorithm is TKO, it uses a list-based structure named utility-list to store the utility information of itemsets in the database. It uses vertical data representation techniques to find top-k HUIs in one phase.

RELATED WORK

  1. High Utility Itemset Mining High Utility Itemset Mining is a popular concept and many algorithms have been proposed for HUI mining such as two-phase, IHUP, IIDS, UP-Growth, and HUI-Miner. These algorithms can be generally classified in two types: Two-phase and one-phase algorithms. The characteristics of two-phase algorithm is that it consist of two phases. In the first phase, they create a set of candidates that are potential high utility itemsets. In the second phase, they calculate the precise utility of each candidate found in the first phase to identify high utility itemsets. Two-phase, IHUP, IIDS, and UP-Growth are two-phase based algorithms.
  2. Top-k Pattern Mining Many studies have been proposed to mine diverse kinds of top-k patterns, such as top-k frequent itemsets, top-k frequent closed itemsets, top-k association rules, top-k sequential rules. The choice of data structures and look for strategy affect the effectiveness of a top-k mining algorithm in terms of both memory and execution time. Apriori is a famous algorithm used in data mining The Apriori algorithm is based on the concept that if a subset H appears N times, any other subset H’ that contains H will appear N times or less. So, if H doesn’t pass the minimum support threshold, neither does H’. There is no need to calculate H’, it is discarded apriori. Now here going to show an example of this algorithm. Let’s suppose here a client john with transactions [ [pen, pencil, book], [pen, book, bag], [pen, bag], [pen, pencil, book] , and a minimum support threshold m of 50 percentage (2 transactions).

First step: Count the singletons apply threshold The singletons for john are: pen: 4, pencil: 2, book: 3, bag: 2 All of the single items appear L or more times, so none of them are discarded. Second step: Generate pairs, count them and apply threshold. The pairs created were: pen, pencil, pen, book, pen, bag, pencil, book, pencil, bag, book, bag. Now we proceed to count them and applying the threshold. pen, pencil: 2 pen, book: 3 pen, bag: 2 pencil, book: 2 pencil, bag: 0 book, bag: 1 pencil, bag and book, bag have not passed the threshold, so they are discarded and any other subcombination both of them can generate. The left over pairs are put in a temporary set. Ass = pen, pencil, pen, book, pen, bag, pencil, book Step M: will generate triplets, quadruplets, etc., add up them, apply threshold and remove containing itemsets.

We generate triplets from our pairs. Triplets = pen, pencil, book, pen, pencil, bag, pen, book, bag, pencil, book, bag. Now we count them: pen, pencil, book: 2 pen, pencil, bag: 0 pen, book, bag: 1 pencil, book, bag: 0 Only pen, pencil, book has passed the threshold, so now we proceed to add it to Ass, but first, we have to remove the subsets that pen, pencil, book contains. Before adding our left over triplet Ass is looked like this: pen pencil, pen, book, pen, bag, pencil, book. When we add the triplet and remove the subsets that are inside it pen, pencil, pen, book and pencil, book are the ones that should go. Ass now look like pen, pencil, book, pen, bag, and this is the final result. If we had more than 1 triplet after apply the threshold, we proceed to generating the quadruplets, counting them, applying the threshold, adding up them to Ass and removing the subsets that each quadruplet contains

High Utility Itemset

Mining Many studies have been proposed for mining HUIs, including Two-Phase, IHUP, IIDS  and UP-Growth . TwoPhase and IHUP utilize transaction-weighted downward closure property to find high utility itemsets. They consist of two phases. In phase I, they find all HTWUIs from the database. In phase II, high utility itemsets are identified from the set of HTWUIs by scanning the original database. Although these methods capture the complete set of HUIs, they may generate too many candidates in phase I, i.e. HTWUIs, which degrades the performance of phase II and the overall performance (in terms of time and space).

To reduce the number of candidates in phase I, various methods have been proposed. Recently, Tseng et al. proposed UP-Growth with four effective strategies DGU, DGN, DLU and DLN, for mining HUIs. Experiments showed that the number of candidates generated by UP-Growth in phase I can be order of magnitudes smaller than that of HTWUIs. To the best of our knowledge, UP-Growth is the state-of-the-art method for mining high utility itemsets. Although many studies addressed the topic of mining high utility itemset from transaction databases, few of them showed the flexibility of mining top-k high utility itemsets. Although the concept of top-k high utility itemset mining was first introduced in [3], the definition of high utility itemset is different and our work.

Top-k Frequent Itemset

Mining In frequent pattern mining, several top-k pattern mining algorithms have been proposed. Most of them follow a same general process for finding top-k patterns, although they also have several differences. We describe this general process below and then highlight the challenges for top-k high utility itemset mining. The general process for mining top-k patterns from a database is the following. Initially, a top-k pattern mining algorithm sets minimum support threshold minsup to 0 to ensure that all the topk patterns will be found. Then, the algorithm starts searching for patterns by using a search strategy. As soon as a pattern is found, it is added to a list of patterns L ordered by the support of patterns. The list L is used to maintain the top-k patterns found until now. Once k patterns are found, the value of minsup is raised to the support of the least interesting pattern in L. Raising minsup is used to prune the search space when searching for more patterns.

Thereafter, each time a pattern is found that meets the minimum support threshold, the pattern is inserted into L, the patterns in L not respecting the threshold anymore are removed from L, and the threshold is raised to the support of the least frequent patterns in L. The algorithm continues searching for more patterns until no pattern is found by the search strategy. What distinguish each top-k pattern mining algorithm are the data structures and search strategies to discover patterns. Top-k pattern mining algorithm needs to use appropriate data structure and search strategies to be efficient in both memory and execution 80 time. Besides, the efficiency of a top-k algorithm depends largely on how fast it can raise the minimum interestingness criterion (minsup) to prune the search space. To raise the threshold quickly, it is desirable that a top-k pattern mining algorithm uses a search strategy that will find the most interesting patterns as early as possible. Although several efficient top-k pattern mining algorithms have been designed based on this idea, it is not possible to simply adapt this idea to HUI mining.

System Configuration:

H/W System Configuration:-

Processor          : Pentium IV

Speed               : 1 Ghz

RAM                  : 512 MB (min)

Hard Disk          : 20GB

Keyboard           : Standard Keyboard

Mouse               : Two or Three Button Mouse

Monitor             : LCD/LED Monitor

S/W System Configuration:-

Operating System               : Windows XP/7

Programming Language       : Java/J2EE

Software Version                 : JDK 1.7 or above

Database                            : MYSQL

PROPOSED ALGORITHM

The TKU Algorithm.

Here, proposed an algorithm named TKU for finding top-k HUIs without specify min_util. The Baseline approach of TKU is an seize of UP-Growth, a tree-based algorithm for mining HUIs. TKU uses the UP-Tree structure of UP-Growth to manage the information of transactions and top-k HUIs. TKU is worked in three steps.

  • Constructing the UP-Tree,
  • generating potential top-k high utility itemsets from the UP-Tree,
  • identifying top-k HUIs from the set of PKHUIs.

UP-Tree structure.:

Each node N of a UP-Tree has five entries: N.name is the item name of N; N.count is the support count of N; N.nu is the node utility of N; N.parent indicates the parent node of N; N.hlink is a node link which may point to a node having the same item name as N.name. The Header table is a structure employed to facilitate the traversal of the UP-Tree. A header table entry contains an item name, an estimated utility value, and a link. The link points to the first node in the UP-Tree having the same item name as the entry. The nodes whose item names are the same can be traversed efficiently by following the links in header table and the node links in the UP-Tree.

Construction of UP-Tree:

A UP-Tree can be constructed by scanning the original database twice. In the first scan, the transaction utility of each transaction and TWU of the each item are completed. Subsequently, items are inserted into the header table in descending order of their TW use. During the second database scan, transactions are reorganized and then inserted into the UP-Tree. Initially, the tree is created with a Root. When a transaction is retrieved, items in the transaction are sorted in descending order of TWU. A transaction after the above reorganization is called reorganized transaction and its transaction utility is called reorganized transaction utility. After inserting all the reorganized transactions, the construction of the UP-Tree is completed.

Conclusion

Efficient Algorithms for Mining Top-K High Utility Itemsets management report in data mining.Here studied the problem of top-k high utility itemsets mining, where k is the number of high utility itemsets to be mined. Two efficient algorithms TKU (mining Top-K Utility itemsets) and TKO (mining Top-K utility itemsets in One phase) are proposed here for mining such itemsets without using minimum utility thresholds concept.

TKU is the first two-phase algorithm for mining top-k high utility itemsets. On the other hand, TKO is the first one-phase algorithm developed for top-k HUI mining. Empirical evaluations on different types of real and synthetic datasets which show the proposed algorithms have good scalability on large datasets and the performance of the proposed algorithms is close to the optimal case of two-phase and one-phase utility mining algorithms. Although here proposed a new framework for top-k HUI mining.