**Booster in High Dimensional Data Classification**

## Abstract

**Booster in High Dimensional Data Classification** management report n data mining problems in high dimensional data with small number of observations are becoming more common especially in microarray data. During the last two decades, lots of efficient classification models and feature selection (FS) algorithms have been proposed for higher prediction accuracies. However, the result of an FS algorithm based on the prediction accuracy will be unstable over the variations in the training set, especially in high dimensional data.

This paper proposes a new evaluation measure Q-statistic that incorporates the stability of the selected feature subset in addition to the prediction accuracy. Then we propose the Booster of an FS algorithm that boosts the value of the Q-statistic of the algorithm applied. Empirical studies based on synthetic data and 14 microarray data sets show that Booster boosts not only the value of the Q-statistic but also the prediction accuracy of the algorithm applied unless the data set is intrinsically difficult to predict with the given algorithm

## INTRODUCTION

The presence of high dimensional data is becoming more common in many practical applications such as data mining, machine learning and micro arraygene expression data analysis. Typical publicly available microarray data has tens of thousands of features with small sample size and the size of the features considered in microarray data analysis is growing. Recently, after the increasing amount of digital text on the Internet web pages, the text clustering (TC) has become a hard technique used to clustering a massive amount of documents into a subset of clusters. It is used in the area of the text mining, pattern recognition and others. Vector Space Model (VSM) is a common model used in the text mining area to represents document components.

Hence, each document is represented as a vector of terms weight, each term weight value is represented as a one dimension space. Usually, text documents contain informative and uninformative features, where an uninformative is as irrelevant, redundant, and uniform distribute features. Unsupervised feature section (FS) is an important task used to find a new subset of informative features to improve the TC algorithm. Methods used in the problems of statistical variable selection such as forward selection, backward elimination and their combination can be used for FS problems[3]. Most of the successful FS algorithms in high dimensional problems have utilized forward selection method but not considered backward elimination method since it is impractical to implement backward elimination process with huge number of features.

**A NEW PROPOSAL FOR FEATURE SELECTION**

This paper proposes Q-statistic to guage the performance of AN FS rule with a classifier. this can be a hybrid live of the prediction accuracy of the classifier and therefore the stability of the chosen options. Then the paper proposes Booster on the choice of feature set from a given FS rule. The basic plan of Booster is to get many information sets from original information set by resampling on sample house. Then FS rule is applied to those resampled information sets to obtain totally different feature subsets. The union of those hand-picked sets are the feature subset obtained by the Booster of FS rule. Experiments were conducted victimization spam email. The authors found that the planned genetic rule for FS is improved the performance of the text. The FS technique could be a style of improvement downside, that is employed to get a replacement set of options. Cat swarm improvement (CSO) rule has been planned to enhance improvement issues. However, CSO is restricted to long execution times. The authors modify it to enhance the FS technique within the text classification. Experiment Results showed that the planned changed CSO overcomes tradition al version and got additional ace up rate leads to FS technique.

**BOOSTER**

Booster is simply a union of feature subsets obtained by a resampling technique. The resampling is done on the sample space. Three FS algorithms considered in this paper are minimal-redundancy-maximal-relevance, Fast Correlation-Based Filter, and Fast clustering-bAased feature Selection algorithm. All three methods work on discretized data. For mRMR, the size of the selection m is fixed to 50 after extensive experimentations. Smaller size gives lower accuracies and lower values of Q-statistic while the larger selection size, say 100, gives not much improvement over 50. The background of our choice of the three methods is that FAST is the most recent one we found in the literature and the other two methods are well known for their efficiencies. FCBF and mRMR explicitly include the codes to remove redundant features. Although FAST does not explicitly include the codes for removing redundant features, they should be eliminated implicitly since the algorithm is based on minimum spanning tree. Our extensive experiments supports that the above three FS algorithms are at least as efficient as other algorithms including CFS.

## EXISTING SYSTEM

Methods used in the problems of statistical variable selection such as forward selection, backward elimination and their combination can be used for FS problems. Most of the successful FS algorithms in high dimensional problems have utilized forward selection method but not considered backward elimination method since it is impractical to implement backward elimination process with huge number of features. A serious intrinsic problem with forward selection is, however, a flip in the decision of the initial feature may lead to a completely different feature subset and hence the stability of the selected feature set will be very low although the selection may yield very high accuracy. This is known as the stability problem in FS. The research in this area is relatively a new field and devising an efficient method to obtain a more stable feature subset with high accuracy is a challenging area of research.

**Disadvantages**

- Several studies based on re-sampling technique have been done to generate different data sets for classification problem, and some of the studies utilize re-sampling on the feature space.

## PROPOSED SYSTEM

This paper proposes Q-statistic to evaluate the performance of an FS algorithm with a classifier. This is a hybrid measure of the prediction accuracy of the classifier and the stability of the selected features. Then the paper proposes Booster on the selection of feature subset from a given FS algorithm. The basic idea of Booster is to obtain several data sets from original data set by re-sampling on sample space. Then FS algorithm is applied to each of these re-sampled data sets to obtain different feature subsets. The union of these selected subsets will be the feature subset obtained by the Booster of FS algorithm. Empirical studies show that the Booster of an algorithm boosts not only the value of Qstatistic but also the prediction accuracy of the classifier applied.

**Advantages**

- The prediction accuracy of classification without consideration on the stability of the selected feature subset.
- The MI estimation with numerical data involves density estimation of high dimensional data.

## System Configuration:

**H/W System Configuration:-**

Processor : Pentium IV

Speed : 1 Ghz

RAM : 512 MB (min)

Hard Disk : 20GB

Keyboard : Standard Keyboard

Mouse : Two or Three Button Mouse

Monitor : LCD/LED Monitor

**S/W System Configuration:-**

Operating System : Windows XP/7

Programming Language : Java/J2EE

Software Version : JDK 1.7 or above

Database : MYSQL

## EFFICIENCY OF BOOSTER

There are two concepts in Booster to reflect the two domains. The first is the shape, Booster’s equivalent of a traditional array a finite set of elements of a certain data-type, accessible through indices. Unlike arrays, shapes need not necessarily be rectangular for convenience we will, for the moment, assume that they are. Shapes serve, from the algorithm designer’s point of view, as the basic placeholders for the algorithm’s data: input-, output-, and intermediate values are stored within shapes. As we will see later on, this does not necessarily mean that they arerepresented in memory that way, but the algorithm designer is allowed to think so.It presents the effect of s-Booster on accuracy and Q-statistic against the original s’s. Classifier used here is NB.

**BOOSTER BOOST S ACCURACY**

Boosting is a technique for generating and combining multiple classifiers to improve predictive accuracy. It is a type of machine learning meta-algorithm for reducing bias in supervised learning and can be viewed as minimization of a convex loss function over a convex set of functions. At issue is whether a set of weak learners can create a single strong learner A weak learner is defined to be a classifier which is only slightly correlated with the true classification and a strong learner is a classifier that is arbitrarily well-correlated with the true classification. Learning algorithms that turn a set of weak learners into a single strong learner is known as boosting.

**BOOSTER BOOSTS Q-STATISTIC**

Q static search algorithm generates random memory solutions and pursuing to improve the harmony memory to obtain optimal solution an optimal subset of informative features. Each musician unique term is a dimension of the search space. The solutions are evaluated by the fitness function as it is used to obtain an optimal harmony global Optimal solution. Harmony search algorithm performs The fitness function is a type of evaluation criteria used to evaluate solutions. At each iteration the fitness function is calculated for each HS solution. Finally, the solution, which has a higher fitness value is the optimal solution . We used mean absolute difference as fitness function in HS algorithm for FS technique using the weight scheme as objective function for each position.

**EXPERIMENT DESCRIPTION**

For the tests we selected fifteen data sets Arrhythmia, Cylinder-band, Hypothyroid, Kr-vs-Kp,Letter, Mushroom, Nursery,OptiDigits, Pageblock, Segment, Sick, Spambase and Waveform5000. All of these data sets have their own properties like the domain of the data set, the kind of attributes it contains, and tree size after training.We tested each data set with four different classification tree algorithms: J48, REPTree, RandomTree and Logistical Model Trees. For each algorithm both the test options percentage split and cross-validation were used. With percentage split, the data set is divided in a training part and a test part. For the training set 66% of the instances in the data set is used and for the test set the remaining part. Cross-validation is especially used when the amount of data is limited. Instead of reserving a part for testing, cross-validation.

**SIMULATION RESULTS**

In this boosting it will show the exact difference between accurate and non accurate boosting. Early stopping cannot save a boosting algorithm it is possible that the global optimum analyzed in the preceding section can be reached after the first iteration. Since depends only on the inner product between and the normalized example vectors, it follows that rotating the set S around the origin by any fixed angle induces a corresponding rotation of the function and in particular of its minima. Note that we have used here the fact that every example point in S lies within the unit disc; this ensures that for any rotation of S each weak hypothesis xi will always give outputs in as required. Consequently a suitable rotation of to will result in the corresponding rotated function having a global minimum at a vector which lies on one of the two coordinates.

## Conclusion and further work

**Booster in High Dimensional Data Classification** management report in data mining paper proposed a measure Q-statistic that evaluates the performance of an FS algorithm. Q-statistic accounts both for the stability of selected feature subset and the prediction accuracy. The paper proposed Booster to boost the performance of an existing FS algorithm. Experimentation with synthetic data and 14 microarray data sets has shown that the suggested Booster improves the prediction accuracy and the Q-statistic of the three well-known FS algorithms: FAST, FCBF, and mRMR. Also we have noted that the classification methods applied to Booster do not have much impact on prediction accuracy and Q-statistic.

Especially, the performance of mRMR-Booster was shown to be outstanding both in the improvements of prediction accuracy and Q-statistic. It was observed that if an FS algorithm is efficient but could not obtain high performance in the accuracy or the Q-statistic for some specific data, Booster of the FS algorithm will boost the performance. However, if an FS algorithm itself is not efficient, Booster may not be able to obtain high performance. The performance of Booster depends on the performance of the FS algorithm applied. If Booster does not provide high performance, it implies two possibilities: the data set is intrinsically difficult to predict or the FS algorithm applied is not efficient with the specific data set. Hence, Booster can also be used as a criterion to evaluate the performance of an FS algorithm or to evaluate the difficulty of a data set for classification.