kNNVWC An Efficient k-Nearest Neighbors Approach Based on Various-Widths Clustering

0
1688
kNNVWC An Efficient k-Nearest Neighbors Approach Based on Various-Widths Clustering

kNNVWC An Efficient k-Nearest Neighbors Approach Based on Various-Widths Clustering

Abstract

kNNVWC An Efficient k-Nearest Neighbors Approach Based on Various-Widths Clustering management report in data mining.The k-nearest neighbor approach (k-NN) has been extensively used as a powerful non-parametric technique in many scientific and engineering applications. However, this approach incurs a large computational cost. Hence, this issue has become an active research field. In this work, a novel k-NN approach based on various-widths clustering, named kNNVWC, to efficiently find k-NNs for a query object from a given data set, is presented.

kNNVWC does clustering using various widths, where a data set is clustered with a global width first and each produced cluster that meets the predefined criteria is recursively clustered with its own local width that suits its distribution. This reduces the clustering time, in addition to balancing the number of produced clusters and their respective sizes. Maximum efficiency is achieved by using triangle inequality to prune unlikely clusters. Experimental results demonstrate that kNNVWC performs well in finding k-NNs for query objects compared to a number of k-NN search algorithms, especially for a data set with high dimensions, various distributions and large size.

INTRODUCTION

The k-nearest neighbor approach (kNN) has been extensively used as a powerful nonparametric technique in many scientific and engineering applications. However, this approach incurs a large computational cost. Hence, this issue has become an active research field. In this work, a novel kNN approach based on various widths clustering, named kNNVWC, to efficiently find kNNs for a query object from a given data set, is presented. kNNVWC does clustering using various widths, where a data set is clustered with a global width first and each produced cluster that meets the predefined criteria is recursively clustered with its own local width that suits its distribution. This reduces the clustering time, in addition to balancing the number of produced clusters and their respective sizes. Maximum efficiency is achieved by using triangle inequality to prune unlikely clusters.

Experimental results demonstrate that kNNVWC performs well in finding kNNs for query objects compared to a number of kNN search algorithms, especially for a data set with high dimensions, various distributions and large size. he k-Nearest Neighbors algorithm (or k-NN for short) is a non-parametric method used for classification and regression. In both cases, the input consists of the k closest training examples in the feature space. k-NN is a type of instance-based learning, or lazy learning, where the function is only approximated locally and all computation is deferred until classification. The k-NN algorithm is among the simplest of all machine learning algorithms.

Both for classification and regression, it can be useful to assign weight to the contributions of the neighbors, so that the nearer neighbors contribute more to the average than the more distant ones. For example, a common weighting scheme consists in giving each neighbor a weight of 1/d, where d is the distance to the neighbor. Fixed width clustering creates a set of clusters of fixed radius (width) w. Here the width w is a parameter to be specified by the user. First, a data vector is taken and used as the centroid (center) of the first cluster with radius w. Then for each subsequent data vector the Euclidean distance between the centroid of the current clusters and this data vector is computed. If the distance to the closest cluster center from the data vector is less than the radius w, the data vector is added into that cluster and the centroid of that cluster is adjusted to the mean of the data vectors it contains.

If the distance to the closest cluster center is more than the radius w, then a new cluster is formed with that data vector as the centroid. This operation produces a set of disjoint, fixed width (radius of w) clusters in the feature space. There is a huge number of clustering algorithms and also numerous possibilities for evaluating a clustering against a gold standard. The choice of a suitable clustering algorithm and of a suitable measure for the evaluation depends on the clustering objects and the clustering task. The clustering objects within this thesis are verbs, and the clustering task is a semantic classification of the verbs. Further cluster parameters are to be explored within the cluster analysis of the verbs.

SYSTEM ARCHITECTURE/ SYSTEM OVERVIEW

As input require to the system is high dimensional data set, which is further processing in the data normalization, for example, we have data in column with height and weight then we have to normalized the data in single unit for comparison. The next step is to produce the number of cluster by using various-width algorithm which perform three operations are System Architecture data partitioning, data fixed width clustering and data merging. In partitioning process, partition a data set into number of clusters using large width to resolve the issue of clustering in high dimensional data space. In data fixed width clustering, compute radius of clusters and its nearest distance from query object.

In merging process, after partitioning the cluster it check the minimum distance between centroid of two clusters and their respective radius. If distance between centroid of cluster with radius of cluster is less than radius of another cluster then merge the cluster with other cluster. The next step is parallel k-NN search, for query object so that we can easily map (i.e. compare) with every cluster object. The comparison task performed parallel executes which result fast comparison. We execute the parallel k-NN process, using GPU; allocate memory on CUDA, then transfer the data on it. Decide how many threads and blocks have to launch to run the program. We use CUDA which is parallel computing platform and programming model that increase the computing performance by using the graphics processing unit (GPU). Using GPU reduces the searching time, and do not want to compare with every cluster object, because with parallel k-NN we parallel launch the thread and solve problem parallel which result in reduce the comparison task.

LITERATURE SURVEY

In k-Nearest neighbor approach, requires scanning whole data set and find k-NN by calculating the distances from a query object to all object in set n, that result into high computational cost. With this result, research has focused on pre-processing the data set to find k-NN by accessing only the part of dataset. The technique classified into two categories: tree-based and flat indexes. The well-known tree-based indexes are kd-tree, Balltree, M-tree, Cover-tree, vp-tree are explain below:

Friedman et al. introduced the kd-tree where data set is split into two parts in balanced binary tree, to deal with high dimensional data.

Fukunaga and Narendra  proposed a metric tree algorithm called the Ball tree. Ball tree compute centroid of data set, and used this centroid to partitioned the data set into two subset. The number of objects assigned to either node cannot be constrained which may result in a highly unbalanced tree structure. Leaf nodes of any M-tree  store all data points, whereas the internal nodes store the called as routing objects. All the objects that are within a given distance from the routing object are stored in the sub-tree of the routing object called covering tree.

Beygelzimer et al. proposed cover tree which is a hierarchy of levels where, the top level contain the root point and the bottom level contain every point in the metric space. Cover time output in logarithmic time to find nearest neighbor. In vp-tree due to the partitioning based on distances from the vantage points, the objects in the leaf nodes overlap with each other nodes especially in a high dimensional data . Treebased indexes losses its effectiveness if overlap between the nodes is high and result into unbalanced tree structure.

Several flat-based indexes are used e.g., k-means and fixed width clustering. With k-means, assigns sparse object to the closest cluster such that clusters are overlapping with lots of another cluster. In fixed width clustering, result in poor clustering due to large number of object in some clusters require large amount of computational time for searching.

Almalawi et al. proposed a flat index that uses different widths for different clusters and width of each cluster is learned during the construction, this approach is known as various-width clustering. In this technique, dataset is partitioned into a number of clusters whose size are constrained by user-defined threshold. Three operations performed to generate the clusters are cluster-width learning, partitioning and merging such that they loop sequentially and executed until criteria are met. Further these operations, they performed k-NN search to obtain result. As computing cluster for each dataset and find the nearest distance for each query object to centroid of cluster, result into high computational cost and pre-processing cost. In existing approach, k-NN search for searching distance of query object with centroid of each clusters to obtain the nearest distance required large amount of time. Motivated by this, we proposed the parallel k-NN search for query object to improve the performance. Parallel algorithm used to reduce the pre-processing cast.

System Configuration:

H/W System Configuration:-

Processor          : Pentium IV

Speed               : 1 Ghz

RAM                  : 512 MB (min)

Hard Disk          : 20GB

Keyboard           : Standard Keyboard

Mouse               : Two or Three Button Mouse

Monitor             : LCD/LED Monitor

S/W System Configuration:-

Operating System               : Windows XP/7

Programming Language       : Java/J2EE

Software Version                 : JDK 1.7 or above

Database                            : MYSQL

SYSTEM ANALYSIS

A Parallel k-Nearest Neighbor Searching Using VariousWidth Clustering consist of three algorithms need to perform to find the exact k-nearest neighbor object to the query object from high dimensional data set.The three algorithms are fixedwidth clustering, various-width clustering and parallel k-NN search for query object.

1) Fixed-Width Clustering The Fixed-width clustering(FWC) algorithm is for partitioning a data set into a number of clusters with fixed width radius ω. Let U ={j1,j2…jn} be the set of objects and C={C1,C2,..Cn} be the produced cluster from U based on FWC. Initially set of clusters, their centroid and width are null. The first object j1 is used to create first cluster C1, and set j1 as clusters centroid. The next object ji is taken from set of objects U and assigned to closest cluster Ci by providing the distance between ji , and check the condition Ci ≤ ω and n >0, where ω is cluster fixed-width. Otherwise, new cluster is created Ci+1 and ji set as its centroid.

2) Fixed-Width Clustering Various-width clustering is used to partitioned the data set into a number of clusters whose sizes are varied in sized and constrained by user-defined threshold. The various-width algorithm consist of three process are:

  • Data partitioning, Data fixed width clustering and Data merging. Global-width calculation: In this process consider the whole data set as cluster, hence for this large cluster we need to find the width which is further used to partitioned that cluster into the number of clusters. Equation 1 is used to calculate global width.
  • Partitioning Process: This process partitions a data set into a number of clusters using a large width to resolve the issue of clustering the sparsely distributed objects in n-dimensional space.
  • Merging Process: The partitioning process of a large cluster can lead to the creation of a cluster that is totally contained in another cluster. Therefore, merging such clusters decreases the number of clusters produced, thereby increasing the performance of the search for k-NNs because the distance computations between a query object and the clusters centroids will be less when fewer clusters exist.

3) Parallel k-NN Search The parallel k-NN search for query object throughout the clusters performed with GPU using CUDA parallel platform. Find the nearest object with respect to the query object, in parallel way through threads.

All Data sets are online freely available.All of them were obtained from the UCI Machine Learning Repository. They are widely used for the evaluation of data mining techniques.We use them only to evaluate the performance of the proposed approach to searching k-nearest neighbours in various domains and dimensions. The characteristics of these data sets are briefly described below:

  • covertype: This data set contains 581,012 number of instances and 12 measures, 54 column of data(10 quantitative variables, 4 binary wilderness areas and 40 binary soil type variable).
  • spambase: This data set contains two classes: spam and non-spam e-mails. It consists of 4601 objects(e-mails), each being described by 57 continuous features denoting word frequencies.
  • waveform: It consists of 5,000 objects having 40 continuous features, some of which are noise.
  • shuttle: This data set consists of 58,000 objects, each represented by nine numerical features.
  • slices: This data set consists of 53,500 computed tomography (CT) images.Each CT image is described by two histograms in polar space from which 386 features are extracted.
  • german: This dataset contains 1000 number of instances and 20 attributes.This dataset classifies people described by a set of attribute as good or bad credit risk.
  • SimSCADA: This data set consists of 12,000 objects each being described by 113 features. Each feature represents one sensor or actuator reading in the water network systems, which is simulated using the proposed SCADA testbed SCADAVT.
  • arrhythmia: This data set consists of 452 objects, each represented by 279 parameters (features) of ECG mea surements. It is used to classify a patient into one of the 16 classes of cardiac arrhythmia.
  • gasSensors: This data set contains 13,910 measurements that come from 16 chemical sensors, and each feature vector contains the eight features extracted from each particular sensor. This results in a 128-dimensional feature vector
  • kddcup99: set comes from the 1998 DARPA Intrusion Detection Evaluation Data.

Further work

An Efficient k-Nearest Neighbors Approach Based on Various-Widths clustering to handle a difficult data set for classification. The Number of feature can easily compared and performance of widths clustering will provide a effective results. K- Nearest Neighbour survey domain is proposed in this literature survey. k-NN based Partitioning Around Medoids (KNNPAM) clustering algorithm to efficiently find k-NNs for a query object from a given data set to minimize the overlap among clusters; and grouping the centers of the clusters into a tree-like index to effectively prune more clusters

CONCLUSION

kNNVWC An Efficient k-Nearest Neighbors Approach Based on Various-Widths Clustering management report in data mining paper presented a parallel k-nearest neighbor searching approach, using various-width clustering algorithm. The proposed algorithm able to reduced the computational time and searching time as based on parallel computing. By using various-width clustering it produce the number of cluster and with parallel k-NN search algorithm using GPU give effective searching technique to find k-nearest neighbors, so that we reduces the time of k-NN algorithm significantly compare to the CPU. While GPU provide much more parallelize as compared to serial searching program.