Nearest Keyword Set Search in Multi-Dimensional Datasets

Nearest keyword Set Search in Multi-Dimensional Datasets

Nearest Keyword Set Search in Multi-Dimensional Datasets


Nearest Keyword Set Search in Multi-Dimensional Datasets management report in data mining.A spatial database manages multidimensional objects provides fast access to those objects based on different selection criteria. The importance of spatial databases is reflected by the convenience of modeling entities of reality in a geometric manner. For example, locations of restaurants, hotels, hospitals and so on are often represented as points in a map, while larger extents such as parks, lakes, and landscapes often as a combination of rectangles. Many functionalities of a spatial database are useful in various ways in specific contexts. For instance, in a geography information system, range search can be deployed to find all restaurants in a certain area, while nearest neighbor retrieval can discover the restaurant closest to a given address.

Conventional spatial queries, such as range search and nearest neighbor retrieval, involve only conditions on objects‟ geometric properties. Today, many modern applications call for novel forms of queries that aim to find objects satisfying both a spatial predicate, and a predicate on their associated texts. For example, instead of considering all the restaurants, a nearest neighbor query would instead ask for the restaurant that is the closest among those whose menus contain “steak, spaghetti, brandy” all at the same time. Currently, the best solution to such queries is based on the IR2-tree, which, as shown in this paper, has a few deficiencies that seriously impact its efficiency. Motivated by this, this work develop a new access method called the spatial inverted index that extends the conventional inverted index to cope with multidimensional data and comes with algorithms that can answer nearest neighbor queries with keywords in real time.


Objects (e.g., images, chemical compounds, documents, or experts in collaborative networks) are often characterized by a collection of relevant features, and are commonly represented as points in a multi- dimensional feature space. For example, images are represented using color feature vectors, and usually have descriptive text information (e.g., tags or keywords) associated with them. In this paper, here consider multi-dimensional datasets where each data point has a set of keywords. The presence of keywords in feature space allows for the development of new tools to query and explore these multidimensional datasets. Nearest Keyword set study on content rich different types of data sets. The NKS analysis is an arrangement of catchphrases in vision of theme. Also, the arrangement of the question consolidates „‟K‟‟ type of catchphrases as a group and concentrates each and every set which possess data based bunches along with structures in which bunches of multi-dimensional section is created. Each point is labeled with an arrangement of clusters.

An increasing number of uses need the productive execution of nearest neighbor (NN) queries thankful by the properties of the spatial objects. Because of the importance of keyword hunt, especially on the Internet, a considerable lot of these applications permit the client to give a rundown of keywords that the spatial objects (from this time forward alluded to just as objects) ought to contain, in their portrayal or other quality. For instance, online business index permit clients to point to an address and an arrangement of keywords, and return organizations whose portrayal contains these keywords, requested by their separation to the predetermined address area. As another case, land sites permit clients to look for properties with particular keywords in their depiction and rank them as per their separation from a predefined area. We call such queries spatial keyword queries.

A spatial keyword query comprises of a query zone and an arrangement of keywords. The response is a derelict of objects ranked by blend of their division to the query range and the substance of their content depiction to the query keywords. A basic yet well-known variation, which is utilized as a part of our running case, is the separation first spatial keyword query, where objects are ranked by separation and keywords are connected as a conjunctive channel to dispose of objects that don’t contain them. Which is our running illustration, shows a dataset of imaginary inns with their spatial directions and an arrangement of distinct traits (name, courtesies)? A case of a spatial keyword query is “determine the nearest accommodation to point that enclose keywords web and pool”. The top consequence of this query is the inn protest. NKS queries are useful for many applications, such as photo-sharing in social networks, graph pattern search, geolocation search in GIS systems, and so on.

Multi-Dimensional Data Sets

One this page you will find some “real world” multi-dimensional data sets. For right now there are two Tiger data sets, extracted from the US Bureau of Census TIGER database by some unknown person (if you know the person please send me email so I can reference appropriately), and a few CFD data sets. This work was partially supported by NSF grant number 9610270. Only the small data sets are given in ascii format, the rest in binary. Included is a simple (and not very elegant) c program to convert from the binary format to an ascii format. There is just enough documentation at the top to show how to use it. CFD Data Sets If you use any of these CFD data sets, please reference this web page and the creator of the data: 2D Point Data These are vertex data sets from various Computation Fluid Dynamics models. The data sets are for a 2-dimensional problem. A system of equations is used to model the air flows over and around aero-space vehicles.

The data sets are for a cross section of a Boeing 737 wing with flaps out in landing configuration at MACH 0.2. The data space consists of a collection of points (nodes) of varying density. Nodes are dense in areas of great change in the solution of the equations and sparse in areas of little change. The location of the points in the data set is HIGHLY skewed. The format of the data sets is (xminyminxmaxymax) which delimit the lower left hand corner and upper right hand corner of the rectangle. Size these data sets are point data, xmin = xmax and ymin = ymax. The points range from (-20,- 20) to (20,20), but I have normalized the data sets to the unit square to facilitate experimental studies that include other data sets. For the 5088 Node Set the original ascii format (not normalized), ascii normalized version, and binary normalized version are all included. For the larger data sets only the normalized binary version is included. If for some reason you need the non-normalized versions contact me. The binary is normal IEEE binary, four floats per point. The three data sets differ mostly in the number of points.

To get a good idea what the data looks like download and preview the postscript picture below for the 5088 Vertex Data Set. The complete normalized shows the whole data set, the blow up is just the region around the centroid. 2D Triangle Data These data sets are a Delaunay triangulation (with a few flips) of the point data sets above. See the postscript picture included below. These data sets are originally in a space efficient format. Consider first the 9759 Triangle Set below. The original ascii file shows the format. Each line contains the triangle number followed by the point numbers of the three corners of the triangle. To get the coordinates of these triangle vertices, just look up the point in the 5088 original ascii point data set (above). Each triangle of the data set is then bounded by a rectangle as shown in the Nonnormalized rectangle-bounded set below. Next, the set of rectangles is normalized to the unit square. Finally the normalized rectangle set is converted to binary. For the larger data sets only the normalized binary data sets are included. Tiger Data Sets These are line segment data contains the road maps of Long Beach and Montgomery Counties. If you use these, please reference as: “Extracted from the US Bureau of Census TIGER database”.

System Configuration:

H/W System Configuration:-

Processor          : Pentium IV

Speed               : 1 Ghz

RAM                  : 512 MB (min)

Hard Disk          : 20GB

Keyboard           : Standard Keyboard

Mouse               : Two or Three Button Mouse

Monitor             : LCD/LED Monitor

S/W System Configuration:-

Operating System               : Windows XP/7

Programming Language       : Java/J2EE

Software Version                 : JDK 1.7 or above

Database                            : MYSQL


Using the combination of R-tree and inverted index, the location specific keyword queries were answered on web and GIS system. Ranking to the objects were done in existing system i.e. IR2 tree to rank objects based on the combination of their distances to the query location and the relevance of text description to the query keywords. Flickr uses keyword density and exact matching for keyword parsing technique. We can search one or two images related to the keyword not more than that. Cao et al and long et al proposed algorithm to retrieve a group of spatial web objects such that the group keywords cover the query keywords and the objects have lowest inter-object distances. Other related queries include aggregate nearest keyword search in spatial databases,top-k preferential query, top-k sites in spatial data base, and optimal location queries.


These techniques do not provide concrete guidelines on how to enable efficient processing for the type of queries where query coordinates are missing. In multi-dimensional spaces, it is difficult for users to provide meaningful coordinates, and our work deals with another type of queries where users can only provide keywords as input. Without query coordinates, it is difficult to adapt existing techniques to our problem.Note that a simple reduction that treats the coordinates of each data point as possible query coordinates suffers poor scalability.


In our proposed system the real data set is collected from photo sharing websites. In which we collect images from descriptive tags from Flickr and the images are transformed into grayscale and associate each data point, with a set of keyword that are derived from tags. We can collect number of datasets, suppose we collect five datasets (R1, R2, R3, R4, R5) with up to million data points, we can create multiple dataset to investigate performance. The query co-ordinates play a fundamental role in every step of algorithm to prune search space. Our work deals with providing keyword as an input. . We propose a novel multiscale index for exact and approximate NKS query processing. We develop efficient search algorithms that work with the multi-scale indexes for fast query processing. Distance browsing is easy with R-trees. In fact, the best-first algorithm is exactly designed to output data points in ascending order of their distances. In order to run the application efficiently the user must have following characteristics.


Better time and space efficiency. A novel multiscale index for exact and approximate NKS query processing. It’s an efficient search algorithms that work with the multi- scale indexes for fast query processing. We conduct extensive experimental studies to demonstrate the performance of the proposed techniques.

USER Module:

User provides the input keyword as an image.

SYSTEM Module:

  1. The system module retrieves all images from the database, and then it analyzes keywords.
  2. The positive point relation is undertaken by the system.
  3. It analyzes image keyword relation between points.
  4. It filters the image based on the relations.
  5. Applying nearest neighbor method retrieved images.
  6. Displays nearest image as an output.


  • Distance browsing is easy with R-trees. In fact, the best first algorithm is exactly designed to output data points in ascending order of their distances.
  • It is straight forward to extend our compression scheme to any dimensional space.


  • Fail to provide real time answers on difficult inputs.
  • The real nearest neighbor lies quite far away from the query point, while all the closer neighbors are missing at least one of the query keywords.


NKS queries are useful for many applications, such as

  • Photo-sharing in social networks,
  • Graph pattern search,
  • Geo-location search in GIS systems and so on.


In Nearest Keyword Set Search in Multi-Dimensional Datasets management report in data mining paper, we proposed solutions for the problem of topk nearest keyword set search in multi-dimensional datasets. We developed an exact (ProMiSH-E) and an approximate (ProMiSH-A) method. We designed a novel index based on random projections and hashing. Index is used to find subset of points containing the true results. We also proposed an efficient solution to query results from a subset of data points. Our empirical results show that ProMiSH is faster than state-of-theart tree-based technique, having performance improvements of multiple orders of magnitude.

These performance gains are further emphasized as dataset size and dimension increase, as well as for large query sizes. ProMiSH-A has the fastest query time. We empirically observed a linear scalability of ProMiSH with the dataset size, the dataset dimension, the query size, and the result size. We also observed that ProMiSH yield practical query times on large datasets of high dimensions for queries of large sizes. In the future, we plan to explore other scoring schemes for ranking the result sets. In one scheme, we may assign weights to the keywords of a point by using techniques like tf-idf. Then, each group of points can be scored based both on the distance between the points and weights of the keywords. Further, the criteria of a result containing all the keywords can be relaxed to generate results having only a subset of the query keywords.