Data Mining

Top-Down XML Keyword Query Processing

January 19, 2019

1497

Top-Down XML Keyword Query Processing

Abstract

Top-Down XML Keyword Query Processing management report in data mining.Efficiently answering XML keyword queries has attracted much research effort in the last decade. The key factors resulting in the inefficiency of existing methods are the common-ancestor-repetition (CAR) and visiting-useless-nodes (VUN) problems. To address the CAR problem, we propose a generic top-down processing strategy to answer a given keyword query w.r.t. LCA/SLCA/ELCA semantics. By “top-down”, we mean that we visit all common ancestor (CA) nodes in a depth-first, left-to-right order; by “generic”, we mean that our method is independent of the query semantics.

To address the VUN problem, we propose to use child nodes, rather than descendant nodes to test the satisfiability of a node v w.r.t. the given semantics. We propose two algorithms that are based on either traditional inverted lists or our newly proposed LLists to improve the overall performance. We further propose several algorithms that are based on hash search to simplify the operation of finding CA nodes from all involved LLists. The experimental results verify the benefits of our methods according to various evaluation metrics.

INTRODUCTION

XML queries are used to extract data and manipulate data from XML documents. By specifying the predicate the related attributes or elements are selected using XML queries. An attribute, a value or an element tag can be represented by node in tree based structure. Edges are used to represent hierarchical relationship like parent-child and ancestor descendant among XML. Conditions are represented by edges and nodes. Conditions must be satisfied in XPath expression as a response of a query. XML query processing is dependent on traditional topdown tree traversals on the XML document which is highly inefficient as a large collection of documents is produced. To solve searching problem and to store XML data various indexing techniques are used. For reduction of memory overload of the processing queries indexing approach is introduced which is use of disk based indexing approach. In this approach a specific data structure is used for storing the index values.

In the XML query processing the common issues which lead to redundancy and inconsistency are CAR and VUN problem.

CAR problem: In graph theory the lowest common ancestor of two nodes v and w in a tree T is the lowest i.e. deepest node that has both v and w as descendants, where each node to be descendant to itself. While multiple operations results in all common ancestors on the path from root to visiting nodes to be repeatedly visited, which is called as common ancestor repetition (CAR).

VUN problem: Given a keyword query Q and XML document D, let v be the set of nodes D that contains one keyword query in their sub trees then we can classify them into following categories:

Common ancestors (CAs)
Useless nodes (UNs)
Auxiliary nodes (AUs).

By considering these problems we demonstrated query semantics with the generic processing strategy for solving above problems more effectively and efficiently as: In order to solve CAR problem a generic top down approach for XML keyword query processing is demonstrated. To address VUN problem, child nodes are used instead of descendants to test lowest common ancestors (LCA), smallest lowest common ancestors (SLCA), exclusive lowest common ancestor (ELCA) nodes as well as labeling scheme independent inverted list (LList) algorithm is demonstrated. Performance is improved by means of using hash index. We introduced distributed query processing, in which the index structure provides basic means to reconstruct and locate distributed XML fragments.

XML KEYWORD AND DATA

In order to manage XML documents Lukas Kircher et al. proposed a technique known as structural bulk updates that is used with X Query update facility to support Pre/Dist/Size encoding. Updating XML is challenging work as structural order of documents had to be observed. They introduced a method known as avoiding redundant distance adjustment which avoids repeated and redundant distance. They demonstrated how the cost of maintaining the document order as well as this technique reduces processing time also large bulk updates are feasible. The basic technique for XML keyword search is LCA. Manoj Agarwal et al. presented a novel system Generic Keyword Search (GKS) which is able to find highly relevant XML schema elements and keywords, deeper analytics insights known as DI in the XML data.

They proposed XML node categorization model to expose XML elements. They introduced ranking methodology for data keywords discovery. They also presented GKS system used for real datasets. This technique over the real data sets shows highly relevant responses to keyword queries efficiently. Users navigated XML data seamlessly and which was highly relevant data. The problem of effective keyword search over XML documents had been studied by Guo liang Li et al. They introduced Valuable Lowest Common Ancestor (VLCA) to answer keyword queries over XML documents efficiently and effectively. They also proposed Compact VLCA (CVLCA) for optimizing strategy for speeding up the computation and for answering keyword queries accurately. The experimental result showed that the proposed methods achieve high quality results on both real and synthetic datasets.

They designed and analyzed efficiency of algorithm in two cases;

when relevant indices had been constructed and XML data had been preprocessed
when XML data had not been preprocessed. XML keyword queries can be efficiently determined as a part of query evaluation.

Information retrieval can be done. To address the SLCA computation problem in XML data Ba Quan Truong et al. proposed a property called optionality resilience which specified behaviors of an XKS for queries with missing elements. The experimental results showed quality of search, execution time, scalability, number of missing elements, number of keywords and heuristics for algorithm selection. It also showed that MESSIAH not only produced high quality result but also faster computation speed.

INDEXING ALGORITHMS

The indexing technique’s performance is studied based on characteristics and requirements. For effective, efficient and accurate retrieval of documents, existing techniques for indexing became inefficient with the tremendous and rapid growth of index size as well as seek time regarding optimized index scheme. Data is rapidly increasing in terms of structured and nonstructured. To handle such large amount of data efficiently indexing techniques are designed. Indexing is developed in order to tolerate high cost and precise search. Basically indexing techniques are categorized based in three methods:

Non Artificial Intelligent (NAI),
Artificial Intelligent (AI) And
Collaborative Artificial Intelligent (CAI).

Non artificial or traditional indexing techniques are tree based indexing, bitmap indexing, graph query processing, hashing, B tree, R tree which uses classifiers for indexing. Artificial indexing techniques are more accurate method in constructing hybrid indexing mechanism. The techniques used under AI are fuzzy decision tree (FDT) and machine learning which produces efficient result. Standard disk based structure and algorithms which includes B+ trees, heap files, disk based prefix trees, inverted indexes, binary large object (BLOB) files, m way posting list intersection, LRU buffer manager and external sorting used to build storage engine. Wook-Shin Han et al. implemented graph indexing techniques and demonstrated various datasets and workload to show various unique features such as performance analysis. They showed that tool supported for importing dataset, selecting index algorithm, building index structure, specifying query workload, executing query and navigating through the results. Indexing techniques are used to speed up the data retrieval.

System Configuration:

H/W System Configuration:-

Processor : Pentium IV

Speed : 1 Ghz

RAM : 512 MB (min)

Hard Disk : 20GB

Keyboard : Standard Keyboard

Mouse : Two or Three Button Mouse

Monitor : LCD/LED Monitor

S/W System Configuration:-

Operating System : Windows XP/7

Programming Language : Java/J2EE

Software Version : JDK 1.7 or above

Database : MYSQL

QUERY PROCESSING

According to the various functions of the query, classifications are made such as moving object, range based query, location based query, trajectory queries, future data detection query. According to user requirements the query processing varies. Query processing is nothing but a collection of data which are arranged for speed and ease of search for retrieval of data. Indexing and query processing are interrelated. Cheng YR et al. proposed filtering and verification framework to improve search efficiency. They defined a r-clique keyword query result for knowledge base environment and derived tightest upper bound. They designed an index which facilitates pruning algorithm and sampling algorithm. The demonstrated result showed that proposed definition satisfied the user requirements compared with r-clique definition and algorithm were efficient.

Querying XML document effectively and efficiently is a challenging issue. Mikael Fernandus Simalango studied query processing issues and proposed solutions for querying XML databases. They reviewed evolving path for XML query languages also provided different approaches for XML query processing. Several challenges for the realization of scalability of XML database management system still exist. For query processing, List intersection is a central operation which is utilized excessively on text and databases. Query processing is method of retrieving the inverted indices which corresponds to query keywords and intersecting them for identification of relevant documents. Sudipto Guha et al. presented algorithm to compute intersection of an arbitrary number of sorted as well as unsorted lists which shows superior performance to achieve good speed up, effectiveness and load balance. They studied list intersection algorithm to reduce overhead of cache hierarchy and to take benefit of parallelism.

Result evaluation is based on real and synthetic data which validates efficiency of proposed algorithm. Vishwakarma Singh et al. studied queries that ask for satisfying given set of keywords of the tightest groups of points. They proposed a novel method known as Pro MiSH (Projection and Multiscale Hashing) which is used to achieve high scalability and speedup using random projection and hash based index structure. They presented an algorithm for finding top k tightest clusters in subset which retrieve the points from disk using B+ tree for exploration of final set of result. The results on real as well as synthetic data showed that ProMiSH had up to 60times of speed up over tree based techniques . In order to improve scalability Evandrino G. Barros et al. introduced PMK Stream (Parallel MK Stream) which evaluated multiple keyword queries for multiple parsing stacks.

The experimental results showed that PMK Stream is efficient for supporting keyword based search over XML data. Prefixbased numbering (PBN) was proposed by Curtis E. Dyreson et al. which is a popular method for numbering nodes in the hierarchy. They presented a strategy to virtually transform the data without renumbering and instantiating. The result was concise, support efficient querying, updating was efficient and practical. DipaliPal et al. proposed a way for indexing for large database which includes small and medium size graphs. For query processing, a query graph was mapped into signature which was used to search results. The experimental results are carried on both real as well as synthetic dataset and this approach provided a scalable, effective and efficient disk based solution for large and medium dataset. Donald Kossmann presented a technique for query processing which is used for information systems and distributed database. He proposed an architecture known as textbook which uses various techniques for parallelism. He also discussed distributed systems such as middle ware, client server and heterogeneous database system used for query processing.

With the help of experimental results Daniela Florescu et al. showed how XML query language could be elaborated to support keyword search. By combining structured query processing and keyword search was useful for both knowledge structure and XML data. Query performance calculated on the basis of three types such as structured, partially structured and unstructured. Indexing process is done by using inverted files and relational database. The result showed XML query processing performed efficiently and non-structure query executed faster than structured query .

Conclusion and further work

Top-Down XML Keyword Query Processing management report in data mining.Considering that the key factors resulting in the inefficiency for existing XML keyword search algorithms are the CAR and VUN problems, we proposed a generic topdown processing strategy that visits all CA nodes only once, thus avoids the CAR problem. We proved that the satisfiability of a node v w.r.t. the given semantics can be determined by v’s child nodes, based on which our methods avoid the VUN problem. Another salient feature is that our approach is independent of query semantics.

We proposed two efficient algorithms that are based on either traditional inverted lists or our newly proposed LLists to improve the overall performance. Further, we proposed three hash search-based methods to reduce the time complexity. The experimental results demonstrate the performance advantages of our proposed methods over existing ones. One of our future work is studying disk-based index to facilitate XML keyword query processing when the size of indexes becomes too large to be completely loaded into memory.

Top-Down XML Keyword Query Processing

Top-Down XML Keyword Query Processing

Abstract

INTRODUCTION

System Configuration:

H/W System Configuration:-

S/W System Configuration:-

Conclusion and further work

MOST POPULAR

Your Personal Nutritionist Using Fatsecret Api

Partial Replacement of Cement using Red Mud in SCC

Privacy-Preserving Data Encryption Strategy for Big Data in Mobile Cloud Computing

Expansion of Blackberry in Retail Market through Retail Auditing, Marketing, Research...

HOT NEWS

Privacy and Integrity Preserving Top-k Query Processing for Two-Tiered Sensor Network

Air Pollution

Fine Grained Two Factor Access Control for Web Based Cloud Computing...

Multi-Party Secret Key Agreement Over State-Dependent Wireless Broadcast Channels