
Efficient R-Tree Based Indexing Scheme for Server-Centric Cloud Storage System
Abstract
Efficient R-Tree Based Indexing Scheme for Server-Centric Cloud Storage System management report in data mining scheme poses new challenges to the community to support efficient concurrent querying tasks for various dataintensive applications, where indexes always hold essential positions. Manual indexing is a time taking process and it requires huge manual hours to index a repository which grows day by day. In this paper, RT-HCN (R-Tree Hierarchical Irregular Compound Networks) is proposed, which is an indexing scheme integrating R-tree based indexing structure and HCN-based routing protocol.
RT-HCN organizes storage and compute nodes into an HCN overlay, one of the new proposed server-centric data center topologies. Based on the properties of HCN, a specific index mapping technique is designed to maintain layered global indexes and corresponding query processing algorithms to support efficient query tasks. The idea of RT-HCN is expanded onto another server-centric data center topology. The results validate the query efficiency, especially the speedup of point query of RT-HCN, depicting its potential applicability in future data centers.
Introduction
Cloud storage systems have been continuously drawing attentions from both academia and industry in recent years. From classical systems for general data services, such as Google’s GFS , Amazon’s Dynamo, Facebook’s Cassandra, to newly designed systems with specialities, such as Haystack, Megastore, Spanner , various distributed storage systems were constructed to satisfy the increasing demand of online data-intensive applications that require massive scalability, efficient manageability, reliable availability, and low latency in the storage layer. Correspondingly, many works are proposed for designing new indexing scheme and data management system to support large-scale data analytical jobs and high concurrent OLTP queries. In a cloud DB system, datasets are partitioned among distributed servers, while users may hop among multiple servers to process their query requests.
To provide an efficient indexing scheme, a common feature of the above literature is that they split indices into two categories: global index and local index, and then portioned their global indices to each server according to an overlay network architecture. Following the direction of indices, users route queries among servers based on the underlying routing protocols of the network connecting servers together. However, all these designs are constructed on P2P networks, while nowadays cloud systems are usually built on an infrastructure called data center, which consists of large number of servers interconnected by a specific Data Center Network (DCN) . For instance, Cisco applies Fat-Tree topology as its DCN architecture for provably efficient communication.
Different from P2P network, DCN is more structured with low equipment cost, high network capacity, and support of incremental expansion. It is natural that such infrastructures bring new challenges for researchers to design efficient indexing scheme to support query processing for various applications. In this paper, we propose our RT-HCN, a distributed indexing scheme for multi-dimensional query processing in Hierarchical Irregular Compound Networks (HCN) , which is the latest designed DCN structure using dualport servers. HCN has many attractive features including low diameter, high bisection width, large number of node-disjoint paths for pairwise traffic, and supports low overhead and robust routing mechanisms. Additionally, in many online data-intensive services users tend to query data with more than one key, e.g., in Youtube video system users may want to find videos via both video ids and size ranges.
Therefore, designing an indexing scheme for multi-dimensional query processing in HCN is useful and meaningful for real-world cloud applications, which has both theoretical and practical significance in this area. To search the data efficiently, the R-tree based multi-dimensional index is used in our system. RT-HCN integrates HCN-based routing protocol and the R-Tree based indexing technology. Similar to previous works, RT-HCN is a two-layer indexing scheme with a global index layer and a local index layer. Since datasets are distributed among different servers, we can use an R-Tree like indexing structure to index locally stored data for each server. Next, RT-HCN portionably distributes these local indices across servers as their global indices. To avoid single master server bottleneck, each server only maintains partial global index for its potential index range. Based on the characteristics of HCN, we design an index publishing rule to guarantee an “onto” mapping from global index to local stored data. We also propose the corresponding query processing algorithms to achieve query efficiency and load balancing for each node in the network.
Finally, we prove theoretically that RT-HCN is both query-efficient and space-efficient, by which servers will not maintain redundant indices while a large number of users can concurrently process queries with low routing cost. We compare our design with RT-CAN, a similar design for traditional P2P network. Experiments validate the efficiency of our proposed scheme and depict its potential implementation in data centers. Our contribution of this paper is threefold:
- to the best of our knowledge, we are the first to propose a distributed multi-dimensional indexing scheme for a specific DCN structure to improve query efficiency and system QoS;
- noticing and taking advantage of the topology of HCN, we present a specialized mapping technique to improve global index allocation in the network, resulting queryefficiency and load-balancing for the cloud system; and
- we theoretically prove the efficiency of RT-CAN, and compare our model numerically with RT-CAN, an indexing scheme for P2P network. Simulation results show that our scheme costs less index space for each node while provides faster query processing speed with higher bandwidth.
Related Works
Nowadays, there are lots distributed storage systems which assist to manage big data for cloud applications. Among them, we have excellent commercial implementations like key-value based system Amazon’s Dynamo, Google’s Google File System (GFS), and BigTable, which aim at dealing with large scales of data. Meanwhile, some open source systems such as HDFS, HBase and HyperTable also provided a good platform for research use. Cassandra is one non-rational database that combines features of BigTable and Dynamo. Some other systems such as Ceph, are designed to provide high performance in objects retrieval.
We want to build a second level overview index and our work follows the framework proposed . It offers an idea to build a two level index in cloud system for data retrieval on top of a physical layer. Moreover, an efficient and extensible framework for index in P2P based cloud system was put forward. However, more specialized topology has been designed to meet the requirement of today’s cloud system and that is why we want to apply the two-level index design to specific data center network and discuss its improvement. As the topology of DCN is known, we can guarantee the processing time by calculating out the physical hops needed for a given query. While in P2P network only logical hops of the overlay network can be estimated. Data center network (DCN) is the network infrastructure inside a data center, which connects a large number of servers via high-speed links and switches.
Compared to traditional cloud system which is usually based on P2P network, specially and carefully designed DCN topologies fulfill the requirements with lowcost, high scalability, low configuration overhead, robustness and energy-saving. DCN structures can be roughly divided into two categories, one is switch-centric such as VL2 and Fat-Tree . The other is server-centric like BCube, DCell, FiConn, MDCube and uFix . They usually have more advantages than the former designs. HCN [22], the topology chosen in our system falls into the server centric topology. It is a well-designed network for data center and offers a high degree of regularity, scalability, and symmetry. Different from traditional P2P network, we have to be aware of the physical topology when we are discussing DCN and that is why we need to improve the mapping technique for distributing global index to fix a given network.
System Configuration:
H/W System Configuration:-
Processor : Pentium IV
Speed : 1 Ghz
RAM : 512 MB (min)
Hard Disk : 20GB
Keyboard : Standard Keyboard
Mouse : Two or Three Button Mouse
Monitor : LCD/LED Monitor
S/W System Configuration:-
Operating System : Windows XP/7
Programming Language : Java/J2EE
Software Version : JDK 1.7 or above
Database : MYSQL
EXISTING AND PROPOSED SYSTEMS
Existing System Previously, the work was assembled on the P2P set of connections by means of the universal index, like BATON along with CAN.P2P that gave enhanced figure meant for links on the logic intensity than the internet intensity, as it has incorporated primary topologies that are in point of fact imprecise as well as nodes possibly will spread out widely by means of boundless physical leap space which used to produce fluxing of presentation. Hence the indexing scheme for P2P onto DCN topologies is not a sensible verdict to include. Such infrastructures convey innovative challenges to devise a professional indexing idea in support of the competent query processing used for a range of purpose.
Disadvantages
- The potential of action is shaky.
- Processing of the query is not resourceful.
Proposed System
In support of Server-Centric Data Center Networks, a circulated multi-dimensional indexing proposal is intended. The Hierarchical Irregular Compound Networks (HCN) is used as an exemplar in progress as of individual mainstream ServerCentric Topologies. Moreover it is accurately a plain and attractive topology which passes an ease, meant for potential growth in addition to index building as of its ability of having stability and inherent promptness. A plan named as RT-HCN is projected which is a two layer indexing design. Seeing that the datasets are dispersed amongst unusual servers, an R-Tree like indexing formation is able to be used to index nearby accumulated multi-dimensional data in support of each server.
SYSTEM ARCHITECTURE
This planning gives that run of effecting of this assignment where user will pierce the question facts that will progression from the server furthermore server will fashion a node, it will course plus gives consequence to the user.
Advantages
- The bulk of clients now know how to practice queries and the routing charge being utilized is less.
- Which is made possible by employing RT-HCN proposal as it is both query and space able, as a product of which servers possibly will not be preserving excessively unneeded indices.
Module Description
- User query
The query is appealed by the user to the server, where the bulk of users continuously demand the query by means of very least routing response. The spot and the pattern of objects are the subjects upon which queries of the users are specified and then send to the server.
- Meta Server
While putting into operation, Meta server is the server that is regarded for situating each object, where the objects in each locality are made active with inimitable information such as port in addition to id number as well.
- R-tree index
Due to the cleanness as well as competence, the R-trees are well thought-out as the foremost part of indexes meant for spatial query processing as it is build up for indexing multi dimensional data. As a result of using this form, it became easy to repossess the aspects speedily by mapping out via the tree formation, as the spatial facets and their links were piled up along with that it was recognized in the tree structure format.
- Index publishing
The nodes that are to be in print are chosen to initiate on or after the subsequent level of R-tree to a conclusion level in a reasonable mode.
- Query Results
Once the above processing is done, the consequence is drive to the client from the server as of which the user be able to decrypt and seek facts. The query fallouts acquired from supply is called KNN, which give a picture of being further pertinent information.
Conclusion
In Efficient R-Tree Based Indexing Scheme for Server-Centric Cloud Storage System management report in data mining paper, we proposed an indexing scheme named RT-HCN for multidimensional query processing in data centers, which are the infrastructures for building cloud storage systems and are interconnected using a specific data center network (DCN). RT-HCN is a two-layer indexing scheme, which integrates HCN-based routing protocol and the R-Tree based indexing technology, and is portionably distributed on every server. Based on the characteristics of HCN, we design a special index publishing rule and query processing algorithms to guarantee efficient data management for the whole network. We prove theoretically that RT-HCN is both query-efficient and space-efficient, by which each server will only maintain a constrained number of indices while a large number of users can concurrently process queries with low routing cost. We compare our design with RT-CAN , a similar design for traditional P2P network. Experiments validate the efficiency of our proposed scheme and depict its potential implementation in data centers.







