Crowd Sourcing for Top-K Query Processing Over Uncertain Data

0
1068
Crowd sourcing for Top-K Query Processing over Uncertain Data

Crowd Sourcing for Top-K Query Processing Over Uncertain Data

Abstract

Crowd Sourcing for Top-K Query Processing Over Uncertain Data management report in data mining.Querying uncertain data has become a prominent application due to the proliferation of user-generated content from social media and of data streams from sensors. When data ambiguity cannot be reduced algorithmically, crowdsourcing proves a viable approach, which consists in posting tasks to humans and harnessing their judgment for improving the confidence about data values or relationships. This paper tackles the problem of processing top-K queries over uncertain data with the help of crowdsourcing for quickly converging to the real ordering of relevant results. Several offline and online approaches for addressing questions to a crowd are defined and contrasted on both synthetic and real data sets, with the aim of minimizing the crowd interactions necessary to find the real ordering of the result set.

INTRODUCTION

In recent years, crowdsourcing techniques have attracted a lot of attention due to their effectiveness in real-life applications. They tackle the tasks including image tagging , decision making and natural language processing, which are hard for computers, but relatively easy for human workers. Some successful crowdsourcing examples for question answering tasks that appear include Quora, Yahoo ! Answer and Stack Overflow , where users submit questions and get answers from the crowd. Using crowdsourcing techniques, these tasks can be solved well by human workers. Despite the success of crowdsoucing techniques, the crowd-selection still remains challenging. Earlier approaches usually focus on the problem of crowd-selection for simple tasks where the selection procedure is based on the trustworthiness of workers.

However, in many cases, the task-driven crowd-selection is a better solution. Consider a question answering task t, “What are the advantages of B+ Tree over B Tree?”. The existing crowd-selection procedure may not work in this case. Since the trustworthy workers may not be skilled in the area of computer science, therefore, they may not be able to answer this question. In this paper, we formulate a new problem of task-driven crowdselection. We focus on addressing the following three challenging issues.

Crowd Modeling.

We extend the crowd modeling from singledimensional trustworthiness to multi-dimensional skills for task-driven crowd-selection. Thus, we model the strengths and weaknesses of workers on latent category of tasks. However, the existing latent skill models based on Multinomial distribution1 are difficult to apply to our problem. The skill values on the latent categories are normalized to one for the property of Multinomial distribution. Thus, the skills of workers on specific latent categories cannot be comparable. For example, the latent skills of worker w i is (CS 0.9, Math 0.1) and the latent skills of worker w j is (CS 0.8, Math 0.2). Given a CS-based task above, the existing models select w i since its skill value on CS is higher. However, it might be that w j is better on CS while w j solves more Math-based tasks. Thus, the skill value of their models cannot infer the strengths and weaknesses of workers on specific latent categories. We propose a novel crowd model to tackle this problem.

Latent Skill Inference.

The probabilistic inference for the latent skills of workers is based on the past resolved tasks. The existing inference approaches are either based on the content of tasks or answer consistency with other workers. However, we argue that the skills of workers are not necessarily related to the number of their resolve tasks or the difference between their answers and others. We consider the feedback score of the resolved tasks to illustrate the latent skills of workers. We argue that the feedback score is a better quality measure for the job done by workers. Nowadays, the feedback score has been widely used in crowdsourcing platforms such as the satisfactory rate in Amazon Mechanical Turk 2 and Crowdflower 3 , and the thumbs-up in Question Answering System like Quora, Yahoo ! Answer and Stack Overflow. We propose a novel latent skill inference method based on resolved tasks with feedback scores.

Incremental Crowd-Selection.

The crowdsourced tasks are always in quantity and arriving in high speed. Therefore, it is time consuming for latent category inference of the crowdsourced tasks in batch. In this work, we first devise a bayesian model that builds a latent category space and infers the skills of workers on that space based on the past resolved tasks. Next, we propose an incremental latent category inference algorithm that projects the newly coming tasks into the existing latent category space. Then, we can select the workers who are skilled in these categories to solve the tasks. In this paper, we propose a bayesian model for task-driven crowdselection.

Our contributions are summarized as follows:

  • We first propose the problem of task-driven crowd-selection for crowdsourced tasks.
  • We propose a novel crowd model for modeling the skills on latent category space that enables the task-driven crowdselection. The model also makes the skills of workers on specific latent task categories become comparable.
  • We make the use of the feedback scores of the resolved tasks for latent skill inference. We consider the the feedback scores as the quality measure of the job done by the workers.
  • We devise a variational algorithm that transforms the complex latent skill inference problem into a high-dimensional optimization problem, which can be solved efficiently.
  • We develop an incremental crowd-selection algorithm that chooses the right workers for coming crowdsourced tasks in the real time.
  • We evaluate our algorithm on the data collected from three well-known crowdsourcing applications Quora, Yahoo ! Answer, and Stack Overflow. For all datasets tested, we show that the quality of the selected workers based on our crowd model is more superior to that of other existing worker models including TSPM [8] and DRM.

    System Configuration:

    H/W System Configuration:-

    Processor          : Pentium IV

    Speed               : 1 Ghz

    RAM                  : 512 MB (min)

    Hard Disk          : 20GB

    Keyboard           : Standard Keyboard

    Mouse               : Two or Three Button Mouse

    Monitor             : LCD/LED Monitor

    S/W System Configuration:-

    Operating System               : Windows XP/7

    Programming Language       : Java/J2EE

    Software Version                 : JDK 1.7 or above

    Database                            : MYSQL

 

The goal of this paper is to define and compare task selection policies for uncertainty reduction via crowdsourcing, with emphasis on the case of top-K queries. Given a data set with uncertain values, our objective is to pose to a crowd the set of questions that, within an allowed budget, minimizes the expected residual uncertainty of the result, possibly leading to a unique ordering of the top K results. The main contributions of the paper are as follows:

  1. We formalize a framework for uncertain topK query processing, adapt to it existing techniques for computing the possible orderings, and introduce a procedure for removing unsuitable orderings, given new knowledge on the relative order of the object.
  2. We define and contrast several measures of uncertainty, either agnostic (Entropy) or dependent on the structure of the orderings.
  3. We formulate the problem of Uncertainty Resolution (UR) in the context of top-K query processing over uncertain data with crowd support . The UR problem amounts to identifying the shortest sequence of questions that, when submitted to the crowd, ensures the convergence to a unique, or at least more determinate, sorted result set. We show that no deterministic algorithm can find the optimal solution for an arbitrary UR problem.
  4. We introduce two families of heuristics for question selection: offline, where all questions are selected prior to interacting with the crowd, and online, where crowd answers and question selection can intermix. For the offline case we define a relaxed, probabilistic version of optimality, and exhibit an algorithm that attains it as well as suboptimal but faster algorithms. We also generalize the algorithms to the case of answers collected from noisy workers.
  5. We propose an algorithm that avoids the materialization of the entire space of possible orderings to achieve even faster results .
  6. We conduct an extensive experimental evaluation of several algorithms on both synthetic and real datasets, and with a real crowd, in order to assess their performance and scalability.

RELATED WORK

Many works in the crowdsourcing area have studied how to exploit a crowd to obtain reliable results in uncertain scenarios. In binary questions are used to label nodes in a directed acyclic graph, showing that an accurate question selection improves upon a random one. aim to reduce the time and budget used for labeling objects in a set by means of an appropriate question selection. Instead,  proposes an online question selection approach for finding the next most convenient question so as to identify the highest ranked object in a set. A query language where questions are asked to humans and algorithms is described ; humans are assumed to always answer correctly, and thus each question is asked once. All these works do not apply to a top-K setting and cannot be directly compared to our work.

Uncertainty in top-K queries.

Uncertainty representation. The problem of ranking tuples in the presence of uncertainty has been addressed in several works. Uncertain top-K queries on probabilistic databases. the quality score for an uncertain top-K query on a probabilistic (i.e., uncertain) database is computed. Moreover, the authors address the problem of cleaning uncertainty to improve the quality of the query answer, by collecting multiple times data from the real world (under budget constraints), so as to confirm or refute what is stated in the database.

Crowdsourcing via other task types.

the authors assume that the ordering of a set of objects is known, and use a crowdsourcing-based algorithm to estimate their score values. crowdsourcing is used to build a tree where the root represents an initial status, leaves represent a fixed objective and each path represents a sequence of actions to be performed so as to meet the objective. Workers are provided with a subpath and are required to suggest the next action in the sequence to be performed. The goal of the proposed algorithm is to retrieve the best K paths from the tree. Uncertainty in schema and object matching.

Schema matching.

Uncertainty in schema matching is tackled by posing questions to workers. Uncertainty is measured via entropy, and two algorithms (online and offline) are proposed to select the questions reducing uncertainty the most. A similar approach is proposed for the context of web tables schema matching, although only an online scenario is considered in this case. We have shown that, in top-K contexts, the results obtained by measuring uncertainty via entropy are largely outperformed by the use of other criteria (e.g., UMPO). Noisy workers are used to validate schema matchings, with emphasis on the design of questions, so as to maximize their informativeness and reduce the noise in validations. Yet, does not present any question selection strategy, which we have shown to be a useful means to obtain good results even with a noisy crowd and simple boolean questions.

Object matching.

There are several noteworthy works about object matching. the objective is to identify all pairs of matching objects between two collections of objects. The authors propose a mixed online and offline approach, where the selected sequence of questions is annotated partially by machines and partially by users, and minimizes the number of questions answered by humans. This work was recently extended by the authors, who propose two alternative algorithms for entity resolution. In the B most promising questions are asked to workers so as to enhance Entity Resolution. Workers’ accuracy estimation. Several works in the state of the art use majority voting as a tool for aggregating multiple noisy answers and computing trusted labels.

In other cases workers are pre-filtered via qualification tests, so that low-quality workers will not access the submitted tasks. Experts may be used to validate uncertain answers. Other works in crowd-related research propose ways to estimate workers’ accuracy: it may be computed depending on the number of disagreements with other worker answers (i.e., the larger the number of disagreements, the larger the error probability), or by modeling the behavior of high quality workers versus spammers. In  the error probability of the user is supposed to be known, and accordingly the user’s answer is considered less relevant as the error probability grows. Finally, uses an approach that mixes test questions to filter out spammers, majority voting to improve the accuracy of single workers and estimation of probability error based on task difficulty.

Conclusions and future work

In Crowd sourcing for Top-K Query Processing over Uncertain Data paper we have introduced Uncertainty Reduction (UR), which is the problem of identifying the minimal set of questions to be submitted to a crowd in order to reduce the uncertainty in the ordering of top-K query results. Since UR does not admit deterministic optimal algorithms, we have introduced two families of heuristics (oine and online, plus a hybrid thereof) capable of reducing the expected residual uncertainty of the result set.

The proposed algorithms have been evaluated experimentally on both synthetic and real data sets, against baselines that select questions either randomly or focusing on tuples with an ambiguous order. The experiments show that oine and online best-first search algorithms achieve the best performance, but are computationally impractical. These trends are further validated on the real datasets. Future work will focus on generalizing the UR problem and heuristics to other uncertain data and queries, for example in skill-based expert search, where queries are desired skills and results contain sequences of people sorted based on their topical expertise and skills can be endorsed by community peers.