Data Lineage in Malicious Environments

0
767
Data Lineage in Malicious Environments

Data Lineage in Malicious Environments

Abstract

Data Lineage in Malicious Environments,Intentional or unintentional leakage of confidential data is undoubtedly one of the most severe security threats that organizations face in the digital era. The threat now extends to our personal lives: a plethora of personal information is available to social networks and smartphone providers and is indirectly transferred to untrustworthy third party and fourth party applications. In this work, we present a generic data lineage framework Lime for data flow across multiple entities that take two characteristic, principal roles (i.e., owner and consumer). We define the exact security guarantees required by such a data lineage mechanism toward identification of a guilty entity, and identify the simplifying non-repudiation and honesty assumptions. We then develop and analyze a novel accountable data transfer protocol between two entities within a malicious environment by building upon oblivious transfer, robust watermarking, and signature primitives. Finally, we perform an experimental evaluation to demonstrate the practicality of our protocol and apply our framework to the important data leakage scenarios of data outsourcing and social networks. In general, we consider Lime, our lineage framework for data transfer, to be an key step towards achieving accountability by design.
 

Introduction

IN the digital era, information leakage through unintentional exposures, or intentional sabotage by disgruntled employees and malicious external entities, present one of the most serious threats to organizations. According to an interesting chronology of data breaches maintained by the Privacy Rights Clearinghouse (PRC), in the United States alone, 868;045;823 records have been breached from 4;355 data breaches made public since 2005 [1]. It is not hard to believe that this is just the tip of the iceberg, as most cases of information leakage go unreported due to fear of loss of customer confidence or regulatory penalties: it costs companies on average $214 per compromised record [2]. Large amounts of digital data can be copied at almost no cost and can be spread through the internet in very short time. Additionally, the risk of getting caught for data leakage is very low, as there are currently almost no accountability mechanisms. For these reasons, the problem of data leakage has reached a new dimension nowadays. Not only companies are affected by data leakage, it is also a concern to individuals. The rise of social networks and smartphones has made the situation worse. In these environments, individuals disclose their personal information to various service providers, commonly known as third party applications, in return for some possibly free services. In the absence of proper regulations and accountability mechanisms, many of these applications share individuals’ identifying information with dozens of advertising and Internet tracking companies. Even with access control mechanisms, where access to sensitive data is limited, a malicious authorized user can publish sensitive data as soon as he receives it. Primitives like encryption offer protection only as long as the information of interest is encrypted, but once the recipient decrypts a message, nothing can prevent him from publishing the decrypted content. Thus it seems impossible to prevent data leakage proactively. Privacy, consumer rights, and advocacy organizations such as PRC [3] and EPIC [4] try to address the problem of information leakages through policies and awareness. However, as seen in the following scenarios the effectiveness of policies is questionable as long as it is not possible to provably associate the guilty parties to the leakages.

Scenario 1: Social networking. It was reported that third party applications of the widely used online social network (OSN) Facebook leak sensitive private information about the users or even their friends to advertising companies [5]. In this case, it was possible to determine that several applications were leaking data by analyzing their behaviour and so these applications could be disabled by Facebook. However, it is not possible to make a particular application responsible for leakages that already happened, as many different applications had access to the private data. 

Scenario 2: Outsourcing. Up to108;000 Florida state employees were informed that their personal information has been compromised due to improper outsourcing [6]. The outsourcing company that was handed sensitive data hired a further subcontractor that hired another subcontractor in India itself. Although the offshore subcontractor is suspected, it is not possible to provably associate one of the three companies to the leakage, as each of them had access to the data and could have possibly leaked it. We find that the above and other data leakage scenarios can be associated to an absence of accountability mechanisms during data transfers: leakers either do not focus on protection, or they intentionally expose confidential data without any concern, as they are convinced that the leaked data cannot be linked to them. In other words, when entities know that they can be held accountable for leakage of some information, they will demonstrate a better commitment towards its required protection. In some cases, identification of the leaker is made possible by forensic techniques, but these are usually expensive and do not always generate the desired results. Therefore, we point out the need for a general accountability mechanism in data transfers. This accountability can be directly associated with provably detecting a transmission history of data across multiple entities starting from its origin. This is known as data provenance, data lineage or source tracing. The data provenance methodology, in the form of robust watermarking techniques [7] or adding fake data [8], has already been suggested in the literature and employed by some industries. However, most efforts have been ad-hoc in nature and there is no formal model available. Additionally, most of these approaches only allow identification of the leaker in a non-provable manner, which is not sufficient in many cases.

Data Lineage in Malicious Environments

Our Contributions

In this paper, we formalize this problem of provably associating the guilty party to the leakages, and work on the data lineage methodologies to solve the problem of information leakage in various leakage scenarios. As our first contribution, we define LIME, a generic data lineage framework for data flow across multiple entities in the malicious environment. We observe that entities in data flows assume one of two roles: owner or consumer. We introduce an additional role in the form of auditor, whose task is to determine a guilty party for any data leak, and define the exact properties for communication between these roles. In the process, we identify an optional nonrepudiation assumption made between two owners, and an optional trust (honesty) assumption made by the auditor about the owners. The key advantage of our model is that it enforces accountability by design; i.e., it drives the system designer to consider possible data leakages and the corresponding accountability constraints at the design stage. This helps to overcome the existing situation where most lineage mechanisms are applied only after a leakage has happened. As our second contribution, we present an accountable data transfer protocol to verifiably transfer data between two entities. To deal with an untrusted sender and an untrusted receiver scenario associated with data transfer between two consumers, our protocols employ an interesting combination of the robust watermarking, oblivious transfer, and signature primitives. We implement our protocol as a C++ library: we use the pairing-based cryptography (PBC) library [9] to build the underlying oblivious transfer and signature primitives; we choose the image as a representative document type and use the Cox algorithm for robust image watermarking [10]. We thoroughly analyze the storage, communication and computation overheads of our protocol, and our performance analysis demonstrates its practicality. To simulate longer data transfer chains, we also perform experiments with multiple iterations of our implementation and find it to be robust. Finally, we demonstrate usage of the protocol to real-life data transfer scenarios such as online social networks and outsourcing.

Model

As LIME is a general model and should be applicable to all cases, we abstract the data type and call every data item document. There are three different roles that can be assigned to the involved parties in LIME: data owner, data consumer and auditor. The data owner is responsible for the management of documents and the consumer receives documents and can carry out some task using them. The auditor is not involved in the transfer of documents, he is only invoked when a leakage occurs and then performs all steps that are necessary to identify the leaker. All of the mentioned roles can have multiple instantiations when our model is applied to a concrete setting. We refer to a concrete instantiation of our model as scenario. In typical scenarios the owner transfers documents to consumers. However, it is also possible that consumers pass on documents to other consumers or that owners exchange documents with each other. In the outsourcing scenario [6] the employees and their employer are owners, while the outsourcing companies are untrusted consumers. In the following we show relations between the different entities and introduce optional trust assumptions. We only use these trust assumptions because we find that they are realistic in a real world scenario and because it allows us to have a more efficient data transfer in our framework. At the end of this section we explain how our framework can be applied without any trust assumptions. When documents are transferred from one owner to another one, we can assume that the transfer is governed by a non-repudiation assumption. This means that the sending owner trusts the receiving owner to take responsibility if he should leak the document. As we consider consumers as untrusted participants in our model, a transfer involving a consumer cannot be based on a non-repudiation assumption. Therefore, whenever a document is transferred to a consumer, the sender embeds information that uniquely identifies the recipient. We call this fingerprinting. If the consumer leaks this document, it is possible to identify him with the help of the embedded information.

Threat Model and Design Goals

Although we try to address the problem of data leakage,LIME cannot guarantee that data leakage does not occur in the first place; once a consumer has received a document, nothing can prevent him from publishing it. We offer a method to provably identify the guilty party once a leakage has been detected. By introducing this reactive accountability, we expect that leakage is going to occur less often, since the identification of the guilty party will in most cases lead to negative consequences. As our only goal is to identify guilty parties,theattacksweareconcernedaboutarethosethatdisabletheauditorfromprovablyidentifying theguiltyparty. Therefore, we consider attackers in our model as consumers that take every possible step to publish a document without being held accountable for their actions. As the owner does not trust the consumer, he uses fingerprinting every time he passes a document to a consumer. However, we assume that the consumer tries to remove this identifying information in order to be able to publish the document safely. As already mentioned previously, consumers might transfer a document to another consumer, so we also have to consider the case of an untrusted sender. This is problematic because a sending consumer who embeds an identifier and sends the marked version to the receiving consumer could keep a copy of this version, publish it and so frame the receiving consumer. Another possibility to frame other consumers is to use fingerprinting on a document without even performing a transfer and publish the resulting document. A different problem that arises with the possibility of false accusation is denial. If false accusation is possible, then every guilty receiving consumer can claim that he is innocent and was framed by the sending consumer. The crucial phase in our model is the transfer of a document involving untrusted entities, so we clearly define which properties we require our protocol to fulfill. We call the two parties sender and recipient. We expect a transfer protocol to fulfill the following properties and only tolerate failures with negligible probabilities.

1) Correctness: When both parties follow the protocol steps correctly and only publish their version of the document, the guilty party can be found.

2) No framing: The sender cannot frame recipients for the sender’s leakages.

3) No denial: If the recipient leaks a document, he can be provably associated with it. We also require our model to be collusion resistant, i.e., it should be able to tolerate a small number of colluding attackers [11]. We also assume that the communication links between parties are reliable. Non-goals. We do not aim at proactively stopping data leakage, we only provide means to provably identify the guilty party in case a leak should occur, so that further steps can be taken. We also do not aim for integrity, as at any point an entity can decide to exchange the document to be sent with another one. However, in our settings, the sender wants the receiver to have the correct document, as he expects the recipient to perform a task using the document so that he eventually obtains a meaningful result. Our approach does not account for derived data (derived data can for example be generated by applying aggregate functions or other statistical operations), as much of the original information can be lost during the creation process of derived data. Nevertheless, we show in Section 7.1 how LIME can operate on composed data. We think of composed data as a form of data created from multiple single documents, so that the original documents can be completely extracted (e.g., concatenation of documents). We do not consider fairness issues in our accountable transfer protocol; more precisely, we do not consider scenarios in which a sender starts to run the transfer protocol but aborts before a recipient received the document, or when a recipient, despite of receiving the document, falsely claims that he did not receive it. In real-world scenarios, we find fairness not to be an issue as senders and recipients expect some utility from the transfer, and are worried about their reputation and corporate liabilities.

Conclusion

Data Lineage in Malicious Environments,We present LIME, a model for countable data transfer across multiple entities. We define participating parties, their interrelationships and give a concrete instantiation for a data transfer protocol using a new combination of oblivious transfer, robust watermarking and digital signatures. Although LIME does not actively prevent data leakage, it introduces reactive accountability. Thus, it will deter malicious parties from leaking private documents and will encourage honest (but careless) parties to provide the required protection for sensitive data. LIME is flexible as we differentiate between trusted senders (usually owners) and untrusted senders (usually consumers). In the case of the trusted sender, a very simple protocol with little overhead is possible. The untreated sender requires a more complicated protocol, but the results are not based on trust assumptions and therefore they should be able to convince a neutral entity (e.g.a judge). Our work also motivates further research on data leakage detection techniques for various document types and case. For example, it will be an interesting future research direction to design a verifiable lineage protocol for derived data.