Abstract

The Internet of Things (IoT) is now an emerging global Internet-based information architecture used to facilitate the exchange of goods and services. IoT-related applications are aiming to bring technology to people anytime and anywhere, with any device. However, the use of IoT raises a privacy concern because data will be collected automatically from the network devices and objects which are embedded with IoT technologies. In the current applications, data collector is a dominant player who enforces the secure protocol that cannot be verified by the data owners. In view of this, some of the respondents might refuse to contribute their personal data or submit inaccurate data. In this paper, we study a self-awareness data collection protocol to raise the confidence of the respondents when submitting their personal data to the data collector. Our self-awareness protocol requires each respondent to help others in preserving his privacy. The communication (respondents and data collector) and collaboration (among respondents) in our solution will be performed automatically.

1. Introduction

The Internet of Things (IoT) is now an emerging global Internet-based information architecture used to facilitate the exchange of goods and services. The concept of IoT is to allow living objects (humans or animals), devices (sensor), or object with embedded technologies to automatically transfer data over communication networks (wired or wireless networks) without human-to-human or human-to-computer interaction. IoT aims to utilize and extend the benefits of Internet such as always-on, data sharing, and remote access capabilities [1].

IoT enables data collection in every aspect of our life. Data collected from smart metering application allows the utility provider to analyze and improve its services. Also, these data can help the user to be aware of their energy consumptions and possible energy saving strategies. In an underwater environment, smart meter is particularly important because information can be detected, gathered, and sent to the sensor [2].

Let us consider the following scenario. A practitioner (data collector) would like to collect medical data from his patients (respondents) with implanted medical devices. Since medical data are highly sensitive information, respondents must be aware of the data to be collected. There are two main paradigms to protect the patient’s privacy in this scenario. The first paradigm relies on the respondent’s trust in the data collector while the second paradigm depends on the respondent’s anonymity. If the respondents do not have confidence in the data collector, they may refuse to submit data or provide inaccurate data to the agency. If the submitted data from the respondents are not genuine, we can predict that the data collector will face the data utility problem because the analyzed results based on the collected data will not be accurate. In the second paradigm, we should prevent the reidentification problem. For instance, if the collected data are used for research purposes, the data collector should not be able to link any of the collected data to the real identity of any patient.

1.1. Challenges of IoT

Wireless sensor networks have been revolutionized by creating significant impact throughout the society [3]. Advances in wireless communication technology (e.g., efficient resource management [4] and performance improvement [5] in wireless network) enable the development and implementation of IoT applications. IoT-related applications include traffic congestion detection and waste management in smart cities, remote diagnostics in patients’ surveillance system (e.g., Ubiquitous healthcare [6, 7]), and storage condition monitoring in supply chain control.

Along with potential benefits offered, the usage of IoT also raises some privacy concerns to the data owners. In particular, real-time data collection and data analysis in IoT applications may compromise the privacy of data owner. In practical, new data arrive continuously and up-to-date data should be used for analysis. The data collected at different times allows malicious providers to learn extra knowledge by cross-examining the data within a targeted timeframe. Therefore, a secure and privacy aware protocol should be implemented in IoT when data are collected automatically. Some new security and privacy challenges can be found in [8].

The development of radio frequency identification (RFID) technologies and the advances of network communication technologies motivate the forming of IoT [9]. Physical objects called u-things which are embedded or connected to communication networks, sensors, and computers are commonly found in our daily life [10]. In the context of IoT, u-things should be able to act automatically (e.g., autodetection and data transfer) and adaptively. The construction of smart u-things involves the following 7 challenges [11, 12]:(i)surrounding situations (context),(ii)users’ needs,(iii)things’ relations,(iv)common knowledge,(v)self-awareness,(vi)looped decisions,(vii)ubiquitous safety (UbiSafe).

The ultimate goal of any ubiquitous intelligence is to make the u-things behave trustworthily in both other-aware and self-aware manners to some degrees and circumstances [13]. Therefore, it is important to design a self-awareness protocol to help data owners to protect their privacy.

In this paper, we will focus on the self-awareness challenge. In particular, we design a self-awareness protocol to increase the confidence of the data owner when the smart u-things automatically submit their data to the data collector.

1.2. Problem Statement

There are two challenges we aim to address in this work. Firstly, we want to protect the identity of each data owner from the data collector before and after the data collection process. Secondly, and more importantly, we want to guarantee the usefulness of the collected data by increasing the confidence of data owner.

The first challenge can be solved by using anonymity technology such as the onion routing (Tor) [14], anonymous proxy server [15], and mix network [16, 17]. These technologies are still under active investigation and their focuses are mainly on network traffic analysis, anonymous communication channel, and private information retrieval. Since our aim in this paper is not to design any of the specific anonymity technology, we refer readers to [15, 18] for the usage of these technologies.

The second challenge requires each respondent to help others in order to preserve his own privacy. This idea is motivated by the coprivacy concept in [19, 20]. Coprivacy (or cooperative privacy) considers the best option for a party to achieve his privacy protection is to help another party in achieving her privacy. The formal definition of coprivacy and its generalizations can be found in [19].

1.3. Our Contributions

In this paper, we propose a self-awareness protocol to facilitate the data collection in IoT-related applications. Instead of placing full trust on the utility provider (data collector), we allow each data owner (respondent) to learn the protection level provided by the data collector before the data submission process. We summarize our contributions as follows.(i)We propose a privacy preserved approach to enable the respondents to learn about the anonymous protection level they will receive from the data collector before the data submission.(ii)Our notion of self-awareness protection can be used to increase the confidence of respondents in the data collection process. Hence, respondents will feel comfortable to submit their genuine data while the data collector can ensure the usefulness of the collected data.

1.4. Organization

The rest of this chapter is organized as follows. The background and related work for this research are presented in Section 2. We describe the technical preliminaries of our solution in Section 3. We present our solution in Section 4 followed by analysis of correctness, privacy, efficiency, and discussion in Section 5. Our conclusion is in Section 6.

2.1. Privacy Paradigm in IoT

In 1973, the United States Department of Health, Education, and Welfare proposed Fair Information Practice Principles (FIPPs) as the guideline to assure fair practice and adequate data privacy protection. In particular, the guideline aims to protect the consumer rights such as how online entities should collect and use the personal data [21]. Five principles of FIPPs are as follows [22].(1)There must be no personal data record-keeping systems whose very existence is secret.(2)There must be a way for a person to find out what information about the person is in a record and how it is used.(3)There must be a way for a person to prevent information about the person that was obtained for one purpose from being used or made available for other purposes without the person’s consent.(4)There must be a way for a person to correct or amend a record of identifiable information about the person.(5)Any organization creating, maintaining, using, or disseminating records of identifiable personal data must assure the reliability of the data for their intended use and must take precautions to prevent misuses of the data.Based on the above principles, we now analyze the privacy protection in current IoT. Since data are collected automatically, it is hard for the data owners to ensure that their privacy can be protected. In most cases, utility providers will design a series of mechanisms to guarantee the privacy protection of the collected data. However, we found that data owners are generally not able to verify those mechanisms offered by the provider. Therefore, a self-awareness protocol should be available for automatic data collection process.

2.2. Anonymous Data Collection

In general, online data collection is a process which involves collaboration between a trusted party (data collector) and a number of data owners (respondents). Due to concerns regarding privacy, respondents might refuse to contribute their personal data or submit inaccurate data to the data collector. Therefore, the data collector needs to ensure the privacy of data submitted through a series of secure mechanisms. However, the protection level provided by the data collector is hard to be verified by the respondents.

Often, data collected from the respondents will be used for research or data analysis. The release of the collected data causes a privacy issue in data publishing, in particular, when it involves the republication of the same data in a given period [23]. There are two settings that can be observed when the data is released to the data recipient. If the data recipient is a third party, data must be released in an anonymous form without compromising the privacy of the respondents. Let us consider a scenario where a hospital (data collector) wishes to publish patients’ records to a research institute (data recipient) for data analysis. In a common practice, all the explicit personal identity information (PII) such as name and social security number will be removed from the original dataset before it is released to the data recipient. However, removing PII does not preserve privacy.

Data anonymization is an interesting solution to protect the privacy of the respondents for this setting. Sweeney proposed -anonymity model to address the linking attack [24]. The concept of -anonymity [25] is such that each released data is indistinct from at least other data. However, -anonymity is found vulnerable against background knowledge attacks by Machanavajjhala et al. [26].

In the literature, techniques such as -anonymity [27, 28], -diversity [26], and -closeness [29] have been proposed to enhance the -anonymity model. We note that these techniques assumed that -anonymity has been achieved in the first place before applying additional techniques to enhance the anonymous protection of the released data. For instance, -anonymity model assumed that all the released data adhere to -anonymity. In addition, it requires that the frequency of the sensitive value in any quasi-identifier is less than after the anonymization [27]. In the -diversity model, the sensitive attribute in the -anonymous table is well represented by values such that each sensitive value is at most . A survey of recent attacks and privacy models in data publishing can be found in [30].

In this paper, we consider the second setting where the data analysis is performed by the data collector. This scenario is more complex to deal with because the data collector has the full access to all raw data from the respondents. Therefore, we need to design a protocol to increase the confidence of the respondents before they submit their records to the data collector. In other words, respondents are aware of the protection level they received from the data collector after the data submission.

Various self-oriented privacy protections have been proposed in the literature. Self-enforcing privacy (SEP) for e-polling was proposed in [31]. The idea of SEP is to enforce the pollster to protect the respondents’ privacy by allowing the respondents to trace their data after the submission. If the pollster releases the poll results, the respondents can indict the pollster by using the evidence they obtained during the data collection process. A fair indictment scheme for SEP can be found in [32].

The most related research to our work in this paper is the respondent-defined privacy protection (RDPP) for anonymous data collection proposed in [33]. The basic idea of RDPP is to allow the respondents to specify the level of protection they require before providing any data to the data collector. For instance, a number of respondents (minimum threshold) must satisfy the constraint chosen by the respondent before he agrees to submit the data. In their protocol, respondents are aware of the minimum level of privacy protection they will receive before submitting their dataset to the data collector. Instead of relying on the data collector to guarantee the privacy protection, the respondents are free to define their preferred protection level.

In this paper, we do not consider indictment for our protocol because the data analysis is done by the data collector. Instead of allowing the respondents to freely define their own privacies, we assume that respondents are willing to submit their data if the protection level offered by the data collector can be verified by them.

4. Technical Preliminaries

4.1. Homomorphic Encryption Scheme

We use homomorphic encryption scheme (i.e., Paillier [34]) as our primary cryptographic tool. Let denote the encryption of with the public key, . Given two ciphertexts, and , there exists an efficient algorithm to compute . This additive property can be performed without the decryption key.

4.2. Definitions

Let us assume that there are respondents . and a data collector . Each respondent has a database with records. We denote as the dataset collected by the data collector. Also, the dataset consists of quasi-identifier and a sensitive attribute. Note that the quasi-identifier can be either categorical or continuous data while the sensitive attribute is a categorical data from its domain (e.g., disease).

A quasi-identifier is a minimal set of attributes in that can be joined with external information to uniquely distinguish individual records [24]. Note that the quasi-identifier can be either categorical or continuous data while the sensitive attribute is a categorical data from its domain.

Definition 1 (quasi-identifier). A quasi-identifier is a minimal set of attributes that can uniquely distinguish tuples in . The for Table 1 is and it can be generalized as .

Definition 2 ( -anonymity). is said to satisfy -anonymity with respect to if and only if each set of attributes in appears at least occurrences in .

Definition 3 (self-awareness privacy). Each respondent is said to achieve self-awareness privacy if he learns the protection level (e.g., -anonymity) provided by the data collector. At the end of the protocol execution, each respondent remains anonymous to others and the data collector is not able to identify any of the respondents with probability more than 0.5.

4.3. Components

Our self-awareness data collection protocol consists of the following three components.(i)Data collector: an authorized party who wants to collect data from a group of respondents via wired or wireless network.(ii)Respondent: participant in the data collection process who is also a candidate to submit his/her record to the data collector.(iii)The onion router (Tor): an anonymous network used to conceal the respondent’s privacy such that the agency cannot monitor the activity flows of any respondent.We show the interactions among the components in our solution in Figure 1. We assume that the respondents and the data collector are equipped with ubiquitous sensors to detect, communicate, and execute the protocol.

4.4. Adversary Model

We assume that both the data collector and the respondents are semihonest players (also known as honest-but-curious). Semihonest players follow the protocol faithfully but may try to discover extra information during the protocol execution.

In our protocol design, the data collector must follow the protocol faithfully in order to ensure that all respondents are willing to participate in the data collection process. For the same reason, all respondents should be semihonest in order to ensure that the privacy protection level offered by the data collector can be achieved.

4.5. Notations Used

The notations used hereafter in this paper are summarized in Notations section.

5. Self-Awareness Data Collection Protocol

5.1. Protocol Idea

The basic idea of our protocol is to allow the respondents to know the protection level they will receive from the data collector before the data submission process [35]. In our design, the data collector will release a set of quasi-identifiers for and define a protection level it wants to provide to the respondents (e.g., a threshold ). Note that a larger will make the respondents feel more comfortable to submit their records. We also require the respondents to collaborate together to find the number of records in which met the quasi-identifier determined by the data collector. We assume that the communication between the data collector and the respondents is via a mixture network such as Tor [14]. Note that the communication (respondents and data collector) and collaboration (among respondents) in our solution are run automatically. We show the overview of our proposed solution in Figure 1.

In the following sections, we will describe our self-awareness data collection protocol in details.

5.2. Our Protocol

In order to participate in the data collection process, all players can precompute some information to be used during the protocol execution. For example, each respondent can generate a cryptographic key pair where is the public key and is the corresponding private key. Next, the respondents encrypt their personal identifiable information (PII) such as name or social security number by using the . The encrypted PII will be used as the public identity of the respondent . This public identity is important for other respondents to identify the owner of a given public key. Each respondent then submits his public identity and encryption key to the data collector via a Tor network. Let us assume there are respondents who participate in the data collection process and, hence, the data collector will receive submissions from the respondents.

Before the data collection begins, the data collector is required to define a set of quasi-identifiers denoted as for the dataset to be collected and determine the protection level (e.g., value) for the respondents.

To initiate the protocol, the data collector first randomly assigns a public key for each . If , the same public key can be assigned to more than one quasi-identifier. Otherwise, the data collector selects of the public keys for the assignment. For simplicity, we will assume that the size for both quasi-identifier and public key is equal (i.e., ) and . Next, the data collector publishes to a shared location (e.g., a webpage):

Based on the information from (1), each respondent retrieves to examine if his records in match any of the quasi-identifiers . At this phase, each respondent maintains a scores list for , . We denote as the score determined by the respondent for . The respondent raises each score by 1 when a record in matches the quasi-identifier. Upon the completion, the respondent encrypts each by using the public key assigned to the quasi-identifier . The encrypted scores list computed by each respondent can be represented as . Then, all the respondents send to the data collector and a shared location. Note that this location can be a separate space that is not shared with the data collector.

Upon receiving from all the respondents, the data collector performs the following tasks.(1)Aggregates the scores determined by all respondents for each . The data collector performs this computation in an encrypted form by using the additive property of the Paillier cryptosystem. The output of the aggregation can be represented as (2)Publishes an outcome table. The data collector publishes the scores for each in an outcome table as shown in Table 2. In Table 2, each row represents the encrypted scores received from each respondent while the column shows the encrypted scores for each quasi-identifier . Note that all the data in are encrypted by using the same public key . Therefore, only the respondent who has been assigned the can decrypt to learn the number of matched records for .After the data collector releases the outcome table, the respondents need to verify that the data released are genuine. For instance, each respondent verifies that the encrypted scores list submitted to the data collector appears as one of the rows in Table 2. If the respondent fails to verify the data, he or she then issues a decision message with a random value.

Let us assume all the respondents successfully verify the data in Table 2. Next, each respondent retrieves (based on his public identity ) and decrypts all encrypted data by using the private key . After the decryption, the respondents must ensure that the aggregated score computed by the data collector is correct. The respondents can verify this by computing from the decrypted scores and then compare it with the decrypted result of . Lastly, each respondent compares with the threshold determined by the data collector. If the number of matched records is greater than the threshold value (e.g., ), we assume that the respondent will submit his records to the data collector. Otherwise, the respondent will abort from the data collection process.

At the final phase, each respondent sends a decision message to the shared location. If the decision message is set to 1, this indicates that . Therefore, the respondents should submit their records to the data collector. Otherwise, if is set to 0, the respondents should not reveal any record to the data collector.

We summarize our self-awareness data collection protocol in Algorithm 1.

Self-Awareness Data Collection Protocol
Phase 1: Public Key and Public Identity Submissions
The data collector broadcasts a submission request to respondents. Each
generates a cryptographic key pair and a public identity by encrypting
its personal identifiable information (PII). Note that the respondents can pre-
compute the cryptographic key pair and the PII in an offline mode. Next, each
sends to via the Tor network.
Phase 2: Satisfaction Scores Computation
The data collector generates QID, decides a threshold and assigns a public
key for each . Next, it broadcasts the information to all respondents. Each
examines if his record in satisfy QID. For each satisfy case, the increases
the constraint score by 1. We denote as the score determines by for .
Next, each encrypts by using the public key to produce
. Each then anonymously sends to and a
shared location.
Phase 3: Scores List Verification
The data collector computes and publishes an outcome table. Each examines
if the published scores list is same as the original list he sent to . If the list has
been modified, the respondent will not participate in the next phase.
Phase 4: Satisfaction Score Checking
Each retrieves and decrypts . Next, it computes
as the satisfaction score for . If the satisfaction score is at
least with occurences (e.g., ), the sends to . Otherwise,
will be sent to .
Phase 5: Data Submission
The respondents submit his record to with the confidence that their privacy
protection is achieved at -anonymity level.

6. Analysis and Discussion

6.1. Analysis of Correctness

In this paper, we assume that both the data collector and the respondents are semihonest players. The semihonest model is realistic in our solution. If both players follow the protocol faithfully, each respondent can ensure that he will achieve the protection level offered by the data collector (e.g., -anonymity). At the same time, the data collector can guarantee that the datasets collected are useful for analysis.

During the protocol execution, all respondents are required to verify the encrypted scores released by the data collector are genuine and the aggregated score for each computed by the data collector is correct. The first verification is to ensure that the data collector has received all data computed by the respondents correctly while the second verification is useful for the respondents to detect a malicious data collector.

In our protocol design, the data collector needs to define a protection level (e.g., value) before the data collection begins. The data collector can define the same protection level for all or define difference in anonymous levels for each . For the latter case, the respondents can perform the same steps to verify each value of .

6.2. Analysis of Privacy

The privacy analysis of our protocol depends on how much information has been revealed during the protocol execution. In general, our solution should protect the privacy of the respondents. This leads to the following two requirements: the data collector should not be able to infer any sensitive information of the respondents from the data collected and the respondents are aware of the data they submit and the protection level they will receive from the data collector.

In our protocol design, we utilize Tor network to prevent direct communication between the data collector and the respondents. This approach will not allow the data collector to track the identity of any respondent. Also, we assume that each respondent has no knowledge about the profile of other respondents, but the number of respondents in the protocol is known publicly.

The unique identity of each respondent will not leak the profile of any respondent because they are in an encrypted form. The data collector is not able to decrypt in the absence of private keys from the respondents. Further, our protocol ensures that no party (including the data collector) can learn the encrypted score in the outcome table before the decryption. Note that only the respondent who has the private key can perform the decryption.

To prevent possible collusions between the data collector and other respondents, we assume that all data transmissions are performed via an anonymous communication channel (e.g., Tor network). This can ensure that the profile of each respondent remains anonymous from others.

The shared location (e.g., web page or web folder) used in our protocol is to allow the respondents to learn the decisions made by others and to detect a malicious data collector. Each respondent notifies others about the verification result by using a decision message . Since the decision message only reveals the public identity of the respondents, we can assume that the profile of the respondents remains hidden from others.

6.3. Analysis of Efficiency

The complexity of our protocol is dominated by the cryptographic operations (encryption and decryption) performed by respondents. We implement our protocol in Java and ran it on a single computer with a 2 GHz CPU and a 2 GB RAM. The performance evaluation is shown in Figure 2. Each respondent performs the same amount of cryptographic operations in our experiment.

6.4. Discussion

In this paper, we assume that the size of the public keys (or the number of respondents) and the quasi-identifier is equal (e.g., ). However, our protocol works correctly for unequal cases. The owner of the public key only performs the decryption and computes at the end of the protocol execution. A respondent may not be involved in the final phase if his public key is not selected by the data collector (for cases when ). Otherwise, a respondent needs to repeat final phase for several times if his public key is assigned to more than one .

7. Conclusion and Future Work

In this paper, we presented a self-awareness protocol for IoT data collection. Since the release of raw data to the data collector has a high risk to compromise privacy of the respondents, we aim to increase confidence of the respondents before they submit their records to the data collector. Our self-awareness protocol allows each respondent to help others in order to preserve his own privacy. At the same time, the final collected data should adhere to the protection level promised by the data collector before the data collection begins. Also, our solution can be extended to support indictment scheme (when the data is released to a third party) because the respondents have evidence (e.g., value of ) to indict a malicious data collector.

Notations

:Respondent
:Size of the respondents
:Dataset collected by the data collector
:Local database of respondent
:Anonymous protection level
:Quasi-identifier set determined by the data collector
:Size of the quasi-identifier
: th quasi-identifier in
: Public identity of the respondent
:Score determined by the respondent for
:Satisfaction score of
:Public key of respondent
:Private key of respondent
:Encryption operation by using
:Decryption operation by using
:Decision message from respondent .

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.