chapter six

6 Privacy-preserving synthetic data generation

This chapter covers

Synthetic data generation
Generating synthetic data for anonymization
Using differential privacy mechanisms to generate privacy-preserving synthetic data
Designing a privacy-preserving synthetic data generation scheme for machine learning tasks

So far we’ve looked into the concepts of differential privacy (including the centralized, DP, and the local, LDP, versions) and their applications in developing privacy-preserving query-processing and machine learning (ML) algorithms. As you saw, the idea of DP is to add noise to the query results (without disturbing their original properties) such that the results can assure the privacy of the individuals while satisfying the utility of the application.

But sometimes data users may request the original data to utilize it locally and directly, perhaps to develop new queries and analysis procedures. Privacy-preserving data-sharing methods can be used for such purposes. This chapter will look into synthetic data generation—a promising solution for data sharing—which generates synthetic yet representative data that can be shared among multiple parties safely and securely. The idea of synthetic data generation is to artificially generate data that has distribution and properties similar to the original data. And because it is artificially produced, we do not have to worry about privacy concerns.

6.1 Overview of synthetic data generation

6.1.1 What is synthetic data? Why is it important?

6.1.2 Application aspects of using synthetic data for privacy preservation

6.1.3 Generating synthetic data

6 Privacy-preserving synthetic data generation

This chapter covers

6.1 Overview of synthetic data generation

6.1.1 What is synthetic data? Why is it important?

6.1.2 Application aspects of using synthetic data for privacy preservation

6.1.3 Generating synthetic data

6.2 Assuring privacy via data anonymization

6.2.1 Private information sharing vs. privacy concerns

6.2.2 Using k-anonymity against re-identification attacks

6.2.3 Anonymization beyond k-anonymity

6.3 DP for privacy-preserving synthetic data generation

6.3.1 DP synthetic histogram representation generation

6.3.2 DP synthetic tabular data generation

6.3.3 DP synthetic multi-marginal data generation

6.4.1 Using hierarchical clustering and micro-aggregation