synthetic/introduction.tex


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34

% Antoine : v0.1, still working on it

The deployment of Machine Learning and AI is becoming widespread in all areas ranging from cars, medicine, IoT, multimedia, cybersecurity to name a few.
This race is fueled by a strong hope for innovation and new services such as autonomous car, personalized medicine, better understanding of life, advanced and personalized service for instance. %targeted advertising....
This generalization of AI requires being fed with a large quantity of data in order to train the underlying learning models.
At the heart of a large amount of innovative service is the human, therefore, these learning models are fed by a lot of personal information which is continuously and massively collected and resold.

This omnipresence of personal information in training data opens up a new attack surface that is still poorly understood.
A large number of attacks have appeared in recent years which showed a risk of reconstruction of training data, inference of sensitive attributes on individuals, inference of membership on training data for instance.
These new privacy risks come up against regulations governing the use of personal data and are framed in new regulations on AI (e.g., IA Act).

In order to reduce these risks on privacy and better comply with regulations, the generation of synthetic data has been largely adopted. 
This technique relies on a generative model which is learned to artificially generate data which has the same statistical properties to the training data.
%Ainsi, au lieu de partager les données personnelles brutes (ou anonymisé), les données synthetiques ne sont plus associées à un individu et peuvent être partagées plus facilement.
Hence, instead of sharing raw personal data (or anonymized), synthetic data points are not linked to any individual and so can be shared with less restrictions.
This new El Dorado of synthetic data for the sale or sharing of data outside the GDPR is attracting a lot of interest and many startups or services have emerged to meet the demand.
This new economy also fuels the need for data to feed learning models through training with synthetic data. 
%For example, projections of the use of synthetic data for model training show that in 2030, more than X\% of models will be trained with this data.

Although training from synthetic data reduces the risk, synthetic data also carries a risk of leaking information compared to the raw data used such as membership inference and attribute inference.
Membership inference refers to the possibility of inferring weather or not a data record belongs to the training data.
Attribute inference refers to how a trained model can be leveraged to infer a sensitive attribute such as the race or the gender.

However, no study has focused on the propagation of the risks of using synthetic data instead of real data for model training.
This paper investigates this risk propagation.
More specifically, we study how membership and attribute inference are impacted by synthetic data.
We studies the following research question through an experimental approach:
How does using synthetic data instead of real data affect users' privacy in the context of training machine learning models?

The paper is organized as follows: 
We begin by introduction key notions around machine learning and synthetic data privacy. 
Then we present our experimental methodology followed by our result with our interpretation.
Finally we show an overview of how this work integrates with the current literature.