diff options
Diffstat (limited to 'synthetic/introduction.tex')
-rw-r--r-- | synthetic/introduction.tex | 34 |
1 files changed, 34 insertions, 0 deletions
diff --git a/synthetic/introduction.tex b/synthetic/introduction.tex new file mode 100644 index 0000000..c4b827a --- /dev/null +++ b/synthetic/introduction.tex @@ -0,0 +1,34 @@ +% Antoine : v0.1, still working on it + +The deployment of Machine Learning and AI is becoming widespread in all areas ranging from cars, medicine, IoT, multimedia, cybersecurity to name a few. +This race is fueled by a strong hope for innovation and new services such as autonomous car, personalized medicine, better understanding of life, advanced and personalized service for instance. %targeted advertising.... +This generalization of AI requires being fed with a large quantity of data in order to train the underlying learning models. +At the heart of a large amount of innovative service is the human, therefore, these learning models are fed by a lot of personal information which is continuously and massively collected and resold. + +This omnipresence of personal information in training data opens up a new attack surface that is still poorly understood. +A large number of attacks have appeared in recent years which showed a risk of reconstruction of training data, inference of sensitive attributes on individuals, inference of membership on training data for instance. +These new privacy risks come up against regulations governing the use of personal data and are framed in new regulations on AI (e.g., IA Act). + +In order to reduce these risks on privacy and better comply with regulations, the generation of synthetic data has been largely adopted. +This technique relies on a generative model which is learned to artificially generate data which has the same statistical properties to the training data. +%Ainsi, au lieu de partager les données personnelles brutes (ou anonymisé), les données synthetiques ne sont plus associées à un individu et peuvent être partagées plus facilement. +Hence, instead of sharing raw personal data (or anonymized), synthetic data points are not linked to any individual and so can be shared with less restrictions. +This new El Dorado of synthetic data for the sale or sharing of data outside the GDPR is attracting a lot of interest and many startups or services have emerged to meet the demand. +This new economy also fuels the need for data to feed learning models through training with synthetic data. +%For example, projections of the use of synthetic data for model training show that in 2030, more than X\% of models will be trained with this data. + +Although training from synthetic data reduces the risk, synthetic data also carries a risk of leaking information compared to the raw data used such as membership inference and attribute inference. +Membership inference refers to the possibility of inferring weather or not a data record belongs to the training data. +Attribute inference refers to how a trained model can be leveraged to infer a sensitive attribute such as the race or the gender. + +However, no study has focused on the propagation of the risks of using synthetic data instead of real data for model training. +This paper investigates this risk propagation. +More specifically, we study how membership and attribute inference are impacted by synthetic data. +We studies the following research question through an experimental approach: +How does using synthetic data instead of real data affect users' privacy in the context of training machine learning models? + +The paper is organized as follows: +We begin by introduction key notions around machine learning and synthetic data privacy. +Then we present our experimental methodology followed by our result with our interpretation. +Finally we show an overview of how this work integrates with the current literature. + |