diff options
author | Jan Aalmoes <jan.aalmoes@inria.fr> | 2024-09-30 13:58:27 +0200 |
---|---|---|
committer | Jan Aalmoes <jan.aalmoes@inria.fr> | 2024-09-30 13:58:27 +0200 |
commit | 644fa7c290ac801f15180dd8a9c425c3b757adf5 (patch) | |
tree | dd59a438fe2d69f170784476468f2120a458f608 | |
parent | cca6686ee7e6689d3fd229741742b177e194bc6a (diff) |
introduction synth
-rw-r--r-- | synthetic/introduction.tex | 39 |
1 files changed, 8 insertions, 31 deletions
diff --git a/synthetic/introduction.tex b/synthetic/introduction.tex index c4b827a..ccf400e 100644 --- a/synthetic/introduction.tex +++ b/synthetic/introduction.tex @@ -1,34 +1,11 @@ -% Antoine : v0.1, still working on it +Comme au chapitre précédent, la confidentialité des données synthétiques est souvent considéré du point de vue suivant : a partir des données synthétiques, que pouvons nous apprendre des données réels ? +Pour cela la confidentialité différentielle permet une protection très forte, plus forte que d'autre notions de confidentialité comme par exemple la limitation des fuite statistiques\footnote{\textit{Statistical disclosure limitation}}~\cite{abowd2008protective}. +Il existe ainsi des méthodes pour imposer la confidentialité différentielle dans les GAN~\cite{jordon2018pate} et dans les auto encodeurs~\cite{abay2019privacy}. -The deployment of Machine Learning and AI is becoming widespread in all areas ranging from cars, medicine, IoT, multimedia, cybersecurity to name a few. -This race is fueled by a strong hope for innovation and new services such as autonomous car, personalized medicine, better understanding of life, advanced and personalized service for instance. %targeted advertising.... -This generalization of AI requires being fed with a large quantity of data in order to train the underlying learning models. -At the heart of a large amount of innovative service is the human, therefore, these learning models are fed by a lot of personal information which is continuously and massively collected and resold. +Ce chapitre est un début de travail sur les liens enter données synthétiques et AIA. +Nous allons déjà étudier la MIA en utilisant des données synthétiques. +Ensuite nous allons regarder l'impacte de l'utilisation des données synthétiques lors de l'entraînement sur le succès de l'AIA. -This omnipresence of personal information in training data opens up a new attack surface that is still poorly understood. -A large number of attacks have appeared in recent years which showed a risk of reconstruction of training data, inference of sensitive attributes on individuals, inference of membership on training data for instance. -These new privacy risks come up against regulations governing the use of personal data and are framed in new regulations on AI (e.g., IA Act). - -In order to reduce these risks on privacy and better comply with regulations, the generation of synthetic data has been largely adopted. -This technique relies on a generative model which is learned to artificially generate data which has the same statistical properties to the training data. -%Ainsi, au lieu de partager les données personnelles brutes (ou anonymisé), les données synthetiques ne sont plus associées à un individu et peuvent être partagées plus facilement. -Hence, instead of sharing raw personal data (or anonymized), synthetic data points are not linked to any individual and so can be shared with less restrictions. -This new El Dorado of synthetic data for the sale or sharing of data outside the GDPR is attracting a lot of interest and many startups or services have emerged to meet the demand. -This new economy also fuels the need for data to feed learning models through training with synthetic data. -%For example, projections of the use of synthetic data for model training show that in 2030, more than X\% of models will be trained with this data. - -Although training from synthetic data reduces the risk, synthetic data also carries a risk of leaking information compared to the raw data used such as membership inference and attribute inference. -Membership inference refers to the possibility of inferring weather or not a data record belongs to the training data. -Attribute inference refers to how a trained model can be leveraged to infer a sensitive attribute such as the race or the gender. - -However, no study has focused on the propagation of the risks of using synthetic data instead of real data for model training. -This paper investigates this risk propagation. -More specifically, we study how membership and attribute inference are impacted by synthetic data. -We studies the following research question through an experimental approach: -How does using synthetic data instead of real data affect users' privacy in the context of training machine learning models? - -The paper is organized as follows: -We begin by introduction key notions around machine learning and synthetic data privacy. -Then we present our experimental methodology followed by our result with our interpretation. -Finally we show an overview of how this work integrates with the current literature. +De manière synthétiques nous apportons des premiers éléments de réponses à la question suivante : +Quel est l'impacte de l'utilisation des données synthétiques, au lieu de données réels, lors de l'entraînement de modèles, sur la confidentialité ? |