summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorJan Aalmoes <jan.aalmoes@inria.fr>2024-09-30 13:58:27 +0200
committerJan Aalmoes <jan.aalmoes@inria.fr>2024-09-30 13:58:27 +0200
commit644fa7c290ac801f15180dd8a9c425c3b757adf5 (patch)
treedd59a438fe2d69f170784476468f2120a458f608
parentcca6686ee7e6689d3fd229741742b177e194bc6a (diff)
introduction synth
-rw-r--r--synthetic/introduction.tex39
1 files changed, 8 insertions, 31 deletions
diff --git a/synthetic/introduction.tex b/synthetic/introduction.tex
index c4b827a..ccf400e 100644
--- a/synthetic/introduction.tex
+++ b/synthetic/introduction.tex
@@ -1,34 +1,11 @@
-% Antoine : v0.1, still working on it
+Comme au chapitre précédent, la confidentialité des données synthétiques est souvent considéré du point de vue suivant : a partir des données synthétiques, que pouvons nous apprendre des données réels ?
+Pour cela la confidentialité différentielle permet une protection très forte, plus forte que d'autre notions de confidentialité comme par exemple la limitation des fuite statistiques\footnote{\textit{Statistical disclosure limitation}}~\cite{abowd2008protective}.
+Il existe ainsi des méthodes pour imposer la confidentialité différentielle dans les GAN~\cite{jordon2018pate} et dans les auto encodeurs~\cite{abay2019privacy}.
-The deployment of Machine Learning and AI is becoming widespread in all areas ranging from cars, medicine, IoT, multimedia, cybersecurity to name a few.
-This race is fueled by a strong hope for innovation and new services such as autonomous car, personalized medicine, better understanding of life, advanced and personalized service for instance. %targeted advertising....
-This generalization of AI requires being fed with a large quantity of data in order to train the underlying learning models.
-At the heart of a large amount of innovative service is the human, therefore, these learning models are fed by a lot of personal information which is continuously and massively collected and resold.
+Ce chapitre est un début de travail sur les liens enter données synthétiques et AIA.
+Nous allons déjà étudier la MIA en utilisant des données synthétiques.
+Ensuite nous allons regarder l'impacte de l'utilisation des données synthétiques lors de l'entraînement sur le succès de l'AIA.
-This omnipresence of personal information in training data opens up a new attack surface that is still poorly understood.
-A large number of attacks have appeared in recent years which showed a risk of reconstruction of training data, inference of sensitive attributes on individuals, inference of membership on training data for instance.
-These new privacy risks come up against regulations governing the use of personal data and are framed in new regulations on AI (e.g., IA Act).
-
-In order to reduce these risks on privacy and better comply with regulations, the generation of synthetic data has been largely adopted.
-This technique relies on a generative model which is learned to artificially generate data which has the same statistical properties to the training data.
-%Ainsi, au lieu de partager les données personnelles brutes (ou anonymisé), les données synthetiques ne sont plus associées à un individu et peuvent être partagées plus facilement.
-Hence, instead of sharing raw personal data (or anonymized), synthetic data points are not linked to any individual and so can be shared with less restrictions.
-This new El Dorado of synthetic data for the sale or sharing of data outside the GDPR is attracting a lot of interest and many startups or services have emerged to meet the demand.
-This new economy also fuels the need for data to feed learning models through training with synthetic data.
-%For example, projections of the use of synthetic data for model training show that in 2030, more than X\% of models will be trained with this data.
-
-Although training from synthetic data reduces the risk, synthetic data also carries a risk of leaking information compared to the raw data used such as membership inference and attribute inference.
-Membership inference refers to the possibility of inferring weather or not a data record belongs to the training data.
-Attribute inference refers to how a trained model can be leveraged to infer a sensitive attribute such as the race or the gender.
-
-However, no study has focused on the propagation of the risks of using synthetic data instead of real data for model training.
-This paper investigates this risk propagation.
-More specifically, we study how membership and attribute inference are impacted by synthetic data.
-We studies the following research question through an experimental approach:
-How does using synthetic data instead of real data affect users' privacy in the context of training machine learning models?
-
-The paper is organized as follows:
-We begin by introduction key notions around machine learning and synthetic data privacy.
-Then we present our experimental methodology followed by our result with our interpretation.
-Finally we show an overview of how this work integrates with the current literature.
+De manière synthétiques nous apportons des premiers éléments de réponses à la question suivante :
+Quel est l'impacte de l'utilisation des données synthétiques, au lieu de données réels, lors de l'entraînement de modèles, sur la confidentialité ?