summaryrefslogtreecommitdiff
path: root/synthetic/introduction.tex
diff options
context:
space:
mode:
authorJan Aalmoes <jan.aalmoes@inria.fr>2024-09-11 00:10:50 +0200
committerJan Aalmoes <jan.aalmoes@inria.fr>2024-09-11 00:10:50 +0200
commitbf5b05a84e877391fddd1b0a0b752f71ec05e901 (patch)
tree149609eeff1d475cd60f398f0e4bfd786c5d281c /synthetic/introduction.tex
parent03556b31409ac5e8b81283d3a6481691c11846d7 (diff)
Preuve existe f pas cca equivalant exists f BA pas randomguess
Diffstat (limited to 'synthetic/introduction.tex')
-rw-r--r--synthetic/introduction.tex34
1 files changed, 34 insertions, 0 deletions
diff --git a/synthetic/introduction.tex b/synthetic/introduction.tex
new file mode 100644
index 0000000..c4b827a
--- /dev/null
+++ b/synthetic/introduction.tex
@@ -0,0 +1,34 @@
+% Antoine : v0.1, still working on it
+
+The deployment of Machine Learning and AI is becoming widespread in all areas ranging from cars, medicine, IoT, multimedia, cybersecurity to name a few.
+This race is fueled by a strong hope for innovation and new services such as autonomous car, personalized medicine, better understanding of life, advanced and personalized service for instance. %targeted advertising....
+This generalization of AI requires being fed with a large quantity of data in order to train the underlying learning models.
+At the heart of a large amount of innovative service is the human, therefore, these learning models are fed by a lot of personal information which is continuously and massively collected and resold.
+
+This omnipresence of personal information in training data opens up a new attack surface that is still poorly understood.
+A large number of attacks have appeared in recent years which showed a risk of reconstruction of training data, inference of sensitive attributes on individuals, inference of membership on training data for instance.
+These new privacy risks come up against regulations governing the use of personal data and are framed in new regulations on AI (e.g., IA Act).
+
+In order to reduce these risks on privacy and better comply with regulations, the generation of synthetic data has been largely adopted.
+This technique relies on a generative model which is learned to artificially generate data which has the same statistical properties to the training data.
+%Ainsi, au lieu de partager les données personnelles brutes (ou anonymisé), les données synthetiques ne sont plus associées à un individu et peuvent être partagées plus facilement.
+Hence, instead of sharing raw personal data (or anonymized), synthetic data points are not linked to any individual and so can be shared with less restrictions.
+This new El Dorado of synthetic data for the sale or sharing of data outside the GDPR is attracting a lot of interest and many startups or services have emerged to meet the demand.
+This new economy also fuels the need for data to feed learning models through training with synthetic data.
+%For example, projections of the use of synthetic data for model training show that in 2030, more than X\% of models will be trained with this data.
+
+Although training from synthetic data reduces the risk, synthetic data also carries a risk of leaking information compared to the raw data used such as membership inference and attribute inference.
+Membership inference refers to the possibility of inferring weather or not a data record belongs to the training data.
+Attribute inference refers to how a trained model can be leveraged to infer a sensitive attribute such as the race or the gender.
+
+However, no study has focused on the propagation of the risks of using synthetic data instead of real data for model training.
+This paper investigates this risk propagation.
+More specifically, we study how membership and attribute inference are impacted by synthetic data.
+We studies the following research question through an experimental approach:
+How does using synthetic data instead of real data affect users' privacy in the context of training machine learning models?
+
+The paper is organized as follows:
+We begin by introduction key notions around machine learning and synthetic data privacy.
+Then we present our experimental methodology followed by our result with our interpretation.
+Finally we show an overview of how this work integrates with the current literature.
+