summaryrefslogtreecommitdiff
path: root/synthetic/related.tex
diff options
context:
space:
mode:
Diffstat (limited to 'synthetic/related.tex')
-rw-r--r--synthetic/related.tex45
1 files changed, 9 insertions, 36 deletions
diff --git a/synthetic/related.tex b/synthetic/related.tex
index 207bdf4..e93edd3 100644
--- a/synthetic/related.tex
+++ b/synthetic/related.tex
@@ -1,38 +1,11 @@
-The literature on the privacy of synthetic data focuses on a different yet related problem.
-In our work, the synthetic data is not released to the public, it is used as a proxy in between the real data and the target model.
-In contrary, the literature uses synthetic data as a way to release a dataset to third parties.
-The goal of this endeavour is to circumvent legislation on personal data~\cite{bellovin2019privacy}.
-Previous work shows that releasing synthetic data instead of the real data does not protect against re-identification nor attribute linkage~\cite{stadler2020synthetic}.
-
-Bellocin et all.~\cite{bellovin2019privacy} discuss the legal aspect of sharing synthetic data over sharing the real data.
-They come to the conclusion that a court will not allow the disclosure of synthetic data because numerous examples show that inferring private attributes of the real data is possible.
-They hint that using differential privacy may lead to legislation allowing synthetic data release.
-For instance, Ping et all.~\cite{ping2017datasynthesizer} use the GreddyBayes algorithm for tabular data in which they introduce differential privacy.
-
-%This conclusion transfers to our work because we have shown that using synthetic data to train a model does not full protect againts privacy attack.
-%Datasynthesizer: privacy preserving synthetic datasets~\cite{ping2017datasynthesizer}.
-%Towards improving privacy of synthetic datasets~\cite{kuppa2021towards}.
-%User-Driven Synthetic Dataset Generation with Quantifiable Differential Privacy~\cite{tai2023user}.
-
-
-%Stadler et all~\cite{stadler2020synthetic} focus on releasing to third parties a genertad synthetic dataset instead of the real dataset.
-%In countrary to our work where we consider that the generated synthetic dataset is not released but is used to train a machine learning model.
-%The study two privacy risks: Reidentification via linkaged and attribute disclosure.
-%Reidentification via linkage is somwhat similar to membership inference attack as this kind of attack aims at inferfing if a data record has been used to generated the synthetic dataset.
-%Attribute disclosure is closer to attribute inference in the sense that an attacker aims to infer sensitive attribute of user's data.
-%The main difference with Stadler et all and our work is that we add in between the synthetic dataset and the attacker a trained machine learning model and the attacker has only a black box acces to this model.
-%In our setup, the synthetic dataset is not directly accessible to the attacker.
-%The sensitive informations contained in the real dataset are filtred twice: by the generation process and then by the training of the target model.
-%In Stadler et all, the sensitive informations are filterd only by the generation process.
-%
-%Stadler et all show that using synthetic data does not protect user's privacy against neither linkage nor attribute disclosure.
-%Our conclusion is that using a synthetic dataset to train a machine learning model does not protect user's privacy against adversaries with black box access to this model.
-%Hence Stadlr et all and our work are aligned in showing that synthetic datasets are not a guaranted protection to user's personal data.
-
-Jordon et all~\cite{jordon2021hide} state that generative approaches can be used to hide the membership status.
-Their contribution consists in a data anonymisation challenge where with two track.
-The first has to produce an algorithm that generates synthetic data that hides the membership status.
-The second produces an algorithm that infers (i.e. an attack) the membership status using synthetic data generated from the algorithms of the first track.
-Sadly, their results remains inconclusive because the participants of the first track submitted their work to closely to the deadline which did not leave enough time for the attacker to develop tailored attacks.
+La littérature sur la confidentialité des données synthétiques se concentre sur un problème connexe.
+Des nous étude, les données synthétique ne sont pas publiques, elle sont utilisé comme intermédiaire entre les données réelles et le modèle cible.
+Au contraire, dans la littérature le données synthétique ont vocation à être distribué à des tiers.
+Le but de cela peut être de contourner la législation sur les données personnelles~\cite{bellovin2019privacy}.
+Des travaux précédent ont montrés que divulguer des données synthétiques au lieu des données réelles ne protège ni contre les attaque de ré-identification ni contre les attaques liant les données synthétiques aux données réelles\footnote{\textit{linkage}}~\cite{stadler2020synthetic}.
+Bellocin et al.~\cite{bellovin2019privacy} étudient l'aspect légale du partage de données synthétiques crées à partir de données réelles.
+Ils viennent à la conclusion qu'un tribunal n'autorisera pas ce partage à cause des nombreux case et des nombreuses recherches qui prouvent qu'il est possible d'apprendre des informations sur les données réelles à partir des données synthétiques.
+Ils supposent aussi que l'utilisation de confidentialité différentielle peut rendre légale le partage mais en l'absence de jurisprudence rien n'est certain.
+Dans cette optique, des travaux comme ceux de Ping et al.~\cite{ping2017datasynthesizer} cherche à impose la confidentialité différentielle lors de la création de données synthétiques.