diff options
Diffstat (limited to 'synthetic/bck/background.tex')
-rw-r--r-- | synthetic/bck/background.tex | 53 |
1 files changed, 53 insertions, 0 deletions
diff --git a/synthetic/bck/background.tex b/synthetic/bck/background.tex new file mode 100644 index 0000000..e8a0c30 --- /dev/null +++ b/synthetic/bck/background.tex @@ -0,0 +1,53 @@ +In this section we introduce key notions as well as previous works on which we base our study. + +\subsection{Machine learning and classification} +In classification tasks, a machine learning model is a function that maps features of a data record to its label. +This function has an architecture which describes the strucure of the internal computing as well as parameters. +For instance with mono dimensialon data, the afine model is $f(x) = ax+b$ where $x$ is the feature and $a$ and $b$ are the parameters. +In generale the range of $f$ is $\mathbb{R}$ and we call $f(x)$ the soft label or the logit of $x$. +Because classification problems require discreat values, we apply a threshold to the soft label under which the predicted label is 0 and above which it is 1. + +Training a machine learning model means using an optimization algorithm that will find optimal parameters to minimize a loss function $l$. +In the previous example, the optimization problem is $\text{min}_{(a,b)\in\mathbb{R}^2}l(f(x),y)$ where $y$ is the ground truth: the label of $x$ in the dataset. + +\subsection{Synthetic datas} +A generator is a function that takes as input a real dataset and outputs a synthetic dataset. +This definition is general enough so that the identity function is a generator. +Even though synthetic datasets are supposedely different than real world datasets. +We refer to the output of the identiy generator as real data while refering to the output of anyother generator as synthetic data. + +In addition to the identity generator we use General Adversarial Networks (GAN)~\cite{gan}. +The goal of a GAN is to generate realistic samples given a distribution of multivariate data. +To do so a GAN leverages two neural networks: a generator and a discriminator. +The domain of the generator (its input space) is of low dimension with respect to its codomain (its output space) which has the same dimension as the data we want to generate. +For instance with 64 by 64 images, the codomain is a matrix with 64 rows and 64 columns. +To generate a new sample, we evaluate the generator on a sample of a multivariate standard normal distribution where the dimension is the domain's dimension. +This output is the new generated synthetic data point. + +The discriminator is only used when training the GAN with the goal of making sure that the generator produces realistic data. +To do so, the discriminator is a neural network with a classification goal: infer if a sample is synthetic or real. +Hence in the training procedure, the discriminatore and the generator are in competition: the generator goal is to fool the discriminator into classifing synthetic data as real data. + + +\subsection{Membership inference attack} +This attack infers the membership status: weather a data record has been used in the training (member $m$) of a machin learning model or not (non-member $\bar{m}$). +%Shadow model +In practice, this attack is made by leveraging shadow models: models that imitates the behaviour of the target~\cite{shokri2017membership}. +This technique allows an attacker to construct a dataset of logits and ground truth labeld by the membership status. + +%Yeom et all +Overfitting is one of the major historical difficulities of machine learning~\cite{hawkins2004problem}. +The generalization error is the differecne between the average loss of members and the average loss of non-members. +The more this error is, the more the model overfits. +Yeom et all. show that overfitting is the major factor that allow memberhsip inference attack~\cite{yeom}. +They build an attack that assume the attacker has access to a dataset of losses labeled by the membership status. +It allows them to build a model to infer the membership status from the losses of the data records. + +%DP +Differential privacy is a probabilistic definition that bound mambership inference attack's success. +In practice, thoes guaranties are achieved thrgout gradient clipping and additive noise in the training algorithme~\cite{abadi2016deep}. + +\subsection{Attribut inference attack} +Model predictions and especially soft labelscan be dependent on a sensitive attribute such as race or sex. +For instance, the prediction of recidivism in predictive justice is dependent on the race of the guilty~\cite{EO}. +Attribut inference attack (AIA) leverage bias in model predictions to infer sensitive attributes of data records~\cite{song2020overlearning}. |