1 files changed, 188 insertions, 0 deletions
diff --git a/synthetic/bck/methodology.tex b/synthetic/bck/methodology.tex
new file mode 100644
index 0000000..3981d78
--- /dev/null
+++ b/synthetic/bck/methodology.tex
@@ -0,0 +1,188 @@
+In this section we discuss the experimental approach we take to answer the research questions asked in the intorduction in Section~\ref{sec:question}.
+We begin by giving an overview of the datasets  and the generators functions that we study.
+The generators outputs another dataset that we use to train the target model, hence we now explain the types of classifier models taken into consideration in our study.
+Then we jump into a section about attribute inference attack and membership inference attack.
+In Section~\ref{sec:ovr} we discuss a specifity of our methodology: how we control the level of overfitting of the generators and the targets models. 
+Finally, we show a graphical representation of the overall process, from the real dataset to the experimental results in Figure~\ref{fig:split}.
+
+\subsection{Datasets}
+We treat two types of dataset: a tabular one and and image one.
+It allows us to experiments with various types of generators (cf. Section~\ref{sec:gen}) and target models (cf. Section~\ref{sec:target}).
+\subsubsection{US census (Adult)}
+The US census is a snapchot of the US adult population that is done every ten year by the US gouvernement\footnote{www.census.gov}.
+It produces a database where each row is an individual and each column is an attribute that discribe pepole.
+In the rest of the paper we refer to this dataset as Adult.
+
+The classification task is to predict whether or not the person is employed.
+
+The sensitive attribute we study for this dataset is the race.
+The race in the US census is encoded by nine classes, we transform it into a binary attribute encoding black with a one (1) and all others with a zero (0).
+Hence, in an attribute inference attack setup, the goal is to infer whether or not an individual is a black person.
+
+In this paper, we do not directly download the US census but instead use an instancation of it: Retiring adult.
+Retiring adult~\cite{ding2021retiring} is a formating of the US census made to interface frictionlessly with commun python tools such as pandas, numpy or scikit-learn.
+It allows us to select states, year, classification and sensitive attributes.
+We restrict ourselvs to the Alabama census of 2018 for practical reasons: mainly storage and computing time limitations.
+This subset contains the records of 47,777 individuals.
+
+\subsubsection{CelebA}
+This images dataset is composed of 202,600 pictures of faces~\cite{zhifei2017cvpr}.
+We use an instanciation of CelebA provided by Pytorch. 
+
+The classification atask is to predit if an individual has blond hair or not.
+
+The sensitive attribute is whether the individual in the picture is a male or not.
+
+\subsection{Generator training}
+\label{sec:gen}
+In our work, we study two variants of GAN:
+DCGAN~\cite{dcgan} and CTGAN~\cite{ctgan}.
+The former makes use of deep convolutional neural network which makes it fit to generate images.
+The later is taylored toward tabular data where distinction are made between categorical and quantitative attributes.
+
+\subsubsection{CTGAN}
+CTGAN~\cite{ctgan} is a conditional GAN taylored for heterogenous tabular data. 
+It is made to take in consideration quantitative and qualitative attributes.
+For instance, the job attribute is qualitative where the height is quantitative.
+In our experiments we use an implementation of CTGAN by the Synthetic Data Vault (SDV)~\footnote{sdv.dev}.
+
+We use CTGAN as the generator for Adult.
+We use SDV's autmatic way of generating metadata for tabular dataset.
+Those metadata are necessary for CTGAN to perform at its best because they indicate the type of attribute.
+
+\subsubsection{DCGAN}
+DCGAN~\cite{dcgan} is a GAN in which both the generator and the discriminator make use of deep convolutional neural network.
+Convolutional layers in neural networks are fit for image task and are comonly used~\cite{cnn}.
+
+To train the target model we need label for each generated image.
+Hence we make the DCGAN conditional~\cite{cgan} by adding en embedding layer in both de generator and the discriminator.
+In the generation process, once the GAN is trained, this embeding allows us to specify the class label of the image we want to generate.
+We not only ask a realistic image, but a realistic image of a person with blond hair, in the case of hair color beeing the classification task.
+
+\subsection{Predictor training}
+\label{sec:target}
+We use a different type of models for each dataset.
+For Adult we use a random forest classifier with a hundred trees.
+We use scikit-learn's implementation of random forest.
+
+For CelebA we use Pytorch's implementation of VGG16~\cite{vgg16}.
+Instead of training this model from scratch as we do for Adult, we use transfer learning.
+This method consisit in initializing the neural network with already trained weights.
+We use weights provided by Pytorch. 
+Before we start training, we replace the last layer of VGG16 by a linear layer with two neurons to adjust for our classification task.
+
+To evaluate a classification task we use the balanced accuracy.
+This value is definded by the following expression:
+\begin{equation*}
+\frac{P(\hat{Y}=0|Y=0) + P(\hat{Y}=1|Y=1)}{2}
+\end{equation*}
+Where $\hat{Y}$ is the prediction of the machine learning classifier and $Y$ is the ground truth: the real label of the data record.
+The balanced accuracy is a well fit metric for our study because one classe is not more sensitive then the other and this metric is not impacted by class imbalance.
+Hence we exclude other metrics such as, accuracy, precision, recall, etc.
+%\subsubsection{Random forest classifier}
+%\subsubsection{Convolutional neural network}
+
+\subsection{Attack training}
+\subsubsection{Attribute inference attack (AIA)}
+\subsubsection{Membership inference attack (MIA)}
+To perform MIA we do not use shadow models but rather adopt an approach similar to Yeom et all~\cite{yeom}. 
+We consider the attacker already has a dataset of the losses with their coresponding membership status ($m$ or $\bar{m}$).
+Hence our methodology gives an upper bound with respect to shadow models. 
+Because our study focuses on synthetic data, the members are the points used to train the generator and not the points used to train de target model.
+
+\subsection{Overfitting control}
+MIA attacks usualy give low results in regular setups, especialy for low false positive rates~\cite{stadler2020synthetic}.
+We artificialy increase the MIA risk by using the OVR CTRL function
+because our study is not on the MIA risk of a particuliar dataset and architectur, but rather on the risk of using synthetic data over real data.
+\label{sec:ovr}
+\begin{figure}
+    \centering
+    \input{synthetic/figure/tikz/ovre}
+    \caption{In this synthetic/figure we detail the OVR CTRL function.
+    This function control the overfitting of the target model.
+    It takes a dataset of size at least $N$ and output a dataset of size $M$.
+    First, we sample $N$ rows denoted $r_0,\cdots,r_{N-1}$ from the input dataset.
+    Second, we repeat the rows $\lfloor\frac{M}{N}\rfloor$ times.
+    Finaly we shuffle the repeated rows.}
+    \label{fig:ovr}
+\end{figure}
+
+Before using the real data to train the generator, we apply the OVR CTRL function to it.
+This function controls the overfitting of the generator through sampling, repetition and shuffling.
+We describe in details the internals of this functions in Figure~\ref{fig:ovr}.
+OVR CTRL duplicates $N$ datas points to create a dataset of $M$ points. 
+When $N$ is smaller then $M$, each data point is seen multiple times at each training epoch.
+
+We demonstrate empiricaly that the target model overfits more for certain values of $N$.
+We observre in Figure~\ref{fig:tune_ovr} that, for Adult, 
+for 5000 different points the utility and the quality score of the synthetic data are high (above 0.7 of balanced accuracy) while achieving an MIA of 0.54 balanced accuracy which indicates leakage of the membership status.
+Hence we chose to use 5000 different data points repeated over 100000 samples.
+
+We apply the same methodology to consynthetic/figure OVR CTRL for CelebA:
+we use 50000 different images repeated over 100000 samples.
+
+\begin{figure}
+    \centering
+    \begin{subfigure}{0.45\linewidth}
+        \includegraphics[width=\textwidth]{synthetic/figure/method/overfit/quality.pdf}
+        \caption{Quality of the synthetic data}
+    \end{subfigure}
+    \begin{subfigure}{0.45\linewidth}
+        \includegraphics[width=\textwidth]{synthetic/figure/method/overfit/utility.pdf}
+        \caption{Utility of the synthetic data}
+    \end{subfigure}
+    \begin{subfigure}{0.45\linewidth}
+        \includegraphics[width=\textwidth]{synthetic/figure/method/overfit/mia.pdf}
+        \caption{Sensitivity to membership inference attack of the synthetic data}
+    \end{subfigure}
+    \begin{subfigure}{0.45\linewidth}
+        \includegraphics[width=\textwidth]{synthetic/figure/method/overfit/aia.pdf}
+        \caption{Sensitivity to attribute inference attack of the synthetic data}
+    \end{subfigure}
+    \caption{Methodology for finding an amount of repetition that both achieve satisfying utility and a high sensitivity to MIA.
+    We use a total number of 100000 points.
+    In this experiment the only generator used is CTGAN.
+    The results presented are for the Adult dataset but we apply the same process for CelebA using DCGAN.
+    }
+
+    \label{fig:tune_ovr}
+\end{figure}
+\subsection{Data pipeline}
+In this section, we describe how the datasets are handeled through our experimental process.
+We also provide a visual representation of this process in Figure~\ref{fig:split}.
+We begin with the real data the we split into a training ($m$) and an evaluation set ($\bar{m}$).
+The training set goes through the function OVR CTRL which controls the overfitting level of the target model and the generator.
+The training set trains the generator model, if the generator is the identity, $m$ is also the output of the generator.
+We use the output of the generator to train the target model and we evalute its utility using only $\bar{m}$: unseen data.
+The output of the target model on $\bar{m}$ is refered as "prediction" in Figure~\ref{fig:split}.
+
+In addition of using prediction for evaluation, we build the AIA dataset with it, assuring the threat model related to this attack.
+We then split the AIA datset to train and evaluate the AIA model.
+
+Finally, we run the MIA represented in the bottom part of Figure~\ref{fig:split}.
+In addition of prediction, the target model outputs the losses of $m$ and $\bar{m}$ which we use to bluid the MIA dataset.
+Similarily to AIA, we split the MIA dataset to train and evaluate the MIA model.
+
+Each one the split is repeated five times in a corss-validation setting.
+\label{sec:data}
+\begin{figure}
+\input{synthetic/figure/tikz/data_split/split}
+    \caption{This synthetic/figure presents the data splits and subsets used to compute results.
+    It is a representation of the whole methodology described in this sections.
+    The reader may start on the top left corner, with the real data.
+    The rectangle boxes represents functions where the inputs are incoming arrows and the outputs outcomings arrows.
+    In the case of trainable functions such as machine learning models, we indicate that an input is the training data with the label "training".
+    We use a similar notation for evaluation.
+    }
+    \label{fig:split}
+\end{figure}
+
+\section{Comparisons between synthetic and real data}
+In Section~\ref{sec:res}, we compare metrics computed using two generator for each dataset: the identity to try the pipeline with real data and a GAN to try the pipline with synthetic data.
+In each of the following experiments, when comparing results, every parameters are the same exept for the generator used.
+It allows us to attribute observed significativs differences between metrics to solely the usage of synthetic or real data.
+
+We repeat every experiment with cross validation, hence utility results are computed five times and MIA and AIA results are computed 25 times.
+We display the resultes in the form of boxplots and we decide if the gape between two boxplots is significant or not using an analysis of variance (anova).
+In this test the null hypothesis is: The results from real and synthetic data have the same mean.
+If the p-value of the Fisher test is less than 0.01 we reject the null hypothesis and conclude that using synthetic data over real data has an impact.