synthetic/results.tex


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54

In this section we analyse the impact of using synthetic data instead of real data on MIA and AIA.
Section~\ref{sec:uti} presents the utility of the target. 
This control factor allows us to assess that every model has learned some level of information and is not random guessing the label.


\subsection{Utility}
\label{sec:uti}

\begin{figure}
    \centering
    \includegraphics[width=0.45\textwidth]{synthetic/figure/result/adult/utility.pdf}
    \caption{Utility of the target model in terms of balanced accuracy evaluated on unseen data.
    The "Real" label refers to a generator equal to identity, hence the synthetic data used to train the target model is the real data.
    The "Synthetic" label refers to a CGAN generator, hence the synthetic data are sampled according to a distribution learned by the generator model. 
    In this case the target model is not trained on real date.}
    \label{fig:utility}
\end{figure}
Using synthetic dataset degrades the utility of the predictor.
We present the balanced accuracy for both synthetic and real data in Figure~\ref{fig:utility}.

Using synthetic data degrades significatively the utility of the target model by 5\% with an anova p-value of $1.23\times 10^{-5}$.
But with a minimum of 0.68 of balanced accuracy on synthetic data, we argue that the target model has learned a level of information that gives a meaningful result in terms of AIA and MIA.

\subsection{Membership inference attack}
\begin{figure}
    \centering
    \includegraphics[width=0.45\textwidth]{synthetic/figure/result/adult/mia.pdf}
    \caption{Success of the MIA in terms of balanced accuracy evaluated on the Train part of MIA dataset.}
\end{figure}
We observe a degradation of the balanced accuracy of the MIA of 30\% on average. 
An anova p-value of $4.54\times 10^{-12}$ indicates the this difference is significative.
In addition we observe that using synthetic data over real data results in drop of balanced accuracy from 0.86 to 0.55.
We conclude that using synthetic data protects significantly the membership status of the majority of data records.

But this result does not mean that the membership status is protected.
The remaining 5\% left is due to outliers in the dataset that can be identified by an attacker~\cite{carlini2022membershipinferenceattacksprinciples}.

\subsection{Attribute inference attack}
\begin{figure}
    \centering
    \includegraphics[width=0.45\textwidth]{synthetic/figure/result/adult/aia.pdf}
    \caption{Success of the AIA in terms of balanced accuracy evaluated on the Train part of AIA dataset.
    The AIA dataset is made of points that have not been seen during training of the target model.
    The target model does not use the sensitive attribute.}

    \label{fig:aia}
\end{figure}
Using synthetic dataset does not have an impact on the success of attribute inference attack.
We present in Figure~\ref{fig:aia} a comparison of AIA between real and synthetic data.

With an anova p-value of $8.65\times 10^{-1}$ we observe that whether we use synthetic or real data does not impact attribute privacy inference.
In addition, with an attack balanced accuracy ranging from 0.52 to 0.54, we observe a slight but certain risk for attribute leakage.
Hence, we conclude that using synthetic data does not protect users against AIA.