1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
|
%Attacks which violate privacy and confidentiality in ML infer potentially sensitive unobservable information from observable information (e.g., model predictions).
\label{sec:bck_aia}
Attacks which violate privacy and confidentiality in ML infer potentially sensitive information from observable information (e.g., model predictions).
This leakage of information is a privacy risk if adv learns something about $traindata$ -or the inputs- which would be impossible to learn without access to $targetmodel$. This differentiates between a privacy risk and simple statistical inference~\cite{cormode}.
Among the various privacy risks explored in literature pertaining to ML models, attribute inference attacks~\cite{fredrikson2,Mahajan2020DoesLS,yeom,Song2020Overlearning,malekzadeh2021honestbutcurious,MehnazAttInf} infer the specific value of a sensitive attribute for a specific input to ML model given some model observables (e.g., model predictions, parameters, intermediate layerwise outputs) and background information. Based on attack surface being exploited, aia{s} can be categorized into (a) imputation-based attacks and (b) representation-based attacks.
Let's introduce some notations to guide us in understanding the zoology of those attacks.
We have a dataset $d:I\rightarrow \mathcal{X}\times\mathcal{S}\times\mathcal{Y}$ containing as column: the features, the sensitive attribute and the ground truth.
$I$ is a finite set of indices.
To access features, sensitive attribute and labels from there indices, we define respectively the following functions:
\begin{itemize}
\item $X:I\rightarrow \mathcal{X},~i\mapsto (d(i))_0$
\item $S:I\rightarrow \mathcal{S},~i\mapsto (d(i))_1$
\item $Y:I\rightarrow \mathcal{Y},~i\mapsto (d(i))_2$
\end{itemize}
Let $(I_0,I_1)$ be a partition of $I$.
$d$ is split in two datasets $d_0 = d_{{|I_0}}$ and $d_1 = d_{{|I_1}}$ which we call respectively the target dataset and the auxiliary dataset.
$d_0$ is used to train a machine learning model to infer the ground truth from the features: we call it the target model $targetmodel$.
Regarding attribute inference attack, we differentiate between training time attacks that target $d_0$: the dataset used in training.
And inference time attack that target data used as input of an already trained target model.
Our work focuses on the later (see figure \ref{fig:tm2}) but for clear positioning of our contributions, we are going to present both types of attack in this background section.
\noindent\textbf{\underline{Imputation-based attacks}} assume adv has access to non-sensitive attributes in addition to model's predictions and background information (e.g., marginal prior over sensitive attribute and confusion matrix). We review these different imputation-based attacks below:
\setlength\tabcolsep{3pt}
\begin{table*}[!htb]
\caption{Comparison of prior work based on: attack surface exploited (e.g., model predictions ($targetmodel(X(i))$), $X(i)$, $Y(i)$, distribution over $S(i)$ ($P_S$) and confusion matrix between true and predicted output across all training data records ($C(Y(i),targetmodel(X(i)))$), whether $S(i)$ is censored, i.e., included in $traindata$ or inputs, whether they account for class imbalance in $S(i)$, whether adv is active or passive and whether the threat model is blackbox or whitebox. All the attacks assume the knowledge of auxiliary data $auxdata$.}
\begin{center}
\footnotesize
\begin{tabular}{ |c|c|c|c|c|c| }
\hline
\textbf{Literature} & \textbf{Attack Vector} & \textbf{$S$ is censored?} & \textbf{Imbalance in $S$?} & \textbf{adv} & \textbf{Threat Model} \\
\hline
\multicolumn{6}{|c|}{\textbf{Imputation-based Attacks}}\\
\hline
\textbf{Fredrikson et al.}~\cite{fredrikson2} & $X$, $Y$, $targetmodel(X(i))$, \textbf{$P_S$}, $C(Y(i),targetmodel(X(i)))$ & $\checkmark$ & $\times$ & Passive & Blackbox\\
\textbf{Yeom et al.}~\cite{yeom} & $X(i)$, $Y(i)$, $targetmodel()$, \textbf{$P_S$} & $\checkmark$ & $\times$ & Passive & Blackbox\\
\textbf{Mehnaz et al.}~\cite{MehnazAttInf} & $X(i)$, $Y(i)$, $targetmodel()$, \textbf{$P_S$}, $C(Y(i),targetmodel(X(i)))$ & $\checkmark$ & $\times$ & Passive & Blackbox\\
\textbf{Jayaraman and Evans}~\cite{jayaraman2022attribute} & $X(i)$, $Y(i)$, $targetmodel()$, \textbf{$P_S$}, $C(Y(i),targetmodel(X(i)))$ & $\times$, $\checkmark$ & $\times$ & Passive & Whitebox\\
\hline
\multicolumn{6}{|c|}{\textbf{Representation-based Attacks}}\\
\hline
\textbf{Song et al.}~\cite{Song2020Overlearning} & $targetmodel(X(i))$ & $\times$ & $\times$ & Passive & Both\\
\textbf{Mahajan et al.}~\cite{Mahajan2020DoesLS} & $targetmodel(X(i))$ & $\checkmark$ & $\times$ & Passive & Blackbox\\
\textbf{Malekzadeh et al.}~\cite{malekzadeh2021honestbutcurious} & $targetmodel(X(i))$ & $\times$ & $\times$ & Active & Blackbox\\
\textbf{Our Work} & $targetmodel(X(i))$ & $\times$, $\checkmark$ & $\checkmark$ & Passive & Blackbox \\
\hline
\end{tabular}
\end{center}
\label{tab:summary}
\end{table*}
\label{sec:bck_aia}
\begin{itemize}
\item \textbf{Fredrikson et al.~\cite{fredrikson2}} assumes that adv has access to $targetmodel(X(i))$.
For this attack it is required that $X$ can be written $X(i) = (\cdots,S(i),\cdots)$.
We will refer to this case as "\textit{S is in the input}".
Fredrikson et al. attack generates an input with different possible values of the sensitive attribute
It then chooses the most likely value based on $targetmodel(X(i))$.
\item \noindent\textbf{Yeom et al.~\cite{yeom}} assumes a distribution $P_S$ over $S$ which is used to estimate the value of $S$ for an arbitrary data record. They propose three different variants of AS based on assumptions on $P_S$: Attack 1 leverages membership oracle to determine the value of $S(i)$ and Attack 2 and 3 assume different types of distributions over $S$.
For this attack to work, $S$ is in the input and the data points being attacked belong to the target dataset
\item \textbf{Mehnaz et al.~\cite{MehnazAttInf}} improves upon Fredrikson et al.~\cite{fredrikson1,fredrikson2} by exploiting $targetmodel\circ X$ and $X$, with $S$ in the input. The attack relies on the intuition that $targetmodel$'s output confidence is higher when the input has the correct value of $S$ as $targetmodel$ encountered the target record with that attribute during training. Their attack involves generating multiple instances of input with different values of $S(i)$ (similar to Fredrikson et al.~\cite{fredrikson1,fredrikson2}) and identifying the most likely value of $S$.
\end{itemize}
An appropriate baseline to identify whether such attacks are indeed a privacy risk is to use data imputation, i.e., train an ML model to infer value of missing attribute from other non-sensitive attributes without $targetmodel(X(i))$~\cite{jayaraman2022attribute}. Jayaraman and Evans~\cite{jayaraman2022attribute} find that existing blackbox imputation-based attacks~\cite{yeom,fredrikson2,MehnazAttInf} do not perform any better than data imputation. In other words, the perceived privacy risk is actually stemming from statistical inference and hence not an actual privacy risk.
To address this, Jayaraman and Evans~\cite{jayaraman2022attribute} propose a whitebox aia which outperforms prior blackbox attacks as well as data imputation in the setting where there is limited knowledge of data for adv. However, since the attack is in a whitebox setting, we omit a detailed description of the attack. All these attacks require that:
\begin{itemize}
\item $S$ is in the input data records which is not always the case in realistic settings,
\item $X(i)$ being attacked belong to the target dataset.
\end{itemize}
\noindent\textbf{\underline{Representation-based attacks}} exploit the distinguishable intermediate layer outputs or predictions for different values of sensitive attributes~\cite{Song2020Overlearning,Mahajan2020DoesLS,malekzadeh2021honestbutcurious}. For instance, the distribution of $targetmodel\circ X$ for \textit{males} is different from the output prediction distribution for \textit{females}. We describe the existing attacks of this category below:
\begin{itemize}
\item \textbf{Song et al.~\cite{Song2020Overlearning} / Mahajan et al.~\cite{Mahajan2020DoesLS}} assume that $S$ is not in the input. adv only observes $targetmodel\circ X$. adv trains an ML attack model $ackmodel$ to map the output predictions $targetmodel(X(i))$ to $S(i)$.
In other words, the statistic $\hat{S}$ used to infer $S$ is of the form: $ \hat{S} = 1_{[0.5,1]}\circ ackmodel\circ targetmodel\circ X$, where $attackmodel: [0,1]\rightarrow[0,1]$.
\item \textbf{Malekzadeh et al.~\cite{malekzadeh2021honestbutcurious}} considers the setting where adv trains $targetmodel$ with a special loss function to explicitly encode information about $S(i)$ in $targetmodel(X(i))$.
It makes it easier to extract the sensitive attribute during inference. In this setting, the model builder is malicious and actively introduces a ``backdoor''.
\end{itemize}
Our work focuses on representation-based aia in a blackbox setting at inference time. We focus on Song et al.~\cite{Song2020Overlearning} and Mahajan et al.~\cite{Mahajan2020DoesLS} as our baselines.
These attacks do not account for class imbalance in sensitive attribute commonly present in data from real-world applications which could effect adv's attack success~\cite{classIMb1,classIMb2}.
In our evaluation, we consider an aia using an adaptive threshold which outperforms these baselines attacks (Section~\ref{sec:evalAIA}).
Malekzadeh et al.~\cite{malekzadeh2021honestbutcurious} has a different threat model where adv explicitly modifies the training to enhance the leakage of $S$.
We do not assume such access to $targetmodel$ in a blackbox setting.
In addition, these attacks did not take into consideration the possibility to infer the sensitive attribute solely from the hard labels.
We summarize relevant prior work in Table~\ref{tab:summary}.
|