background/proba.tex


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124


La théorie des probabilités est profondément liée à l'apprentissage automatique.
Les propriétés de modèles comme la confidentialité différentielle, les définitions d'équité, les métriques d'utilité, etc. que nous aborderons en Section~\ref{sec:background-ml} s'écrivent en terme de probabilité.
Ainsi nous présentons les notions de probabilités et de théorie de la mesure que nous allons utiliser.
A la manière de la Section~\ref{sec:background-set}, notre présentation à principalement le but de fixer les objets que nous utiliserons dans les prochaines sections et non pas d'être un cours complet. 
Si le.la lecteur.rice souhaite en apprendre plus sur la théorie de la mesure nous le renvoyons vers les notes de cours de Thierry Gallay de l'université Joseph Fourrier~\cite{mesure}.
Si il.elle souhait explorer plus en avant les probabilités il.elle pourra consulter les notes de cours de Jean-François Le Gall de l'École Normale Supérieur de Paris~\cite{proba}.

Soit $A$ un ensemble.
Nous appelons une tribu que nous notons $\mathcal{A}$ un sous ensemble de $\mathcal{P}(A)$ qui contient $\emptyset$ et $A$, qui est stable par complémentaire et qui est stable par union dénombrable d'éléments de $\mathcal{A}$.
Nous disons alors que $(A,\mathcal{A})$ est un espace mesurable.
Soit maintenant $A\subset\mathcal{P}(A)$, nous appelons $\sigma(A)$ la plus petite tribu pour l'intersection qui contient tous les éléments de $A$.

Nous appelons mesure, une fonction $d$ :$\mathcal{A}$ $\rightarrow$ $[0,+\infty]$ telle que $d(\emptyset) = 0$ et $d\left(\bigcup_{i\in \mathbb{N}} A_i\right) = \sum_{i\in \mathbb{N}}d(A_i)$ pour tout $(A_1, A_2, \cdots) \in \mathcal{A}^\mathbb{N} $ avec $\forall (i,j) A_i\cap A_j = \emptyset$.
Nous disons alors que $(A, \mathcal{A}, d)$ est un espace mesuré.
Pour un espace mesurable $(A,\mathcal{P}(A))$, la mesure de Dirac est la mesure telle que pour $a\in A$ 
\begin{equation*}
    \delta_a : \left\{
        \begin{matrix}
        \mathcal{P}(A)\rightarrow \{0,1\}\\
        B\mapsto\left\{
            \begin{matrix}
                1&\text{si}&a\in B\\
                0&\text{sinon}&
            \end{matrix}
            \right.
        \end{matrix}
        \right.
\end{equation*}

Soit $(A, \mathcal{A}, d)$ et $(B, \mathcal{B}, e)$ deux espaces mesurés.
Nous définissons alors 
\begin{equation*}
    \mathcal{A}\otimes\mathcal{B} = \sigma\left(
    \left\{
        a\times b \mid a\in\mathcal{A}\wedge b\in\mathcal{B}
    \right\}\right)
\end{equation*}
et de plus la mesure produit de $d$ et $e$, que l'on note $d\otimes e$, est l'unique mesure telle que 
\begin{equation*}
    \forall a\in\mathcal{A}\forall b\in\mathcal{B}~d\otimes e(a\times b) = d(a)\cdot e(b)
\end{equation*}
Alors l'espace $(A\times B,\mathcal{A}\otimes\mathcal{B},d\otimes e)$ est un espace mesuré.

Nous appelons fonction mesurable, une fonction de $A$ à $B$ telle que  $\forall b\in\mathcal{B}$~$f^{-1}(b)\in\mathcal{A}$.
Nous notons alors $f:(A, \mathcal{A})\rightarrow (B, \mathcal{B})$ ou $f:(A, \mathcal{A},d)\rightarrow (B, \mathcal{B})$
Nous définissons la mesure image de $f$ par $d$, que nous notons $d_f$, par l'expression suivante : 
\begin{equation}
    d_f:
    \left\{
        \begin{matrix}
            \mathcal{B}\rightarrow [0,+\infty]\\
            b\mapsto d\left(f^{-1}(b)\right)

        \end{matrix}
    \right.
\end{equation}

\begin{definition}{Intégrale}
    Soient $(E,\mathcal{E},\mu)$ un espace mesuré.
    Pour une fonction $f=\sum_{i\in I}\alpha_i 1_{A_i}$, nous dirons étagé, 
    avec $\{A_i\mid i\in I\} \subset \mathcal{E}$ et $\alpha_i\in\mathbb{R}^+$.
    alors $\int_E f d\mu= \sum_{i\in I}\alpha_i \mu(A_i)$.

    Soit $g$ un fonction mesurable de $(E,\mathcal{E},\mu)$ dans $\mathbb{R}^+$, alors
    \begin{equation*}
        \int_{E}gd\mu = sup\left\{\int_E fd\mu\mid f~\text{est étagé}\wedge f\leq g\right\}
    \end{equation*}

    Enfin dans le cas générale de $h$ une fonction mesurable de $(E,\mathcal{E},\mu)$ dans $\mathbb{R}$, alors
    si $\int_E |h|d\mu\in\mathbb{R}$ on pose
    \begin{equation*}
        \int_E hd\mu = \int_E max(h,0)d\mu - \int_E max(-h,0)d\mu
    \end{equation*}
\end{definition}


Dans le cas particulier où $d(A) = 1$, nous appelons $d$ une mesure de probabilité
 $(A,\mathcal{A},d)$ est alors un espace probabilisé et les fonctions mesurables sur cet espace sont appelés variables aléatoires.
Pour un évènement $a\in\mathcal{A}$ tel que $d(a)\neq 0$, la probabilité conditionnelle est 
\begin{equation*}
    d(\square\mid a):\left\{
        \begin{matrix}
            \mathcal{A}\rightarrow [0,1]\\
            b\mapsto \frac{d(a\cap b)}{d(a)}
        \end{matrix}
        \right.
\end{equation*}
La loi de probabilité d'une variable aléatoire $f$ sur $(X,\mathcal{X})$ est la mesure image de $f$ sur $d$.
Nous dirons que deux variables aléatoire $f$ et $g$ sont indépendantes si et seulement si la loi de la variables aléatoire $h:\omega\mapsto (f(\omega),g(\omega))$ est la mesure produit de la loi de $f$ et $g$.

De plus, dans le cas des variables aléatoires, il est courant d'écrire $\{f\in A\}$ pour $f^{-1}(A)$ et $\{f=a\}$ pour $f^{-1}(\{a\})$.

\begin{definition}{Espérance}
    Pour une variable aléatoire $X$, on définit l'espérance de $X$ par la formule suivante.
    \begin{equation*}
        E(X) = \int_{\Omega}X(\omega)dP(\omega)
    \end{equation*}
\end{definition}


%Having introduced probability theory, we explicit the relation with the ML theory described previously.
%Let $I$ a finite set, $\mathcal{X}$, $\mathcal{S}$ and $\mathcal{Y}$ the sets of features, sensitive attribute and label.
%Let $d:I\rightarrow \mathcal{X}\times\mathcal{S}\times\mathcal{Y}$ a dataset.
%Let $\#$ be the measure on $(I,\mathcal{P}(I))$ which maps to every $a$ in $\mathcal{P}(I)$ the number of elements of $a$.
%Let $P:\mathcal{P}(I)\rightarrow [0,1]$, $a\mapsto \frac{\#(a)}{\#(I)}$.
%Then $(I, \mathcal{P}(I), P)$ is a probability space.
%On this space we can define the following random variables:
%\begin{itemize}
%    \item $X:I\rightarrow \mathcal{X},~i\mapsto (d(i))_0$
%    \item $S:I\rightarrow \mathcal{S},~i\mapsto (d(i))_1$
%    \item $Y:I\rightarrow \mathcal{Y},~i\mapsto (d(i))_2$
%\end{itemize}
%MWhere for a vector $u$, $u_j$ refers to the $j$th element of $u$.

%From there we can define various random variables that will be useful in the rest of the paper.
%For instance $\hat{Y}=f\circ X$ is random variable that represents the prediction of a trained machine learning model $f$. 
%We can use it to write the accuracy in a compact way: $P(\hat{Y}=Y)$ by using the well accepted abuse of notations that for a random variable $A$ and an event $a$, 
%$\{A\in a\} = \{i\in\mathcal{P}(I)~|~A(i)\in a\} = A^{-1}(a)$.
%The accuracy is a reliable metric of a trained model's utility when $P(Y=0) = P(Y=1) = \frac{1}{2}$ but not so much when there is unbalance in $Y$. 
%To take into account an eventual unbalanced distribution of the labels, we will consider the balanced accuracy : 
%$\frac{P(\hat{Y}=0|Y=0) + P(\hat{Y}=1|Y=1)}{2}$.
%
%Finally in the context of attribute inference attack at inference time, we define the random variable $\hat{S}=a\circ \hat{Y}$ where here $a$ is a machine learning model trained to infer sensitive attribute from model's output.