AML-Lecture-Notes/L6_Linear_Classification.tex at main · MirkoDeVita98/AML-Lecture-Notes · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
\documentclass[twoside]{article}
\setlength{\oddsidemargin}{0.25 in}
\setlength{\evensidemargin}{-0.25 in}
\setlength{\topmargin}{-0.6 in}
\setlength{\textwidth}{6.5 in}
\setlength{\textheight}{8.5 in}
\setlength{\headsep}{0.75 in}
\setlength{\parindent}{0 in}
\setlength{\parskip}{0.1 in}
\newcommand{\eqdef}{:\mathrel{\mathop=}}
\newcommand{\norm}[1]{\left\lVert #1 \right\rVert}

%
% ADD PACKAGES here:
%

\usepackage{amsmath,amsfonts,graphicx,dsfont,amssymb, cool, cancel, mathtools}
%
% The following commands set up the lecnum (lecture number)
% counter and make various numbering schemes work relative
% to the lecture number.
%
\newcounter{lecnum}
\renewcommand{\thepage}{\thelecnum-\arabic{page}}
\renewcommand{\thesection}{\thelecnum.\arabic{section}}
\renewcommand{\theequation}{\thelecnum.\arabic{equation}}
\renewcommand{\thefigure}{\thelecnum.\arabic{figure}}
\renewcommand{\thetable}{\thelecnum.\arabic{table}}
\newcommand{\indep}{\raisebox{0.05em}{\rotatebox[origin=c]{90}{$\models$}}}

%
% The following macro is used to generate the header.
%
\newcommand{\lecture}[4]{
   \pagestyle{myheadings}
   \thispagestyle{plain}
   \newpage
   \setcounter{lecnum}{#1}
   \setcounter{page}{1}
   \noindent
   \begin{center}
   \framebox{
      \vbox{\vspace{2mm}
    \hbox to 6.28in { {\bf Advanced Machine Learning
	\hfill Fall 2020} }
       \vspace{4mm}
       \hbox to 6.28in { {\Large \hfill Lecture #1: #2  \hfill} }
       \vspace{2mm}
       \hbox to 6.28in { {\it  #3 \hfill  #4} }
      \vspace{2mm}}
   }
   \end{center}
   \markboth{Lecture #1: #2}{Lecture #1: #2}

   {\bf Note}: {\it LaTeX template courtesy of UC Berkeley EECS dept.}

   {\bf Disclaimer}: {\it These notes are adapted from ETH's Advanced Machine Learning Course and "Pattern Recognition and Machine Learning, Chapter 4, Springer".}
   \vspace*{4mm}
}
%
% Convention for citations is authors' initials followed by the year.
% For example, to cite a paper by Leighton and Maggs you would type
% \cite{LM89}, and to cite a paper by Strassen you would type \cite{S69}.
% (To avoid bibliography problems, for now we redefine the \cite command.)
% Also commands that create a suitable format for the reference list.
\renewcommand{\cite}[1]{[#1]}
\def\beginrefs{\begin{list}%
        {[\arabic{equation}]}{\usecounter{equation}
         \setlength{\leftmargin}{2.0truecm}\setlength{\labelsep}{0.4truecm}%
         \setlength{\labelwidth}{1.6truecm}}}
\def\endrefs{\end{list}}
\def\bibentry#1{\item[\hbox{[#1]}]}

%Use this command for a figure; it puts a figure in wherever you want it.
%usage: \fig{NUMBER}{SPACE-IN-INCHES}{CAPTION}
\newcommand{\fig}[3]{
			\vspace{#2}
			\begin{center}
			Figure \thelecnum.#1:~#3
			\end{center}
	}
% Use these for theorems, lemmas, proofs, etc.
\newtheorem{theorem}{Theorem}[lecnum]
\newtheorem{lemma}[theorem]{Lemma}
\newtheorem{proposition}[theorem]{Proposition}
\newtheorem{claim}[theorem]{Claim}
\newtheorem{corollary}[theorem]{Corollary}
\newtheorem{definition}[theorem]{Definition}
\newenvironment{proof}{{\bf Proof:}}{\hfill\rule{2mm}{2mm}}

% **** IF YOU WANT TO DEFINE ADDITIONAL MACROS FOR YOURSELF, PUT THEM HERE:

\DeclareMathOperator*{\argmax}{arg\,max}
\DeclareMathOperator*{\argmin}{arg\,min}

\begin{document}
%FILL IN THE RIGHT INFO.
%\lecture{**LECTURE-NUMBER**}{**DATE**}{**LECTURER**}{**SCRIBE**}
\lecture{6}{Linear methods for classification}{}{}
%\footnotetext{These notes are partially based on those of Nigel Mansell.}

% **** YOUR NOTES GO HERE:

% Some general latex examples and examples making use of the
% macros follow.
%**** IN GENERAL, BE BRIEF. LONG SCRIBE NOTES, NO MATTER HOW WELL WRITTEN,
%**** ARE NEVER READ BY ANYBODY.
The goal in classification is to take an input vector $\boldsymbol{x}$ and to assign it to one of $K$ discrete classes $\mathcal{C}_k$ where $k = 1,...,K$.\\
In the most common scenario, the classes are taken to be disjoint, so that each input is assigned to one and only one class. The input space is thereby divided into \textit{decision regions} whose boundaries are called \textit{decision boundaries} or \textit{decision surfaces}.\\
In this lecture we consider linear models for classification, by which we mean that the decision surfaces are linear functions of the input vector $\boldsymbol{x}$ and hence are defined by $(D - 1)$-dimensional hyperplanes within the $D$-dimensional input space. Data sets whose classes can be separated exactly by linear decision surfaces are said to be \textit{linearly separable}.\medskip

For probabilistic models, in the case of two-class problems,the target variable can be represented as $t \in \{0, 1\}$ such that $t = 1$ represents class $\mathcal{C}_1$ and $t = 0$ represents class $\mathcal{C}_2$.\\
We can interpret the value of $t$ as the probability that the class is $\mathcal{C}_1$, with the values of probability taking only the extreme values of $0$ and $1$.\\
For $K > 2$ classes, it is convenient to use a 1-of-K coding scheme in which $\boldsymbol{t}$ is a vector of length $K$ such that if the class is $\mathcal{C}_j$, then all elements $t_k$ of $\boldsymbol{t}$ are zero except element $t_j$, which takes the value $1$.\medskip

\textbf{Approaches to the classification problem:}
\begin{enumerate}
    \item \textit{discriminant functions} that directly assigns each vector $\boldsymbol{x}$ to a specific class.
    \item Modelling of a conditional probability distribution $p(\mathcal{C}_k| \boldsymbol{x})$ in an inference stage, and then subsequently using this distribution to make optimal decisions. There are two different approaches to determining the conditional probabilities $p(\mathcal{C}_k| \boldsymbol{x})$. One technique is to model them directly, for example by representing them as parametric models and then optimizing the parameters using a training set.\\
    Alternatively, we can adopt a \textit{generative} approach in which we model the class-conditional densities given by $p(\boldsymbol{x}|\mathcal{C}_k)$, together with the prior probabilities $p(\mathcal{C}_k)$ for the classes, and then we compute the required posterior probabilities using Bayes' theorem
    \begin{equation*}
        p(\mathcal{C}_k| \boldsymbol{x}) = \frac{p(\boldsymbol{x}|\mathcal{C}_k)p(\mathcal{C}_k)}{p(\boldsymbol{x})}
    \end{equation*}
\end{enumerate}
For classification problems we wish to predict discrete class labels, or more generally posterior probabilities that lie in the range $(0, 1)$. To achieve this, we transform the linear function of $\boldsymbol{w}$ using a non linear \textit{activation function} $f(\cdot)$ so that
\begin{equation}
    y(\boldsymbol{x}) = f(\boldsymbol{w^\intercal x} + w_0)
\end{equation}
Its inverse is called \textit{link function}. The decision surfaces correspond to $y(\boldsymbol{x}) = \text{constant}$, so that $\boldsymbol{w^\intercal x} + w_0 = \text{constant}$ and hence the decision surfaces are linear functions of $\boldsymbol{x}$, even if the function $f(\cdot)$ is non linear. For this reason, the class of models described by Eq.6.1 are called \textit{generalized linear models}.

\section{Discriminant functions}
A discriminant is a function that takes an input vector $\boldsymbol{x}$ and assigns it to one of $K$ classes, denoted $\mathcal{C}_k$.
\subsection{Two classes}
\textit{Linear discriminants}' decision surfaces are hyperplanes and their simplest representation can be obtained by taking a linear function of the input vector so that
\begin{equation*}
    y(\boldsymbol{x}) = \boldsymbol{w^\intercal x} + w_0
\end{equation*}
where $\boldsymbol{w}$ is called a \textit{weight} vector, and $w_0$ is a bias (not in the statistical sense). The negative of the bias is sometimes called a \textit{threshold}. An input vector $\boldsymbol{x}$ is assigned to class $\mathcal{C}_1$ if $y(\boldsymbol{x}) \geq 0$ and to class $\mathcal{C}_2$ otherwise. The corresponding decision boundary is defined by the realtion $y(\boldsymbol{x}) = 0$, which corresponds to a $(D - 1)$-dimensional hyperplane within the $D$-dimensional input space. Consider two points $\boldsymbol{x}_A$ and $\boldsymbol{x}_B$ both of which lie on the decision surface.\\
Because $y(\boldsymbol{x}_A) = y(\boldsymbol{x}_B) = 0$, we have $\boldsymbol{w}^\intercal(\boldsymbol{x}_A- \boldsymbol{x}_B) = 0$ and hence the vector $\boldsymbol{w}$ is orthogonal to every vector lying within the decision surface, and so $\boldsymbol{w}$ determines the orientation of the decision surface. Similarly, if $\boldsymbol{x}$ is a point on the decision surface, then $y(\boldsymbol{x}) = 0$, and so the normal distance from the origin to the decision surface is given by
\begin{equation*}
    \frac{\boldsymbol{w^\intercal x}}{\norm{\boldsymbol{w}}} = - \frac{w_o}{\boldsymbol{w}}
\end{equation*}
We therefore see that the bias parameter $w_0$ determines the location of the decision surface. Furthermore, we note that the value $y(\boldsymbol{x})$ gives a signed measure of the perpendicular distance $r$ of the point $\boldsymbol{x}$ from the decision surface. To see this consider an arbitrary point $\boldsymbol{x}$ and let $\boldsymbol{x}_\perp$ be its orthogonal projection onto the decision surface so that
\begin{equation*}
    \boldsymbol{x} = \boldsymbol{x}_\perp + r\frac{\boldsymbol{w^\intercal x}}{\norm{\boldsymbol{w}}}
\end{equation*}
Multiplying both sides of this result by $\boldsymbol{w}^\intercal$ and adding $w_0$, and making use of $y(\boldsymbol{x}) = \boldsymbol{w^\intercal x} + w_0$ and $y(\boldsymbol{x}_\perp) = \boldsymbol{w^\intercal x}_\perp + w_0$, we have
\begin{equation*}
    r = \frac{y(\boldsymbol{x})}{\norm{\boldsymbol{w}}}
\end{equation*}
This result is illustrated in Figure 6.1.

\begin{figure}[ht]
    \centering
    \includegraphics[width=0.40\textwidth]{img/discriminant.png}
    \caption{Illustration of the geometry of a linear discriminant function in two dimensions. The decision surface, shown in red, is perpendicular to $\boldsymbol{w}$, and its displacement from the origin is controlled by the bias parameter $w_0$. Also, the signed orthogonal distance of a general point $\boldsymbol{x}$ from the decision surface is given by $y(\boldsymbol{x}) / \norm{\boldsymbol{w}}$.}
\end{figure}
As with the linear regression models, it is sometimes convenient to use a more compact notation in which we introduce an additionally dummy 'input' value $x_0 = 1$ and then define $\Tilde{\boldsymbol{w}} = (w_0, \boldsymbol{w})$ and $\Tilde{\boldsymbol{x}} = (x_0, \boldsymbol{x})$ so that
\begin{equation*}
    y(\boldsymbol{x}) = \Tilde{\boldsymbol{w}}^\intercal\Tilde{\boldsymbol{x}}
\end{equation*}
In this case, the decision surface are $D$-dimensional hyperplanes passing through the origin of the $(D + 1)$-dimensional expanded input space.
\newpage
\subsection{Multiple classes}
Now consider the extension of linear discriminants to $K > 2$ classes. We might be tempted to build a $K$-class discriminant by combining a number of two-class discriminant functions. However, this leads to some serious difficulties.\\
Consider the use of $K -1$ classifiers each of which solves a two-class problem of separating points in a particular class $\mathcal{C}_k$ from points not in that class. This is known as a \textit{one-versus-the-rest} classifier.\medskip

\begin{figure}[h]
    \centering
    \includegraphics[width=0.50\textwidth]{img/multiclass.png}
    \caption{Attempting to construct a $K$ class discriminant from a set of two class discriminants leads to ambiguous regions, shown in green. On the left is an example involving the use of two discriminants designed to distinguish points in class $\mathcal{C}_k$ from points not in class $\mathcal{C}_k$. On the right is an example involving three discriminant functions each of which is used to separate a pair of classes $\mathcal{C}_k$ and $\mathcal{C}_j$.}
\end{figure}
The left hand example in Figure 6.2 shows an example involving three classes where this approach leads to regions of input space that are ambiguously classified.\medskip

An alternative is to introduce $K(K - 1) / 2$ binary discriminant functions, one for every possible pair of classes. This is known as a \textit{one-versus-one} classifier. Each point is then classified according to a majority vote amongst the discriminant functions. However, this too runs into the problem of ambiguous regions, as illustrated in the right-hand diagram of Figure 6.2.
\newpage
We can avoid these difficulties by considering a single $K$-class discriminant comprising $K$ linear functions of the form
\begin{equation*}
    y_k(\boldsymbol{x}) = \boldsymbol{w_k^\intercal x} + w_{k0}
\end{equation*}
and then assigning a point $\boldsymbol{x}$ to class $\mathcal{C}_k$ if $y_k(\boldsymbol{x}) > y_j(\boldsymbol{x})$ for all $j \neq k$. The decision boundary between class $\mathcal{C}_k$ and class $\mathcal{C}_j$ is therefore given by $y_k(\boldsymbol{x}) = y_j(\boldsymbol{x})$ and hence corresponds to a $(D-1)$-dimensional hyperplane defined by
\begin{equation*}
    (\boldsymbol{w}_k - \boldsymbol{w}_j)^\intercal\boldsymbol{x} + (w_{k0} - w_{j0}) = 0
\end{equation*}
This has the same form as the decision boundary for the two-class case and so analogous geometrical properties apply.

\subsection{Fisher’s linear discriminant}
One way to view a linear classification model is in terms of dimensionality reduction. Consider first the case of two classes, and suppose we take the $D$-dimensional input vector $\boldsymbol{x}$ and project it down to one dimension using
\begin{equation}
    y = \boldsymbol{w^\intercal x}
\end{equation}
If we place a threshold on $y$ and classify $y \geq -w_0$ as class $\mathcal{C}_1$, and otherwise class $\mathcal{C}_2$, then we obtain our standard linear classifier.\\
In general, the projection onto one dimension leads to a considerable loss of information, and classes that are well separated in the original $D$-dimensional space may become strongly overlapping in one dimension. However, by adjusting the components of the weight vector $\boldsymbol{w}$, we can select a projection that maximizes the class separation. To begin with, consider a two-class problem in which there are $N_1$ points of $\mathcal{C}_1$ and $N_2$ points of class $\mathcal{C}_2$, so that the mean vectors of the two classes are given by
\begin{equation*}
    \boldsymbol{m}_1 = \frac{1}{N_1}\sum\limits_{n \in \mathcal{C}_1} \boldsymbol{x}_n
    \hspace{20pt}
    \boldsymbol{m}_2 = \frac{1}{N_2}\sum\limits_{n \in \mathcal{C}_2} \boldsymbol{x}_n
\end{equation*}
The simplest measure of the separation of the classes when projected onto $\boldsymbol{w}$, is the separation of the projected class means. This suggests that we might choose $\boldsymbol{w}$ so as to maximize
\begin{equation*}
    m_2 - m_1 = \boldsymbol{w}^\intercal(\boldsymbol{m}_2 - \boldsymbol{m}_1)
\end{equation*}
where
\begin{equation}
    m_k = \boldsymbol{w}^\intercal\boldsymbol{m}_k
\end{equation}
is the mean of the projected data from class $\mathcal{C}_k$. However, this expression can be arbitrarily large simply by increasing the magnitude of $\boldsymbol{w}$. To solve this problem we could constrain $\boldsymbol{w}$ to have unit length, so that $\sum_i w_i^2 = 1$. Using a Lagrange multiplier to perform the constrained maximization, we then find that $\boldsymbol{w} \propto (\boldsymbol{m}_2 - \boldsymbol{m}_1)$. \\
There is still a problem with this approach, however, as illustrated in Figure 6.3. This shows two classes that are well separated in the original two- dimensional space $(x_1, x_2)$ but that have considerable overlap when projected onto the line joining their means. This difficulty arises from the strongly non-diagonal covariances of the class distributions. The idea proposed by Fisher is to maximize a function that will give a large separation between the projected class means while also giving a small variance within each class, thereby minimizing the class overlap.
\newpage
\begin{figure}[h]
    \centering
    \includegraphics[width=0.70\textwidth]{img/fisher.png}
    \caption{The left plot shows samples from two classes (depicted in red and blue) along with the histograms resulting from projection onto the line joining the class means. Note that there is considerable class overlap in the projected space. The right plot shows the corresponding projection based on the Fisher linear discriminant, showing the greatly improved class separation.}
\end{figure}
The projection formula 6.2 transforms the set of labelled data points in $\boldsymbol{x}$ into a labelled set in the one-dimensional space $y$. The within-class variance of the transformed data from $\mathcal{C}_k$ is therefore given by
\begin{equation}
    s_k^2 = \sum\limits_{n \in \mathcal{C}_k}(y_n - m_k)^2
\end{equation}
where $y_n = \boldsymbol{w^\intercal x}_n$. We can define the total within-class variance for the whole data set to be simply $s_1^2 + s_2^2$. The fisher criterion is defined to be the ratio of the between-class variance to the within-class variance and is given by
\begin{equation*}
    J(\boldsymbol{w}) = \frac{(m_2 - m_1)^2}{s_1^2 + s_2^2}
\end{equation*}
We can make the dependence on $\boldsymbol{w}$ explicit by using 6.2, 6.3 and 6.4 to rewrite the Fisher criterion in the form
\begin{equation}
    J(\boldsymbol{w}) = \frac{\boldsymbol{w}^\intercal\textbf{S}_\text{B}\boldsymbol{w}}{\boldsymbol{w}^\intercal\textbf{S}_\text{W}\boldsymbol{w}}
\end{equation}
where $\textbf{S}_\text{B}$ is the between-class covariance matrix and is given by
\begin{equation}
    \textbf{S}_\text{B} = (\boldsymbol{m}_2 - \boldsymbol{m}_1)(\boldsymbol{m}_2 - \boldsymbol{m}_1)^\intercal
\end{equation}
and $\textbf{S}_\text{W}$ is the total within-class covariance matrix, given by
\begin{equation*}
    \textbf{S}_\text{W} = \sum\limits_{n \in \mathcal{C}_1}(\boldsymbol{x}_n - \boldsymbol{m}_1)(\boldsymbol{x}_n - \boldsymbol{m}_1)^\intercal + \sum\limits_{n \in \mathcal{C}_2}(\boldsymbol{x}_n - \boldsymbol{m}_2)(\boldsymbol{x}_n - \boldsymbol{m}_2)^\intercal
\end{equation*}
Differentiating 6.5 with respect to $\boldsymbol{w}$, we find that $J(\boldsymbol{w})$ is maximized when
\begin{equation}
    (\boldsymbol{w}^\intercal\textbf{S}_\text{B}\boldsymbol{w})\textbf{S}_\text{W}\boldsymbol{w} = (\boldsymbol{w}^\intercal\textbf{S}_\text{W}\boldsymbol{w})\textbf{S}_\text{B}\boldsymbol{w}
\end{equation}
From 6.6, we see that $\textbf{S}_\text{B}\boldsymbol{w}$ is always in the direction of $(\boldsymbol{m}_2 - \boldsymbol{m}_1)$. Furthermore we do not care about the magnitude of $\boldsymbol{w}$, only its direction, and so we can drop the scalar factors $(\boldsymbol{w}^\intercal\textbf{S}_\text{B}\boldsymbol{w})$ and $(\boldsymbol{w}^\intercal\textbf{S}_\text{W}\boldsymbol{w})$. Multiplying both sides of 6.7 by $\textbf{S}_\text{W}^{-1}$ we then obtain
\begin{equation*}
    \boldsymbol{w} \propto \textbf{S}_\text{W}^{-1}(\boldsymbol{m}_2 - \boldsymbol{m}_1)
\end{equation*}
The projected data can be subsequently used to construct a discriminant, by choosing a threshold $y_0$ so that we classify a new point as belonging to $\mathcal{C}_1$ if $y(\boldsymbol{x}) \geq y_0$ and classify it as belonging to $\mathcal{C}_2$ otherwise.\\
We can also model the class-conditional densities $p(y|\mathcal{c}_K)$ using Gaussian distributions and then use MLE to find the parameters of the distributions. Some justification for the Gaussian assumption comes from the central limit theorem by noting that $y = \boldsymbol{w^\intercal x}$ is the sum of a set of random variables.
\newpage
\subsection{The Perceptron algorithm}
Another example of a linear discriminant model is the perceptron that corresponds to a two-class model in which the input vector $\boldsymbol{x}$ is first transformed using a fixed nonlinear transformation to give a feature vector $\boldsymbol{\phi}(\boldsymbol{x})$, and this is then used to construct a generalized linear model of the form
\begin{equation}
    y(\boldsymbol{x}) = f(\boldsymbol{w^\intercal}\boldsymbol{\phi}(\boldsymbol{x}))
\end{equation}
where the non-linear activation function $f(\cdot)$ is given by a step function of the form
\begin{equation*}
f(a) =
\begin{cases}
    +1,\hspace{5pt} a \geq 0\\
    -1,\hspace{5pt} a < 0
\end{cases}
\end{equation*}
The vector $\boldsymbol{\phi}(\boldsymbol{x})$ will typically include bias component $\phi_0(\boldsymbol{x}) = 1$. The target values $t$ are +1 for class $\mathcal{C}_1$ and -1 for class $\mathcal{C}_2$.\\
The algorithm used to determine the parameters $\boldsymbol{w}$ of the perceptron can most easily be motivated by error function minimization. A natural choice of error function would be the total number of misclassified patterns. However, this does not lead to a simple algorithm because the error is a piecewise constant function of $\boldsymbol{w}$, with discontinuities wherever a change in $\boldsymbol{w}$ causes the decision boundary to move across one of the data points. Methods based on changing $\boldsymbol{w}$ using the gradient of the error function cannot then be applied, because the gradient is zero almost everywhere.\medskip

We therefore consider an alternative error function known as the \textit{perceptron criterion}. To derive this, we note that we are seeking a weight vector $\boldsymbol{w}$ such that patterns $\boldsymbol{x}_n$ in class $\mathcal{C}_1$ will have $\boldsymbol{w}^\intercal\boldsymbol{\phi}(\boldsymbol{x}_n) > 0$, whereas patterns $\boldsymbol{x}_n$ in class $\mathcal{C}_2$ have $\boldsymbol{w}^\intercal\boldsymbol{\phi}(\boldsymbol{x}_n) < 0$. Using the $t \in \{-1,+1\}$ target coding scheme it follows that we would like all patterns to satisfy $\boldsymbol{w}^\intercal\boldsymbol{\phi}(\boldsymbol{x}_n)t_n > 0$. The perceptron criterion associates zero error with any pattern that is correctly classified, whereas for a misclassified pattern $\boldsymbol{x}_n$ it tries to minimize the quantity $-\boldsymbol{w}^\intercal\boldsymbol{\phi}(\boldsymbol{x}_n)t_n$. The perceptron criterion is therefore given by
\begin{equation*}
    E_P(\boldsymbol{w}) = - \sum\limits_{n \in \mathcal{M}}\boldsymbol{w}^\intercal
    \boldsymbol{\phi}_nt_n
\end{equation*}
where $\mathcal{M}$ denotes the set of all misclassified patterns. The contribution to the error associated with a particular misclassified pattern is a linear function of $\boldsymbol{w}$ in regions of $\boldsymbol{w}$ space where the pattern is misclassified and zero in regions where it is correctly classified. The total error function is therefore piecewise linear.\\
We now apply the stochastic gradient descent algorithm to this error function. The change in the weight vector $\boldsymbol{w}$ is then given by
\begin{equation}
    \boldsymbol{w}^{\tau + 1} = \boldsymbol{w}^\tau - \eta \nabla E_P(\boldsymbol{w}) = \boldsymbol{w}^\tau + \eta\boldsymbol{\phi}_nt_n
\end{equation}
where $\eta$ is the learning rate parameter and $\tau$ is an integer that indexes the steps of the algorithm. Because the perceptron function $y(\boldsymbol{x}, \boldsymbol{w})$ is unchanged if we multiply $\boldsymbol{w}$ by a constant, we can set the learning rate $\eta$ equal to 1 without loss of generality. Note that, as the weight vector evolves during training, the set of patterns that are misclassified will change.\medskip

The algorithm consists in cycling through the training patterns, and for each pattern $\boldsymbol{x}_n$, we evaluate the perceptron function 6.8. If the pattern is correctly classified, then the weight vector remains unchanged, whereas if it is incorrectly classified, then for class $\mathcal{C}_1$ we add the vector $\boldsymbol{\phi}(\boldsymbol{x}_n)$ onto the current estimate of the weight vector $\boldsymbol{w}$ while for class $\mathcal{C}_2$ we subtract the vector $\boldsymbol{\phi}(\boldsymbol{x}_n)$ from $\boldsymbol{w}$.\\
If we consider the effect of a single update in the perceptron learning algorithm, we see that the contribution to the error from a misclassified pattern will be reduced because from 6.9 we have
\begin{equation*}
    -\boldsymbol{w}^{(\tau + 1)\intercal}\boldsymbol{\phi}_nt_n = -\boldsymbol{w}^{(\tau)\intercal}\boldsymbol{\phi}_nt_n - (\boldsymbol{\phi}_nt_n)^\intercal\boldsymbol{\phi}_nt_n < -\boldsymbol{w}^{(\tau)\intercal}\boldsymbol{\phi}_nt_n
\end{equation*}
where we have set $\eta = 1$ and made us of $\norm{\boldsymbol{\phi}_nt_n}^2 > 0$.\\
Of course, this does not imply that the contribution to the error function from the other misclassified patterns will have been reduced. Furthermore, the change in weight vector may have caused some previously correctly classified patterns to become misclassified. Thus the perceptron learning rule is not guaranteed to reduce the total error function at each stage.\\
However, the perceptron convergence theorem states that if there exists an exact solution (in other words, if the training data set is linearly separable), then the perceptron learning algorithm is guaranteed to find an exact solution in a finite number of steps.\\
Aside from difficulties with the learning algorithm, the perceptron does not provide probabilistic outputs, nor does it generalize readily to $K > 2$ classes. The most important limitation, however, arises from the fact that it is based on linear combinations of fixed basis functions.

\section{Probabilistic Generative Models}
We turn next to a probabilistic view of classification and show how models with linear decision boundaries arise from simple assumptions about the distribution of the data.\\
Here we shall adopt a generative approach in which we model the class-conditional densities $p(\boldsymbol{x}|\mathcal{C}_k)$, as well as the class priors $p(\mathcal{C}_k)$, and then use these to compute posterior probabilities $p(\mathcal{C}_k | \boldsymbol{x})$ through Bayes' theorem.\\
In the case of two classes, the posterior probability for class $\mathcal{C}_1$ can be written as
\begin{equation}
\begin{aligned}
    p(\mathcal{C}_1|\boldsymbol{x}) &= \frac{p(\boldsymbol{x}|\mathcal{C}_1)p(\mathcal{C}_1)}{p(\boldsymbol{x}|\mathcal{C}_1)p(\mathcal{C}_1) + p(\boldsymbol{x}|\mathcal{C}_2)p(\mathcal{C}_2)}\\
    &= \frac{1}{1 + \exp{(-a)}} = \sigma(a)
\end{aligned}
\end{equation}
where we have defined
\begin{equation}
    a = \ln{\frac{p(\boldsymbol{x}|\mathcal{C}_1)p(\mathcal{C}_1)}{p(\boldsymbol{x}|\mathcal{C}_2)p(\mathcal{C}_2)}}
\end{equation}
and $\sigma(a)$ is the \textit{logistic sigmoid} function. It satisfies the symmetry property $\sigma(-a) =1 - \sigma(a)$ and its inverse is given by
\begin{equation*}
    a = \ln{(\frac{\sigma}{1 - \sigma})}
\end{equation*}
and is know as the \textit{logit} function. It represents the log of the ratio of probabilities $\ln{[p(\mathcal{C}_1|\boldsymbol{x})/p(\mathcal{C}_2|\boldsymbol{x})]}$ for the two classes, also know as the \textit{log odds}.\medskip

Let us assume that the class-conditional densities are Gaussian and then explore the resulting form for the posterior probabilities. To start with, we shall assume that all classes share the same covariance matrix. Thus the density for class $\mathcal{C}_k$ is given by
\begin{equation*}
    p(\boldsymbol{x}|\mathcal{C}_k) = \frac{1}{(2\pi)^{D/2}}\frac{1}{|\boldsymbol{\Sigma}|^{1/2}}\exp{\{-\frac{1}{2}(\boldsymbol{x} - \boldsymbol{\mu}_k)^\intercal\boldsymbol{\Sigma}^{-1}(\boldsymbol{x} - \boldsymbol{\mu}_k)\}}
\end{equation*}
Consider the case of two classes. From 6.10 and 6.11, we have
\begin{equation*}
    p(\mathcal{C}_1|\boldsymbol{x}) = \sigma(\boldsymbol{w ^\intercal x} + w_0)
\end{equation*}
where we have defined
\begin{equation*}
    \boldsymbol{w} = \boldsymbol{\Sigma}^{-1}(\boldsymbol{\mu}_1 - \boldsymbol{\mu}_2)
\end{equation*}
\begin{equation*}
    w_0 = -\frac{1}{2}\boldsymbol{\mu}_1^\intercal \boldsymbol{\Sigma}^{-1}\boldsymbol{\mu}_1 + \frac{1}{2}\boldsymbol{\mu}_2^\intercal \boldsymbol{\Sigma}^{-1}\boldsymbol{\mu}_2 + \ln{\frac{p(\mathcal{C}_1)}{p(\mathcal{C}_2)}}
\end{equation*}
We see that the quadratic terms in $\boldsymbol{x}$ from the exponents of the Gaussian densities have cancelled (due to the assumption of common covariance matrices) leading to a linear function of $\boldsymbol{x}$ in the argument of the logistic sigmoid. The resulting decision boundaries correspond to surfaces along which the posterior probabilities $p(\mathcal{C}_k|\boldsymbol{x})$ are constant and so will be given by linear functions of $\boldsymbol{x}$, and therefore the decision boundaries are linear in input space. The prior probabilities $p(\mathcal{C}_k)$ enter only through the bias parameter $w_0$ so that changes in the priors have the effect of making parallel shifts of the decision boundary and more generally of the parallel contours of constant posterior probability.\medskip

If we relax the assumption of a shared covariance matrix and allow each class-conditional density $p(\boldsymbol{x}|\mathcal{C}_k)$ to have its own covariance matrix $\boldsymbol{\Sigma}_k$, then the earlier cancellations will no longer occur, and we will obtain quadratic functions of $\boldsymbol{x}$, giving rise to a \textit{quadratic discriminant}
\begin{equation*}
\begin{aligned}
    \ln{\frac{p(\boldsymbol{x}|\mathcal{C}_1)p(\mathcal{C}_1)}{p(\boldsymbol{x}|\mathcal{C}_2)p(\mathcal{C}_2)}} &= \ln{p(\boldsymbol{x}|\mathcal{C}_1)} - \ln{p(\boldsymbol{x}|\mathcal{C}_2)} + \ln{\frac{p(\mathcal{C}_1)}{p(\mathcal{C}_2)}}\\
    &= \boldsymbol{x^\intercal W x} + \boldsymbol{w^\intercal x} + w_0
\end{aligned}
\end{equation*}
where the \textit{generalized quadratic discriminant} is $p(\mathcal{C}_k|\boldsymbol{x}) = \sigma(\boldsymbol{x^\intercal W x} + \boldsymbol{w^\intercal x} + w_0)$.
\subsection{MLE Solution}
Once we have specified a parametric functional form for the class-conditional densities $p(\boldsymbol{x}|\mathcal{C}_k)$, we can then determine the values of the parameters, together with the prior class probabilities $p(\mathcal{C}_k)$, using maximum likelihood. This requires a data set comprising observations of $\boldsymbol{X}$ along with their corresponding class labels.\\
In the case of two classes, each having a Gaussian class-conditional density with a shared covariance matrix, and suppose we have a data set $\{\boldsymbol{x}_n, t_n\}$ where $n = 1,..., N$. Here $t_n = 1$ denotes class $\mathcal{C}_1$ and $t_n = 0$ denotes class $\mathcal{C}_2$. We denote the prior class probability $p(\mathcal{C}_1) = \pi$, so that $p(\mathcal{C}_2) = 1 - \pi$. For a data point $\boldsymbol{x}_n$ from class $\mathcal{C}_1$, we have $t_n = 1$ and hence
\begin{equation*}
    p(\boldsymbol{x}_n, \mathcal{C}_1) = p(\mathcal{C}_1)p(\boldsymbol{x}_n| \mathcal{C}_1)=\pi\mathcal{N}(\boldsymbol{x}_n|\boldsymbol{\mu}_1, \boldsymbol{\Sigma})
\end{equation*}
Similarly for class $\mathcal{C}_2$, we have $t_n = 0$ and hence
\begin{equation*}
     p(\boldsymbol{x}_n, \mathcal{C}_2) = p(\mathcal{C}_2)p(\boldsymbol{x}_n| \mathcal{C}_2)=(1 -\pi)\mathcal{N}(\boldsymbol{x}_n|\boldsymbol{\mu}_2, \boldsymbol{\Sigma})
\end{equation*}
Thus the likelihood function is given by
\begin{equation*}

\end{equation*}
\end{document}