Columbia University

ECBM E4040 Neural Networks and Deep Learning

Yi-Pei Chan

Assignment 1 - Basic ML Classifiers

In this task, you are going to implement two classifers and apply them to the CIFAR-10 dataset:

(1) Logistic regression classifier

(2) Softmax classifier.

Load CIFAR-10 data

CIFAR-10 is a widely used dataset which contains 60,000 color images of size 32x32 divided into 10 classes, with 6,000 images per class. There are 50,000 training images and 10,000 test images. We are going to use them to create our training set, validation set and test set.

See https://www.cs.toronto.edu/~kriz/cifar.html.

10-class dataset

First, we load the raw CIFAR-10 data and create a 10-class dataset.

2-class dataset

Next, in order to implement the experiment with the logistic regression classifier, we subsample the 10-class dataset to the 2-class dataset.

Part 1: Logistic Regression Classifier

In this part, you are going to implement a logistic regression classifier.

Let’s assume a training dataset of images $x_i \in R^D$, each associated with a label $y_i$. Here $i=1 \dots N$ and $y_i \in 1 \dots K$. That is, we have N examples (each with a dimensionality D) and K distinct categories.

We will now define the score function $f: R^D \to R^K$ that maps the raw image pixels to class scores: $$f(x_i; W, b)=W x_i + b$$ where $W$ is of size $K \times D$ and $b$ is of size $K \times 1$.

Here we will use bias trick to represent the two parameters $W,b$ as one by extending the vector $x_i$ with one additional dimension that always holds the constant 1 - a default bias dimension. With the extra dimension, the new score function will simplify to a single matrix multiply: $$f = f(x_i;W)=W x_i$$

Brief introduction to logistic regression classifier

Logistic regression classifier can solve a binary classification problem ($K=2$). A binary logistic regression classifier has only two classes (0,1), and calculates the probability of class 1 as:

$$ P(y=1 | x ; w)=\frac{1}{1+e^{-f}}=\sigma\left(f\right) $$

Since the probabilities of class 1 and 0 sum to one, the probability for class 0 is:

$$ P(y=0 | x ; w)=1-P(y=1 | x ; w) $$

Hence, an example is classified as a positive example ($y = 1$) if $\sigma\left(f\right)>0.5$, or equivalently if the score $f>0$. The loss function then maximizes the log likelihood of this probability. You can convince yourself that this simplifies to:

$$ L_{i}=-\sum_{j} (y_{i j} \log \left(\sigma\left(f_{j}\right)\right)+\left(1-y_{i j}\right) \log \left(1-\sigma\left(f_{j}\right)\right)) $$

where the labels $y_{ij}$ are assumed to be either 1 (positive) or 0 (negative), and $\sigma(\cdot)$ is the sigmoid function. The expression above can look scary but the gradient on $f$ is in fact extremely simple and intuitive:

$$ \frac{\partial L_{i}}{\partial f}=-\sum_{j} (y_{i j}-\sigma\left(f_{j}\right)) $$$$ \frac{\partial L_{i}}{\partial W}= - \sum_{j} (y_{i j}-\sigma\left(f_{j}\right)) * x_{i} $$

[1] http://cs231n.github.io/neural-networks-2/

[2] https://medium.com/@martinpella/logistic-regression-from-scratch-in-python-124c5636b8ac

You have to implement logistic regression in two ways: naive and vectorized. We provide the verification code for you to check if your code works properly.

Verification code:

Part 2: Softmax classifier

Softmax classifier is a generalization of the Logistic Regression classifier to multiple classes.

In the Softmax classifier, the function mapping $f(x_i;W)=W x_i$ stays unchanged, but we now interpret the obtained scores as the unnormalized log probabilities for each class, and replace the hinge loss with a cross-entropy loss that has the form: $$L_i= - \log (\frac{e^{f_{y_i}}}{\sum_j e^{f_j}}).$$

The cross-entropy between a “true” distribution $p$ and an estimated distribution $q$ is defined as: $$H(p, q)=- \sum_x p(x) \log q(x).$$

Now, let's rewrite the expression of $L_i$: $$L_i= - \sum_k p_{i,k} \log (\frac{e^{f_k}}{\sum_j e^{f_j}})$$ where $p_i=[0, \dots,1, \dots, 0]$ contains a single 1 at the $y_i$-th position, $p_{i,k}=p_i[k]$, $p_i \in [1 \times K]$.

Note: Numerical stability. When you are writing code for computing the Softmax function in practice, the intermediate terms $e^{f_{y_i}}$ and $\sum_j e^{f_j}$ may be very large due to the exponentials. Dividing with large numbers can be numerically unstable, so it is important to use the normalization trick. Notice that if we multiply both the top and the bottom of the fraction by constant $C$ and push $C$ inside the exponent, we get the following (mathematically equivalent) expression: $$\frac{e^{f_{y_i}}}{\sum_j e^{f_j}}=\frac{Ce^{f_{y_i}}}{C\sum_j e^{f_j}}=\frac{e^{f_{y_i}+\log C}}{\sum_j e^{f_j+\log C}}.$$

A common choice for $C$ is to set it to $\log C= -\max_j f_j$.

In most cases, you also need to consider a bias term $b$ with length D. However, in this experiment, since a bias dimension has been added into the $X$, you can ignore it.

Softmax derivations (in matrix representation)

$$\nabla_{W_k} L= - \frac{1}{N} \sum_i x_i^T(p_{i,m} - P_m) + 2 \lambda W_k,$$

where $P_k= \frac{e^{f_k}}{\sum_j e^{f_j}}$.

You have to implement the softmax layer in three ways. For the first two implementations, we provide the verification code for you to check if your implementation is correct.

Do not forget the $L_2$ regularization term in the loss.

Verification code:

Part 3: Train your classifiers

Now you can start to train your classifiers. We are going to use gradient descent algorithm for training, which differs from the usual logistic regression training process.

In the training section, you are asked to implement Stochastic gradient descent (SGD) optimization method. Pseudo code for SGD is shown below.

Train Logistic Regression + Stochastic Gradient Descent (SGD)

Verification code:

Plot Logistic Regression SGD Loss Curve

Train Softmax + SGD

Verification code:

Plot Softmax Loss SGD Loss Curve