# Neural Networks

Making predictions with feedforward neural networks.

## Artificial Neuron

• Neuron pre-activation
• $a(x) = b + \Sigma_{i}\omega_{i}x_{i} = b + w^{T}x$
• Neuron activation
• $h(x) = g(a(x)) = g(b+\Sigma_i\omega_{i}x_{i})$
• where $w$ are the connection weights
• $b$ is the neuron bias
• $g(.)$ is called the activation function The neuron output range determined by $g(.)$, bias $b$ only changes the position of the riff.

## Capacity of Neural Network

Universal approximation theorem: a single hidden layer neural network with a linear output unit can approximate any continuous function arbitrarily well, given enough hidden units.

This is a good result, but it doesn't mean there is a learning algorithm that can find the necessary parameter values.

## Multilayer Neural Network

A multilayer neural network could have $L$ hidden layers: # Activation Function

## Sigmoid

• Squashes the neuron's pre-activation between 0 and 1
• Always positive
• Bounded
• Strictly increasing
$$g(a) = sigm(a) = \frac{1}{1 + \exp(-a)}$$

## Tanh

• Squashes the neuron's pre-activation between -1 and 1
• Can be positive or negative
• Bounded
• Strictly increasing
$$g(a) = tanh(a) = \frac{\exp(a) - \exp(-a)}{\exp(a) + \exp(-a)} = \frac{\exp(2a) - 1}{\exp{2a} + 1}$$

## Rectified Linear Unit

• Bounded below by 0 (always non-negative)
• Not upper bounded
• Strictly increasing
• Tends to give neurons with sparse activities
$$g(a) = ReLU(a) = \max(0, a)$$

## Softmax

• For multi-class classification:
• We need multiple outputs
• We would like to estimate the conditional probability $p(y=c|x)$
• We use the softmax activation function at the output:
• strictly positive
• sums to one
$$o(a) = softmax(a) = [\frac{\exp(a_1)}{\Sigma_{c}\exp(a_c)} ... \frac{\exp(a_c)}{\Sigma_{c}\exp(a_c)}]^{T}$$

# Flow Graph

Forward propagation can be represented as an acyclic flow graph. It's a nice way of implementing forward propagation in a modular way. Each box could be an object with an fprop method, that computes the value of the box given its parents. Calling the fprop method of each box in the right order yield forward propagation.

# Optimization

## Empirical Risk Minimization

Framework to design learning algorithms:

$${\arg\min}{\theta} \frac{1}{T} \sum{t} l(f(x^{(t)}; \theta), y^{(t)}) + \lambda\Omega(\theta)$$

where $l(f(x^{(t)}; \theta), y^{(t)})$ is a loss function, $\Omega(\theta)$ is a regularizer. Ideally, we would optimize classification error, but it's not smooth. Loss function is a surrogate for what we truly should optimize.

• Algorithm that performs updates after each example
• initialize $\theta$
• for N epochs
• for each training example $(x^{(t)}, y^{(t)})$
• $\Delta = -\triangledown_{\theta}l(f(x^{(t)}; \theta), y^{(t)}) - \lambda\triangledown_{\theta}\Omega(\theta)$
• $\theta \leftarrow \theta + \alpha \Delta$.
• To apply this algorithm to neural network training, we need
• the loss function: $l(f(x^{(t)}; \theta), y^{(t)})$
• a procedure to compute the parameter gradients, $\Delta = -\triangledown_{\theta}l(f(x^{(t)}; \theta), y^{(t)}) - \lambda\triangledown_{\theta}\Omega(\theta)$
• the regularizer, $\Omega(\theta)$
• initialization method for $\theta$

## Loss Function for Classification

To frame as minimization, we minimize the negative log-likelihood (cross-entropy)

$$l(f(x), y) = - \sum_{c} 1_{(y=c)}\log f(x){c} = -\log f(x){y}$$

We take the log to simplify for numerical stability and math simplicity.

## Backpropogation

Partial derivative: $g'(a) = g(a)(1-g(a))$

Partial derivative: $g'(a) = 1-g(a)^2$

Partial derivative: $g'(a) = 1_{a>0}$

### Flow Graph

• Each object also has a bprop method
• it computes the gradient of the loss with respect to each parent
• fprop depends on the fprop of a box's parent, while bprop depends the bprop of a box's children
• By calling bprop in the reverse order, we get backpropagation
• only need to reach the parameters ### Regularization

L2 Regularization:

$$\Omega(\theta) = \sum_{k}\sum_{i}\sum_{j}(W^{(k)}{i, j})^{2} = \sum{k}||W^{(k)}]]^{2}_{F}$$

Gradient: $\triangledown_{W^{(k)}}\Omega(\theta) = 2W^{(k)}$
Only applied on weights, not on biases.Weight decay is used to adjust the effect of the complex model to loss function. If the weight decay is big, the loss function of complex model is big, too.

# Initialization

• For bias
• initialize all to 0
• For weights
• Can't initialize weights to 0 with tanh activation
• all gradients would then to be 0
• Can't initialize all weights to the same value
• all hidden units in a layer will always behave the same
• need to break symmetry
• Recipe: sample $W_{i,j}^{(k)}$ from $U[-b, b]$ where $b=\frac{\sqrt{6}}{\sqrt{H_k+H_{k-1}}}$, $H_k$ is the size of $h^{(k)}(x)$.
• the idea is to sample around 0 but break symmetry
• other values of b could work well

# Model Selection

• To search for the best configuration of the hyper-parameters:
• Grid search
• specify a set of values you want to test for each hyper-parameter
• try all possible configurations of these values
• Random search
• specify a distribution over the values of each hyper-parameters
• sample independently each hyper-parameter to get configuration

## Early Stopping

To select the number of epochs, stop training when validation set error increases

# More on Optimizer

## Mini-batch, Momentum

• Can update based on a mini-batch of example
• the gradient is the average regularized loss for the mini-batch
• can give a more accurate estimate of the risk gradient
• can leverage matrix/matrix operations, which are more efficient
• Can use a exponential average of previous gradients
• can get through plateaus more quickly, by "gaining momentum".

$$\bar{\triangledown}^{(t)}\theta = \triangledown\theta l (f(x^{(t)}), y^{(t)})+\beta\bar{\triangledown}^{(t-1)}_\theta$$

All update with adaptive learning rates (one learning rate per parameter)

learning rates are scaled by the square root of the cumulative sum of squared gradients
$$\gamma^{(t)} = \gamma^{(t-1)} + (\triangledown_\theta l (f(x^{(t)}), y^{(t)})^2$$

$$\bar{\triangledown}^{(t)}\theta = \frac{\triangledown\theta l (f(x^{(t)}), y^{(t)})}{\sqrt{\gamma^{(t)} + \epsilon}}$$

### RMSProp

instead of cumulative sum, use exponential moving average
$$\gamma^{(t)} = \beta\gamma^{(t-1)} + (1-\beta)(\triangledown_\theta l (f(x^{(t)}), y^{(t)})^2$$

$$\bar{\triangledown}^{(t)}\theta = \frac{\triangledown\theta l (f(x^{(t)}), y^{(t)})}{\sqrt{\gamma^{(t)} + \epsilon}}$$

essentially combines RMSProp with momentum

To debug your implementation of fprop/bprop, you can compare with a finite-difference approximation of the gradient

$$\frac{\partial f(x)}{\partial x} \approx \frac{f(x + \epsilon) - f(x-\epsilon)}{2\epsilon}$$

• $f(x)$ would be the loss
• $x$ would be a parameter
• $f(x+\epsilon)$ would be the loss if you add $\epsilon$ to the parameter
• $f(x-\epsilon)$ would be the loss if you subtract $\epsilon$ to the parameter

# Deep Learning

## Training

• Training is hard, optimization is harder.
• Saturated units block gradient propagation
• Overfitting
• exploring a space of complex functions
• deep nets usually have lot of parameters
• unsupervised pre-training
• dropout training
• Underfitting
• better optimization method
• Depending on the problem, or the other situation will tend to dominate

### Unsupervised pre-training

initialize hidden layers using unsupervised learning. Force network to represent latent structure of input distribution, encourage hidden layers to encode that structure.

This is a harder task than supervised learning.

• We will use a greedy, layer-wise procedure
• train one layer at a time, from first to last, with unsupervised criterion
• fix the parameters of previous hidden layers
• precious layers viewed as feature extraction

#### Auto Encoder

Feed-forward neural network trained to reproduce its input at the output layer.

#### Fine-Tuning

• once all layers are pretrained
• train the whole network using supervised learning
• Supervised learning is performed as in a regular feed-forward network
• forward propagation, backpropagation and update
• All parameters are tuned for the supervised task at hand
• representation is adjusted to be more discriminative

### Dropout

Idea: cripple neural network by removing hidden units stochastically. Could use a different dropout probability, but 0.5 usually works well.

## Batch Normalization

• Normalizing the inputs will speed up training
• could normalization also be useful at the level of the hidden layer
• each unit's pre-activation is normalized
• during training, mean and stddev is computed for each minibatch
• backpropagation takes into account the normalization
• at test time, the global mean/stddev is used PPT