Computer Vision
 The structure of images is complex
 invariances
 scale
 translation
 cropping
 dilation
 homogeneity
 Perceptual sensitivity
 color
 edges
 orientations
 invariances
 Extracting semantics is challenging
 occlusion
 deformation
 illumination
 viewpoint
 object pose
Convolutional Network
 Translation invariance
 Convolutional kernels are a spatially localized receptive field whose weights are shared across spatial locations.
Padding: what to do at the edges: valid or same
stride: pixels to shift when applying a kernel
The number of model parameters is independent of image size.

How many parameters are in a single layer
 $(filter width \times filter height) \times (input depth) \times (output depth)$

How much computational cost in a single layer
 $(filter width \times filter height) \times (input depth) \times (output depth) \times (input width \div stride) \times (input height \div stride)$

Scaling to high resolution images
 Computational demand grows as quadratically as the image size
 Spatial pooling builds invariance across spatial dimensions
 Regularization mitigates overfitting (e.g. weight decay, dropout)
 Normalization empirically accelerates training and makes better model images
 Computational demand grows as quadratically as the image size
Convolutional layers take up most of computation, but fully connected layer have most of the parameters.

Trends in netowrk architecture
 Normalization methods are an important ingredient for achieving stateoftheart performance
 Many variations, none of which is strictly biological
 Almost all vision models employ some form of normalization throughout a network representation
 Deeper and larger networks lead to better predictive performance
 Multiscale architectures provide greate predictive performance while minimizing computational demand.
 Normalization methods are an important ingredient for achieving stateoftheart performance

Normalization styles
 Batch Norm
 Layer Norm
 Instance Norm
 Group Norm
 Calculate the mean $\mu$ and variance $\sigma^2$ within each group of channels and normalize
 $\mu = \frac{1}{G}\sum_{i\in G}x_c$
 $\sigma^2=\frac{1}{G}\sum_{i\in G}(x_c  \mu)^2$
 We can get $\bar{x_c} = \frac{x_c  \mu}{\sqrt{\sigma^2 + \epsilon}}$
 Learn the mean and variance $(\gamma, \beta)$ of each layer as parameters
 $y_c = \gamma \bar{x_c} + \beta$
 Calculate the mean $\mu$ and variance $\sigma^2$ within each group of channels and normalize
Normalization stabilizes activations during training