Computer Vision

  • The structure of images is complex
    • invariances
      • scale
      • translation
      • cropping
      • dilation
      • homogeneity
    • Perceptual sensitivity
      • color
      • edges
      • orientations
  • Extracting semantics is challenging
    • occlusion
    • deformation
    • illumination
    • viewpoint
    • object pose

Convolutional Network

  • Translation invariance
    • Convolutional kernels are a spatially localized receptive field whose weights are shared across spatial locations.

Padding: what to do at the edges: valid or same

stride: pixels to shift when applying a kernel

The number of model parameters is independent of image size.

  • How many parameters are in a single layer

    • $(filter width \times filter height) \times (input depth) \times (output depth)$
  • How much computational cost in a single layer

    • $(filter width \times filter height) \times (input depth) \times (output depth) \times (input width \div stride) \times (input height \div stride)$
  • Scaling to high resolution images

    • Computational demand grows as quadratically as the image size
      • Spatial pooling builds invariance across spatial dimensions
      • Regularization mitigates overfitting (e.g. weight decay, dropout)
      • Normalization empirically accelerates training and makes better model images

Convolutional layers take up most of computation, but fully connected layer have most of the parameters.

  • Trends in netowrk architecture

    • Normalization methods are an important ingredient for achieving state-of-the-art performance
      • Many variations, none of which is strictly biological
      • Almost all vision models employ some form of normalization throughout a network representation
    • Deeper and larger networks lead to better predictive performance
    • Multi-scale architectures provide greate predictive performance while minimizing computational demand.
  • Normalization styles

    • Batch Norm
    • Layer Norm
    • Instance Norm
    • Group Norm
      • Calculate the mean $\mu$ and variance $\sigma^2$ within each group of channels and normalize
        • $\mu = \frac{1}{|G|}\sum_{i\in G}x_c$
        • $\sigma^2=\frac{1}{|G|}\sum_{i\in G}(x_c - \mu)^2$
        • We can get $\bar{x_c} = \frac{x_c - \mu}{\sqrt{\sigma^2 + \epsilon}}$
      • Learn the mean and variance $(\gamma, \beta)$ of each layer as parameters
        • $y_c = \gamma \bar{x_c} + \beta$

Normalization stabilizes activations during training