Deep Learning Part I Applied Math and Machine Learning Basics 1

We begin with general ideas from applied math. Next, we describe the fundamental goals of machine learning. This elementary framework is the basis for a broad variety of machine learning algorithms, including approaches to machine learning that are not deep.

Linear Algebra

Scalars, Vectors, Matrices and Tensors
  • Scalars: A scalar is just a single number, in contrast to most of the other objects studied in linear algebra, which are usually arrays of multiple numbers.
  • Vectors: A vector is an array of number. The numbers are arranged in order. We can think of vectors as identifying points in space, with each element giving the coordinate along a different axis.
  • Matrices: A matrix is a 2-D array of numbers, so each element is identified by two indices instead of just one.
  • Tensors: In some cases we will need an array with more than two axes. In general case, an array of numbers arranged on a regular grid with a variable number of axes is known as a tensor.
Linear Dependence and Span

To analyze how many solutions the equation has, we can think of the columns of A as specifying different directions we can travel from the origin, and determine how many ways there are of reaching b. In this view, each element of x specifies how far we should travel in each of the directions, with xi specifying how far to move in the direction of column i:

In general, this kind of operation is called a linear combination. Formally, a linear combination of some set of vectors {v1, ..., vn} is given by multiplying each vector vi by a corresponding scalar coefficient and adding the results:

The span of a set of vectors is the set of all points obtainable by linear combination of the original vectors. Determining whether Ax = b has a solution thus amounts to testing whether b is in the span of the columns of A. The particular span is known as the column space of the range of A.

This kind of redundancy is known as linear dependence. A set of vectors is linearly independent if no vector in the set is a linear combination of the other vectors. A square matrix with linearly dependent columns is known as singular.


In machine learning, we usually measure the size of vectors using a function called a norm. Norms are functions mapping vectors to non-negative values. The norm of a vector x measures the distance from the origin to the point x. More rigorously, a norm is any function f that satisfies the following properties:

  • f(x) = 0 => x = 0
  • f(x + y) <= f(x) + f(y)
  • ∀ α∈R, f(αx) = |α|f(x)

The L1 norm is commonly used in machine learning when the difference between zero and nonzero elements is very important. Every time an element of x moves away from 0 by ε, the L1 norm increases by ε.

One other norm that commonly arises in machine learning is the L∞ norm, also known as the max norm. This norm simplifies to the absolute value of the element with the largest magnitude in the vector.

Sometimes we may also wish to measure the size of a matrix. The most common way to do this is with the otherwise obscure Frobenius norm, which is analogous to the L2 norm of a vector.

Special Kinds of Matrices and Vectors

Diagonal matrices consist mostly of zeros and have non-zero entries only along the main diagonal. Diagonal matrices are of interest in part because multiplying by a diagonal matrix is very computationally efficient. Not all diagonal matrices need be square. Non-square diagonal matrices do not have inverses but it is still possible to multiply by them cheaply.

A symmetric matrix is any matrix that is equal to its own transpose.

A unit vector is a vector with unit norm.


One of the most widely used kinds of matrix decomposition is called eigendecomposition, in which we decompose a matrix into a set of eigenvectors and eigenvalues.

An eigenvector of a square matrix A is a non-zero vector v such that multiplication by A alter only the scale of v: Av = λv

The scalar λ is known as the eigenvalue corresponding to this eigenvector.

Suppose that a matix A has n linearly independent eigenvectors, {v1, ..., vn}, with corresponding eigenvalues{λ1, ..., λn}. We may concatenate all of the eigenvectors to form a matrix V with one eigenvector per column: V = [v1, ..., vn]. Likewise, we can concatenate the eigenvalues to form a vector Λ = [λ1, ..., λn]. The eigendecomposition of A is then given by: A = V diag(Λ) V^(-1)

Not every matrix can be decomposed into eigenvalues and eigenvectors. Specifically, every real symmetric matrix can be decomposed into an expression using only real-valued eigenvectors and eigenvalues: A = QΛQ(T), where Q is an orthogonal matrix composed of eigenvectors of A, and Λ is a diagonal matrix.

While any real symmetric matrix A is guaranteed to have an eigendecomposition, the eigendecomposition ma not be unique. If any two or more eigenvectors share the same eigenvalue, then any set of orthogonal vectors lyinng in their span are also eigenvectors with that eigenvalue. The eigendecomposition is unique only if all of the eigenvalues are unique.

The matrix is singular if and only if any of the eigenvalues are zero. The eigendecomposition of a real symmetric matrix can also be used to optimize quadratic expression of the form f(x) =xTAx subject to ||x|| = 1.

A matrix whose eigenvalues are all positive is called positive definite. A matrix whose eigenvalues are all positive or zero-valued is called positive semidefinite. If all eigenvalues are negative, the matrix is negative definite.

Singular Value Decomposition

The singular value decomposition provides another way to factorize a matrix, into singular vectors and singular values. Every real matrix has a singular value decomposition, but the same is not true of the eigenvalue decomposition.

The singular value decomposition is similar, except this time we will write A as a product of three matrices: A = UDV.

Suppose that A is an m×n matrix. Then U is defined to be an m×m matrix, D to be an m×n matrix, and V to be an n×n matrix. The matrices U and V are both defined to be orthogonal matrices. he matrix D is defined to be a diagonal matrix. Note that D is not necessarily square. The element along the diagonal of D are known as the singular values of matrix A.

The moore-Penrose Pseudoinverse

Matrix inversion is not defined for matrices that are not square. The Moore-Penrose pseudoinverse allow us to make some headway in these cases. The pseudoinverse of A is defined as a matrix:

Prictical algorithms for computing the pseudoinverse are not based on this definition, but rather the formula:

The Trace Operation

The race operator gives the sum of all of the diagonal entries of a matrix:

The trace operator is invariant to the transpose operator: Tr(A) = Tr(AT)

The trace of a square matrix composed of many factors is also invariant to moving the last factor into the first position: Tr(ABC) = Tr(BCA) = Tr(CAB)

The Determinant

The determinant is equal to the product of all the eigenvalues of the matrix.

Example: Principle Components Analysis

One simple machine learning algorithm, principal components analysis or PCA can be derived using only knowledge of basic linear algebra.

Suppose we have a collection of m points {x1, ..., xm} in Rn. Suppose we would like to apply lossy compression to these points.

One way we can encode these points is to represent a lower-dimensional version of them. For each point xi∈Rn, we will find a corresponding code vector ci∈Rl. If l is smaller than n, it will take less memor to store the code points than the original data. We will want to find some encoding function that produces the code for an input, f(x) = c, and a decoding function that produces the reconstructed input given its code, x ≈ g(f(x)).

PCA is defined by our choice of the decoding function. To make the decoder very simple, we choose to use matrix multiplication to map the code back into Rn. Let g(c) = Dc, where D∈Rn×l is the matrix defining the decoding. To keep the encoding problem easy, PCA constrains the columns of D to be orthogonal to each other.

To give the problem a unique solution, we constrain all of the collumns of D to have unit norm. The first thing we need to do is figure out how to generate the optimal code point c* for each input point x. One way to do this is to minimize the distance between the input point x and its reconstructions, g(c*). We can measure distance using a norm:

It can be simplified as:

We can now change the function being minimized again, to omit the first term. T omake further progress, we must substitute in the definition of g(c):

We can solve this optimization problem using vector calculus:

This makes the algorithm efficient: we can optimally encode x just using a matrix-vector operation. Using a further matrix multiplication, we can also define the PCA reconstruction operation: r(x) = g(f(x)) = DDTx

Next, we need to choose the encoding matrix D. To do so, we revisit the idea of minimizing the L2 distance between inputs and reconstructions. However since we will use the same matrix D todecode all of the points, we can no longer consider the points in isolation. Instead we must minimize the Frobenius norm of the matrix of errors computed over all dimensions and all points:

To derive the algorithm for finding D*, we will start by considering the case where l = 1. In this case, D is just a single vector, d. Simplifying D into d, the problem reduces to:

At this point, it can be helpful to rewrite the problem in terms of a single design matrix of examples, rather than as a sum over separate example vectors. Let X∈Rm×n be the matrix defined by stacking all of the vectors describing the points, such that we can now rewrite the problem as:

At this point, we re-introduce the constraint:

This optimization problem may be solved using eigendecomposition. Specifically, the optimal d is given by the eigenvector of XTX corresponding to the largest eigenvalue.

In the general case, where l > 1, the matrix D is given by the l eigenvectors corresponding to the largest eigenvalues.