P4DA NumPy Basics 2

Universal Functions: Fast Element-wise Array Functions

ufunc is a function that performs elementwise operations on data in ndarrays. Many ufuncs are simple elementswise transformations, like sqrt or exp:

arr = np.arange(10)

np.sqrt(arr)  
#Output: 
#array([ 0.        ,  1.        ,  1.41421356,  1.73205081,  2.        ,
#        2.23606798,  2.44948974,  2.64575131,  2.82842712,  3.        ])

np.exp(arr)  
#Output:
#array([  1.00000000e+00,   2.71828183e+00,   7.38905610e+00,
#         2.00855369e+01,   5.45981500e+01,   1.48413159e+02,
#         4.03428793e+02,   1.09663316e+03,   2.98095799e+03,
#         8.10308393e+03])

These are referred to unary ufuncs. Others, such as add or maxium, take 2 arrays and return a single array as the result:

x = randn(8)  
y = randn(8)

np.maximum(x, y)  
#Output:
#array([ 0.59209796,  0.47299782,  0.25238924, -0.1111911 ,  0.67203337,
#        0.31578221, -0.4526654 ,  0.45740994])

While not common, a ufunc can return multiple arrays. modf is one example, a vectorized version of the build-in Python divmod: it returns the fractional and integral parts of a floating point array:

arr = randn(7)*5

np.modf(arr)  
#Output:
#(array([-0.71885654, -0.29785083, 0.93198306, -0.79554338, -0.36055619, 0.65198947, #-0.758033  ]), 
#         array([-0., -0.,  1., -5., -8.,  3., -2.]))

More about ufuncs.

Data Processing Using Arrays

As a simple example, suppose we wished to evaluate the function sqrt(x^2 + y ^2) across a regular grid of values. The np.meshgrid function takes two 1D arrays and produces two 2D matrices corresponding to all pairs of (x, y) in the two arrays:

points = np.arange(-5, 5, 0.01) # 1000 equally spaced points  
xs, ys = np.meshgrid(points, points)

#Output:
#array([[-5.  , -5.  , -5.  , ..., -5.  , -5.  , -5.  ],
#       [-4.99, -4.99, -4.99, ..., -4.99, -4.99, -4.99],
#       [-4.98, -4.98, -4.98, ..., -4.98, -4.98, -4.98],
#       ..., 
#       [ 4.97,  4.97,  4.97, ...,  4.97,  4.97,  4.97],
#       [ 4.98,  4.98,  4.98, ...,  4.98,  4.98,  4.98],
#       [ 4.99,  4.99,  4.99, ...,  4.99,  4.99,  4.99]])

Now evaluating the function is a simple matter of writing the same expression you would write with two points:

import matplotlib.pyplot as plt  
z = np.sqrt(xs ** 2 + ys ** 2)

z  
#Output: 
#array([[ 7.07106781,  7.06400028,  7.05693985, ...,  7.04988652, 7.05693985, 7.06400028],
#      [ 7.06400028,  7.05692568,  7.04985815, ...,  7.04279774, 7.04985815,  7.05692568],
#      [ 7.05693985,  7.04985815,  7.04278354, ...,  7.03571603, 7.04278354,  7.04985815],
#      ..., 
#      [ 7.04988652,  7.04279774,  7.03571603, ...,  7.0286414 , 7.03571603,  7.04279774],
#      [ 7.05693985,  7.04985815,  7.04278354, ...,  7.03571603, 7.04278354,  7.04985815],
#      [ 7.06400028,  7.05692568,  7.04985815, ...,  7.04279774, 7.04985815, 7.05692568]])

plt.imshow(z, cmap = plt.cm.gray)  
plt.colorbar()  
plt.title("Image plot of $\sqrt{x^2 + y^2}$ for a grid of values")  

Use the matplotlib function imshow to create an image plot from a 2D array of function values.

Image plot of $\sqrt{x^2 + y^2}$ for a grid of values

Expressing Conditional Logic as Array Operations

The numpy.where function is a vectorized version of the ternary expression x if condition else y. Suppose we had a boolean array and two arrays of values:

xarr = np.array([1.1, 1.2, 1.3, 1.4, 1.5])  
yarr = np.array([2.1, 2.2, 2.3, 2.4, 2.5])  
cond = np.array([True, False, True, True, False])

result = [(x if c else y) for x, y, c in zip(xarr, yarr, cond)]

result  
#Output: 
#[1.1000000000000001, 2.2000000000000002, 1.3, 1.3999999999999999, 2.5]

This has multiple problems. First, it will not be very fast for large arrays. Secondly, it will not work with multidimensional arrays. With np.where you can write this very concisely:

result = np.where(cond, xarr, yarr)

result  
#Output:array([1.1, 2.2, 1.3, 1.4, 2.5])

The second and third arguments to np.where don't need to be arrays; one or both of them can be scalars. Atypical use of where in data analysis is to produce a new array of values based on another array.

Suppose you had a matrix of randomly generated data and you wanted to replace all positive values with 2 and all negative values with -2:

arr = randn(4, 4)  
arr  
#Output:
#array([[-0.29078981, -1.13348685,  0.27383502,  0.23798519],
#       [ 0.95101727, -0.27016914, -0.8045934 , -1.89861787],
#       [ 0.85489754, -0.3724457 , -0.5128126 , -0.4517238 ],
#       [ 2.20839445,  1.38272659,  0.28715723, -1.20740332]])
np.where(arr > 0, 2, -2)  

With some cleverness you can use where to express more complicated logic; Sometimes you can write nested where expression:

np.where(cond1 & cond2, 0,  
        np.where(cond1, 1,
                np.where(cond2, 2, 3)))
Mathematical and Statistical Methods

A set of mathematical functions which compute statistics about an entire array of about the data along an axis are accessible as array methods.

arr = np.random.randn(5, 4)

arr.mean()  
#Output:-0.25727502644941103
np.mean(arr)  
#Output:-0.25727502644941103
arr.sum()  
#Output:-5.1455005289882205

Functions like mean and sum take an optional axis argument which computes the statistic over the given axis, resulting in an array with one fewer dimension:

arr.mean(axis=1)  
#Output:array([-0.44845091,  0.60224045, -0.27011575, #-1.04713576, -0.12291316])

arr.sum()  
#Output:array([-1.27449525, -1.4166109 , -1.30776441, #-1.14662997])

Other methods like cumsum and cumprod do not aggregate, instead producing an array of the intermediate results:

arr = np.array([[0, 1, 2], [3, 4, 5], [6, 7, 8]])  
arr.cumsum(0)  
#Output:
#array([[ 0,  1,  2],
#       [ 3,  5,  7],
#       [ 9, 12, 15]])
arr.cumprod(1)  
#Output:
#array([[  0,   0,   0],
#       [  3,  12,  60],
#       [  6,  42, 336]])
Method for Boolean Arrays

Boolean values are coerced to True and False in these methods.

arr = randn(100)  
(arr > 0).sum()
#Output:40

There are two additional methods, any and all, useful especially for boolean arrays. any tests whether one or more values in an array is True:

bools = np.array([False, False, True, False])

bools.any()  
#Output: True
bools.all()  
#Output: False
Sorting

NumPy arrays can be sorted in-place using the sort method:

arr = randn(8)  
arr  
#Output:array([ 2.14278004,  0.61605461,  0.42811785,  0.8523577 ,  1.79848808,
#        1.17615761,  0.73719144, -0.38943294])
arr.sort()  
#Output:
#array([-0.38943294,  0.42811785,  0.61605461,  0.73719144,  0.8523577 ,
#        1.17615761,  1.79848808,  2.14278004])
Unique and Other Set Logic

NumPy has some basic set operations for one-dimensional ndarrays. Probably the most commonly used one is np.unique, which returns the sorted unique values in an array:

names = np.array(['Bob', 'Joe', 'Will', 'Bob', 'Will', 'Joe', 'Joe'])

np.unique(names)  
#Output:
#array(['Bob', 'Joe', 'Will'], 
#      dtype='|S4')

File Input and Output with Arrays

NumPy is able to save and load data to and from disk either in text or binary format.

Storing Arrays on Disk in Binary Format

np.save and np.load are the two workhorse functions for efficiently saving and loading array data on disk. Arrays are saved by default in an uncompressed raw binary format with file extension .npy.

arr = np.arange(10)  
np.save('some_array', arr)

np.load('some_array.npy')  
#Output:array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

If you save multiple arrays in a zip archive using np.savez and passing the arrays as keyword arguments. When loading an .npz file, you get back a dict-like object which loads the individual arrays lazily:

np.savez('array_archive.npz', a = arr, b = arr)  
arch = np.load('array_archive.npz')  
arch['b']  
#Output:array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
Saving and Loading Text Files

Use np.loadtxt to read data in txt formate file, which you can appoint specific delimiter to extract data from large text.

arr = np.loadtxt('array_ex.txt', delimiter=',')  
arr  
#Output:
#array([[ 0.580052,  0.18673 ,  1.040717,  1.134411],
#       [ 0.194163, -0.636917, -0.938659,  0.124094],
#       [-0.12641 ,  0.268607, -0.695724,  0.047428],
#       [-1.484413,  0.004176, -0.744203,  0.005487],
#       [ 2.302869,  0.200131,  1.670238, -1.88109 ],
#       [-0.19323 ,  1.047233,  0.482803,  0.960334]])

np.savetxt performs the inverse operation: writing an array to a delimited text file. genfromtxt is similar to loadtxt but is geared for structured arrays and missing data handling.