P4DA NumPy Basics 1

NumPy is the fundamental package required for high performance scientific computing and data analysis. Here are some things it provides:

  • ndarray: a fast and space-efficient multidimensional array providing vectorized arithmetic operations and sophisticated broadcasting capabilities.
  • Standard mathematical functions for fast operations on entire arrays of data without having to write loops
  • Tools for reading / writing array data to disk and working with memory-mapped files
  • Linear algebra, random number generation, and Fourier transform capabilities
  • Tools for integrating code written in C, C++, and Fortran

Because NumPy provides an easy-to-use C API, it is very easy to pass data to external libraries written in a low-level language and also for external libraries to return data to Python as NumPy arrays.

Having an understanding of NumPy arrays and array-oriented computing will help you use tools like pandas much more effectively.

For more data analysis applications, the main areas of functionality this article will focus on:

  • Fast vectorized array operations for data munging and cleaning, subsetting and filtering, transformation, and any other kinds of computations
  • Common array algorithms like sorting, unique, and set operations
  • Efficient descriptive statistics and aggregating/summarizing data
  • Data alignment and relational data manipulations for merging and joining together heterogeneous data sets
  • Expressing conditional logic as array expressions instead of loops with if-elif-else branches
  • Group-wise data manipulations(aggregation, transformation, function application).

Use pandas as your basis for most kinds of data analysis as it provides a rich, high-level interface making most common data tasks very concise and simple.

The NumPy ndarray: A Multidimensional Array Object

N-dimensional array object, or ndarray, which is a fast, flexible container for large data sets in Python.

Creating ndarrays

The easiest way to create an array is to use the array function. For example, a list is a good candidate for conversion:

data1 = [6, 7.5, 8, 0, 1]  
arr1 = np.array(data1)  
arr1  
#Output:
#array([ 6. ,  7.5,  8. ,  0. ,  1. ])

Nested sequences, like a list of equal-length lists, will be converted into a multidimensional array:

data2 = [[1, 2, 3, 4], [5, 6, 7, 8]]  
arr2 = np.array(data2)  
arr2  
#Output:
#array([[1, 2, 3, 4],
#       [5, 6, 7, 8]])
arr2.ndim  
#Output:
# 2
arr2.shape  
#Output:
# (2L, 4L)

np.array tries to infer a good data type for the array that it creates. The data type is stored in a special dtype object:

arr2.dtype  
#Output:
#dtype('int32')
arr1.dtype  
#Output:
#dtype('float64')

zeros and ones create arrays of 0's and 1's with a given length or shape. empty creates an array without initializing its values to any particular value. To create a higher dimensional array with these methods, pass a tuple for the shape:

np.zeros(5)  
#Output:
#array([ 0.,  0.,  0.,  0.,  0.])

np.zeros((2, 3))  
#Output:
#array([[ 0.,  0.,  0.],
#       [ 0.,  0.,  0.]])

np.empty((2, 3, 2))  
#Output:
#array([[[  0.00000000e+000,   6.36598737e-314],
#        [  0.00000000e+000,   1.27319747e-313],
#        [  1.27319747e-313,   1.27319747e-313]],
#
#       [[  1.27319747e-313,   1.27319747e-313],
#        [  0.00000000e+000,   4.44659081e-323],
#        [  2.54639495e-313,   6.42285340e-323]]])

It is not safe to assume that np.emtyp will return an array of all zeros. It will return uninitialized garbage values.

Short list of standard array creation functions:

standard array creation functions

Data Types for ndarrays

The dtype is a special object containing the information the ndarry needs to interpret a chunk of memory as a particular type of data:

arr1 = np.array([1, 2, 3], dtype = np.float64)

arr2 = np.array([1, 2, 3], dtype = np.int32)  

In most cases they map directly onto an underly7ing machine representation, which makes it easy to read and write binary streams of data to disk and also to connect to code written in a low-level language like C or Fortran.

A full listing of NumPy's supported data types:

NumPy's supported data types

You can explicitly convert or cast an array from one dtype to another using ndarry's astype method:

arr = np.array([1, 2, 3, 4, 5, 6])

float_arr = arr.astype(np.float64)  

Integers were cast to floating point. In another case:

arr = np.array([3.7, -1.2, -2.6, 0.5, 12.9, 10.1])

arr.astype(np.int32)  
#Output: 
#array([ 3, -1, -2,  0, 12, 10])

If you have an array of strings representing numbers, you can use astype to convert them to numeric form:

numeric_string = np.array(['1.25', '-9.6', '42'], dtype = np.str)

numeric_string.astype(float)  
#Output:
#array([  1.25,  -9.6 ,  42.  ])

NumPy is smart enough to alias the Python types to the quivalent dtypes.

int_array = np.arange(10)

calibers = np.array([.22, .270, .357, .380, .44, 50], dtype = np.float64)

int_array.astype(calibers.dtype)  
#Output:
#array([ 0.,  1.,  2.,  3.,  4.,  5.,  6.,  7.,  8.,  9.])
Operations between Arrays and Scalars

Arrays enable you to express batch operations on data without writing any for loops. This is usually called vectorization. Any arithmetic operations between equal-size arrays applies the operation elementwise:

arr = np.array([[1., 2., 3.], [4., 5., 6.]])

arr * arr  
#Output:
#array([[  1.,   4.,   9.],
#       [ 16.,  25.,  36.]])

arr - arr  
#Output:
#array([[ 0.,  0.,  0.],
#       [ 0.,  0.,  0.]])

Arithmatic operations propagates the value to each elements.

Basic Indexing and Slicing

NumPy has many ways to select a subset of your data or individual elements. One-dimensional arrays are simple:

arr = np.arange(10)

arr[5]  
#Output:
# 5
arr[5:8]  
#Output:
#array([5, 6, 7])
arr[5:8] = 12  
#array([0, 1, 2, 3, 4, 12, 12, 12, 8, 9])

An important first distinction from lists is that array slices are views on the original array. This means that the data is not copied, and any modifications to the view will be reflected in the source array:

arr_slice = arr[5:8]  
arr_slice[1] = 12345  
arr  
#Output:
#array([0, 1, 2, 3, 4, 12, 12345, 12, 8, 9])
arr_slice[:] = 64  
arr  
#Output:
#array([0, 1, 2, 3, 4, 64, 64, 64, 8, 9])

With higher dimensional arrays, you have many more options. Individual elements can be accessed recursively. You can pass a comma-separated list of indices to select individual elements.

arr2d = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])  
arr2d[2][0]  
#7
arr2d[2, 0]  
#7
Boolean Indexing

Use the randn function in numpy.random to generate some random normally distributed data:

names = np.array(['Bob', 'Joe', 'Will', 'Bob', 'Will', 'Joe', 'Joe'])

data = randn(7, 4)

#Output:
#array([[  1.63483709e-01,  -6.57302271e-02,  -5.60891461e-01,
#         -8.23897454e-02],
#       [  6.93143787e-01,   1.20269557e+00,   7.17040421e-01,
#          1.29311676e+00],
#       [  8.55416348e-01,  -2.15708942e+00,   5.87096843e-01,
#         -1.76492779e-01],
#       ..., 
#       [ -3.55794859e-01,   2.93436921e-01,   6.06956468e-01,
#         -4.16028885e-01],
#       [  1.61066626e-01,   3.25042808e-01,  -1.80249855e+00,
#         -1.04501415e+00],
#       [ -1.12850396e-01,   7.02006826e-01,   7.21326879e-01,
#          3.00182436e-04]])

If we wanted to select all the rows with corresponding name 'Bob', comparing names with the string 'Bob' yields a boolean array:

name == 'Bob'  
#Output:
#array([ True, False, False,  True, False, False, False], dtype=bool)

data[names == 'Bob']  
#Output:
#array([[ 0.16348371, -0.06573023, -0.56089146, -0.08238975],
#       [ 0.00413943, -0.06916036,  0.91714914, -0.09036852]])
Fancy Indexing

Fancy Indexing is a term adopted by NumPy to describe indexing using integer arrays.

arr = np.empty((8, 4))

for i in range(8):  
    arr[i] = i

arr  
#Output:
#array([[ 0.,  0.,  0.,  0.],
#       [ 1.,  1.,  1.,  1.],
#       [ 2.,  2.,  2.,  2.],
#       ..., 
#       [ 5.,  5.,  5.,  5.],
#       [ 6.,  6.,  6.,  6.],
#       [ 7.,  7.,  7.,  7.]])

To select out a subset of the rows in a particular order, you can simply pass a list or ndarray of integers specifying the desired order:

arr[[4, 3, 0]]  
#Output:
#array([[ 4.,  4.,  4.,  4.],
#       [ 3.,  3.,  3.,  3.],
#       [ 0.,  0.,  0.,  0.]])

Passing multiple index arrays does somthing slightly different; it selects a 1D array of elements corresponding to each tuple of indices:

arr = np.arange(32).reshape((8, 4))  
arr  
#Output:
#array([[ 0,  1,  2,  3],
#       [ 4,  5,  6,  7],
#       [ 8,  9, 10, 11],
#       ..., 
#       [20, 21, 22, 23],
#       [24, 25, 26, 27],
#       [28, 29, 30, 31]])

arr[[1, 5, 7, 2], [0, 3, 1, 2]]  
#Output:
#array([ 1.,  5.,  7.,  2.])

The element (1, 0), (5, 3), (7, 1), (2, 2) were selected, which is the rectangular region formed by selecting a subset of the matrix's rows and columns.

arr[[1, 5, 7, 2]][:]  
#Output:
#array([[ 4,  7,  5,  6],
#       [20, 23, 21, 22],
#       [28, 31, 29, 30],
#       [ 8, 11,  9, 10]])
Transposing Arrays and Swapping Axes

Transposing is a special form of reshaping which similarly returns a view on the underlying data without copying anything.

arr = np.arange(15).reshape((5, 3))  
arr.T  
#Output:
#array([[ 0,  3,  6,  9, 12],
#       [ 1,  4,  7, 10, 13],
#       [ 2,  5,  8, 11, 14]])

For higher dimensional arrays, transpose will accept a tuple of axis numbers to permute the axes:

arr = np.arange(16).reshape((2, 2, 4))  
#Output:
#array([[[ 0,  1,  2,  3], [ 4,  5,  6,  7]],
#       [[ 8,  9, 10, 11], [12, 13, 14, 15]]])

arr.transpose((1, 0, 2))  
#Output:
#array([[[ 0,  1,  2,  3], [ 8,  9, 10, 11]],
#       [[ 4,  5,  6,  7], [12, 13, 14, 15]]])

ndarray has the method swapaxes which takes a pair of axis numbers:

arr = np.arange(16).reshape((2, 2, 4))  
#Output:
#array([[[ 0,  1,  2,  3], [ 4,  5,  6,  7]],
#       [[ 8,  9, 10, 11], [12, 13, 14, 15]]])

arr.swapaxes(1, 2)  
#Output:
#array([[[ 0,  4],        [ 1,  5],        [ 2,  6],        [ 3,  7]],
#       [[ 8, 12],        [ 9, 13],        [10, 14],        [11, 15]]])