P4DA Getting Started with pandas 1

pandas will be the primary library of data analysis. It contains high-level data structures and manipulation tools designed to make data analysis fast and easy in Python. pandas is built on top of NumPy and makes it easy to use in NumPy-centric applications.

from pandas import Series, DataFrame  
import pandas as pd  

Introduction to pandas Data Structures

You will need to get comfortable with its two workhorse data structures: Series and DataFrame.


A Series is a one-dimensional array-like object containing an array of data and an associated array of data labels, called it index. The simplest Series is formed from only an array of data:

obj = Series([4, 7, -5, 3])  
#0    4
#1    7
#2   -5
#3    3
#dtype: int64

A default index consisting of the integers 0 through N-1 is created. You can get the array representation and index object of the Series via its values and index attributes, respectively:

#Output:array([ 4,  7, -5,  3], dtype=int64)
#Output:RangeIndex(start=0, stop=4, step=1)

Often it will be desirable to create a Series with an index identifying each data point:

obj2 = Series([4, 7, -5, 3], index = ['d', 'b', 'a', 'c'])  
#d    4
#b    7
#a   -5
#c    3
#dtype: int64

Compared with a regular NumPy array, you can use values in the index when selecting single values or a set of values:

#Output:Index([u'd', u'b', u'a', u'c'], dtype='object')

NumPy array operations, such as filtering with a boolean array, scalar multiplication, or applying math functions, will preserve the index-value link:

obj2[obj2 > 2]  
obj2 * 2  

Another way to think about a Series is as a fixed-length, ordered dict, as it is a mapping of index values to data values.

'b' in obj2  
#Output: Ture
'e' in obj2  
#Output: False

Should you have data contained in a Python dict, you can create a Series from it by passing the dict:

sdata = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}  
obj3 = Series(sdata)  
#Ohio      35000
#Oregon    16000
#Texas     71000
#Utah       5000
#dtype: int64

When only passing a dict, the index in the resulting Series will have the dict's keys in sorted order:

states = ['California', 'Ohio', 'Oregon', 'Texas']  
obj4 = Series(sdata, index = states)  
#California        NaN
#Ohio          35000.0
#Oregon        16000.0
#Texas         71000.0
#dtype: float64
Data Frame

A DataFrame represents a tabular, spreadsheet-like data structure containing an ordered collection of columns, each of which can be a different value type. The DataFrame has both a row and column index; it can be thought of as a dict of Series.

import pandas as pd  
from pandas import DataFrame, Series  
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'],  
        'year':[2000, 2001, 2002, 2001, 2002], 
        'pop':[1.5, 1.7, 3.6, 2.4, 2.9]}
frame = DataFrame(data)  
#   pop   state  year
#0  1.5    Ohio  2000
#1  1.7    Ohio  2001
#2  3.6    Ohio  2002
#3  2.4  Nevada  2001
#4  2.9  Nevada  2002

If you specify a sequence of columns, the DataFrame's columns will be exactly what you pass:

DataFrame(data, columns=['year', 'state', 'pop'])  
#   year   state  pop
#0  2000    Ohio  1.5
#1  2001    Ohio  1.7
#2  2002    Ohio  3.6
#3  2001  Nevada  2.4
#4  2002  Nevada  2.9

Rows can also be retrieved by position or name by a couple of methods, such as the ix indexing field"

#pop       3.6
#state    Ohio
#year     2002
#Name: 2, dtype: object

Columns can be modified by assignment. When assigning lists or arrays to a column, the value's length must match the length of the DataFrame. If you assign a Series, it will be instead conformed exactly to the DataFrame's index, inserting missing values in any holes:

val = Series([1.8, 3.2], index = [1, 3])  
frame['pop'] = val  
#   pop   state  year
#0  NaN    Ohio  2000
#1  1.8    Ohio  2001
#2  NaN    Ohio  2002
#3  3.2  Nevada  2001
#4  NaN  Nevada  2002

You can always transpose the result; Like Series, the values attribute returns the data contained in the DataFrame as a 2D ndarray:

#array([[1.5, 'Ohio', 2000L],
#       [1.7, 'Ohio', 2001L],
#       [3.6, 'Ohio', 2002L],
#       [2.4, 'Nevada', 2001L],
#       [2.9, 'Nevada', 2002L]], dtype=object)

For more about DataFrame, see DataFrame