# P4DA Getting Started with pandas 1

`pandas`

will be the primary library of data analysis. It contains high-level data structures and manipulation tools designed to make data analysis fast and easy in Python. pandas is built on top of NumPy and makes it easy to use in NumPy-centric applications.

```
from pandas import Series, DataFrame
import pandas as pd
%pylab
```

### Introduction to pandas Data Structures

You will need to get comfortable with its two workhorse data structures: Series and DataFrame.

##### Series

A Series is a one-dimensional array-like object containing an array of data and an associated array of data labels, called it *index*. The simplest Series is formed from only an array of data:

```
obj = Series([4, 7, -5, 3])
obj
#Output:
#0 4
#1 7
#2 -5
#3 3
#dtype: int64
```

A default index consisting of the integers 0 through N-1 is created. You can get the array representation and index object of the Series via its values and index attributes, respectively:

```
obj.values
#Output:array([ 4, 7, -5, 3], dtype=int64)
obj.index
#Output:RangeIndex(start=0, stop=4, step=1)
```

Often it will be desirable to create a Series with an index identifying each data point:

```
obj2 = Series([4, 7, -5, 3], index = ['d', 'b', 'a', 'c'])
#Output:
#d 4
#b 7
#a -5
#c 3
#dtype: int64
```

Compared with a regular NumPy array, you can use values in the index when selecting single values or a set of values:

```
obj2.index
#Output:Index([u'd', u'b', u'a', u'c'], dtype='object')
```

NumPy array operations, such as filtering with a boolean array, scalar multiplication, or applying math functions, will preserve the index-value link:

```
obj2[obj2 > 2]
obj2 * 2
np.exp(obj2)
```

Another way to think about a Series is as a fixed-length, ordered dict, as it is a mapping of index values to data values.

```
'b' in obj2
#Output: Ture
'e' in obj2
#Output: False
```

Should you have data contained in a Python dict, you can create a Series from it by passing the dict:

```
sdata = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}
obj3 = Series(sdata)
#Output:
#Ohio 35000
#Oregon 16000
#Texas 71000
#Utah 5000
#dtype: int64
```

When only passing a dict, the index in the resulting Series will have the dict's keys in sorted order:

```
states = ['California', 'Ohio', 'Oregon', 'Texas']
obj4 = Series(sdata, index = states)
obj4
#Output:
#California NaN
#Ohio 35000.0
#Oregon 16000.0
#Texas 71000.0
#dtype: float64
```

##### Data Frame

A DataFrame represents a tabular, spreadsheet-like data structure containing an ordered collection of columns, each of which can be a different value type. The DataFrame has both a row and column index; it can be thought of as a dict of Series.

```
import pandas as pd
from pandas import DataFrame, Series
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'],
'year':[2000, 2001, 2002, 2001, 2002],
'pop':[1.5, 1.7, 3.6, 2.4, 2.9]}
frame = DataFrame(data)
#Output:
# pop state year
#0 1.5 Ohio 2000
#1 1.7 Ohio 2001
#2 3.6 Ohio 2002
#3 2.4 Nevada 2001
#4 2.9 Nevada 2002
```

If you specify a sequence of columns, the DataFrame's columns will be exactly what you pass:

```
DataFrame(data, columns=['year', 'state', 'pop'])
#Output:
# year state pop
#0 2000 Ohio 1.5
#1 2001 Ohio 1.7
#2 2002 Ohio 3.6
#3 2001 Nevada 2.4
#4 2002 Nevada 2.9
```

Rows can also be retrieved by position or name by a couple of methods, such as the *ix* indexing field"

```
frame.ix[2]
#Output:
#pop 3.6
#state Ohio
#year 2002
#Name: 2, dtype: object
```

Columns can be modified by assignment. When assigning lists or arrays to a column, the value's length must match the length of the DataFrame. If you assign a Series, it will be instead conformed exactly to the DataFrame's index, inserting missing values in any holes:

```
val = Series([1.8, 3.2], index = [1, 3])
frame['pop'] = val
frame
#Output:
# pop state year
#0 NaN Ohio 2000
#1 1.8 Ohio 2001
#2 NaN Ohio 2002
#3 3.2 Nevada 2001
#4 NaN Nevada 2002
```

You can always transpose the result; Like Series, the *values* attribute returns the data contained in the DataFrame as a 2D ndarray:

```
frame.values
#Output:
#array([[1.5, 'Ohio', 2000L],
# [1.7, 'Ohio', 2001L],
# [3.6, 'Ohio', 2002L],
# [2.4, 'Nevada', 2001L],
# [2.9, 'Nevada', 2002L]], dtype=object)
```

For more about DataFrame, see DataFrame