P4DA Data Loading, Storage, and File Formats 1

Input and output typically falls into a few main categories: reading text files and other more efficient on-disk formats, loading data from databases, and interacting with network sources like web APIs.

Reading and Writing Data in Text Format

pandas features a number of functions for reading tabular data as a DataFrame object. Input/Output APIs show details of file input and output, though read_csv and read_table are likely the ones you'll use the most.

An overview of the mechanics of functions converting text data into a DataFrame:
* Indexing: can treat one or more columns as the returned DataFrame. * Type inference and data conversion: this includes the user-defined value conversions and custom list of missing value markers * Datetime parsing: includes combining capability, including combining date and time information spread over multiple columns into a single column in the result. * Iterating: support for iterating over chunks of very large files * Unclean data issues: skipping rows or a footer, comments, or other minor things like numeric data with thousands separated by commas.

Type inference is one of the more important features of these functions, you don't have to specify which columns are numeric, integer or string. Let's start with a small comma-separated text file:

!cat ex1.csv
#Output:
#a,b,c,d,message
#1,2,3,4,hello
#5,6,7,8,world
#9,10,11,12,foo

Since this is comma-delimited, we can use read_csv to read it into a DataFrame, also we can use read_table and specifying the delimiter:

df = pd.read_csv('ex1.csv')  
df = pd.read_table('ex1.csv', sep=',')  
df  
#Output: 
#   a   b   c   d message
#0  1   2   3   4   hello
#1  5   6   7   8   world
#2  9  10  11  12     foo

A file will not always have a header row, you can allow pandas to assign default column names, or you can specify names yourself. And if you want the message column to be the index of the returned DataFrame, use index_col argument:

pd.read_csv('ex2.csv', header = None)  
#Output: 
#   0   1   2   3      4
#0  1   2   3   4  hello
#1  5   6   7   8  world
#2  9  10  11  12    foo

pd.read_csv('ex2.csv', names = ['a','b','c','d','message'], index_col = 'message')  
#Output: 
#         a   b   c   d
#message               
#hello    1   2   3   4
#world    5   6   7   8
#foo      9  10  11  12

In the event that you want to form a hierarchical index from multiple columns, just pass a list of column numbers or names:

parsed = pd.read_csv('csv_mindex.csv', index_col = ['key1', 'key2'])

parsed  
#Output: 
#            value1  value2
#key1 key2                
#one  a          1       2
#     b          3       4
#     c          5       6
#     d          7       8
#two  a          9      10
#     b         11      12
#     c         13      14
#     d         15      16

In some cases, a table might not have a fixed delimiter, using whitespace or some other pattern to separate fields. In these case, you can pass a regular expression as a delimiter for read_table:

list(open('ex3.txt'))  
#Output: 
#['            A         B         C\n',
# 'aaa -0.264438 -1.026059 -0.619500\n',
# 'bbb  0.927272  0.302904 -0.032399\n',
# 'ccc -0.264273 -0.386314 -0.217601\n',
# 'ddd -0.871858 -0.348382  1.100491\n']

In this case fields are separated by a variable amount of whitespace. This can be expressed by the regular expression \s+, so we have then:

result = pd.read_table('ex3.txt', sep = '\s+')  
result  
#Output:
#            A         B         C
#aaa -0.264438 -1.026059 -0.619500
#bbb  0.927272  0.302904 -0.032399
#ccc -0.264273 -0.386314 -0.217601
#ddd -0.871858 -0.348382  1.100491

You can skip any rows of a file with skiprows:

!cat ex4.csv
## hey!
#a,b,c,d,message
## just wanted to make things more difficult for you
## who reads CSV files with computers, anyway?
#1,2,3,4,hello
#5,6,7,8,world
#9,10,11,12,foo

pd.read_csv('ex4.csv', skiprows = [0, 2, 3])  
#Output: 
#   a   b   c   d message
#0  1   2   3   4   hello
#1  5   6   7   8   world
#2  9  10  11  12     foo

Handling missing values is an important and frequently nuanced part o the file parsing process. By default, pandas uses a set of commonly occurring sentinels, such as NA, -1.#IND, and NULL:

!cat ex5.csv
#something,a,b,c,d,message
#one,1,2,3,4,NA
#two,5,6,,8,world
#three,9,10,11,12,foo

pd.read_csv('ex5.csv')  
#Output:
#  something  a   b     c   d message
#0       one  1   2   3.0   4     NaN
#1       two  5   6   NaN   8   world
#2     three  9  10  11.0  12     foo

Different NA sentinels can be specified for each column in a dict:

sentinels = {'message': ['foo', 'NA'], 'something':['two']}  
pd.read_csv('ex5.csv', na_values = sentinels)  
#Output:
#  something  a   b     c   d message
#0       one  1   2   3.0   4     NaN
#1       NaN  5   6   NaN   8   world
#2     three  9  10  11.0  12     NaN

More about readcsv and readtable.

Reading Text Files in Pieces

When processing very large files you may only want to read in a small piece of a file or iterate through smaller chunks of the file.

If you want to only read out a small number of rows, specify that with nrows:

pd.read_csv('ex6.csv', nrows = 5)  
#Output:
#        one       two     three      four key
#0  0.467976 -0.038649 -0.295344 -1.824726   L
#1 -0.358893  1.404453  0.704965 -0.200638   B
#2 -0.501840  0.659254 -0.421691 -0.057688   G
#3  0.204886  1.074134  1.388361 -0.982404   R
#4  0.354628 -0.133116  0.283763 -0.837063   Q

To read out a file in pieces, specify a chunksize as a number of rows:

chunker = pd.read_csv('ex6.csv', chunksize = 1000)  
chunker  
#Output: <pandas.io.parsers.TextFileReader at 0xc10e748>

The TextParser object returned by read_csv allows you to iterate over the parts of the file according to the chunksize. We can iterate over ex6.csv, aggregating the value counts in the key column like:

tot = Series([])  
for piece in chunker:  
    tot = tot.add(piece['key'].value_counts(), fill_value = 0)

tot = tot.sort_values(ascending = False)  
tot[:10]  
#Output: 
#E    368.0
#X    364.0
#L    346.0
#O    343.0
#Q    340.0
#M    338.0
#J    337.0
#F    335.0
#K    334.0
#H    330.0
#dtype: float64

TextParser is also equipped with a get_chunk method which enables you to read pieces of an arbitrary size.

Writing Data Out to Text Format

Data can also be exported to delimited format. Using DataFrame's to_csv method, we can write the data out to a comma-separated file:

data = pd.read_csv('ex5.csv')  
data  
#Output: 
#  something  a   b     c   d message
#0       one  1   2   3.0   4     NaN
#1       two  5   6   NaN   8   world
#2     three  9  10  11.0  12     foo
data.to_csv('out.csv')  
!cat out.csv
#,something,a,b,c,d,message
#0,one,1,2,3.0,4,
#1,two,5,6,,8,world
#2,three,9,10,11.0,12,foo

Other delimiters can be used, writing to sys.stdout so it just prints the text result:

data.to_csv(sys.stdout, sep = '|')  
#Output:
#|something|a|b|c|d|message
#0|one|1|2|3.0|4|
#1|two|5|6||8|world
#2|three|9|10|11.0|12|foo

Missing values appear as empty strings in the Output. With no other options specified, both the row and column labels are written, both of these can be disabled:

data.to_csv(sys.stdout, index = False, header = False, na_rep = 'NULL')  
#Output:
#one,1,2,3.0,4,NULL
#two,5,6,NULL,8,world
#three,9,10,11.0,12,foo

Series also has a to_csv method. There is also a from_csv convenience method that makes it a bit simpler.

More about tocsv and fromcsv.

Manually Working with Delimited Formats

For any file with a single-character delimiter, you can use Python's built-in csv module, passing any open file or file-like object to csv.reader:

import csv  
f = open('ex7.csv')  
reader = csv.reader(f)  
for line in reader:  
    print line
#Output:   
#['a', 'b', 'c']
#['1', '2', '3']
#['1', '2', '3', '4']

CSV files come in many different flavors. Defining a new format with a different delimiter, string quoting convention or line terminator is done by defining a simple subclass of csv.Dialect:

class my_dialect(csv.Dialect):  
    lineterminator = '\n'
    delimiter = ';'
    quotechar = '"'

More about csv module.