Input and output typically falls into a few main categories: reading text files and other more efficient on-disk formats, loading data from databases, and interacting with network sources like web APIs.
Reading and Writing Data in Text Format
pandas features a number of functions for reading tabular data as a DataFrame object. Input/Output APIs show details of file input and output, though read_csv
and read_table
are likely the ones you'll use the most.
An overview of the mechanics of functions converting text data into a DataFrame:
* Indexing: can treat one or more columns as the returned DataFrame.
* Type inference and data conversion: this includes the user-defined value conversions and custom list of missing value markers
* Datetime parsing: includes combining capability, including combining date and time information spread over multiple columns into a single column in the result.
* Iterating: support for iterating over chunks of very large files
* Unclean data issues: skipping rows or a footer, comments, or other minor things like numeric data with thousands separated by commas.
Type inference is one of the more important features of these functions, you don't have to specify which columns are numeric, integer or string. Let's start with a small comma-separated text file:
!cat ex1.csv
#Output:
#a,b,c,d,message
#1,2,3,4,hello
#5,6,7,8,world
#9,10,11,12,foo
Since this is comma-delimited, we can use read_csv
to read it into a DataFrame, also we can use read_table
and specifying the delimiter:
df = pd.read_csv('ex1.csv')
df = pd.read_table('ex1.csv', sep=',')
df
#Output:
# a b c d message
#0 1 2 3 4 hello
#1 5 6 7 8 world
#2 9 10 11 12 foo
A file will not always have a header row, you can allow pandas to assign default column names, or you can specify names yourself. And if you want the message column to be the index of the returned DataFrame, use index_col
argument:
pd.read_csv('ex2.csv', header = None)
#Output:
# 0 1 2 3 4
#0 1 2 3 4 hello
#1 5 6 7 8 world
#2 9 10 11 12 foo
pd.read_csv('ex2.csv', names = ['a','b','c','d','message'], index_col = 'message')
#Output:
# a b c d
#message
#hello 1 2 3 4
#world 5 6 7 8
#foo 9 10 11 12
In the event that you want to form a hierarchical index from multiple columns, just pass a list of column numbers or names:
parsed = pd.read_csv('csv_mindex.csv', index_col = ['key1', 'key2'])
parsed
#Output:
# value1 value2
#key1 key2
#one a 1 2
# b 3 4
# c 5 6
# d 7 8
#two a 9 10
# b 11 12
# c 13 14
# d 15 16
In some cases, a table might not have a fixed delimiter, using whitespace or some other pattern to separate fields. In these case, you can pass a regular expression as a delimiter for read_table
:
list(open('ex3.txt'))
#Output:
#[' A B C\n',
# 'aaa -0.264438 -1.026059 -0.619500\n',
# 'bbb 0.927272 0.302904 -0.032399\n',
# 'ccc -0.264273 -0.386314 -0.217601\n',
# 'ddd -0.871858 -0.348382 1.100491\n']
In this case fields are separated by a variable amount of whitespace. This can be expressed by the regular expression \s+, so we have then:
result = pd.read_table('ex3.txt', sep = '\s+')
result
#Output:
# A B C
#aaa -0.264438 -1.026059 -0.619500
#bbb 0.927272 0.302904 -0.032399
#ccc -0.264273 -0.386314 -0.217601
#ddd -0.871858 -0.348382 1.100491
You can skip any rows of a file with skiprows
:
!cat ex4.csv
## hey!
#a,b,c,d,message
## just wanted to make things more difficult for you
## who reads CSV files with computers, anyway?
#1,2,3,4,hello
#5,6,7,8,world
#9,10,11,12,foo
pd.read_csv('ex4.csv', skiprows = [0, 2, 3])
#Output:
# a b c d message
#0 1 2 3 4 hello
#1 5 6 7 8 world
#2 9 10 11 12 foo
Handling missing values is an important and frequently nuanced part o the file parsing process. By default, pandas uses a set of commonly occurring sentinels, such as NA, -1.#IND, and NULL:
!cat ex5.csv
#something,a,b,c,d,message
#one,1,2,3,4,NA
#two,5,6,,8,world
#three,9,10,11,12,foo
pd.read_csv('ex5.csv')
#Output:
# something a b c d message
#0 one 1 2 3.0 4 NaN
#1 two 5 6 NaN 8 world
#2 three 9 10 11.0 12 foo
Different NA sentinels can be specified for each column in a dict:
sentinels = {'message': ['foo', 'NA'], 'something':['two']}
pd.read_csv('ex5.csv', na_values = sentinels)
#Output:
# something a b c d message
#0 one 1 2 3.0 4 NaN
#1 NaN 5 6 NaN 8 world
#2 three 9 10 11.0 12 NaN
More about readcsv and readtable.
Reading Text Files in Pieces
When processing very large files you may only want to read in a small piece of a file or iterate through smaller chunks of the file.
If you want to only read out a small number of rows, specify that with nrows
:
pd.read_csv('ex6.csv', nrows = 5)
#Output:
# one two three four key
#0 0.467976 -0.038649 -0.295344 -1.824726 L
#1 -0.358893 1.404453 0.704965 -0.200638 B
#2 -0.501840 0.659254 -0.421691 -0.057688 G
#3 0.204886 1.074134 1.388361 -0.982404 R
#4 0.354628 -0.133116 0.283763 -0.837063 Q
To read out a file in pieces, specify a chunksize
as a number of rows:
chunker = pd.read_csv('ex6.csv', chunksize = 1000)
chunker
#Output: <pandas.io.parsers.TextFileReader at 0xc10e748>
The TextParser
object returned by read_csv
allows you to iterate over the parts of the file according to the chunksize
. We can iterate over ex6.csv, aggregating the value counts in the key column like:
tot = Series([])
for piece in chunker:
tot = tot.add(piece['key'].value_counts(), fill_value = 0)
tot = tot.sort_values(ascending = False)
tot[:10]
#Output:
#E 368.0
#X 364.0
#L 346.0
#O 343.0
#Q 340.0
#M 338.0
#J 337.0
#F 335.0
#K 334.0
#H 330.0
#dtype: float64
TextParser
is also equipped with a get_chunk
method which enables you to read pieces of an arbitrary size.
Writing Data Out to Text Format
Data can also be exported to delimited format. Using DataFrame's to_csv
method, we can write the data out to a comma-separated file:
data = pd.read_csv('ex5.csv')
data
#Output:
# something a b c d message
#0 one 1 2 3.0 4 NaN
#1 two 5 6 NaN 8 world
#2 three 9 10 11.0 12 foo
data.to_csv('out.csv')
!cat out.csv
#,something,a,b,c,d,message
#0,one,1,2,3.0,4,
#1,two,5,6,,8,world
#2,three,9,10,11.0,12,foo
Other delimiters can be used, writing to sys.stdout
so it just prints the text result:
data.to_csv(sys.stdout, sep = '|')
#Output:
#|something|a|b|c|d|message
#0|one|1|2|3.0|4|
#1|two|5|6||8|world
#2|three|9|10|11.0|12|foo
Missing values appear as empty strings in the Output. With no other options specified, both the row and column labels are written, both of these can be disabled:
data.to_csv(sys.stdout, index = False, header = False, na_rep = 'NULL')
#Output:
#one,1,2,3.0,4,NULL
#two,5,6,NULL,8,world
#three,9,10,11.0,12,foo
Series also has a to_csv
method. There is also a from_csv
convenience method that makes it a bit simpler.
Manually Working with Delimited Formats
For any file with a single-character delimiter, you can use Python's built-in csv module, passing any open file or file-like object to csv.reader
:
import csv
f = open('ex7.csv')
reader = csv.reader(f)
for line in reader:
print line
#Output:
#['a', 'b', 'c']
#['1', '2', '3']
#['1', '2', '3', '4']
CSV files come in many different flavors. Defining a new format with a different delimiter, string quoting convention or line terminator is done by defining a simple subclass of csv.Dialect
:
class my_dialect(csv.Dialect):
lineterminator = '\n'
delimiter = ';'
quotechar = '"'
More about csv module.