P4DA Example1

usa.gov data from bit.ly

Data source: usagov-bitly-data

First, change your current directory to where the file is in. Using python os module:

import os  
os.chdir(u"Your data file directory")

%pwd  #using %pwd to check your current directory

#if the directory is correct, you could read the file use open
file = 'usagov_bitly_data2012-03-16-1331923249.txt'  
open(file).readline() #using `readline` to print the first line of the file  

The data file is in json format, using json module to parse the data:

import json  
records = [json.loads(line) for line in open(file)]  

Conveniently, iterating over an open file handle gives you a sequence of its line. The resulting object records is now a list of Python dicts:

records[0]['tz']  
#Output: u'America/New_York'

print records[0]['tz']  
#Output: America/New_York

Counting Time Zones in Pure Python

Extract a list of time zones using list comprehension:

time_zones = [rec['tz'] for rec in records if 'tz' in rec]  
# Make sure the some records in the list have 'tz' field

time_zones[:5]

#Output:
# [u'America/New_York',
# u'America/Denver',
# u'America/New_York',
# u'America/Sao_Paulo',
# u'America/New_York']
'''  

Define a function to count the number of different time zones:

def get_counts(sequence):  
    counts = {}
    for x in sequence:
        if x in counts:
            counts[x] += 1
        else:
            counts[x] = 1
    return counts

Or you can use collections module to initialize counts list more briefly:

from collections import defaultdict

def get_counts2(sequence):  
    counts = defaultdict(int) #values will initialize to 0
    for x in sequence:
        counts[x] += 1
    return counts

Test the function and we can get the counts of the time zones distribution:

counts = get_counts(time_zones)

counts['America/New_York']  
#Output: 1251

If you want to know the top N number of time zones, you can define a function to filter them:

def top_counts(count_dict, n = 10):  
    value_key_pairs = [(count, tz) for tz, count in count_dict.items]
    value_key_pairs.sort()  #using sort to rearrange the data
    return value_key_pairs[-n:] #sort in ascendant

Using collection modules and build-in function to make it a little easier:

from collection import Counter

counts = Counter(time_zones)

counts.most_common(10) #filter the top 10  

Counting Time Zones with pandas

The main pandas data structure is the DataFrame. Creating a DataFrame from the original set of records:

import pandas  
from pandas import DataFrame, Series  
frame = DataFrame(records)

frame  
#Output: 
#       _heartbeat_                                                  #a  \
#0              NaN  Mozilla/5.0 (Windows NT 6.1; WOW64) #AppleWebKi...   
#1              NaN                             #GoogleMaps/RochesterNY   
#2              NaN  Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 

frame['tz'][:3]  
#Output: 
#0    America/New_York
#1      America/Denver
#2    America/New_York
#Name: tz, dtype: object

The Series object returned by frame['tz'] has a method value_counts that gives us what we're looking for:

tz_counts = frame['tz'].value_counts()  
tz_counts[:5]  
#Output:
#America/New_York       1251
#                        521
#America/Chicago         400
#America/Los_Angeles     382
#America/Denver          191
#Name: tz, dtype: int64

Let's make a plot of this data using plotting library, matplotlib. You can fill in a substitute value for unknown and missing time zone data in the records. The fillna function can replace missing(NA) values and unknown(empty string) values can be replaced by boolean array indexing:

clean_tz = frame['tz'].fillna('Missing')  
cliean_tz[clean_tz == ''] = 'Unknown'  
tz_counts = clean_tz.value_counts()  
tz_counts[:5]  
#Outputs:
#America/New_York       1251
#Unknown                 521
#America/Chicago         400
#America/Los_Angeles     382
#America/Denver          191
#Missing                 120
#Name: tz, dtype: int64

Making a horizontal bar plot can be accomplished using the plot method on the counts objects:

tz_counts[:10].plot(kind='barh', rot =0)  

The plot shows:

Top Time Zone

The a field contains information about the browser, device or application used to perform the URL shortening:

frame['a'][1]  
#Output: 
#u'GoogleMaps/RochesterNY'
frame['a'][50]  
#Output:
#u'Mozilla/5.0 (Windows NT 5.1; rv:10.0.2) Gecko/20100101 Firefox/10.0.2'

For example, we could split off the first token in the spring and make another summary of the user behavior:

results = Series([x.split()[0] for x in frame.a.dropna()])  
result[:5]  
#Output:
#0               Mozilla/5.0
#1    GoogleMaps/RochesterNY
#2               Mozilla/4.0
#3               Mozilla/5.0
#4               Mozilla/5.0
#dtype: object

Exclude some of the agents are missing:

cframe = frame[frame.a.notnull()]

#Handle no Windows user in a easier way
operating_system = np.where(cframe['a'].str.contains('Windows'), 'Windows', 'Not Windows')

operating_system[:5]  
#Output:
#array(['Windows', 'Not Windows', 'Windows', 'Not Windows', 'Windows'], 
      dtype='|S11')

Now you can group the data by its time zone column and this new list of operating system:

by_tz_os = cframe.groupby(['tz', operating_system])  
#reshape into a table using unstack
agg_counts = by_tz_os.size().unstack().fillna(0)  
agg_counts[:5]  
#Output:
#                     Not Windows  Windows
#tz                                       
#                           245.0    276.0
#Africa/Cairo                 0.0      3.0
#Africa/Casablanca            0.0      1.0
#Africa/Ceuta                 0.0      2.0
#Africa/Johannesburg          0.0      1.0

Select the top overall time zones. Construct an indirect index array from the row counts in agg_counts:

#sort in ascending order
indexer = agg_counts.sum(1).argsort()  
indexer[:5]  
#Output:
#tz
#                       24
#Africa/Cairo           20
#Africa/Casablanca      21
#Africa/Ceuta           92
#Africa/Johannesburg    87
#dtype: int64

Use take to select the rows in that order, then slice off the last 5 rows:

cout_subset = agg_counts.take(indexer)[-5:]  
cout_subset  
#Output:
#                     Not Windows  Windows
#tz                                       
#America/Denver             132.0     59.0
#America/Los_Angeles        130.0    252.0
#America/Chicago            115.0    285.0
#                           245.0    276.0
#America/New_York           339.0    912.0

count_subset.plot(kind='barh', stacked = True)  

Plot in bar:

Top time zone by Windows and non-Windows

The plot doesn't make it easy to see the relative percentage of Windows users in the smaller groups, but the rows can easily be normalized to sum to 1 then plotted again:

normed_subset = count_subset.div(cout_subset.sum(1), axis = 0)  
normed_subset.plot(kind='barh', stacked = True)  

Normalized