P4DA Data Loading, Storage, and File Formats 2

Json Data

JSON (short for JavaScript Object Notation) has become one of the standard formats for sending data by HTTP request between web browsers and other applications. It is a much more flexible data format than a tabular text form like csv.

To convert a JSON string to Python form, use json.loads:

obj = """  
{
  "foo": [ 1, null ],
  "baz": {
    "foo": [ true, "bar" ],
    "baz": "qux"
  }
}
"""
result = json.loads(obj)  
result  
#Output: {u'baz': {u'baz': u'qux', u'foo': [True, u'bar']}, u'foo': [1, None]}

json.dumps on the other hand converts a Python object back to JSON:

asjson = json.dumps(result)  
asjson  
#Output:
#'{"foo": [1, null], "baz": {"foo": [true, "bar"], "baz": "qux"}}'

Conveniently, you can pass a list of JSON objects to the DataFrame constructor and select a subset of the data fields:

siblings = DataFrame(result['foo'])  
siblings  
#Output"
#     0
#0  1.0
#1  NaN

More about JSON module.

XML and HTML: Web Scraping

Python has many libraries for reading and writing data in the ubiquitous HTML and XML formats. lxml is one that has consistently strong performance in parsing very large files.

To get started, find the URL you want to extract data from, open it with urllib2 and parse the stream with lxml like so:

import pandas as pd  
%pylab
from lxml.html import parse  
from urllib2 import urlopen

parsed = parse(urlopen('https://finance.yahoo.com/quote/AAPL/options?ltr=1'))  
doc = parsed.getroot()  

using this object, you can extract all HTML tags of a particular type. Using the document root's findall method along with an XPath:

links = doc.findall('.//a')  
links[15:20]  
#Output: 
#[<Element a at 0xc3568b8>,
# <Element a at 0xc356908>,
# <Element a at 0xc356958>,
# <Element a at 0xc3569a8>,
# <Element a at 0xc3569f8>]

To get the URL and link text you have to use each element's get method and text_content method:

lnk = links[25]  
lnk.get('href')  
#Output: '/brklivestream?soc_src=mail&soc_trk=ma'

lnk.text_content()  
#Output: u"Mark your calendarBerkshire Hathaway's 2017 Shareholders MeetingLIVE on Yahoo #Finance \u2022 May 6, starting at 10 a.m. ET"

Thus, getting a list of all URLs in the document is a matter of writing this list comprehension:

urls = [lnk.get('href') for lnk in doc.findall('.//a')]  
urls[-5:]  
#Output:
#['http://info.yahoo.com/relevantads/',
# 'http://info.yahoo.com/legal/us/yahoo/utos/utos-173.html',
# 'https://twitter.com/YahooFinance',
# 'https://facebook.com/yahoofinance',
# 'http://yahoofinance.tumblr.com']
Parsing XML with lxml.objectify

XML is another common structured data format supporting hierarchical, nested data with metadata.

Using lxml.objectify,we can parse the file and get a reference to the root node of the XML file with getroot:

from lxml import objectify  
path = 'Performance_MNR.xml'  
parsed = objectify.parse(open(path))  
root = parsed.getroot()  

root.INDICATOR return a generator yielding each XML element. We can populate a dict of tag names to data values:

data = []  
skip_fields = ['PARENT_SEQ', 'INDICATOR_SEQ', 'DESIRED_CHANGE', 'DECIMAL_PLACE']

for elt in root.INDICATOR:  
    el_data = {}
    for child in elt.getchildren():
        if child.tag in skip_fields:
            continue
        el_data[child.tag] = child.pyval
    data.append(el_data)

perf = DataFrame(data)  

Binary Data Formats

Conveniently, pandas objects all have a save method which writes the data to disk as a pickle. You read the data back into Python with pandas.load, another pickle convenience function:

frame.save('frame_pickle')  
pd.load('frame_pickle')  

Reading Microsoft Excel Files

pandas also supports reading tabular data stored in Excel files using the ExcelFile class. Interally ExcelFile uses the xlrd and openpyxl packages, so you may have to install them firest.

xls_file = pd.ExcelFile('data.xls')  
table = xls_file.parse('Sheet1')