P4DA Example3

US Baby Names 1880-2010

The United States Social Security Administration has made available data on the frequency of baby names from 1880 through the present.

Data set Download:

US National baby names

There are many things you might want to do with the data set:

  • Visualize the proportion of babies given a particular name over time
  • Determine the relative rank of a name
  • Determine the most popular names in each year or the names with largest increase
  • Analyze trends in names: vowels, consonants, length, overall diversity, changes in spelling, first and last letters
  • Analyze external sources of trends: biblical names, celebrities, demographic changes

As this data set is a nicely comma-separated form, it can be loaded into a DataFrame with pandas.read_csv:

import pandas as pd  
names2015 = pd.read_csv('yob2015.txt', names = ['name', 'sex', 'births'])  
#     name sex  births
#0    Emma   F   20355
#1  Olivia   F   19553
#2  Sophia   F   17327

These files only contain name with at least 5 occurrences in each year, so for simplicity's sake we can use the sum of the births column by sex as the total number of births in that year:

#Total births
#F    1769325
#M    1898858
#Name: births, dtype: int64

Since the data set is split into files by year, one of the first things to do is to assemble all of the data into a single DataFrame and further to add a year field. This can be done by pandas.concat:

year = range(1880, 2016)  
pieces = []  
columns = ['name', 'sex', 'births']

for year in years:  
    file = "yob%d.txt" % year
    frame = pd.read_csv(file, names = columns)
    frame['year'] = year
    names = pd.concat(pieces, ignore_index = True)

#        name sex  births  year
#0       Mary   F    7065  1880
#1       Anna   F    2604  1880
#2       Emma   F    2003  1880
#3  Elizabeth   F    1939  1880
#4     Minnie   F    1746  1880

With this data in hand, we can aggregating the data at the year and sex level using groupby and pivot_table:

total_births = names.pivot_table('births', index = 'year', columns = 'sex', aggfunc = 'sum')

#sex        F       M
#1880   90992  110490
#1881   91953  100743
#1882  107848  113686
#1883  112318  104627
#1884  129020  114443

Plot the Total births by sex and year:

total_births.plot(title = "Total births by sex and year")  

total births by sex and year

Let's insert a column prop with the fraction of babies given each name relative to the total number of births.

def add_prop(group):  
    #Integer division floors unless you use py3
    births = group.births.astype(float)
    group['prop'] = births / births.sum()
    return group

names = names.groupby(['year', 'sex']).apply(add_prop)  
#   name sex  births  year      prop
#0  Mary   F    7065  1880  0.077644
#1  Anna   F    2604  1880  0.028618
#2  Emma   F    2003  1880  0.022013

Check if the prop column sums to 1 within all the groups. Since this is floating point data, use np.allclose to check that the group sums are sufficiently close to 1:

import numpy as np  
np.allclose(names.groupby(['year', 'sex']).prop.sum(), 1)  

Extract a subset of the data to facilitate further analysis: the top 1000 names for each sex/year combination:

def get_top1000(group):  
    return group.sort_values(by = 'births', ascending = False)[:1000]

grouped = names.groupby(['year', 'sex'])  
top1000 = grouped.apply(get_top1000)  

Analyzing Naming Trends

Splitting the Top 1000 names into the boy and girl portions is easy to do first:

boy = top1000[top1000.sex == 'M']  
girl = top1000[top1000.sex == 'F']

#form a pivot table of the total number of births by year and name
total_births = top1000.pivot_table('births', index = 'year', columns = 'name', aggfunc = 'sum')

#get the subset 
subset = total_births[['John', 'Harry', 'Mary', 'Marilyn']]  
subset.plot(subplots = True, figsize=(12, 10), grid = False, title = "Number of births per year")  

Number of births per year

Measure the increase in naming diversity

The hypothesis of fewer parents are choosing common names for their children can be explored and confirmed in the data. One measure is the proportion of births represented by the top 10000 most popular names, which I aggregate and plot by year and sex:

table = top1000.pivot_table('prop', index = 'year', columns = 'sex', aggfunc = 'sum')  
table.plot(title = "Sum of table1000.prop by year and sex", yticks = np.linspace(0, 1.2, 13), xticks = range(1880, 2020, 10))  

Sum of table1000.prop by year and sex

Indeed there appears to be increasing name diversity.

Another interesting metric is the number of distinct names, taken in order of popularity from highest to lowest, in the top 50% of the births. Let's just consider the boy names from 2010:

df = boys[boys.year == 2010]  

After sorting prop in descending order, we want to know how many of the most popular names it takes to reach 50%. Take the cumulative sum, cumsum, of prop then calling the method searchsorted returns the position in the cumulative sum at which 0.5 would need to be inserted to keep it in sorted order:

prop_cumsum = df.sort_values(by = 'prop', ascending = False).prop.cumsum()  
#array([116], dtype=int64)

By contrast, in 1900 this number was much smaller:

df = boys[boys.year == 1900]  
in1900 = df.sort_values(by = 'prop', ascending = False).prop.cumsum())  
#array([24], dtype=int64)

gourpby those fields and apply a function returning the count for each group:

def get_quantile_count(group, q = 0.5):  
    group = group.sort_values(by = 'prop', ascending = False)
    return group.prop.cumsum().searchsorted(q) + 1

diversity = top1000.groupby(['year', 'name']).apply(get_quantile_count)  
diversity = diversity.unstack('sex')  

The "Last Letter" Revolution

First aggregate all of the births in the full data set by year, sex, and final letter:

#use lambda function
get_last_letter = lambda x: x[-1]  
last_letters = names.name.map(get_last_letter)  
last_letters.name = 'last_letter'

table = names.pivot_table('births', index = last_letters, columns = ['sex', 'year'], aggfunc = 'sum')  

The table is a full detail about the last letter of every year from 1880 to 2015, so I select out three representative years spanning the history and print the first few rows:

subtable = table.reindex(columns = [1910, 1960, 2010], level = 'year')  

The output is beneath:

Subtable output

Normalize the table by total births to compute a new table containing proportion of total births for each sex ending in each letter:

#sex  year
#F    1910     396503.0
#     1960    2022121.0
#     2010    1773440.0
#M    1910     194216.0
#     1960    2132756.0
#     2010    1914629.0
#dtype: float64

letter_prop= subtable / subtable.sum().astype(float)  

With the letter proportions now in hand, I can make bar plots for each sex broken down by year:

import matplotlib.pyplot as plt  
fig, axes = plt.subplots(2, 1, figsize=(10, 8))  
letter_prop['M'].plot(kind = 'bar', rot = 0, title = 'Male')  
letter_prop['F'].plot(kind = 'bar', rot = 0, title = 'Female', legend = False)  

Proportion of boy and girl names ending in each letter

As we can see, boy names ending in 'n' have experienced significant growth since the 1960s. Going back to the full table created above, I again normalize by year and sex and select a subset of letters for the boy names, finally transposing to make each column a time series:

letter_prop = table / table.sum().astype(float)  
dny_ts = letter_prop.ix[['d', 'n', 'y'], 'M'].T  

#last_letter         d         n         y
#1880         0.083057  0.153217  0.075763
#1881         0.083242  0.153212  0.077455
#1882         0.085332  0.149561  0.077538
#1883         0.084051  0.151653  0.079148
#1884         0.086121  0.149926  0.080407


Proportion of boys born with names ending in d/n/h over time

Boy names that became girl names(and vice vesa)

Going back to the top1000 data set, I compute a list of names occurring in the dataset starting with 'lesl':

all_names = top1000.name.unique()  
mask = np.array(['lesl' in x.lower() for x in all_names])  
lesley_like = all_names[mask]  
#array(['Leslie', 'Lesley', 'Leslee', 'Lesli', 'Lesly'], dtype=object)

We can filter down to just those names and sum births grouped by name to see the relative frequencies:

filtered = top1000[top1000.name.isin(lesley_like)]  
#Leslee       993
#Lesley     35032
#Lesli        929
#Leslie    376105
#Lesly      11431
#Name: births, dtype: int64

table= filtered.pivot_table('births', index = 'year', columns = 'sex', aggfunc = 'sum')  
table = table.div(table.sum(1), axis = 0)  
table.plot(style={'M':'k-', 'F':'k--'})  

Lesley-like name over time