Python Notebooks

Current Population Survey Microdata with Python

Brian Dew, @bd_econ, August 22, 2018

The following is a minimal working example of using python and the struct method to read CPS microdata. It requires two files to be saved locally: the data dictionary, January_2017_Record_Layout.txt, and the large (13MB compressed, 146MB uncompressed) CPS public use microdata file, apr17pub.dat. Both are downloaded from the Census CPS ftp page.

First, the Census data dictionary file identifies where the relevant variables are located in the microdata file. Because the microdata file is fixed-width format with no headers, there is nothing inside of the file that tells what each variable is or where it is stored--hence the need for a separate dictionary.

Once the data structure is identified, the relevant parts of each row of microdata are extracted into a big list. Pandas, which makes further data cleaning and calculations easier, is able to quickly read this list. To confirm that the data are loaded properly, the example successfully benchmarks the results of a local calculation of what share of women age 25-54 are employed in April 2017 against the published BLS estimate.

The code benefits greatly from the excellent four-part series (1, 2, 3, 4) by Tom Augspurger and from very patient guidance with the CPS from John Schmitt.

Grateful for any advice on making the code better!

Requirements:

Python (3.6 used, probably works on 2.7+), pandas, and numpy

Two external files:

1) January_2017_Record_Layout.txt (a data dictionary text file: http://thedataweb.rm.census.gov/pub/cps/basic/201701-/January_2017_Record_Layout.txt); and 2) apr17pub.dat (CAUTION: large public use microdata file [13MB compressed, 146MB uncompressed]: http://thedataweb.rm.census.gov/pub/cps/basic/201501-/apr17pub.zip)

In [1]:
# Import relevant libraries (python 3.6)
import os, re, struct
import pandas as pd
import numpy as np

# check that two required files saved in the cwd
print(os.listdir(os.curdir))
['.ipynb_checkpoints', 'apr17pub.dat', 'cps_example.ipynb', 'January_2017_Record_Layout.txt']

CPS data dictionary

Census posts a file that describes how to read the fixed-width format CPS microdata files. This "data dictionary" file can be machine read. In this example, the January 2017 data dictionary text file is saved locally then used to manually find and type out the "names" of four variables: PRTAGE for age, PESEX for gender, PREMPNOT for employment status, and PWCMPWGT for the sample weight.

In [2]:
# read data dictionary text file 
data_dict = open('January_2017_Record_Layout.txt').read()

# manually list out the IDs for series of interest 
var_names = ['PRTAGE', 'PESEX', 'PREMPNOT', 'PWCMPWGT']

Regular expressions

The data dictionary text file follows a pattern that makes it machine readable. In this case, specific rows of interest follow the pattern of: VariableName [space] VariableLength [space] VariableDescription [space] VariableStart - VariableEnd. The regex pattern that captures the data of interest is identified by p.

The python numbering system starts at zero and does not include the last number from a range. As a result, the start location of a variable, as identified in the data dictionary, is adjusted by -1. Additionally, the length of the variable value, its "width" in the dataset, is stored as a string ending in s, the struct format code for a byte containing a single character. This will be used later in reading the microdata

In [3]:
# regular expression matching series name and data dict pattern
p = f'\n({"|".join(var_names)})\s+(\d+)\s+.*?\t+.*?(\d\d*).*?(\d\d+)'

# dictionary of variable name: [start, end, and length + 's']
d = {s[0]: [int(s[2])-1, int(s[3]), f'{s[1]}s']
     for s in re.findall(p, data_dict)}

print(d)
{'PRTAGE': [121, 123, '2s'], 'PESEX': [128, 130, '2s'], 'PREMPNOT': [392, 394, '2s'], 'PWCMPWGT': [845, 855, '10s']}

Struct method

There are at least three ways to read a fixed-width format file in python: 1) string slicing on each row for each variable; 2) pandas.read_fwf(file, colspecs); or 3) use struct and a format string to identify what part of each CPS row to keep as a variable and what part to ignore. The pandas method is the easiest to use, but the slowest. The struct method is the most difficult to use but the fastest. If you need to read 40 variables for all months going back to 1994, you might want to consider writing out a script that applies the struct method.

Here is a minimal example:

In [4]:
# lists of variable starts, ends, and lengths
start, end, width = zip(*d.values())

# create list of which characters to skip in each row
skip = ([f'{s - e}x' for s, e in zip(start, [0] + list(end[:-1]))])

# create format string by joining skip and variable segments
unpack_fmt = ''.join([j for i in zip(skip, width) for j in i])
print(unpack_fmt)

# struct can interpret row bytes with the format string
unpacker = struct.Struct(unpack_fmt).unpack_from
121x2s5x2s262x2s451x10s

Fixed width format

The string format above, called unpack_fmt, can be read as: ignore the first 121 characters (121x), store the next two (2s), ignore the next five (5x), store the next two (2s), and so on. This format matches the structure of each line of the fwf data. Here's the first row of the raw CPS microdata, as an example:

In [5]:
print(open('apr17pub.dat').readline())
000110116792163 42017 120100-1 1 1-1 115-1-1-1  15049796 1 2 1 7 2 0 205011 2  1 1-1-1-1 36 01 338600001103000   -1-1 1-1420 1 2 1 2-1 243 1-1 9-1 1-1 1 1 1 2 1 2 57 57 57 1 0 0 1 1 1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1 2-150-1-1 50-1-1-1-1 2-1 2-150-1 50-1-1    2 5 5-1 2 3 5 2-1-1-1-1-1 -1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1 -1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1 1-121 1 1 1 6-1-1-1 -1-1-1 1 2-1-1-1-1 1 2 1 6 4      -1-1       4 3 3 1 2 4-1-1 6-138-114-1 1 9-1 3-1 2 1 1 1 0-1-1-1-1  -1  -1  -1  -10-1      -10-1-1      -1      -10-1-1-1-1-1-1-1-1-1 2-1-1 2  15049796  22986106         0  16044411  15280235 0 0 1-1-1-1 0 0 1 0-1 050 0 0 0 0 1 0 0 0-1-1-1 1 0 0-1 1 1 0 1 0 1 1 0 1 1 1 0 1 0 1 1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1 0 0 0-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1 0 1 1 3865 1-1-1-1-1-1-1 1 1 1-1-1-1  1573071277704210  -1  -114-1-1-1-1-1 0-1-1-1-1-15050 1 1 1 2 2 2 2 2 2 2 0 0 0 0 0 0 0-1-1-1-1-1 1 1 1202020                                            A

If we ignore the first 121 characters and then keep the next two, we find 42 which is the age of the person in the first row of the microdata.

Read the raw microdata

First, open the raw CPS microdata file with the readlines method. The raw data is filtered to only include observations with a positive sample weight. The unpacker identified above is applied to each row of microdata, while the extracted variable values are converted to integers from strings.

The first five observations are printed. They are much more readable than the raw data, but still not identified by name.

In [6]:
# open file (read as binary) and read lines into "raw_data"
raw_data = open('apr17pub.dat', 'rb').readlines()

wgt = d['PWCMPWGT']  # Location of sample weight variable

# unpack and store data of interest if sample weight > 0
data = [[*map(int, unpacker(row))] for row in raw_data
        if int(row[wgt[0]:wgt[1]]) > 0]

print(data[:5])
[[42, 1, 1, 15730712], [26, 2, 1, 14582612], [25, 2, 1, 20672047], [42, 2, 4, 15492377], [47, 1, 1, 18155638]]

Create pandas dataframe from CPS data

Pandas is a useful tool for working with the data--here is it used to filter out the observations by age and gender. I also convert the sample weights to the values they are intended to represent, as the data dictionary indicates that they have four implied decimal places. That is, the first person in the sample below represents "1,458.2612" people in the US population.

In [7]:
# Pandas dataframe of women age 25 to 54
df = (pd.DataFrame(data, columns=d.keys())
      .query('PESEX == 2 and 25 <= PRTAGE <= 54')
      .assign(PWCMPWGT = lambda x: x['PWCMPWGT'] / 10000))

print(df.head().to_string(index=False))
PRTAGE  PESEX  PREMPNOT   PWCMPWGT
    26      2         1  1458.2612
    25      2         1  2067.2047
    42      2         4  1549.2377
    49      2         1  1633.0038
    26      2         1  1611.2316

Benchmarking against BLS published data

The published value for April 2017 is 72.3%

In [8]:
# identify employed portion of group as 1 & the rest as 0
empl = np.where(df['PREMPNOT'] == 1, 1, 0)

# take sample weighted average of employed portion of group
epop = np.average(empl, weights=df['PWCMPWGT'])

# print out the result to check against LNU02300062
print(f'April 2017: {epop*100:.1f}%')
April 2017: 72.3%