CPS microdata

Current Population Survey Microdata with Python: a quick example

March 10, 2018

Note: IPUMS is likely the quickest interface for retrieving CPS data. The process below can be avoided by using IPUMS.
See: Tom Augspurger's blog and github as the definitive resource for working with CPS microdata in python
View/Download this page as jupyter notebook

If your research requires reading raw CPS microdata, which are stored in fixed-width format text files covering one month each, you can use Python to do so.

The Census FTP page contains the microdata and dictionaries identifying each variable name, location, value range, and whether it applies to a restricted sample. To follow this example, download the April 2017 compressed data file that matches your operating system and unpack it in the same location as the python code. Next download the January 2017 data dicationary text file and save it in the same location.

Python 3.6


# Import packages 
import pandas as pd  # pandas 0.22
import numpy as np
import re            # regular expressions

Use the January 2017 data dictionary to find variable locations

The example will calculate the employment to population ratio for women between the age of 25 and 54 in April 2017. To do this, we need to find the appropriate data dictionary on the Census FTP site, in this case January_2017_Record_Layout.txt, open it with python, and read the text inside.

We find that the BLS composite weight is called PWCMPWGT, the age variable is called PRTAGE, the sex variable is called PESEX and women are identified by 2, and the employment status is stored as PREMPNOT.

You may also notice that the dictionary follows a pattern, where variable names and locations are stored on the same line and in the same order. Regular expressions can be used to extract the parts of this pattern that we care about, specifically: the variable name, length, description, and location. I've already identified the pattern p below, but note that the patterns change over time, and you may need to adjust to match your specific data dictionary.

The python list dd_sel_var stores the variable names and locations for the four variables of interest.


# Data dictionary 
dd_file = 'January_2017_Record_Layout.txt'
dd_full = open(dd_file, 'r', encoding='iso-8859-1').read()

# Series of interest 

# Regular expression finds rows with variable location details
p = re.compile('\n(\w+)\s+(\d+)\s+(.*?)\t+.*?(\d\d*).*?(\d\d+)')

# Keep adjusted results for series of interest
dd_sel_var = [(i[0], int(i[3])-1, int(i[4])) 
              for i in p.findall(dd_full) if i[0] in series]


[('PRTAGE', 121, 123), ('PESEX', 128, 130), ('PREMPNOT', 392, 394), ('PWCMPWGT', 845, 855)]

Read the CPS microdata for April 2017

There are many ways to accomplish this task. One that is simple for small scale projects and still executes quickly involves using python list comprehension to read each line of the microdata and pull out the parts we want, using the locations from the data dictionary.

Pandas is used to make the data structure a bit more human readable and to make filtering the data a bit more intuitive. The column names come from the data dictionary varible ids.


# Convert raw data into a list of tuples
data = [tuple(int(line[i[1]:i[2]]) for i in dd_sel_var) 
        for line in open('apr17pub.dat', 'rb')]

# Convert to pandas dataframe, add variable ids as heading
df = pd.DataFrame(data, columns=[v[0] for v in dd_sel_var])

Benchmarking against BLS published data

The last step to show that the example has worked is to compare a sample calculation, the prime age employment rate of women, to the BLS published version of that calculation. If the benchmark calculation from the microdata is very close to the BLS result, we can feel a bit better about other calculations that we need to do.


# Temporary dataframe with only women age 25 to 54
dft = df[(df['PESEX'] == 2) & (df['PRTAGE'].between(25, 54))]

# Identify employed portion of group as 1.0 & the rest as 0.0
empl = np.where(dft['PREMPNOT'] == 1, 1.0, 0.0)

# Take weighted average of employed portion of group
epop = np.average(empl, weights=dft['PWCMPWGT']) * 100

# Print out the result to check against LNU02300062
print(f'April 2017: {round(epop, 1)}')
April 2017: 72.3

Scaling up this example

The quick example above can be scaled up to work for multiple years worth of monthly data. One example of how I've done this can be found in this jupyter notebook.

About the CPS

The CPS was initially deployed in 1940 to give a more accurate unemployment rate estimate, and it is still the source of the official unemployment rate. The CPS is a monthly survey of around 65,000 households. Each selected household is surveyed up to 8 times. Interviewers ask basic demographic and employment information for the first three interview months, then ask additional detailed wage questions on the 4th interview. The household is not surveyed again for eight months, and then repeats four months of interviews with detailed wage questions again on the fourth.

The CPS is not a random sample, but a multi-stage stratified sample. In the first stage, each state and DC are divided into "primary sampling units". In the second stage, a sample of housing units are drawn from the selected PSUs.

There are also months were each household receives supplemental questions on a topic of interest. The largest such "CPS supplement", conducted each March, is the Annual Social and Economic Supplement. The sample size for this supplement is expanded, and the respondents are asked questions about various sources of income, and about the quality of their jobs (for example, health insurance benefits). Other supplements cover topics like job tenure, or computer and internet use.

The CPS is a joint product of the U.S. Census Bureau and the Bureau of Labor Statistics.

Special thanks to John Schmitt for guidance on the CPS.
Back to Python Examples