I am the Hoben and Patricia Thomas and Thomas and Ann Hettmansperger Early Career Professor of Statistics and a faculty affiliate in political science at Penn State University. My research focuses on methodological and applied problems in the social sciences, including elections, legislative redistricting, racial disparities, and missing data.
I'm a Co-PI of the Algorithm-Assisted Redistricting Methodology (ALARM) Project at Harvard University, which I helped start in 2021. I was previously a Faculty Fellow at the Center for Data Science at New York University. Before that, I received my Ph.D. in statistics from Harvard University, and a bachelor's in mathematics from Grinnell College. As part of my research, I also develop and maintain a number of open-source R packages for redistricting, statistical analysis, and visualization.
Working Papers
Redistricting Reforms Reduce Gerrymandering by Constraining Partisan Actors (2024).
Abstract
Political actors frequently manipulate redistricting plans to gain electoral advantages, a process commonly known as gerrymandering. To address this problem, several states have implemented institutional reforms including the establishment of map-drawing commissions. It is difficult to assess the impact of such reforms because each state structures bundles of complex rules in different ways. We propose to model redistricting processes as a sequential game. The equilibrium solution to the game summarizes multi-step institutional interactions as a single dimensional score. This score measures the leeway political actors have over the partisan lean of the final plan. Using a differences-in-differences design, we demonstrate that reforms reduce partisan bias and increase competitiveness when they constrain partisan actors. We perform a counterfactual policy analysis to estimate the partisan effects of enacting recent institutional reforms nationwide. We find that instituting redistricting commissions generally reduces the current Republican advantage, but Michigan-style reforms would yield a much greater pro-Democratic effect than types of redistricting commissions adopted in Ohio and New York.
Estimating Racial Disparities When Race is Not Observed (2024).
NBER working paper, Under Review.
Software;
Poster
Abstract
The estimation of racial disparities in various fields is often hampered by the lack of individual-level racial information. In many cases, the law prohibits the collection of such information to prevent direct racial discrimination. As a result, analysts have frequently adopted Bayesian Improved Surname Geocoding (BISG) and its variants, which combine individual names and addresses with Census data to predict race. Unfortunately, the residuals of BISG are often correlated with the outcomes of interest, generally attenuating estimates of racial disparities. To correct this bias, we propose an alternative identification strategy under the assumption that surname is conditionally independent of the outcome given (unobserved) race, residence location, and other observed characteristics. We introduce a new class of models, Bayesian Instrumental Regression for Disparity Estimation (BIRDiE), that take BISG probabilities as inputs and produce racial disparity estimates by using surnames as an instrumental variable for race. Our estimation method is scalable, making it possible to analyze large-scale administrative data. We also show how to address potential violations of the key identification assumptions. A validation study based on the North Carolina voter file shows that BISG+BIRDiE reduces error by up to 84% when estimating racial differences in party registration. Finally, we apply the proposed methodology to estimate racial differences in who benefits from the home mortgage interest deduction using individual-level tax data from the U.S. Internal Revenue Service. Open-source software is available which implements the proposed methodology.
Individual and Differential Harm in Redistricting (2022).
Replication code
Abstract
Social scientists have developed dozens of measures for assessing partisan bias in redistricting. But these measures cannot be easily adapted to other groups, including those defined by race, class, or geography. Nor are they applicable to single- or no-party contexts such as local redistricting. To overcome these limitations, we propose a unified framework of *harm* for evaluating the impacts of a districting plan on individual voters and the groups to which they belong. We consider a voter harmed if their chosen candidate is not elected under the current plan, but would be under a different plan. Harm improves on existing measures by both focusing on the choices of individual voters and directly incorporating counterfactual plans. We discuss strategies for estimating harm, and demonstrate the utility of our framework through analyses of partisan gerrymandering in New Jersey, voting rights litigation in Alabama, and racial dynamics of Boston City Council elections.
Projective Averages for Summarizing Redistricting Ensembles (2024).
Replication code
Abstract
A recurring challenge in the application of redistricting simulation algorithms lies in extracting useful summaries and comparisons from a large ensemble of districting plans. Researchers often compute summary statistics for each district in a plan, and then study their distribution across the plans in the ensemble. This approach discards rich geographic information that is inherent in districting plans. We introduce the projective average, an operation that projects a district-level summary statistic back to the underlying geography and then averages this statistic across plans in the ensemble. Compared to traditional district-level summaries, projective averages are a powerful tool for geographically granular, sub-district analysis of districting plans along a variety of dimensions. However, care must be taken to account for variation within redistricting ensembles, to avoid misleading conclusions. We propose and validate a multiple-testing procedure to control the probability of incorrectly identifying outlier plans or regions when using projective averages.
Finding Pareto Efficient Redistricting Plans with Short Bursts (2023).
Replication code
Abstract
Redistricting practitioners must balance many competing constraints and criteria when drawing district boundaries. To aid in this process, researchers have developed many methods for optimizing districting plans according to one or more criteria. This research note extends a recently-proposed single-criterion optimization method, short bursts (Cannon et al., 2023), to handle the multi-criterion case, and in doing so approximate the Pareto frontier for any set of constraints. We study the empirical performance of the method in a realistic setting and find it behaves as expected and is not very sensitive to algorithmic parameters. The proposed approach, which is implemented in open-source software, should allow researchers and practitioners to better understand the tradeoffs inherent to the redistricting process.
Publications
Evaluating Bias and Noise Induced by the U.S. Census Bureau’s Privacy Protection Methods (2024). Science Advances 10:18, eadl2524.
Abstract
The United States Census Bureau faces a difficult trade-off between the accuracy of Census statistics and the protection of individual information. We conduct the first independent evaluation of bias and noise induced by the Bureau's two main disclosure avoidance systems: the TopDown algorithm employed for the 2020 Census and the swapping algorithm implemented for the 1990, 2000, and 2010 Censuses. Our evaluation leverages the recent release of the Noisy Measure File (NMF) as well as the availability of two independent runs of the TopDown algorithm applied to the 2010 decennial Census. We find that the NMF contains too much noise to be directly useful alone, especially for Hispanic and multiracial populations. TopDown's post-processing dramatically reduces the NMF noise and produces similarly accurate data to swapping in terms of bias and noise. These patterns hold across census geographies with varying population sizes and racial diversity. While the estimated errors for both TopDown and swapping are generally no larger than other sources of Census error, they can be relatively substantial for geographies with small total populations.
Measuring and Modeling Neighborhoods (2024).
American Political Science Review, Online ahead of print.
Survey tool;
Replication code;
Poster
Abstract
Granular geographic data present new opportunities to understand how neighborhoods are formed, and how they influence politics. At the same time, the inherent subjectivity of neighborhoods creates methodological challenges in measuring and modeling them. We develop an open-source survey instrument that allows respondents to draw their neighborhoods on a map. We also propose a statistical model to analyze how the characteristics of respondents and local areas determine subjective neighborhoods. We conduct two surveys: collecting subjective neighborhoods from voters in Miami, New York City, and Phoenix, and asking New York City residents to draw a community of interest for inclusion in their city council district. Our analysis shows that, holding other factors constant, White respondents include census blocks with more White residents in their neighborhoods. Similarly, Democrats and Republicans are more likely to include co-partisan areas. Furthermore, our model provides more accurate out-of-sample predictions than standard neighborhood measures.
Census Officials Must Constructively Engage with Independent Evaluations (2024). Proceedings of the National Academy of Sciences 121:11, e2321196121. Letter to the editor re: Jarmin et al. (2023).
Making Differential Privacy Work for Census Data Users (2023). Harvard Data Science Review 5:4. With response and rejoinder.
Abstract
The U.S. Census Bureau collects and publishes detailed demographic data about Americans which are heavily used by researchers and policymakers. The Bureau has recently adopted the framework of differential privacy in an effort to improve confidentiality of individual census responses. A key output of this privacy protection system is the Noisy Measurement File (NMF), which is produced by adding random noise to tabulated statistics. The NMF is critical to understanding any biases in the data, and performing valid statistical inference on published census data. Unfortunately, the current release format of the NMF is difficult to access and work with. We describe the process we use to transform the NMF into a usable format, and provide recommendations to the Bureau for how to release future versions of the NMF. These changes are essential for ensuring transparency of privacy measures and reproducibility of scientific research built on census data.
Sequential Monte Carlo for Sampling Balanced and Compact Redistricting Plans (2023).
Annals of Applied Statistics 17:4, 3300-3323.
Software implementation
Covered by The Washington Post, Quanta magazine.
Abstract
Random sampling of graph partitions under constraints has become a popular tool for evaluating legislative redistricting plans. Analysts detect partisan gerrymandering by comparing a proposed redistricting plan with an ensemble of sampled alternative plans. For successful application, sampling methods must scale to large maps with many districts, incorporate realistic legal constraints, and accurately and efficiently sample from a selected target distribution. Unfortunately, most existing methods struggle in at least one of these areas. We present a new Sequential Monte Carlo (SMC) algorithm that generates a sample of redistricting plans converging to a realistic target distribution. Because it draws many plans in parallel, the SMC algorithm can efficiently explore the relevant space of redistricting plans better than the existing Markov chain Monte Carlo (MCMC) algorithms that generate plans sequentially. Our algorithm can simultaneously incorporate several constraints commonly imposed in real-world redistricting problems, including equal population, compactness, and preservation of administrative boundaries. We validate the accuracy of the proposed algorithm by using a small map where all redistricting plans can be enumerated. We then apply the SMC algorithm to evaluate the partisan implications of several maps submitted by relevant parties in a recent high-profile redistricting case in the state of Pennsylvania. We find that the proposed algorithm converges to the target distribution faster and with fewer samples than a state-of-the-art MCMC algorithm. Open-source software is available for implementing the proposed methodology.
Widespread Partisan Gerrymandering Mostly Cancels Nationally, but Reduces Electoral Competition (2023).
Proceedings of the National Academy of Sciences 120:25, e2217322120.
Replication code
Abstract
Congressional district lines in many U.S. states are drawn by partisan actors, raising concerns about gerrymandering. To isolate the electoral impact of gerrymandering from the effects of other factors including geography and redistricting rules, we compare predicted election outcomes under the enacted plan with those under a large sample of non-partisan, simulated alternative plans for all states. We find that partisan gerrymandering is widespread in the 2020 redistricting cycle, but most of the bias it creates cancels at the national level, giving Republicans two additional seats, on average. In contrast, moderate pro-Republican bias due to geography and redistricting rules remains. Finally, we find that partisan gerrymandering reduces electoral competition and makes the House's partisan composition less responsive to shifts in the national vote.
Researchers Need Better Access to U.S. Census Data (2023). Science 380:6648, 902-903.
Recalibration of Predicted Probabilities Using the “Logit Shift”: Why Does it Work, and When Can it be Expected to Work Well? (2023). Political Analysis 31:4, 651-661.
Abstract
The output of predictive models is routinely recalibrated by reconciling low-level predictions with known quantities defined at higher levels of aggregation. For example, models predicting vote probabilities at the individual level in U.S. elections can be adjusted so that their aggregation matches the observed vote totals in each county, thus producing better calibrated predictions. In this research note, we provide theoretical grounding for one of the most commonly used recalibration strategies, known colloquially as the "logit shift." Typically cast as a heuristic adjustment strategy (whereby a constant correction on the logit scale is found, such that aggregated predictions match target totals), we show that the logit shift offers a fast and accurate approximation to a principled, but computationally impractical adjustment strategy: computing the posterior prediction probabilities, conditional on the observed totals. After deriving analytical bounds on the quality of the approximation, we illustrate its accuracy using Monte Carlo simulations. We also discuss scenarios in which the logit shift is less effective at recalibrating predictions: when the target totals are defined only for highly heterogeneous populations, and when the original predictions correctly capture the mean of true individual probabilities, but fail to capture the shape of their distribution.
Comment: the Essential Role of Policy Evaluation for the 2020 Census Disclosure Avoidance System (2023). Harvard Data Science Review, Special Issue 2. Response to boyd and Sarathy (2022).
Abstract
In "Differential Perspectives: Epistemic Disconnects Surrounding the US Census Bureau's Use of Differential Privacy," boyd and Sarathy argue that empirical evaluations of the Census Disclosure Avoidance System (DAS), including our published analysis, failed to recognize how the benchmark data against which the 2020 DAS was evaluated is never a ground truth of population counts. In this commentary, we explain why policy evaluation, which was the main goal of our analysis, is still meaningful without access to a perfect ground truth. We also point out that our evaluation leveraged features specific to the decennial Census and redistricting data, such as block-level population invariance under swapping and voter file racial identification, better approximating a comparison with the ground truth. Lastly, we show that accurate statistical predictions of individual race based on the Bayesian Improved Surname Geocoding, while not a violation of differential privacy, substantially increases the disclosure risk of private information the Census Bureau sought to protect. We conclude by arguing that policy makers must confront a key trade-off between data utility and privacy protection, and an epistemic disconnect alone is insufficient to explain disagreements between policy choices.
Simulated Redistricting Plans for the Analysis and Evaluation of Redistricting in the United States (2022).
Nature: Scientific Data 9:1, 689.
Project website;
Replication code;
Data
Abstract
A collection of simulated congressional districting plans and underlying code developed by the Algorithm-Assisted Redistricting Methodology (ALARM) Project. The data allow for the evaluation of enacted and other congressional redistricting plans in the United States. While the use of redistricting simulation algorithms has become standard in academic research and court cases, any simulation analysis requires non-trivial efforts to combine multiple data sets, identify state-specific redistricting criteria, implement complex simulation algorithms, and summarize and visualize simulation outputs. We have developed a complete workflow that facilitates this entire process of simulation-based redistricting analysis for the congressional districts of all 50 states. The resulting data include ensembles of simulated 2020 congressional redistricting plans and necessary replication data. We provide the underlying code, which serves as a template for customized analyses. All data and code are free and publicly available.
The Use of Differential Privacy for Census Data and Its Impact on Redistricting: the Case of the 2020 U.S. Census (2021).
Science Advances 7:41, eabk3283.
FAQ;
Reaction to the Bureau’s Response;
Supplementary information;
Replication materials
Originally a Public Comment to the Census Bureau (May 28, 2021).
Covered by The Washington Post,
the Associated Press,
the San Francisco Chronicle,
NC Policy Watch, and others.
Abstract
Census statistics play a key role in public policy decisions and social science research. However, given the risk of revealing individual information, many statistical agencies are considering disclosure control methods based on differential privacy, which add noise to tabulated data. Unlike other applications of differential privacy, however, census statistics must be postprocessed after noise injection to be usable. We study the impact of the U.S. Census Bureau's latest disclosure avoidance system (DAS) on a major application of census statistics, the redrawing of electoral districts. We find that the DAS systematically undercounts the population in mixed-race and mixed-partisan precincts, yielding unpredictable racial and partisan biases. While the DAS leads to a likely violation of the "One Person, One Vote" standard as currently interpreted, it does not prevent accurate predictions of an individual's race and ethnicity. Our findings underscore the difficulty of balancing accuracy and respondent privacy in the Census.
Geodesic Interpolation on Sierpinski Gaskets (2021). Journal of Fractal Geometry 8:2, 117-152.
Abstract
We study the analogue of a convex interpolant of two sets on Sierpiński gaskets and an associated notion of measure transport. The structure of a natural family of interpolating measures is described and an interpolation inequality is established. A key tool is a good description of geodesics on these gaskets, some results on which have previously appeared in the literature.
Software
redist
: Simulation Methods for Legislative Redistricting
Enables researchers to sample redistricting plans from a pre-specified
target distribution using Sequential Monte Carlo and Markov Chain Monte Carlo
algorithms. The package allows for the implementation of various constraints
in the redistricting process such as geographic compactness and population
parity requirements. Tools for analysis such as computation of various summary
statistics and plotting functionality are also included. The package implements
the SMC algorithm of McCartan and Imai (2023), the enumeration algorithm of
Fifield, Imai, Kawahara, and Kenny (2020), the Flip MCMC algorithm of Fifield,
Higgins, Imai and Tarr (2020), the Merge-split/Recombination algorithms of
Carter et al. (2019) and DeFord et al. (2021), and the Short-burst optimization
algorithm of Cannon et al. (2020).
redistmetrics
: Redistricting Metrics
Reliable and flexible tools for scoring redistricting plans using common
measures and metrics. These functions provide key direct access to tools useful
for non-simulation analyses of redistricting plans, such as for measuring
compactness or partisan fairness. Tools are designed to work with the redist
package seamlessly.
birdie
: Bayesian Instrumental Regression for Disparity Estimation
Bayesian models for accurately estimating conditional distributions by race,
using Bayesian Improved Surname Geocoding (BISG) probability estimates of
individual race. Implements the methods described in McCartan, Fisher, Goldin,
Ho and Imai (2024).
easycensus
: Quickly Find, Extract, and Marginalize U.S. Census Tables
Extracting desired data using the proper Census variable names can be
time-consuming. This package takes the pain out of that process by providing
functions to quickly locate variables and download labeled tables from the
Census APIs (https://www.census.gov/data/developers/data-sets.html).
PL94171
: Tabulate P.L. 94-171 Redistricting Data Summary Files
Tools to process legacy format summary redistricting data files produced by the
United States Census Bureau pursuant to P.L. 94-171. These files are generally
available earlier but are difficult to work with as-is.
adjustr
: Stan Model Adjustments and Sensitivity Analyses using Importance Sampling
Functions to help assess the sensitivity of a Bayesian model (fitted using
the rstan package) to the specification of its likelihood and priors. Users
provide a series of alternate sampling specifications, and the package uses
Pareto-smoothed importance sampling to estimate posterior quantities of interest
under each specification.
causaltbl
: Tidy Causal Data Frames and Tools
Provides a causal_tbl
class for causal inference. A causal_tbl
keeps
track of information on the roles of variables like treatment and outcome, and
provides functionality to store models and their fitted values as columns in a
data frame.
conformalbayes
: Jackknife(+) Predictive Intervals for Bayesian Models
Provides functions to construct finite-sample calibrated predictive intervals
for Bayesian models, following the approach in Barber et al. (2021). These
intervals are calculated efficiently using importance sampling for the
leave-one-out residuals. By default, the intervals will also reflect the
relative uncertainty in the Bayesian model, using the locally-weighted conformal
methods of Lei et al. (2018) .
alarmdata
: Download, Merge, and Process Redistricting Data
Utility functions to download and process data produced by the ALARM Project,
including 2020 redistricting files Kenny and McCartan (2021) and the 50-State
Redistricting Simulations of McCartan, Kenny, Simko, Garcia, Wang, Wu, Kuriwaki,
and Imai (2022). The package extends the data introduced in McCartan, Kenny,
Simko, Garcia, Wang, Wu, Kuriwaki, and Imai (2022) to also include states with
only a single district.
blockpop
: Estimate Census Block Populations for 2020
Uses FCC block-level population estimates from 2010–2019, which are based on
new roads and map data, along with decennial Census and ACS data, to estimate
2020 block populations.
ggredist
: Scales, Geometries, and Extensions of ggplot2
for Election Mapping
Provides ggplot2
extensions for political map making. Implements new
geometries for groups of simple feature geometries. Adds palettes and scales for
red to blue color mapping and for discrete maps. Implements tools for easy label
generation and placement, automatic map coloring, and themes.
tinytiger
: Lightweight Interface to TIGER/Line Shapefiles
Download geographic shapes from the United States Census Bureau TIGER/Line
Shapefiles. Functions support downloading and reading in geographic boundary
data. All downloads can be set up with a cache to avoid multiple downloads. Data
is available back to 2000 for most geographies.
wacolors
: Colorblind-Friendly Palettes from Washington State
Color palettes taken from the landscapes and cities of Washington state. Colors
were extracted from a set of photographs, and then combined to form a set
of continuous and discrete palettes. Continuous palettes were designed to be
perceptually uniform, while discrete palettes were chosen to maximize contrast
at several different levels of overall brightness and saturation. Each palette
has been evaluated to ensure colors are distinguishable by colorblind people.
nbhdmodel
: Neighborhood Modeling and Analysis
Functionality for fitting neighborhood models of McCartan, Brown, and Imai.
The core methodology is described in the paper and can be implemented with
any tool that can fit generalized linear mixed models (GLMMs). However, some
of the preprocessing necessary to set up the GLMM can be onerous. In addition
to providing a specialized GLMM routine, this package provides several
preprocessing functions that, while not completely general, should be useful for
others performing these kinds of analyses.
Other Writing
Candy Cane Shortages and the Importance of Variation (December 21, 2021). International Statistical Institute: Statisticians React to the News.
Where Will the Rocket Land? (May 12, 2021). International Statistical Institute: Statisticians React to the News.
Who’s the Most Electable Democrat? It Might be Warren or Buttigieg, Not Biden (October 23, 2019). The Washington Post.
I-405 Express Toll Lanes: Usage, Benefits, and Equity (2019).
Technical report for the Washington State Department of Transportation.
Project website
Summary
Congestion is increasing in cities around the country, and particularly in the Seattle region. Local governments are increasingly experimenting with congestion pricing schemes to manage congestion. The Washington State Department of Transportation (WSDOT) opened a congestion pricing facility in 2015 on I-405, which runs through the eastern suburbs of Seattle. The facility operates by selling extra space in the high-occupancy vehicle (HOV) lanes to single-occupancy vehicles (SOVs), and dynamically changing the price of entry to manage demand and keep the lanes operating. These combined HOV and tolled SOV lanes are called High Occupancy Tolling (HOT) lanes.
While the HOT lanes have been operative for over three years, there has been little research into the equity impacts of the lanes. Using data on each trip made on the I-405 HOT lanes in 2018, demographic data on census block groups, and lane speed, volume, and travel time data, we tried to answer this question. We studied how the express toll lanes are used, the benefits they provide to the region, and how these benefits are distributed among different groups of users.
Contact
325 Thomas Building
461 Pollock Road
University Park, PA 16802