
Working Papers
Estimating Racial Disparities When Race is Not Observed. Software and replication code; Poster.
Abstract
The estimation of racial disparities in health care, financial services, voting, and other contexts is often hampered by the lack of individual-level racial information in administrative records. In many cases, the law prohibits the collection of such information to prevent direct racial discrimination. As a result, many analysts have adopted Bayesian Improved Surname Geocoding (BISG), which combines individual names and addresses with the Census data to predict race. Although BISG tends to produce well-calibrated racial predictions, its residuals are often correlated with the outcomes of interest, yielding biased estimates of racial disparities. We propose an alternative identification strategy that corrects this bias. The proposed strategy is applicable whenever one's surname is conditionally independent of the outcome given their (unobserved) race, residence location, and other observed characteristics. Leveraging this identification strategy, we introduce a new class of models, Bayesian Instrumental Regression for Disparity Estimation (BIRDiE), that estimate racial disparities by using surnames as a high-dimensional instrumental variable for race. Our estimation method is scalable, making it possible to analyze large-scale administrative data. We also show how to address potential violations of the key identification assumptions. A validation study based on the North Carolina voter file shows that BIRDiE reduces error by up to 84% in comparison to the standard approaches for estimating racial differences in party registration. Open-source software is available which implements the proposed methodology.
Individual and Differential Harm in Redistricting. Replication code
Abstract
Social scientists have developed dozens of measures for assessing partisan bias in redistricting. But these measures cannot be easily adapted to other groups, including those defined by race, class, or geography. Nor are they applicable to single- or no-party contexts such as local redistricting. To overcome these limitations, we propose a unified framework of harm for evaluating the impacts of a districting plan on individual voters and the groups to which they belong. We consider a voter harmed if their chosen candidate is not elected under the current plan, but would be under a different plan. Harm improves on existing measures by both focusing on the choices of individual voters and directly incorporating counterfactual plans. We discuss strategies for estimating harm, and demonstrate the utility of our framework through analyses of partisan gerrymandering in New Jersey, voting rights litigation in Alabama, and racial dynamics of Boston City Council elections.
Measuring and Modeling Neighborhoods. Under Review. Survey tool; Poster
Abstract
Granular geographic data present new opportunities to understand how neighborhoods are formed, and how they influence politics. At the same time, the inherent subjectivity of neighborhoods creates methodological challenges in measuring and modeling them. We develop a survey instrument that allows respondents to draw their neighborhoods on a map. We also propose a statistical model to analyze how the characteristics of respondents and local areas determine subjective neighborhoods. We conduct two surveys: collecting subjective neighborhoods from voters in Miami, New York City, and Phoenix, and asking New York City residents to draw a community of interest for inclusion in their city council district. Our analysis shows that, holding other factors constant, White respondents include census blocks with more White residents in their neighborhoods. Similarly, Democrats and Republicans are more likely to include co-partisan areas. In addition, our model provides more accurate out-of-sample predictions than standard neighborhood measures.
Finding Pareto Efficient Redistricting Plans with Short Bursts.
Abstract
Redistricting practitioners must balance many competing constraints and criteria when drawing district boundaries. To aid in this process, researchers have developed many methods for optimizing districting plans according to one or more criteria. This research note extends a recently-proposed single-criterion optimization method, short bursts (Cannon et al., 2023), to handle the multi-criterion case, and in doing so approximate the Pareto frontier for any set of constraints. We study the empirical performance of the method in a realistic setting and find it behaves as expected and is not very sensitive to algorithmic parameters. The proposed approach, which is implemented in open-source software, should allow researchers and practitioners to better understand the tradeoffs inherent to the redistricting process.
Widespread Partisan Gerrymandering Mostly Cancels in Aggregate, but Reduces Competition and Responsiveness. Under Review.
Abstract
Congressional district lines in many U.S. states are drawn by partisan actors, raising concerns about gerrymandering. To isolate the electoral impact of gerrymandering from the effects of other factors including geography and redistricting rules, we compare predicted election outcomes under the enacted plan with those under a large sample of non-partisan, simulated alternative plans for all states. We find that partisan gerrymandering is widespread in the 2020 redistricting cycle, but most of the bias it creates cancels at the national level, giving Republicans two additional seats, on average. In contrast, moderate pro-Republican bias due to geography and redistricting rules remains. Finally, we find that partisan gerrymandering reduces electoral competition and makes the House's partisan composition less responsive to shifts in the national vote.
Publications
Sequential Monte Carlo for Sampling Balanced and Compact Redistricting Plans.
Annals of Applied Statistics, Forthcoming.
Software implementation.
Covered by The Washington Post.
Abstract
Random sampling of graph partitions under constraints has become a popular tool for evaluating legislative redistricting plans. Analysts detect partisan gerrymandering by comparing a proposed redistricting plan with an ensemble of sampled alternative plans. For successful application, sampling methods must scale to large maps with many districts, incorporate realistic legal constraints, and accurately and efficiently sample from a selected target distribution. Unfortunately, most existing methods struggle in at least one of these areas. We present a new Sequential Monte Carlo (SMC) algorithm that generates a sample of redistricting plans converging to a realistic target distribution. Because it draws many plans in parallel, the SMC algorithm can efficiently explore the relevant space of redistricting plans better than the existing Markov chain Monte Carlo (MCMC) algorithms that generate plans sequentially. Our algorithm can simultaneously incorporate several constraints commonly imposed in real-world redistricting problems, including equal population, compactness, and preservation of administrative boundaries. We validate the accuracy of the proposed algorithm by using a small map where all redistricting plans can be enumerated. We then apply the SMC algorithm to evaluate the partisan implications of several maps submitted by relevant parties in a recent high-profile redistricting case in the state of Pennsylvania. We find that the proposed algorithm converges to the target distribution faster and with fewer samples than a state-of-the-art MCMC algorithm. Open-source software is available for implementing the proposed methodology.
The use of differential privacy for census data and its impact on redistricting: The case of the 2020 U.S. Census.
2021. Science Advances 7(41), eabk3283.
Originally a Public Comment to the Census Bureau (May 28, 2021).
FAQ;
Reaction to the Bureau’s Response;
Supplementary information;
Replication materials.
Covered by The Washington Post,
the Associated Press,
the San Francisco Chronicle,
NC Policy Watch, and others.
Abstract
Census statistics play a key role in public policy decisions and social science research. However, given the risk of revealing individual information, many statistical agencies are considering disclosure control methods based on differential privacy, which add noise to tabulated data. Unlike other applications of differential privacy, however, census statistics must be postprocessed after noise injection to be usable. We study the impact of the U.S. Census Bureau’s latest disclosure avoidance system (DAS) on a major application of census statistics, the redrawing of electoral districts. We find that the DAS systematically undercounts the population in mixed-race and mixed-partisan precincts, yielding unpredictable racial and partisan biases. While the DAS leads to a likely violation of the “One Person, One Vote” standard as currently interpreted, it does not prevent accurate predictions of an individual’s race and ethnicity. Our findings underscore the difficulty of balancing accuracy and respondent privacy in the Census.
Simulated redistricting plans for the analysis and evaluation of redistricting plans in the United States. 2022. Nature Scientific Data 9, 689. Project website; Replication code; Data
Abstract
A collection of simulated congressional districting plans and underlying code developed by the Algorithm-Assisted Redistricting Methodology (ALARM) Project. The data allow for the evaluation of enacted and other congressional redistricting plans in the United States. While the use of redistricting simulation algorithms has become standard in academic research and court cases, any simulation analysis requires non-trivial efforts to combine multiple data sets, identify state-specific redistricting criteria, implement complex simulation algorithms, and summarize and visualize simulation outputs. We have developed a complete workflow that facilitates this entire process of simulation-based redistricting analysis for the congressional districts of all 50 states. The resulting data include ensembles of simulated 2020 congressional redistricting plans and necessary replication data. We provide the underlying code, which serves as a template for customized analyses. All data and code are free and publicly available.
Comment: The Essential Role of Policy Evaluation for the 2020 Census Disclosure Avoidance System. 2023. Harvard Data Science Review, Special Issue 2. Response to boyd and Sarathy (2022)
Abstract
In "Differential Perspectives: Epistemic Disconnects Surrounding the US Census Bureau's Use of Differential Privacy," boyd and Sarathy argue that empirical evaluations of the Census Disclosure Avoidance System (DAS), including our published analysis, failed to recognize how the benchmark data against which the 2020 DAS was evaluated is never a ground truth of population counts. In this commentary, we explain why policy evaluation, which was the main goal of our analysis, is still meaningful without access to a perfect ground truth. We also point out that our evaluation leveraged features specific to the decennial Census and redistricting data, such as block-level population invariance under swapping and voter file racial identification, better approximating a comparison with the ground truth. Lastly, we show that accurate statistical predictions of individual race based on the Bayesian Improved Surname Geocoding, while not a violation of differential privacy, substantially increases the disclosure risk of private information the Census Bureau sought to protect. We conclude by arguing that policy makers must confront a key trade-off between data utility and privacy protection, and an epistemic disconnect alone is insufficient to explain disagreements between policy choices.
Recalibration Of Predicted Probabilities Using the “Logit Shift”: Why does it work, and when can it be expected to work well? Forthcoming in Political Analysis.
Abstract
The output of predictive models is routinely recalibrated by reconciling low-level predictions with known quantities defined at higher levels of aggregation. For example, models predicting vote probabilities at the individual level in U.S. elections can be adjusted so that their aggregation matches the observed vote totals in each county, thus producing better calibrated predictions. In this research note, we provide theoretical grounding for one of the most commonly used recalibration strategies, known colloquially as the “logit shift.” Typically cast as a heuristic adjustment strategy (whereby a constant correction on the logit scale is found, such that aggregated predictions match target totals), we show that the logit shift offers a fast and accurate approximation to a principled, but computationally impractical adjustment strategy: computing the posterior prediction probabilities, conditional on the observed totals. After deriving analytical bounds on the quality of the approximation, we illustrate its accuracy using Monte Carlo simulations. We also discuss scenarios in which the logit shift is less effective at recalibrating predictions: when the target totals are defined only for highly heterogeneous populations, and when the original predictions correctly capture the mean of true individual probabilities, but fail to capture the shape of their distribution.
Geodesic Interpolation on Sierpinski Gaskets. 2021. Journal of Fractal Geometry 8(2), 117-152.
Abstract
We study the analogue of a convex interpolant of two sets on Sierpiński gaskets and an associated notion of measure transport. The structure of a natural family of interpolating measures is described and an interpolation inequality is established. A key tool is a good description of geodesics on these gaskets, some results on which have previously appeared in the literature.
Works in Progress
Regression of the Conditional Median.
Algorithm-Assisted Redistricting Methodology (book).
Studying Officeholders’ Perceived Geographic Constituencies.
Software

redist
: Simulation Methods for Legislative Redistricting
This R package enables researchers to sample redistricting plans from a
pre-specified target distribution using Sequential Monte Carlo and Markov Chain
Monte Carlo algorithms. The package supports various constraints in the
redistricting process, such as geographic compactness and population parity
requirements. Tools for analysis, including computation of various summary
statistics and plotting functionality, are also included.

redistmetrics
: Redistricting Metrics
Reliable and flexible tools for scoring redistricting plans using common
measures and metrics. These functions provide key direct access to tools useful
for non-simulation analyses of redistricting plans, such as for measuring
compactness or partisan fairness. Tools are designed to work with the
redist
package seamlessly.

birdie
: Bayesian Instrumental Regression for Disparity Estimation
Bayesian models for accurately estimating conditional
distributions by race, using Bayesian Improved Surname Geocoding (BISG)
probability estimates of individual race. Implements the methods described
in McCartan, Goldin, Ho and Imai (2023).

easycensus
: Quickly Find, Extract, and Marginalize U.S. Census Tables
Extracting desired data using the proper Census variable names can be
time-consuming. This package takes the pain out of that process by providing
functions to quickly locate variables and download labeled tables from the
Census APIs.

PL94171
: Tabulate P.L. 94-171 Redistricting Data Summary Files
Tools to process legacy format summary redistricting data files produced by the
United States Census Bureau pursuant to P.L. 94-171. These files are generally
available earlier but are difficult to work with as-is.

adjustr
: Stan Model Adjustments and Sensitivity Analyses using Importance Sampling
Functions to help assess the sensitivity of a Bayesian model
(fitted using the rstan
package) to the specification of its likelihood and
priors. Users provide a series of alternate sampling specifications, and the
package uses Pareto-smoothed importance sampling to estimate posterior
quantities of interest under each specification.

conformalbayes
: Jackknife(+) Predictive Intervals for Bayesian Models
Provides functions to construct finite-sample calibrated predictive intervals
for Bayesian models, following the approach in
Barber et al. (2021).
These intervals are calculated efficiently using importance sampling for the
leave-one-out residuals. By default, the intervals will also reflect the
relative uncertainty in the Bayesian model, using the locally-weighted
conformal methods of Lei et al. (2018).

alarmdata
: Download, Merge, and Process Redistricting Data
Utility functions to download and process data produced by the ALARM Project,
including 2020 redistricting files
and 50-State Redistricting Simulations.

blockpop
: Estimate Census Block Populations for 2020
2020 Census data is delayed and affected by differential privacy. This package
uses FCC block-level population estimates from 2010–2019, which are based on
new roads and map data, along with decennial Census and ACS data, to estimate
2020 block populations, both overall and by major race/ethnicity categories
(using iterative proportional fitting).

ggredist
: Scales, Palettes, and Extensions of ggplot2
for Redistricting
Provides ggplot2
extensions for political mapmaking, including new
geometries, easy label generation and placement, automatic map coloring, and
map scales, palettes, and themes.

tinytiger
: Lightweight Interface to TIGER/Line Shapefiles
Download geographic shapes from the United States Census Bureau
TIGER/Line Shapefiles.
Functions support downloading and reading in geographic boundary data.
All downloads can be set up with a cache to avoid multiple downloads.
Data is available back to 2000 for most geographies.
wacolors
: Colorblind-friendly Palettes from Washington State
Other Writing
Candy cane shortages and the importance of variation. December 21, 2021. International Statistical Institute: Statisticians React to the News.
Where will the rocket land? May 12, 2021. International Statistical Institute: Statisticians React to the News.
Who’s the most electable Democrat? It might be Warren or Buttigieg, not Biden. October 23, 2019. The Washington Post.
I-405 Express Toll Lanes: Usage, benefits, and equity. 2019. Technical report for the Washington State Department of Transportation. Project website
Project summary
Congestion is increasing in cities around the country, and particularly in the Seattle region. Local governments are increasingly experimenting with congestion pricing schemes to manage congestion. The Washington State Department of Transportation (WSDOT) opened a congestion pricing facility in 2015 on I-405, which runs through the eastern suburbs of Seattle. The facility operates by selling extra space in the high-occupancy vehicle (HOV) lanes to single-occupancy vehicles (SOVs), and dynamically changing the price of entry to manage demand and keep the lanes operating. These combined HOV and tolled SOV lanes are called High Occupancy Tolling (HOT) lanes.
While the HOT lanes have been operative for over three years, there has been little research into the equity impacts of the lanes. Using data on each trip made on the I-405 HOT lanes in 2018, demographic data on census block groups, and lane speed, volume, and travel time data, we tried to answer this question. We studied how the express toll lanes are used, the benefits they provide to the region, and how these benefits are distributed among different groups of users.
Contact
Science Center, Ste. 400
1 Oxford St.
Cambridge MA 02138