I work on methodological and applied problems in social sciences, with particular emphasis on Bayesian and computational methods and spatial or geographic data. Much of my research has developed or applied algorithmic tools to the problem of legislative redistricting; I helped start the Algorithm-Assisted Redistricting Methodology (ALARM) Project at Harvard University in 2021. I also develop and maintain a number of open-source R packages for redistricting, statistical analysis, and visualization.
Estimating Racial Disparities When Race is Not Observed (2023).
The estimation of racial disparities in health care, financial services, voting, and other contexts is often hampered by the lack of individual-level racial information in administrative records. In many cases, the law prohibits the collection of such information to prevent direct racial discrimination. As a result, many analysts have adopted Bayesian Improved Surname Geocoding (BISG), which combines individual names and addresses with the Census data to predict race. Although BISG tends to produce well-calibrated racial predictions, its residuals are often correlated with the outcomes of interest, yielding biased estimates of racial disparities. We propose an alternative identification strategy that corrects this bias. The proposed strategy is applicable whenever one's surname is conditionally independent of the outcome given their (unobserved) race, residence location, and other observed characteristics. Leveraging this identification strategy, we introduce a new class of models, Bayesian Instrumental Regression for Disparity Estimation (BIRDiE), that estimate racial disparities by using surnames as a high-dimensional instrumental variable for race. Our estimation method is scalable, making it possible to analyze large-scale administrative data. We also show how to address potential violations of the key identification assumptions. A validation study based on the North Carolina voter file shows that BIRDiE reduces error by up to 84% in comparison to the standard approaches for estimating racial differences in party registration. Open-source software is available which implements the proposed methodology.
Individual and Differential Harm in Redistricting (2022).
Social scientists have developed dozens of measures for assessing partisan bias in redistricting. But these measures cannot be easily adapted to other groups, including those defined by race, class, or geography. Nor are they applicable to single- or no-party contexts such as local redistricting. To overcome these limitations, we propose a unified framework of *harm* for evaluating the impacts of a districting plan on individual voters and the groups to which they belong. We consider a voter harmed if their chosen candidate is not elected under the current plan, but would be under a different plan. Harm improves on existing measures by both focusing on the choices of individual voters and directly incorporating counterfactual plans. We discuss strategies for estimating harm, and demonstrate the utility of our framework through analyses of partisan gerrymandering in New Jersey, voting rights litigation in Alabama, and racial dynamics of Boston City Council elections.
Evaluating Bias and Noise Induced by the U.S. Census Bureau’s Privacy Protection Methods (2023). Under Review.
The United States Census Bureau faces a difficult trade-off between the accuracy of Census statistics and the protection of individual information. We conduct the first independent evaluation of bias and noise induced by the Bureau's two main disclosure avoidance systems: the TopDown algorithm employed for the 2020 Census and the swapping algorithm implemented for the 1990, 2000, and 2010 Censuses. Our evaluation leverages the recent release of the Noisy Measure File (NMF) as well as the availability of two independent runs of the TopDown algorithm applied to the 2010 decennial Census. We find that the NMF contains too much noise to be directly useful alone, especially for Hispanic and multiracial populations. TopDown's post-processing dramatically reduces the NMF noise and produces similarly accurate data to swapping in terms of bias and noise. These patterns hold across census geographies with varying population sizes and racial diversity. While the estimated errors for both TopDown and swapping are generally no larger than other sources of Census error, they can be relatively substantial for geographies with small total populations.
Finding Pareto Efficient Redistricting Plans with Short Bursts (2023).
Redistricting practitioners must balance many competing constraints and criteria when drawing district boundaries. To aid in this process, researchers have developed many methods for optimizing districting plans according to one or more criteria. This research note extends a recently-proposed single-criterion optimization method, short bursts (Cannon et al., 2023), to handle the multi-criterion case, and in doing so approximate the Pareto frontier for any set of constraints. We study the empirical performance of the method in a realistic setting and find it behaves as expected and is not very sensitive to algorithmic parameters. The proposed approach, which is implemented in open-source software, should allow researchers and practitioners to better understand the tradeoffs inherent to the redistricting process.
Measuring and Modeling Neighborhoods (2023).
American Political Science Review, Conditionally Accepted.
Granular geographic data present new opportunities to understand how neighborhoods are formed, and how they influence politics. At the same time, the inherent subjectivity of neighborhoods creates methodological challenges in measuring and modeling them. We develop an open-source survey instrument that allows respondents to draw their neighborhoods on a map. We also propose a statistical model to analyze how the characteristics of respondents and local areas determine subjective neighborhoods. We conduct two surveys: collecting subjective neighborhoods from voters in Miami, New York City, and Phoenix, and asking New York City residents to draw a community of interest for inclusion in their city council district. Our analysis shows that, holding other factors constant, White respondents include census blocks with more White residents in their neighborhoods. Similarly, Democrats and Republicans are more likely to include co-partisan areas. Furthermore, our model provides more accurate out-of-sample predictions than standard neighborhood measures.
Making Differential Privacy Work for Census Data Users (2023). Harvard Data Science Review 5:4.
The U.S. Census Bureau collects and publishes detailed demographic data about Americans which are heavily used by researchers and policymakers. The Bureau has recently adopted the framework of differential privacy in an effort to improve confidentiality of individual census responses. A key output of this privacy protection system is the Noisy Measurement File (NMF), which is produced by adding random noise to tabulated statistics. The NMF is critical to understanding any biases in the data, and performing valid statistical inference on published census data. Unfortunately, the current release format of the NMF is difficult to access and work with. We describe the process we use to transform the NMF into a usable format, and provide recommendations to the Bureau for how to release future versions of the NMF. These changes are essential for ensuring transparency of privacy measures and reproducibility of
Sequential Monte Carlo for Sampling Balanced and Compact Redistricting Plans (2023).
Annals of Applied Statistics 17:4, 3300-3323.
Covered by The Washington Post, Quanta magazine.
Random sampling of graph partitions under constraints has become a popular tool for evaluating legislative redistricting plans. Analysts detect partisan gerrymandering by comparing a proposed redistricting plan with an ensemble of sampled alternative plans. For successful application, sampling methods must scale to large maps with many districts, incorporate realistic legal constraints, and accurately and efficiently sample from a selected target distribution. Unfortunately, most existing methods struggle in at least one of these areas. We present a new Sequential Monte Carlo (SMC) algorithm that generates a sample of redistricting plans converging to a realistic target distribution. Because it draws many plans in parallel, the SMC algorithm can efficiently explore the relevant space of redistricting plans better than the existing Markov chain Monte Carlo (MCMC) algorithms that generate plans sequentially. Our algorithm can simultaneously incorporate several constraints commonly imposed in real-world redistricting problems, including equal population, compactness, and preservation of administrative boundaries. We validate the accuracy of the proposed algorithm by using a small map where all redistricting plans can be enumerated. We then apply the SMC algorithm to evaluate the partisan implications of several maps submitted by relevant parties in a recent high-profile redistricting case in the state of Pennsylvania. We find that the proposed algorithm converges to the target distribution faster and with fewer samples than a state-of-the-art MCMC algorithm. Open-source software is available for implementing the proposed methodology.
Widespread Partisan Gerrymandering Mostly Cancels Nationally, but Reduces Electoral Competition (2023).
Proceedings of the National Academy of Sciences 120:25, e2217322120.
Congressional district lines in many U.S. states are drawn by partisan actors, raising concerns about gerrymandering. To isolate the electoral impact of gerrymandering from the effects of other factors including geography and redistricting rules, we compare predicted election outcomes under the enacted plan with those under a large sample of non-partisan, simulated alternative plans for all states. We find that partisan gerrymandering is widespread in the 2020 redistricting cycle, but most of the bias it creates cancels at the national level, giving Republicans two additional seats, on average. In contrast, moderate pro-Republican bias due to geography and redistricting rules remains. Finally, we find that partisan gerrymandering reduces electoral competition and makes the House's partisan composition less responsive to shifts in the national vote.
Researchers Need Better Access to U.S. Census Data (2023). Science 380:6648, 902-903.
Recalibration of Predicted Probabilities Using the “Logit Shift”: Why Does it Work, and When Can it be Expected to Work Well? (2023). Political Analysis 31:4, 651-661.
The output of predictive models is routinely recalibrated by reconciling low-level predictions with known quantities defined at higher levels of aggregation. For example, models predicting vote probabilities at the individual level in U.S. elections can be adjusted so that their aggregation matches the observed vote totals in each county, thus producing better calibrated predictions. In this research note, we provide theoretical grounding for one of the most commonly used recalibration strategies, known colloquially as the "logit shift." Typically cast as a heuristic adjustment strategy (whereby a constant correction on the logit scale is found, such that aggregated predictions match target totals), we show that the logit shift offers a fast and accurate approximation to a principled, but computationally impractical adjustment strategy: computing the posterior prediction probabilities, conditional on the observed totals. After deriving analytical bounds on the quality of the approximation, we illustrate its accuracy using Monte Carlo simulations. We also discuss scenarios in which the logit shift is less effective at recalibrating predictions: when the target totals are defined only for highly heterogeneous populations, and when the original predictions correctly capture the mean of true individual probabilities, but fail to capture the shape of their distribution.
Comment: the Essential Role of Policy Evaluation for the 2020 Census Disclosure Avoidance System (2023). Harvard Data Science Review, Special Issue 2.Response to boyd and Sarathy (2022).
In "Differential Perspectives: Epistemic Disconnects Surrounding the US Census Bureau's Use of Differential Privacy," boyd and Sarathy argue that empirical evaluations of the Census Disclosure Avoidance System (DAS), including our published analysis, failed to recognize how the benchmark data against which the 2020 DAS was evaluated is never a ground truth of population counts. In this commentary, we explain why policy evaluation, which was the main goal of our analysis, is still meaningful without access to a perfect ground truth. We also point out that our evaluation leveraged features specific to the decennial Census and redistricting data, such as block-level population invariance under swapping and voter file racial identification, better approximating a comparison with the ground truth. Lastly, we show that accurate statistical predictions of individual race based on the Bayesian Improved Surname Geocoding, while not a violation of differential privacy, substantially increases the disclosure risk of private information the Census Bureau sought to protect. We conclude by arguing that policy makers must confront a key trade-off between data utility and privacy protection, and an epistemic disconnect alone is insufficient to explain disagreements between policy choices.
Simulated Redistricting Plans for the Analysis and Evaluation of Redistricting in the United States (2022).
Nature: Scientific Data 9:1, 689.
A collection of simulated congressional districting plans and underlying code developed by the Algorithm-Assisted Redistricting Methodology (ALARM) Project. The data allow for the evaluation of enacted and other congressional redistricting plans in the United States. While the use of redistricting simulation algorithms has become standard in academic research and court cases, any simulation analysis requires non-trivial efforts to combine multiple data sets, identify state-specific redistricting criteria, implement complex simulation algorithms, and summarize and visualize simulation outputs. We have developed a complete workflow that facilitates this entire process of simulation-based redistricting analysis for the congressional districts of all 50 states. The resulting data include ensembles of simulated 2020 congressional redistricting plans and necessary replication data. We provide the underlying code, which serves as a template for customized analyses. All data and code are free and publicly available.
The Use of Differential Privacy for Census Data and Its Impact on Redistricting: the Case of the 2020 U.S. Census (2021).
Science Advances 7:41, eabk3283.
Originally a Public Comment to the Census Bureau (May 28, 2021).
Covered by The Washington Post, the Associated Press, the San Francisco Chronicle, NC Policy Watch, and others.
Census statistics play a key role in public policy decisions and social science research. However, given the risk of revealing individual information, many statistical agencies are considering disclosure control methods based on differential privacy, which add noise to tabulated data. Unlike other applications of differential privacy, however, census statistics must be postprocessed after noise injection to be usable. We study the impact of the U.S. Census Bureau's latest disclosure avoidance system (DAS) on a major application of census statistics, the redrawing of electoral districts. We find that the DAS systematically undercounts the population in mixed-race and mixed-partisan precincts, yielding unpredictable racial and partisan biases. While the DAS leads to a likely violation of the "One Person, One Vote" standard as currently interpreted, it does not prevent accurate predictions of an individual's race and ethnicity. Our findings underscore the difficulty of balancing accuracy and respondent privacy in the Census.
Geodesic Interpolation on Sierpinski Gaskets (2021). Journal of Fractal Geometry 8:2, 117-152.
We study the analogue of a convex interpolant of two sets on Sierpiński gaskets and an associated notion of measure transport. The structure of a natural family of interpolating measures is described and an interpolation inequality is established. A key tool is a good description of geodesics on these gaskets, some results on which have previously appeared in the literature.
redist: Simulation Methods for Legislative Redistricting
Enables researchers to sample redistricting plans from a pre-specified target distribution using Sequential Monte Carlo and Markov Chain Monte Carlo algorithms. The package allows for the implementation of various constraints in the redistricting process such as geographic compactness and population parity requirements. Tools for analysis such as computation of various summary statistics and plotting functionality are also included. The package implements the SMC algorithm of McCartan and Imai (2020), the enumeration algorithm of Fifield, Imai, Kawahara, and Kenny (2020), the Flip MCMC algorithm of Fifield, Higgins, Imai and Tarr (2020), the Merge-split/Recombination algorithms of Carter et al. (2019) and DeFord et al. (2021), and the Short-burst optimization algorithm of Cannon et al. (2020).
redistmetrics: Redistricting Metrics
Reliable and flexible tools for scoring redistricting plans using common measures and metrics. These functions provide key direct access to tools useful for non-simulation analyses of redistricting plans, such as for measuring compactness or partisan fairness. Tools are designed to work with the
birdie: Bayesian Instrumental Regression for Disparity Estimation
Bayesian models for accurately estimating conditional distributions by race, using Bayesian Improved Surname Geocoding (BISG) probability estimates of individual race. Implements the methods described in McCartan, Goldin, Ho and Imai (2023).
easycensus: Quickly Find, Extract, and Marginalize U.S. Census Tables
Extracting desired data using the proper Census variable names can be time-consuming. This package takes the pain out of that process by providing functions to quickly locate variables and download labeled tables from the Census APIs (https://www.census.gov/data/developers/data-sets.html).
PL94171: Tabulate P.L. 94-171 Redistricting Data Summary Files
Tools to process legacy format summary redistricting data files produced by the United States Census Bureau pursuant to P.L. 94-171. These files are generally available earlier but are difficult to work with as-is.
adjustr: Stan Model Adjustments and Sensitivity Analyses using Importance Sampling
Functions to help assess the sensitivity of a Bayesian model (fitted using the rstan package) to the specification of its likelihood and priors. Users provide a series of alternate sampling specifications, and the package uses Pareto-smoothed importance sampling to estimate posterior quantities of interest under each specification.
causaltbl: Tidy Causal Data Frames and Tools
causal_tbl class for causal inference. A
track of information on the roles of variables like treatment and outcome, and
provides functionality to store models and their fitted values as columns in a
conformalbayes: Jackknife(+) Predictive Intervals for Bayesian Models
Provides functions to construct finite-sample calibrated predictive intervals for Bayesian models, following the approach in Barber et al. (2021). These intervals are calculated efficiently using importance sampling for the leave-one-out residuals. By default, the intervals will also reflect the relative uncertainty in the Bayesian model, using the locally-weighted conformal methods of Lei et al. (2018) .
alarmdata: Download, Merge, and Process Redistricting Data
Utility functions to download and process data produced by the ALARM Project, including 2020 redistricting files Kenny and McCartan 2021 and the 50-State Redistricting Simulations of McCartan, Kenny, Simko, Garcia, Wang, Wu, Kuriwaki, and Imai. The package extends the data introduced in McCartan, Kenny, Simko, Garcia, Wang, Wu, Kuriwaki, and Imai to also include states with only a single district.
blockpop: Estimate Census Block Populations for 2020
Uses FCC block-level population estimates from 2010–2019, which are based on new roads and map data, along with decennial Census and ACS data, to estimate 2020 block populations.
ggredist: Scales, Geometries, and Extensions of
ggplot2 for Election Mapping
ggplot2 extensions for political map making. Implements new
geometries for groups of simple feature geometries. Adds palettes and scales for
red to blue color mapping and for discrete maps. Implements tools for easy label
generation and placement, automatic map coloring, and themes.
tinytiger: Lightweight Interface to TIGER/Line Shapefiles
Download geographic shapes from the United States Census Bureau TIGER/Line Shapefiles. Functions support downloading and reading in geographic boundary data. All downloads can be set up with a cache to avoid multiple downloads. Data is available back to 2000 for most geographies.
wacolors: Colorblind-Friendly Palettes from Washington State
Color palettes taken from the landscapes and cities of Washington state. Colors were extracted from a set of photographs, and then combined to form a set of continuous and discrete palettes. Continuous palettes were designed to be perceptually uniform, while discrete palettes were chosen to maximize contrast at several different levels of overall brightness and saturation. Each palette has been evaluated to ensure colors are distinguishable by colorblind people.
nbhdmodel: Neighborhood Modeling and Analysis
Functionality for fitting neighborhood models of McCartan, Brown, and Imai. The core methodology is described in the paper and can be implemented with any tool that can fit generalized linear mixed models (GLMMs). However, some of the preprocessing necessary to set up the GLMM can be onerous. In addition to providing a specialized GLMM routine, this package provides several preprocessing functions that, while not completely general, should be useful for others performing these kinds of analyses.
Candy Cane Shortages and the Importance of Variation (December 21, 2021). International Statistical Institute: Statisticians React to the News.
Where Will the Rocket Land? (May 12, 2021). International Statistical Institute: Statisticians React to the News.
Who’s the Most Electable Democrat? It Might be Warren or Buttigieg, Not Biden (October 23, 2019). The Washington Post.
I-405 Express Toll Lanes: Usage, Benefits, and Equity (2019).
Technical report for the Washington State Department of Transportation.
Congestion is increasing in cities around the country, and particularly in the Seattle region. Local governments are increasingly experimenting with congestion pricing schemes to manage congestion. The Washington State Department of Transportation (WSDOT) opened a congestion pricing facility in 2015 on I-405, which runs through the eastern suburbs of Seattle. The facility operates by selling extra space in the high-occupancy vehicle (HOV) lanes to single-occupancy vehicles (SOVs), and dynamically changing the price of entry to manage demand and keep the lanes operating. These combined HOV and tolled SOV lanes are called High Occupancy Tolling (HOT) lanes.
While the HOT lanes have been operative for over three years, there has been little research into the equity impacts of the lanes. Using data on each trip made on the I-405 HOT lanes in 2018, demographic data on census block groups, and lane speed, volume, and travel time data, we tried to answer this question. We studied how the express toll lanes are used, the benefits they provide to the region, and how these benefits are distributed among different groups of users.