Fits one of two possible Bayesian Instrumental Regression for Disparity
Estimation (BIRDiE) models to BISG probabilities and covariates. The simplest
Categorical-Dirichlet model (cat_dir()
) is appropriate when there are no
covariates or when all covariates are discrete and fully interacted with
another. The more general Categorical mixed-effects model (cat_mixed()
) is
a supports any number of fixed effects and up to one random intercept. For
continuous outcomes a Normal linear model is available (gaussian()
).
Usage
birdie(
r_probs,
formula,
data = NULL,
weights = NULL,
family = cat_dir(),
model = NULL,
prior = NULL,
prefix = "pr_",
se_boot = 0,
ctrl = birdie.ctrl()
)
Arguments
- r_probs
A data frame or matrix of BISG probabilities, with one row per individual. The output of
bisg()
can be used directly here.- formula
A two-sided formula object describing the model structure. The left-hand side is the outcome variable, which must be discrete. A single random intercept term, denoted with a vertical bar (
"(1 | <term>)"
), is supported on the right-hand side.- data
An optional data frame containing the variables named in
formula
.- weights
An optional numeric vector specifying likelihood weights.
- family
A description of the complete-data model type to fit. Options are:
cat_dir()
: Categorical-Dirichlet model. All covariates must be fully interacted.cat_mixed()
: Categorical mixed-effects model. Up to one random effect is supported.gaussian()
: Linear model.
See the Details section below for more information on the various models.
- model
Deprecated.
- prior
A list with entries specifying the model prior.
For the
cat_dir
model, the only entry isalpha
, which should be a matrix of Dirichlet hyperparameters. The matrix should have one row for every level of the outcome variable and one column for every racial group. The default prior (used whenprior=NULL
) is an empirical Bayes prior equal to 1 plus the weighted-mean estimate of the outcome-race table. A fully noninformative prior with all entries set to \(1+\epsilon\) can be obtained by settingprior=NA
.For the
cat_mixed
model, theprior
list should contain three scalar entries:scale_int
, the standard deviation on the Normal prior for the intercepts (which control the global estimates ofY|R
),scale_beta
, the standard deviation on the Normal prior for the fixed effects, andscale_sigma
, the prior mean of the standard deviation of the random intercepts. These can be a single scalar or a vector with an entry for each racial group.For the
gaussian
model, theprior
list should contain two entries:scale_int
, controlling, the standard deviation on the Normal prior for the intercepts (which control the global estimates ofY|R
), andscale_beta
, controlling the standard deviation on the Normal prior for the fixed effects. These must be a single scalar. Each is expressed in terms of the estimated residual standard deviation (i.e., they are multiplied together to form the "true" prior).
The prior is stored after model fitting in the
$prior
element of the fitted model object.- prefix
If
r_probs
is a data frame, the columns containing racial probabilities will be selected as those with names starting withprefix
. The default will work with the output ofbisg()
.- se_boot
The number of bootstrap replicates to use to compute approximate standard errors for the main model estimates. Only available when
familiy=cat_dir()
orgaussian()
. When there are fewer than 1,000 individuals or 100 or fewer replicates, a Bayesian bootstrap is used instead (i.e., weights are drawn from a \(\text{Dirichlet}(1, 1, ..., 1)\) distribution, which produces more reliable estimates.- ctrl
A list containing control parameters for the EM algorithm and optimization routines. A list in the proper format can be made using
birdie.ctrl()
.
Value
An object of class birdie
, for which many
methods are available. The model estimates may be accessed with
coef.birdie()
, and updated BISG probabilities (conditioning on the
outcome) may be accessed with fitted.birdie()
.
Details
birdie()
uses an expectation-maximization (EM) routine to find the maximum
a posteriori (MAP) estimate for the specified model. Asymptotic
variance-covariance matrices for the MAP estimate are available for the
Categorical-Dirichlet and Normal linear models via bootstrapping (se_boot
).
The Categorical-Dirichlet model is specified as follows: $$
Y_i \mid R_i, X_i, \Theta \sim \text{Categorical}(\theta_{R_iX_i}) \\
\theta_{rx} \sim \text{Dirichlet}(\alpha_r),
$$ where \(Y\) is the outcome variable, \(R\) is race, \(X\) are
covariates (fixed effects), and \(\theta_{rx}\) and \(\alpha_r\) are
vectors with length matching the number of levels of the outcome variable.
There is one vector \(\theta_{rx}\) for every combination of race and
covariates, hence the need for formula
to either have no covariates or a
fully interacted structure.
The Categorical mixed-effects model is specified as follows: $$
Y_i \mid R_i, X_i, \Theta \sim \text{Categorical}(g^{-1}(\mu_{R_iX_i})) \\
\mu_{rxy} = W\beta_{ry} + Zu_{ry} \\
u_{ry} \mid \sigma^2_{ry} \sim \mathcal{N}(0, \sigma^2_{ry}) \\
\beta_{ry} \sim \mathcal{N}(0, s^2_{r\beta}) \\
\sigma_{ry} \sim \text{Gamma}(2, 2/s_{r\sigma}),
$$ where \(\beta_{ry}\) are the fixed effects, \(u_{ry}\) is the random
intercept, and \(g\) is a softmax link function.
Estimates for \(\beta_{ry}\) and \(\sigma_{ry}\) are stored in the
$beta
and $sigma
elements of the fitted model object.
The Normal linear model is specified as follows: $$
Y_i \mid R_i, \vec X_i, \Theta \sim \mathcal{N}(\vec X_i^\top\vec\theta, \sigma^2) \\
\sigma^2 \sim \text{Inv-Gamma}(n_\sigma/2, l_\sigma^2 n_\sigma/2) \\
\beta_{\text{intercept}} \sim \mathcal{N}(0, s^2_\text{int}) \\
\beta_k \sim \mathcal{N}(0, s^2_\beta), \\
$$ where \(\vec\theta\) is a vector of linear model coefficients.
Estimates for \(\theta\) and \(\sigma\) are stored in the
$beta
and $sigma
elements of the fitted model object.
More details on the models and their properties may be found in the paper referenced below.
References
McCartan, C., Goldin, J., Ho, D.E., & Imai, K. (2022). Estimating Racial Disparities when Race is Not Observed. Available at https://arxiv.org/abs/2303.02580.
Examples
data(pseudo_vf)
r_probs = bisg(~ nm(last_name) + zip(zip), data=pseudo_vf)
# Process zip codes to remove missing values
pseudo_vf$zip = proc_zip(pseudo_vf$zip)
birdie(r_probs, turnout ~ 1, data=pseudo_vf)
#> Using weakly informative empirical Bayes prior for Pr(Y | R)
#> This message is displayed once every 8 hours.
#> Categorical-Dirichlet BIRDiE model
#> Formula: turnout ~ 1
#> Data: pseudo_vf
#> Number of obs: 5,000
#> Estimated distribution:
#> white black hisp asian aian other
#> no 0.3 0.358 0.391 0.61 0.768 0.269
#> yes 0.7 0.642 0.609 0.39 0.232 0.731
birdie(r_probs, turnout ~ zip, data=pseudo_vf)
#> Categorical-Dirichlet BIRDiE model
#> Formula: turnout ~ zip
#> Data: pseudo_vf
#> Number of obs: 5,000
#> Estimated distribution:
#> white black hisp asian aian other
#> no 0.302 0.347 0.358 0.559 0.666 0.399
#> yes 0.698 0.653 0.642 0.441 0.334 0.601
fit = birdie(r_probs, turnout ~ (1 | zip), data=pseudo_vf,
family=cat_mixed(), ctrl=birdie.ctrl(abstol=1e-3))
#> Using default prior for Pr(Y | R):
#> → Prior scale on intercepts: 2.0
#> → Prior scale on fixed effects coefficients: 0.2
#> → Prior mean of random effects standard deviation: 0.05
#> This message is displayed once every 8 hours.
summary(fit)
#> Categorical mixed-effects BIRDiE model
#> Formula: turnout ~ (1 | zip)
#> Data: pseudo_vf
#>
#> 7 iterations and 0.82 secs to convergence
#>
#> Number of observations: 5,000
#> Number of groups: 618
#>
#> Entropy decrease from marginal race distribution:
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> -0.4858 0.1514 0.4055 0.4225 0.7326 1.0608
#> Entropy decrease from BISG probabilities:
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> -0.761242 -0.007368 0.032008 0.051524 0.108660 0.851732
#>
#> Estimated outcome-by-race distribution:
#> white black hisp asian aian other
#> no 0.298 0.358 0.374 0.565 0.848 0.34
#> yes 0.702 0.642 0.626 0.435 0.152 0.66
coef(fit)
#> white black hisp asian aian other
#> no 0.2978312 0.3578046 0.374459 0.5651647 0.8475868 0.3395913
#> yes 0.7021688 0.6421954 0.625541 0.4348353 0.1524132 0.6604087
fitted(fit)
#> # A tibble: 5,000 × 6
#> pr_white pr_black pr_hisp pr_asian pr_aian pr_other
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 0.963 0.00320 0.0130 0.000473 0.00117 0.0190
#> 2 0.0236 0.953 0.00439 0.000787 0.00126 0.0165
#> 3 0.965 0.00580 0.00546 0.000699 0.0000697 0.0232
#> 4 0.590 0.371 0.00606 0.000528 0.000150 0.0325
#> 5 0.973 0.00164 0.00424 0.00271 0.00394 0.0147
#> 6 0.553 0.283 0.0989 0.00495 0.000271 0.0595
#> 7 0.133 0.758 0.0600 0.00316 0.000214 0.0456
#> 8 0.965 0.00417 0.0313 0 0 0
#> 9 0.748 0.219 0.00832 0.000421 0.0000919 0.0240
#> 10 0.878 0.0932 0.0122 0.000554 0.000158 0.0154
#> # ℹ 4,990 more rows