#### Research

## Published and Forthcoming Papers

Nearest-neighbor matching is a popular nonparametric tool to create balance between treatment and control groups in observational studies. As a preprocessing step before regression, matching reduces the dependence on parametric modeling assumptions. In current empirical practice, however, the matching step is often ignored in the calculation of standard errors and confidence intervals. In this article, we show that ignoring the matching step results in asymptotically valid standard errors if matching is done without replacement and the regression model is correctly specified relative to the population regression function of the outcome variable on the treatment variable and *all* the covariates used for matching. However, standard errors that ignore the matching step are not valid if matching is conducted with replacement or, more crucially, if the second step regression model is misspecified in the sense indicated above. Moreover, correct specification of the regression model is not required for consistent estimation of treatment effects with matched data. We show that two easily implementable alternatives produce approximations to the distribution of the post-matching estimator that are robust to misspecification. A simulation study and an empirical example demonstrate the empirical relevance of our results.

Concerns about the dissemination of spurious results have led to calls for pre-analysis plans (PAPs) to avoid ex-post “p-hacking.” But often the conceptual hypotheses being tested do not imply the level of specificity required for a PAP. In this paper we suggest a framework for PAPs that capitalize on the availability of causal machine-learning (ML) techniques, in which researchers combine specific aspects of the analysis with ML for the flexible estimation of unspecific remainders. A “cheap-lunch” result shows that the inclusion of ML produces limited worst-case costs in power, while offering a substantial upside from systematic specification searches.

The ability to distinguish between people in setting the price of credit is often constrained by legal rules that aim to prevent discrimination. These legal requirements have developed focusing on human decision-making contexts, and so their effectiveness is challenged as pricing increasingly relies on intelligent algorithms that extract information from big data. In this Essay, we bring together existing legal requirements with the structure of machine-learning decision-making in order to identify tensions between old law and new methods and lay the ground for legal solutions. We argue that, while automated pricing rules provide increased transparency, their complexity also limits the application of existing law. Using a simulation exercise based on real-world mortgage data to illustrate our arguments, we note that restricting the characteristics that the algorithms allowed to use can have a limited effect on disparity and can in fact increase pricing gaps. Furthermore, we argue that there are limits to interpreting the pricing rules set by machine learning that hinders the application of existing discrimination laws. We end by discussing a framework for testing discrimination that evaluates algorithmic pricing rules in a controlled environment. Unlike the human decision-making context, this framework allows for ex ante testing of price rules, facilitating comparisons between lenders.

Machines are increasingly doing “intelligent” things. Face recognition algorithms use a large dataset of photos labeled as having a face or not to estimate a function that predicts the presence y of a face from pixels x. This similarity to econometrics raises questions: How do these new empirical tools fit with what we know? As empirical economists, how can we use them? We present a way of thinking about machine learning that gives it its own place in the econometric toolbox. Machine learning not only provides new tools, it solves a different problem. Specifically, machine learning revolves around the problem of prediction, while many economic applications revolve around parameter estimation. So applying machine learning to economics requires finding relevant tasks. Machine learning algorithms are now technically easy to use: you can download convenient packages in R or Python. This also raises the risk that the algorithms are applied naively or their output is misinterpreted. We hope to make them conceptually easier to use by providing a crisper understanding of how these algorithms work, where they excel, and where they can stumble—and thus where they can be most usefully applied.

## Current Working Papers

The past years have seen seen the development and deployment of machine-learning algorithms to estimate personalized treatment-assignment policies from randomized controlled trials. Yet such algorithms for the assignment of treatment typically optimize expected outcomes without taking into account that treatment assignments are frequently subject to hypothesis testing. In this article, we explicitly take significance testing of the effect of treatment-assignment policies into account, and consider assignments that optimize the probability of finding a subset of individuals with a statistically significant positive treatment effect. We provide an efficient implementation using decision trees, and demonstrate its gain over selecting subsets based on positive (estimated) treatment effects. Compared to standard tree-based regression and classification tools, this approach tends to yield substantially higher power in detecting subgroups with positive treatment effects.

Since their introduction in Abadie and Gardeazabal (2003), Synthetic Control (SC) methods have quickly become one of the leading methods for estimating causal effects in observational studies with panel data. Formal discussions often motivate SC methods by the assumption that the potential outcomes were generated by a factor model. Here we study SC methods from a design-based perspective, assuming a model for the selection of the treated unit(s), e.g., random selection as guaranteed in a randomized experiment. We show that SC methods offer benefits even in settings with randomized assignment, and that the design perspective offers new insights into SC methods for observational data. A first insight is that the standard SC estimator is not unbiased under random assignment. We propose a simple modification of the SC estimator that guarantees unbiasedness in this setting and derive its exact, randomization-based, finite sample variance. We also propose an unbiased estimator for this variance. We show in settings with real data that under random assignment this Modified Unbiased Synthetic Control (MUSC) estimator can have a root mean-squared error (RMSE) that is substantially lower than that of the difference-in-means estimator. We show that such an improvement is weakly guaranteed if the treated period is similar to the other periods, for example, if the treated period was randomly selected. The improvement is most likely to be substantial if the number of pre-treatment periods is large relative to the number of control units.

Econometric analysis typically focuses on the statistical properties of fixed estimators and ignores researcher choices. In this article, I approach the analysis of experimental data as a mechanism-design problem that acknowledges that researchers choose between estimators, sometimes based on the data and often according to their own preferences. Specifically, I focus on covariate adjustments, which can increase the precision of a treatment-effect estimate, but open the door to bias when researchers engage in specification searches. First, I establish that unbiasedness is a requirement on the estimation of the average treatment effect that aligns researchers’ preferences with the minimization of the mean-squared error relative to the truth, and that fixing the bias can yield an optimal restriction in a minimax sense. Second, I provide a constructive characterization of treatment-effect estimators with fixed bias as sample-splitting procedures. Third, I show how these results imply flexible pre-analysis plans that include beneficial specification searches.

📎 PDF download · 📎 PDF download of longer 2018 version with additional results

Shrinkage estimation usually reduces variance at the cost of bias. But when we care only about some parameters of a model, I show that we can reduce variance without incurring bias if we have additional information about the distribution of covariates. In a linear regression model with homoscedastic Normal noise, I consider shrinkage estimation of the nuisance parameters associated with control variables. For at least three control variables and exogenous treatment, I establish that the standard least-squares estimator is dominated with respect to squared-error loss in the treatment effect even among unbiased estimators and even when the target parameter is low-dimensional. I construct the dominating estimator by a variant of James–Stein shrinkage in a high-dimensional Normal-means problem. It can be interpreted as an invariant generalized Bayes estimator with an uninformative (improper) Jeffreys prior in the target parameter.

The two-stage least-squares (2SLS) estimator is known to be biased when its first-stage fit is poor. I show that better first-stage prediction can alleviate this bias. In a two-stage linear regression model with Normal noise, I consider shrinkage in the estimation of the first-stage instrumental variable coefficients. For at least four instrumental variables and a single endogenous regressor, I establish that the standard 2SLS estimator is dominated with respect to bias. The dominating IV estimator applies James–Stein type shrinkage in a first-stage high-dimensional Normal-means problem followed by a control-function approach in the second stage. It preserves invariances of the structural instrumental variable equations.

Shrinkage estimation usually reduces variance at the cost of bias. But when we care only about some parameters of a model, I show that we can reduce variance without incurring bias if we have additional information about the distribution of covariates. In a linear regression model with homoscedastic Normal noise, I consider shrinkage estimation of the nuisance parameters associated with control variables. For at least three control variables and exogenous treatment, I establish that the standard least-squares estimator is dominated with respect to squared-error loss in the treatment effect even among unbiased estimators and even when the target parameter is low-dimensional. I construct the dominating estimator by a variant of James–Stein shrinkage in a high-dimensional Normal-means problem. It can be interpreted as an invariant generalized Bayes estimator with an uninformative (improper) Jeffreys prior in the target parameter.