When experiments are carried out, only one randomization actually occurs. However, this is just one of many possible randomizations. One way of testing whether any covariates are related to treatment status is to use randomization inference to generate many of these potential randomizations.

We will be using part of the dataset framing in the mediation package. For the sake of this assignment, let’s assume that only the treatment tone was randomized. We want to test that none of the pre-treatment covariates are related to tone. These pre-treatment covariates, for the sake of this assignment, are age, educ, gender, and income.

In order to test the balance, first run a linear regression, where the outcome is the treatment status. Then get the F-statistic. The F-Statistic tests whether the covariates predict treatment better than the intercept alone would. You use the summary(model)$fstatistic to to extract the f-statistic vector from the regression object. Note: this is not prefectly correct. It may be better to run a logistic regression and calculate a likelihood ratio test. You can do this too if you want!

However, you don’t know what the p-value for this F-Statistic should be. There were many possible randomizations. The formula for the total number of possible randomization is \(\frac{N!}{n!(N-n)!}\). In our example we have 130 who didn’t get tone and 135 who did. This is a very large number!

  1. Obain the F-Statistic for the observed random assignment using lm()
  2. Simulate 1000 other assignments to treatment and control and extract and store the F-statistics for each of these assignments. Don’t forget set.seed(). Remember the function sample(), and replicate() could also help.
  1. Create a histogram that shows the distribution of these F-Statistics, where the observed F-Statistic lies within the distribution of simulated F-Statistics and where a statistically significant F-test would lie. Like other distributions the F-distribition has a pf() and qf() function.