Another important data management trick is to make sure your version of R doesn't have any previous objects in it. Many people just include this code and don't explain it, at this point we can understand what it does. Always put this at the topi of your R script to avoid problems. Let's walk through what it does.
rm(list=ls())
Last time we touched briefly on lists. We will go into more detail here using an example of multiple imputation. Again, remember the income example. While we don't know much about the technique, let's say that we use multiple imputation. This requires multiple versions of the same data set. Let's thing theoretically how R could store this all in one object board:
#the begining of a script rm(list=ls()) if (!require(Amelia)) { install.packages("Amelia") library(Amelia) }
## Warning: package 'Amelia' was built under R version 3.1.2
data(package="Amelia") ??freetrade data(freetrade)
Let's learn a little bit about the freetrade
dataset. Please do this in groups or individually. Let's report back what we learned
if (!require(ggplot2)) { install.packages("ggplot2") library(ggplot2) }
The package ggplot
has become one of the main packages for graphing in R. We will show some quick examples here. If you want to go deeper, you can see the wide variety of materials Hadley Wickham has put out about this. See POLS503 site.
qplot(freetrade$year,freetrade$gdp.pc)
scatterCountry <- ggplot(freetrade, aes(x=year, y=gdp.pc, colour=country) ) scatterCountry + geom_point()
gdpDist <- ggplot(freetrade, aes(x=gdp.pc)) gdpDist + geom_histogram(aes(y=..density..), colour="grey30", fill="green") + geom_density(colour="blue")
Try doing this with some other variables. What happens with the missing data?
Interested in more on ggplot
: go to the POLS503 site
imputedTrade <- amelia(freetrade, m = 5, ts = "year", cs = "country")
## -- Imputation 1 -- ## ## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 ## ## -- Imputation 2 -- ## ## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 ## ## -- Imputation 3 -- ## ## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 ## ## -- Imputation 4 -- ## ## 1 2 3 4 5 6 7 8 9 10 11 12 ## ## -- Imputation 5 -- ## ## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
class(imputedTrade)
## [1] "amelia"
summary(imputedTrade) #why
## ## Amelia output with 5 imputed datasets. ## Return code: 1 ## Message: Normal EM convergence. ## ## Chain Lengths: ## -------------- ## Imputation 1: 15 ## Imputation 2: 16 ## Imputation 3: 19 ## Imputation 4: 12 ## Imputation 5: 16 ## ## Rows after Listwise Deletion: 96 ## Rows after Imputation: 171 ## Patterns of missingness in the data: 8 ## ## Fraction Missing for original variables: ## ----------------------------------------- ## ## Fraction Missing ## year 0.00000000 ## country 0.00000000 ## tariff 0.33918129 ## polity 0.01169591 ## pop 0.00000000 ## gdp.pc 0.00000000 ## intresmi 0.07602339 ## signed 0.01754386 ## fiveop 0.10526316 ## usheg 0.00000000
??summary.amelia
Ok, so the object is an Amelia
object. What should we do now? Luckily most unknown object types work like lists
names(imputedTrade)
## [1] "imputations" "m" "missMatrix" "overvalues" "theta" ## [6] "mu" "covMatrices" "code" "message" "iterHist" ## [11] "arguments" "orig.vars"
for (i in 1:length(imputedTrade)){ print( c( "NAME:",names(imputedTrade)[i], "CLASS(ES):",class(imputedTrade[[i]]) ) ) }
## [1] "NAME:" "imputations" "CLASS(ES):" "mi" "list" ## [1] "NAME:" "m" "CLASS(ES):" "numeric" ## [1] "NAME:" "missMatrix" "CLASS(ES):" "matrix" ## [1] "NAME:" "overvalues" "CLASS(ES):" "NULL" ## [1] "NAME:" "theta" "CLASS(ES):" "array" ## [1] "NAME:" "mu" "CLASS(ES):" "matrix" ## [1] "NAME:" "covMatrices" "CLASS(ES):" "array" ## [1] "NAME:" "code" "CLASS(ES):" "numeric" ## [1] "NAME:" "message" "CLASS(ES):" "character" ## [1] "NAME:" "iterHist" "CLASS(ES):" "list" ## [1] "NAME:" "arguments" "CLASS(ES):" "ameliaArgs" "list" ## [1] "NAME:" "orig.vars" "CLASS(ES):" "character"
Take a minute to look at this object and tell me where the actual data is
Let's get some intuition about how some of the functions in Amelia work, using looping or applying
for (i in 1:length(imputedTrade[["imputations"]])) { print( summary(imputedTrade[["imputations"]][[i]]$tariff) ) }
## Min. 1st Qu. Median Mean 3rd Qu. Max. ## -12.52 17.61 27.58 32.52 41.10 107.90 ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## -11.11 17.40 27.30 32.48 41.25 103.50 ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## -21.18 16.20 26.90 31.14 40.58 100.00 ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## -8.393 16.950 27.300 32.540 42.460 100.000 ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## -23.54 15.95 27.46 32.23 43.18 100.00
Just from a theoretical perspective what is wrong with these imputations?
It's easier to use the apply function of families to do this and it is much faster. Generally, when you are not changing things in each iteration of the loop you want to apply
!
lapply(imputedTrade[["imputations"]], summary)
What's the difference between this and our loop?
summaryTariffData <- lapply(imputedTrade[["imputations"]], function(x) summary(x$tariff) )
tscsPlot(imputedTrade, cs = "Malaysia", main = "Malaysia (no time settings)", var = "tariff", ylim = c(-10, 60) )
Interested in learning more, go to the Amelia package on CRAN.
Okay, so what's the difference with apply
verus lapply
. Try to use apply on the original freetrade
dataset.