Another important data management trick is to make sure your version of R doesn't have any previous objects in it. Many people just include this code and don't explain it, at this point we can understand what it does. Always put this at the topi of your R script to avoid problems. Let's walk through what it does.

rm(list=ls())

Last time we touched briefly on lists. We will go into more detail here using an example of multiple imputation. Again, remember the income example. While we don't know much about the technique, let's say that we use multiple imputation. This requires multiple versions of the same data set. Let's thing theoretically how R could store this all in one object board:

#the begining of a script
rm(list=ls())
if (!require(Amelia)) {
  install.packages("Amelia")
  library(Amelia)
}
## Loading required package: Amelia
## Warning: package 'Amelia' was built under R version 3.1.2
## Loading required package: Rcpp
## ## 
## ## Amelia II: Multiple Imputation
## ## (Version 1.7.3, built: 2014-11-14)
## ## Copyright (C) 2005-2015 James Honaker, Gary King and Matthew Blackwell
## ## Refer to http://gking.harvard.edu/amelia/ for more information
## ##
data(package="Amelia")
??freetrade
data(freetrade)

Let's learn a little bit about the freetrade dataset. Please do this in groups or individually. Let's report back what we learned

if (!require(ggplot2)) {
  install.packages("ggplot2")
  library(ggplot2)
}
## Loading required package: ggplot2

The package ggplot has become one of the main packages for graphing in R. We will show some quick examples here. If you want to go deeper, you can see the wide variety of materials Hadley Wickham has put out about this. See POLS503 site.

qplot(freetrade$year,freetrade$gdp.pc)
scatterCountry <- ggplot(freetrade,
                         aes(x=year, y=gdp.pc, colour=country)
                         )

scatterCountry + geom_point()
gdpDist <- ggplot(freetrade, aes(x=gdp.pc))
gdpDist +
  geom_histogram(aes(y=..density..), colour="grey30", fill="green") +
  geom_density(colour="blue")

Try doing this with some other variables. What happens with the missing data?

Interested in more on ggplot: go to the POLS503 site

imputedTrade <- amelia(freetrade, m = 5, ts = "year", cs = "country")
## -- Imputation 1 --
## 
##   1  2  3  4  5  6  7  8  9 10 11 12 13 14 15
## 
## -- Imputation 2 --
## 
##   1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16
## 
## -- Imputation 3 --
## 
##   1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19
## 
## -- Imputation 4 --
## 
##   1  2  3  4  5  6  7  8  9 10 11 12
## 
## -- Imputation 5 --
## 
##   1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16
class(imputedTrade)
## [1] "amelia"
summary(imputedTrade) #why
## 
## Amelia output with 5 imputed datasets.
## Return code:  1 
## Message:  Normal EM convergence. 
## 
## Chain Lengths:
## --------------
## Imputation 1:  15
## Imputation 2:  16
## Imputation 3:  19
## Imputation 4:  12
## Imputation 5:  16
## 
## Rows after Listwise Deletion:  96 
## Rows after Imputation:  171 
## Patterns of missingness in the data:  8 
## 
## Fraction Missing for original variables: 
## -----------------------------------------
## 
##          Fraction Missing
## year           0.00000000
## country        0.00000000
## tariff         0.33918129
## polity         0.01169591
## pop            0.00000000
## gdp.pc         0.00000000
## intresmi       0.07602339
## signed         0.01754386
## fiveop         0.10526316
## usheg          0.00000000
??summary.amelia

Ok, so the object is an Amelia object. What should we do now? Luckily most unknown object types work like lists

names(imputedTrade)
##  [1] "imputations" "m"           "missMatrix"  "overvalues"  "theta"      
##  [6] "mu"          "covMatrices" "code"        "message"     "iterHist"   
## [11] "arguments"   "orig.vars"
for (i in 1:length(imputedTrade)){
  print(
    c(
      "NAME:",names(imputedTrade)[i],
      "CLASS(ES):",class(imputedTrade[[i]])
      )
    )
  }
## [1] "NAME:"       "imputations" "CLASS(ES):"  "mi"          "list"       
## [1] "NAME:"      "m"          "CLASS(ES):" "numeric"   
## [1] "NAME:"      "missMatrix" "CLASS(ES):" "matrix"    
## [1] "NAME:"      "overvalues" "CLASS(ES):" "NULL"      
## [1] "NAME:"      "theta"      "CLASS(ES):" "array"     
## [1] "NAME:"      "mu"         "CLASS(ES):" "matrix"    
## [1] "NAME:"       "covMatrices" "CLASS(ES):"  "array"      
## [1] "NAME:"      "code"       "CLASS(ES):" "numeric"   
## [1] "NAME:"      "message"    "CLASS(ES):" "character" 
## [1] "NAME:"      "iterHist"   "CLASS(ES):" "list"      
## [1] "NAME:"      "arguments"  "CLASS(ES):" "ameliaArgs" "list"      
## [1] "NAME:"      "orig.vars"  "CLASS(ES):" "character"

Take a minute to look at this object and tell me where the actual data is

Let's get some intuition about how some of the functions in Amelia work, using looping or applying

for (i in 1:length(imputedTrade[["imputations"]])) {
  print(
    summary(imputedTrade[["imputations"]][[i]]$tariff)
    )
}
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  -12.52   17.61   27.58   32.52   41.10  107.90 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  -11.11   17.40   27.30   32.48   41.25  103.50 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  -21.18   16.20   26.90   31.14   40.58  100.00 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  -8.393  16.950  27.300  32.540  42.460 100.000 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  -23.54   15.95   27.46   32.23   43.18  100.00

Just from a theoretical perspective what is wrong with these imputations?

It's easier to use the apply function of families to do this and it is much faster. Generally, when you are not changing things in each iteration of the loop you want to apply!

lapply(imputedTrade[["imputations"]], summary)

What's the difference between this and our loop?

summaryTariffData <- lapply(imputedTrade[["imputations"]],
                            function(x) summary(x$tariff)
                            )
tscsPlot(imputedTrade,
         cs = "Malaysia",
         main = "Malaysia (no time settings)",
         var = "tariff",
         ylim = c(-10, 60)
         )

Interested in learning more, go to the Amelia package on CRAN.

Okay, so what's the difference with apply verus lapply. Try to use apply on the original freetrade dataset.