Before we continue our tour of R objects, we need to take a brief interlude into data management. When you are working with data a lot it is important to understand a little bit more about help and how directory structures

Help

There are four main help commands in R. Let's says we want to draw from the multivariate normal distribution. help() is a equivalent to one question mark ? . Let's say we know the name of the command.

It's not giving us any help because this command lives in a library called MASS. That's where the two question marks come in. But you have to have the library installed on your computer. install.packages("MASS").

Rstudio makes some this easier, but you should understand how it works!

??mvrnorm

But what if you don't know the name of the command, which is often the case. First, we can see if we have function in a library installed. If we don't, then we can search the CRAN packages to see if the function has been written. In either case, if we don't like the function written, or don't trust it, we can rewrite it ourselves!

Let's say we read a paper that included a zero-inflated negative binomial regression

help.search("zero inflated negative binomial")

Talk to your neighbour about an statistical issue you have and try to find a command.

Missing Data

As social scientists, we often do not have complete data sets. R is very sensitive to this fact. While Stata often arbitrarily omits missing data, R forces you to think about your actions.

There are two major types of missing values, what Matloff calls "no such animal". These are NA and NULL . What is the difference?

Let's create some fake data that represents wealth. Wealth is often log-normall distributed.Now let's examine what happens when rich people systematically don't answer questions about their wealth, while learning about missing values.

set.seed(20150406)
options(scipen=2)
americanIncome <- rlnorm(10000, log(35), log(5))

#examine income
summary(americanIncome)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
##     0.083    11.780    35.740   128.400   106.800 15150.000
mean(americanIncome)
## [1] 128.3906
americanIncome <- sort(americanIncome)
head(americanIncome, 10)
##  [1] 0.08263699 0.10469058 0.11357847 0.11649199 0.12614043 0.13767362
##  [7] 0.14785704 0.15665439 0.16178657 0.16702796
#generate missing values and insert
missingValues <- sample(9000:10000, 700)
americanIncomeNull <- americanIncomeNA <- americanIncome
americanIncomeNA[missingValues] <- NA
americanIncomeNull[missingValues] <- NULL #why an error
## Error in americanIncomeNull[missingValues] <- NULL: replacement has length zero
americanIncomeNull <- na.omit(americanIncomeNA)
length(americanIncomeNull)
## [1] 9300
#compare
mean(americanIncomeNA, na.rm=TRUE)
## [1] 79.26198
mean(americanIncomeNA) # why is there an NA (not an error or a value)?
## [1] NA
mean(americanIncomeNull, na.rm=FALSE) #why does this work
## [1] 79.26198
mean(americanIncome) - mean(americanIncomeNA, na.rm=TRUE)
## [1] 49.12858

Directories

Before we move into the workhorse of social scientists, the data.frame, we need to better understand how file directories work in R

Mac Yosemite Users see here.

getwd() #gives you the current wd
## [1] "/Volumes/Optibay-1TB/Dropbox/intro_R/Topic3"

Often, you are working in a project and you want to start reading files from within the project The you need the setwd() To make your files easily runnable on other peoples computer, I will usually do something like the following, so that if someone else works on the file they can easily change the directory.

#forward slash
Course508Dir <- "/Volumes/Optibay-1TB/Dropbox/intro_R/"
setwd(Course508Dir)

Data Formats

R can read anything (almost), but often .csv files are good non-propietary files. However, the delimiter can be many different things, not always commas! R can read also read SPSS read.spss or read.dta in the foreign library. Ways to interact with SQL and and RMDB you can think of. Just keep this in mind for later on.R also has it's own storage format .Rdata .

Is is almost always worth visually examining your data NOT in Excel! to look at the csv structure. There are variations and R can deal with all of them, but have a look at your data first. It will save you lots of time and energy!

Let's use the examples from Andrew Gelmans ' book

Brief aside, more later. What is a list?

firstList <- list(matrix(), seq(1,10), "hello")
print(firstList)
## [[1]]
##      [,1]
## [1,]   NA
## 
## [[2]]
##  [1]  1  2  3  4  5  6  7  8  9 10
## 
## [[3]]
## [1] "hello"
firstList[[2]] #note the double brackets
##  [1]  1  2  3  4  5  6  7  8  9 10
congIdeo <- read.csv(file.path(Course508Dir, "Gelman_Files/ARM_Data/ideology/ideo2.dat"),
                     header=TRUE,
                     as.is=TRUE,
                     na.strings=NA,
                     sep="")

class(congIdeo)
## [1] "data.frame"
str(congIdeo)
## 'data.frame':	10268 obs. of  22 variables:
##  $ congno  : int  80 80 80 80 80 80 80 80 80 80 ...
##  $ sticpsr : int  41 41 41 41 41 41 41 41 41 42 ...
##  $ cd      : int  1 2 3 4 5 6 7 8 9 1 ...
##  $ stalpha : chr  "ALABAMA" "ALABAMA" "ALABAMA" "ALABAMA" ...
##  $ occup   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ score1  : num  -0.101 -0.127 -0.098 -0.177 -0.256 -0.235 -0.15 -0.3 -0.191 0.022 ...
##  $ score2  : num  0.279 0.364 0.485 0.424 0.281 0.333 0.379 0.32 0.313 0.368 ...
##  $ year    : int  48 48 48 48 48 48 48 48 48 48 ...
##  $ stgj    : int  1 1 1 1 1 1 1 1 1 4 ...
##  $ po      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ dv      : num  NA NA NA 0.85 NA 0.824 NA 0.884 0.871 NA ...
##  $ dvp     : num  NA NA NA 0.881 NA NA 0.727 0.924 0.941 NA ...
##  $ redist  : int  0 0 0 0 0 0 0 0 0 1 ...
##  $ i2      : int  1 1 1 1 1 0 0 0 1 1 ...
##  $ demqa   : int  1 1 1 1 1 1 NA 1 1 1 ...
##  $ i1      : int  1 1 1 1 1 1 1 1 0 1 ...
##  $ demqal  : int  1 1 1 1 1 1 1 1 NA 1 ...
##  $ nstate  : chr  "Alabama" "Alabama" "Alabama" "Alabama" ...
##  $ dpvote  : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ normvote: num  NA NA NA NA NA NA NA NA NA NA ...
##  $ dvfix   : num  NA NA NA 0.822 NA 0.824 NA 0.884 0.843 NA ...
##  $ dvpfix  : num  NA NA NA 0.861 NA NA 0.707 0.904 0.941 NA ...
dim(congIdeo)
## [1] 10268    22
names(congIdeo)
##  [1] "congno"   "sticpsr"  "cd"       "stalpha"  "occup"    "score1"  
##  [7] "score2"   "year"     "stgj"     "po"       "dv"       "dvp"     
## [13] "redist"   "i2"       "demqa"    "i1"       "demqal"   "nstate"  
## [19] "dpvote"   "normvote" "dvfix"    "dvpfix"
summary(congIdeo)
##      congno          sticpsr            cd           stalpha         
##  Min.   : 80.00   Min.   : 1.00   Min.   : 1.000   Length:10268      
##  1st Qu.: 86.00   1st Qu.:21.00   1st Qu.: 3.000   Class :character  
##  Median : 92.00   Median :34.00   Median : 6.000   Mode  :character  
##  Mean   : 91.54   Mean   :35.79   Mean   : 9.425                     
##  3rd Qu.: 98.00   3rd Qu.:49.00   3rd Qu.:13.000                     
##  Max.   :103.00   Max.   :82.00   Max.   :52.000                     
##                                                                      
##      occup            score1             score2              year      
##  Min.   :0.0000   Min.   :-0.99700   Min.   :-0.87100   Min.   :48.00  
##  1st Qu.:0.0000   1st Qu.:-0.30100   1st Qu.:-0.14600   1st Qu.:60.00  
##  Median :0.0000   Median :-0.01300   Median : 0.00000   Median :72.00  
##  Mean   :0.0265   Mean   :-0.02617   Mean   : 0.03043   Mean   :71.08  
##  3rd Qu.:0.0000   3rd Qu.: 0.25600   3rd Qu.: 0.19125   3rd Qu.:84.00  
##  Max.   :1.0000   Max.   : 1.00000   Max.   : 1.00000   Max.   :94.00  
##  NA's   :864                                                           
##       stgj             po               dv              dvp        
##  Min.   : 1.00   Min.   :0.0000   Min.   :0.0730   Min.   :0.0730  
##  1st Qu.:13.00   1st Qu.:0.0000   1st Qu.:0.3990   1st Qu.:0.3990  
##  Median :25.00   Median :0.0000   Median :0.5200   Median :0.5170  
##  Mean   :25.32   Mean   :0.1793   Mean   :0.5280   Mean   :0.5273  
##  3rd Qu.:37.00   3rd Qu.:0.0000   3rd Qu.:0.6550   3rd Qu.:0.6520  
##  Max.   :50.00   Max.   :1.0000   Max.   :0.9697   Max.   :0.9910  
##                                   NA's   :1618     NA's   :1733    
##      redist             i2              demqa               i1         
##  Min.   :0.0000   Min.   :-1.0000   Min.   :-1.0000   Min.   :-1.0000  
##  1st Qu.:0.0000   1st Qu.:-1.0000   1st Qu.:-1.0000   1st Qu.:-1.0000  
##  Median :0.0000   Median : 1.0000   Median : 0.0000   Median : 1.0000  
##  Mean   :0.2313   Mean   : 0.1607   Mean   : 0.1827   Mean   : 0.1548  
##  3rd Qu.:0.0000   3rd Qu.: 1.0000   3rd Qu.: 1.0000   3rd Qu.: 1.0000  
##  Max.   :1.0000   Max.   : 1.0000   Max.   : 1.0000   Max.   : 1.0000  
##  NA's   :4        NA's   :38        NA's   :206       NA's   :37       
##      demqal           nstate              dpvote          normvote     
##  Min.   :-1.0000   Length:10268       Min.   :0.0656   Min.   :0.1339  
##  1st Qu.:-1.0000   Class :character   1st Qu.:0.3970   1st Qu.:0.3922  
##  Median : 0.0000   Mode  :character   Median :0.4736   Median :0.4691  
##  Mean   : 0.1791                      Mean   :0.4889   Mean   :0.4852  
##  3rd Qu.: 1.0000                      3rd Qu.:0.5675   3rd Qu.:0.5615  
##  Max.   : 1.0000                      Max.   :0.9527   Max.   :0.9704  
##  NA's   :227                          NA's   :2624     NA's   :2624    
##      dvfix            dvpfix      
##  Min.   :0.1730   Min.   :0.0900  
##  1st Qu.:0.4410   1st Qu.:0.4400  
##  Median :0.5080   Median :0.5090  
##  Mean   :0.5194   Mean   :0.5198  
##  3rd Qu.:0.5880   3rd Qu.:0.5910  
##  Max.   :0.9550   Max.   :0.9910  
##  NA's   :1648     NA's   :1764

Many ways to access the same variable. The dollar sign is the general way to access a variable by name, but it depends on your coding situation. Never, despite what anyone tells you EVER use attach().

head(congIdeo$score1)
## [1] -0.101 -0.127 -0.098 -0.177 -0.256 -0.235
head(congIdeo[, 6])
## [1] -0.101 -0.127 -0.098 -0.177 -0.256 -0.235
head(congIdeo[[6]])
## [1] -0.101 -0.127 -0.098 -0.177 -0.256 -0.235
head(congIdeo[, "score1"])
## [1] -0.101 -0.127 -0.098 -0.177 -0.256 -0.235

A note on drop=TRUE and dimensionality. What's the diffference and why does it matter?

head(congIdeo[, 6, drop=TRUE])
## [1] -0.101 -0.127 -0.098 -0.177 -0.256 -0.235
head(congIdeo[, 6, drop=FALSE])
##   score1
## 1 -0.101
## 2 -0.127
## 3 -0.098
## 4 -0.177
## 5 -0.256
## 6 -0.235
rowForm <- head(congIdeo[, 6, drop=TRUE])
columnForm <- head(congIdeo[, 6, drop=FALSE])

class(rowForm)
## [1] "numeric"
class(columnForm)
## [1] "data.frame"
dim(rowForm)
## NULL
dim(columnForm)
## [1] 6 1

What if you want to access multiple variables?

Slicing data in the presence of NAs. We have to be careful. Let's look at one of our problematic variables. We have the variable redist, which is whether the congressional district was redistricted in that year. Interesting... that there are missing values. We probably want to look at those... But first...

#A trick
sum(TRUE)
## [1] 1
sum(FALSE)
## [1] 0
sum(rep(FALSE, 30))
## [1] 0
#difference with subset, deals with NAs for you
sum(congIdeo$redist > 0)
## [1] NA
sum(congIdeo$redist > 0 & !is.na(congIdeo$redist))
## [1] 2374
#use subset default method
sum(
  subset(congIdeo$redist, congIdeo$redist > 0)
  )
## [1] 2374
#use for class data.frame
#what's the difference?
onlyRedist <- subset(congIdeo,
                     subset=congIdeo$redist > 0
                     )

onlyRedist2 <- subset(congIdeo,
                     subset=congIdeo$redist > 0,
                     select=1:10
                     )

So, what if we wanted to look at those NA's?

congIdeo[is.na(congIdeo$redist), c("stalpha", "redist", "year")]
##      stalpha redist year
## 9569 MARYLAN     NA   92
## 9572 MARYLAN     NA   92
## 9594 MICHIGA     NA   92
## 9814 WEST VI     NA   92

Finally, let's check the the number of states in the data.

length(
  unique(
    congIdeo$stalpha
    )
  )
## [1] 50