Before we continue our tour of R objects, we need to take a brief interlude into data management. When you are working with data a lot it is important to understand a little bit more about help and how directory structures
There are four main help commands in R. Let's says we want to draw from the multivariate normal distribution. help()
is a equivalent to one question mark ?
. Let's say we know the name of the command.
It's not giving us any help because this command lives in a library called MASS
. That's where the two question marks come in. But you have to have the library installed on your computer. install.packages("MASS").
Rstudio makes some this easier, but you should understand how it works!
??mvrnorm
But what if you don't know the name of the command, which is often the case. First, we can see if we have function in a library installed. If we don't, then we can search the CRAN packages to see if the function has been written. In either case, if we don't like the function written, or don't trust it, we can rewrite it ourselves!
Let's say we read a paper that included a zero-inflated negative binomial regression
help.search("zero inflated negative binomial")
Talk to your neighbour about an statistical issue you have and try to find a command.
As social scientists, we often do not have complete data sets. R is very sensitive to this fact. While Stata often arbitrarily omits missing data, R forces you to think about your actions.
There are two major types of missing values, what Matloff calls "no such animal". These are NA
and NULL
. What is the difference?
Let's create some fake data that represents wealth. Wealth is often log-normall distributed.Now let's examine what happens when rich people systematically don't answer questions about their wealth, while learning about missing values.
set.seed(20150406) options(scipen=2) americanIncome <- rlnorm(10000, log(35), log(5)) #examine income summary(americanIncome)
## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.083 11.780 35.740 128.400 106.800 15150.000
mean(americanIncome)
## [1] 128.3906
americanIncome <- sort(americanIncome) head(americanIncome, 10)
## [1] 0.08263699 0.10469058 0.11357847 0.11649199 0.12614043 0.13767362 ## [7] 0.14785704 0.15665439 0.16178657 0.16702796
#generate missing values and insert missingValues <- sample(9000:10000, 700) americanIncomeNull <- americanIncomeNA <- americanIncome americanIncomeNA[missingValues] <- NA americanIncomeNull[missingValues] <- NULL #why an error
## Error in americanIncomeNull[missingValues] <- NULL: replacement has length zero
americanIncomeNull <- na.omit(americanIncomeNA) length(americanIncomeNull)
## [1] 9300
#compare mean(americanIncomeNA, na.rm=TRUE)
## [1] 79.26198
mean(americanIncomeNA) # why is there an NA (not an error or a value)?
## [1] NA
mean(americanIncomeNull, na.rm=FALSE) #why does this work
## [1] 79.26198
mean(americanIncome) - mean(americanIncomeNA, na.rm=TRUE)
## [1] 49.12858