September, 2015

Today

  • R basics
  • groups
  • exercise 1

R basics

  • do you all have R installed on your computer?
  • can you all read datasets into R?
  • have you finished exercise 1?

Installing packages

  • on its own, R can't do all that much
  • to really make use of R's capabilities, we need packages
  • a package bundles together code, data, documentation, and tests
  • we install packages from two sources
    • the Comprehensive R Archive Network (CRAN)
    • github

Installing packages from CRAN

We can install the readr package, for example, by running

install.packages("readr")

Afterwards, we can access all the functions available in the package by running

library("readr")

Installing packages from Github

It's slightly more difficult to install from github since we need to load a package from CRAN first: devtools

Installing from Github now looks like

library("devtools")
install_github("hadley/purrr")

the purrr package can now be loaded using the library command

library("purrr")

Getting help

  • if you know the command, type ? followed by the function in the console
?summary
  • search your version of R using ?? followed by the function name
  • otherwise, try Google/Stackoverflow or our discussion forum

Fundamentals

R is object oriented

Examples:

  • character string (e.g. words)
  • number
  • vector
  • matrix
  • data frame
  • list

We can verify the class of an object using the class function

We assign objects to object names using "<-" or "="

Example

z = "text"
p = c(1, 3, 5)
q = 2
y = NA
k = FALSE
  • Question: what is the class of the objects given above?

Special values

  • NA: not avaliable, missing (is.na)
  • NULL: undefined (is.null)
  • TRUE: logical true (isTRUE)
  • FALSE: logical false (!isTRUE)

Symbols

Operator Meaning
< less than
> greater than
== equal to
<= less than or equal to
>= greater than or equal to
!= not equal to
a | b a or b
a & b a and b

Functions

Functions operate on objects

R has many cool built in functions such as summary, mean, table, etc

x = 1:10
mean(x)
## [1] 5.5
sd(x)
## [1] 3.02765
table(x)
## x
##  1  2  3  4  5  6  7  8  9 10 
##  1  1  1  1  1  1  1  1  1  1

Vectors

The most basic type of R object is a vector. There is really only one rule about vectors in R, which is that a vector can only contain objects of the same class.

Everything in R is a vector

We can create vectors using the c (concatenate/combine) function

my_vector = c(1, 3, 5, 10)
another_vector = 1:100
a_third_vector = c("yes", "no", "hello")
my_logical_vector = c(TRUE, FALSE, FALSE, TRUE)

Data frames

  • R stores spreadsheet like data in a data frame
  • these are really collections of vectors of the same length
  • we read data with commands such as read.csv, read_xlsx, etc
  • see the post Reading And Working With Data

Data frames basics

you can select variables using $, known as the component selector

you can also call variables/observations using indexing

If you have a data frame, df then

df[1, 1]

selects the first row and the first column of the dataset

df[, 1]

selects the entire first column,

df[2, ]

selects the second row, etc.

Working with data frames

Some useful functions for working with data frames:

  • names: returns the column names of the data frame
  • rownames: returns the row names (if any) of the data frame
  • summary: returns summary statistics
  • head: returns the first 5 or 10 observations of the data frame

Your turn

Work with a data frame in groups

I want you to work with a cleaned version of your survey responses in groups.

The data can be read as

# install.packages("readr")
library("readr")

df = read_csv("https://raw.githubusercontent.com/sebastianbarfort/sds/master/data/sds-survey-1.csv")

Link here

Questions for discussion

  1. what are the variable names in the dataset?
  2. how many observations do we have?
  3. make a table of observations grouped by study
  4. try to figure out how to rename one of the variables
  5. are there any NAs in the data?
  6. what is the mean age of the respondents?
  7. try to crosstabulate observations by gender and field of study
  8. Are there any variables that are coded wrong? Think about how to change this if you think this is the case

Answers I

names(df) # Q 1
## [1] "field.of.study"   "credit"           "hours"           
## [4] "computing.skills" "system"           "degree"          
## [7] "age"              "gender"           "time.spent"
nrow(df) # Q2
## [1] 99
table(df$field.of.study) # Q3
## 
##                Economics        Political Science Security Risk Management 
##                       89                        8                        1 
##                Sociology 
##                        1

Answers II

names(df) = c("study", names(df)[-1]) #Q4
summary(df) # Q5
##     study               credit          hours           computing.skills  
##  Length:99          Min.   :0.0000   Length:99          Length:99         
##  Class :character   1st Qu.:1.0000   Class :character   Class :character  
##  Mode  :character   Median :1.0000   Mode  :character   Mode  :character  
##                     Mean   :0.8673                                        
##                     3rd Qu.:1.0000                                        
##                     Max.   :1.0000                                        
##                     NA's   :1                                             
##     system             degree               age           gender         
##  Length:99          Length:99          Min.   :20.00   Length:99         
##  Class :character   Class :character   1st Qu.:23.50   Class :character  
##  Mode  :character   Mode  :character   Median :24.00   Mode  :character  
##                                        Mean   :25.01                     
##                                        3rd Qu.:26.00                     
##                                        Max.   :43.00                     
##                                                                          
##    time.spent    
##  Min.   :  30.0  
##  1st Qu.:  46.5  
##  Median :  66.0  
##  Mean   : 110.5  
##  3rd Qu.:  88.0  
##  Max.   :3158.0  
## 
mean(df$age) # Q6
## [1] 25.0101

Answers III

table(df$field.of.study, df$gender) # Q7
##                           
##                            Female Male
##   Economics                    36   53
##   Political Science             1    7
##   Security Risk Management      0    1
##   Sociology                     0    1

Some interesting points

Clear majority of econ students (90%)

Only 4% rate their computing skills below average ;)

We have 1 Linux user (54% on Windows)

86% are taking the course for credit

A classic error

An interesting variable we might be interested in is hours: how many hours does the student plan on studying for the course

However, when we want to analyze this variable we run into problems

mean(df$hours)
## Warning in mean.default(df$hours): argument is not numeric or logical:
## returning NA
## [1] NA

What is the problem?

Error analysis

We can see that R has read the variable as a character vector

class(df$hours)
## [1] "character"

The reason is that some have answered, for example, 3-4 instead of 3,5

head(df$hours)
## [1] NA    "3"   "8"   "3-4" "5"   "6"

We can investigate how many times this happens by using the grep function. Take a minute and discuss with your classmate how you would use this function to analyze the problem

The grep function

The grep function does so-called string matching. That is, it looks for a certain string in our vector. It returns the indices where the string appears in the vector.

grep("-", df$hours)
## [1] 4

We can fix the error using the gsub function

df$hours = gsub("3-4", "3.5", df$hours)

At last, we can convert the column to class integer

df$hours = as.integer(df$hours)

Work ambition (do you work enough?)

We can plot the distribution of the hours variable

library("ggplot2")
p = ggplot(df, aes(x = hours))
p + geom_histogram() 

Work ambition by gender

library("ggplot2")
p = ggplot(df, aes(x = hours))
p + geom_histogram() + facet_wrap(~ gender)

Exercise 1

Exercise 1

  • first, try to describe the dataset. What simple commands would you run?

  • second, I want you to think about pros and cons of using this kind of data? What should we be worried about when working with this kind of data?

  • third, think about interesting questions to ask from the data?

  • fourth, to test your R skills, try to answer the following questions
    • what are the dimensions of the data?
    • what are the class of each column?
    • which state has the highest mean marijuana price?

R skills I

Read the data

library("readr")
my.df = read_csv("https://raw.githubusercontent.com/sebastianbarfort/sds/master/data/marijuana-street-price-clean.csv")
dim(my.df)
## [1] 22899     8
names(my.df)
## [1] "State"  "HighQ"  "HighQN" "MedQ"   "MedQN"  "LowQ"   "LowQN"  "date"
summary(my.df)
##     State               HighQ           HighQN           MedQ      
##  Length:22899       Min.   :202.0   Min.   :   93   Min.   :144.8  
##  Class :character   1st Qu.:303.8   1st Qu.:  597   1st Qu.:215.8  
##  Mode  :character   Median :342.3   Median : 1420   Median :245.8  
##                     Mean   :329.8   Mean   : 2275   Mean   :247.6  
##                     3rd Qu.:356.6   3rd Qu.: 2958   3rd Qu.:274.2  
##                     Max.   :415.7   Max.   :18492   Max.   :379.0  
##                                                                    
##      MedQN            LowQ           LowQN             date           
##  Min.   :  134   Min.   : 63.7   Min.   :  11.0   Min.   :2013-12-27  
##  1st Qu.:  548   1st Qu.:147.1   1st Qu.:  51.0   1st Qu.:2014-04-18  
##  Median : 1320   Median :186.8   Median : 139.0   Median :2014-08-09  
##  Mean   : 2184   Mean   :203.7   Mean   : 202.8   Mean   :2014-08-14  
##  3rd Qu.: 2673   3rd Qu.:221.4   3rd Qu.: 263.0   3rd Qu.:2014-11-29  
##  Max.   :22027   Max.   :734.6   Max.   :1287.0   Max.   :2015-06-11  
##                  NA's   :10557

R skills II

For figuring out which state has the highest mean marijuana price we need to look at the dplyr package.

Take a look at this package and try if you can figure out how to do this (tip: you need the group_by function)

Highest mean price

library("dplyr")
my.df %>%
  group_by(State) %>%
  summarise(
    m.price = mean(HighQ, na.rm = TRUE)
  ) %>%
  arrange(desc(m.price))
## Source: local data frame [51 x 2]
## 
##           State  m.price
## 1  North Dakota 398.6688
## 2  South Dakota 375.8185
## 3       Vermont 374.2504
## 4      Maryland 370.9852
## 5      Virginia 368.1470
## 6          Iowa 367.0958
## 7     Louisiana 366.8325
## 8      Delaware 366.7818
## 9  Pennsylvania 366.1257
## 10     Oklahoma 361.5731
## ..          ...      ...

The plot code

library("readr")
my.df = read_csv("https://raw.githubusercontent.com/sebastianbarfort/sds/master/data/marijuana-street-price-clean.csv")
library("ggplot2")
library("scales")
p = ggplot(my.df, aes(x = date, y = HighQ))
p = p + geom_point(alpha = .05) # add points
p = p + geom_line() # add points
p = p + geom_smooth(colour = "red")
p = p + facet_wrap(~ State, scales = "free_y")
p = p + scale_x_date(breaks = pretty_breaks(4))
p = p + labs(x = NULL, y = "Price ($)", title = "Price of Marijuana")

The plot