- R basics
- groups
- exercise 1
September, 2015
We can install the readr
package, for example, by running
install.packages("readr")
Afterwards, we can access all the functions available in the package by running
library("readr")
It's slightly more difficult to install from github since we need to load a package from CRAN first: devtools
Installing from Github now looks like
library("devtools") install_github("hadley/purrr")
the purrr
package can now be loaded using the library
command
library("purrr")
?
followed by the function in the console?summary
??
followed by the function nameR is object oriented
Examples:
We can verify the class of an object using the class
function
We assign objects to object names using "<-
" or "=
"
z = "text" p = c(1, 3, 5) q = 2 y = NA k = FALSE
NA
: not avaliable, missing (is.na
)NULL
: undefined (is.null
)TRUE
: logical true (isTRUE
)FALSE
: logical false (!isTRUE
)Operator | Meaning |
---|---|
< |
less than |
> |
greater than |
== |
equal to |
<= |
less than or equal to |
>= |
greater than or equal to |
!= |
not equal to |
a | b |
a or b |
a & b |
a and b |
Functions operate on objects
R has many cool built in functions such as summary
, mean
, table
, etc
x = 1:10 mean(x)
## [1] 5.5
sd(x)
## [1] 3.02765
table(x)
## x ## 1 2 3 4 5 6 7 8 9 10 ## 1 1 1 1 1 1 1 1 1 1
The most basic type of R object is a vector. There is really only one rule about vectors in R, which is that a vector can only contain objects of the same class.
Everything in R is a vector
We can create vectors using the c
(concatenate/combine) function
my_vector = c(1, 3, 5, 10) another_vector = 1:100 a_third_vector = c("yes", "no", "hello") my_logical_vector = c(TRUE, FALSE, FALSE, TRUE)
read.csv
, read_xlsx
, etcyou can select variables using $
, known as the component selector
you can also call variables/observations using indexing
If you have a data frame, df
then
df[1, 1]
selects the first row and the first column of the dataset
df[, 1]
selects the entire first column,
df[2, ]
selects the second row, etc.
Some useful functions for working with data frames:
names
: returns the column names of the data framerownames
: returns the row names (if any) of the data framesummary
: returns summary statisticshead
: returns the first 5 or 10 observations of the data frameI want you to work with a cleaned version of your survey responses in groups.
The data can be read as
# install.packages("readr") library("readr") df = read_csv("https://raw.githubusercontent.com/sebastianbarfort/sds/master/data/sds-survey-1.csv")
Link here
NA
s in the data?names(df) # Q 1
## [1] "field.of.study" "credit" "hours" ## [4] "computing.skills" "system" "degree" ## [7] "age" "gender" "time.spent"
nrow(df) # Q2
## [1] 99
table(df$field.of.study) # Q3
## ## Economics Political Science Security Risk Management ## 89 8 1 ## Sociology ## 1
names(df) = c("study", names(df)[-1]) #Q4 summary(df) # Q5
## study credit hours computing.skills ## Length:99 Min. :0.0000 Length:99 Length:99 ## Class :character 1st Qu.:1.0000 Class :character Class :character ## Mode :character Median :1.0000 Mode :character Mode :character ## Mean :0.8673 ## 3rd Qu.:1.0000 ## Max. :1.0000 ## NA's :1 ## system degree age gender ## Length:99 Length:99 Min. :20.00 Length:99 ## Class :character Class :character 1st Qu.:23.50 Class :character ## Mode :character Mode :character Median :24.00 Mode :character ## Mean :25.01 ## 3rd Qu.:26.00 ## Max. :43.00 ## ## time.spent ## Min. : 30.0 ## 1st Qu.: 46.5 ## Median : 66.0 ## Mean : 110.5 ## 3rd Qu.: 88.0 ## Max. :3158.0 ##
mean(df$age) # Q6
## [1] 25.0101
table(df$field.of.study, df$gender) # Q7
## ## Female Male ## Economics 36 53 ## Political Science 1 7 ## Security Risk Management 0 1 ## Sociology 0 1
Clear majority of econ students (90%)
Only 4% rate their computing skills below average ;)
We have 1 Linux user (54% on Windows)
86% are taking the course for credit
An interesting variable we might be interested in is hours
: how many hours does the student plan on studying for the course
However, when we want to analyze this variable we run into problems
mean(df$hours)
## Warning in mean.default(df$hours): argument is not numeric or logical: ## returning NA
## [1] NA
What is the problem?
We can see that R has read the variable as a character vector
class(df$hours)
## [1] "character"
The reason is that some have answered, for example, 3-4
instead of 3,5
head(df$hours)
## [1] NA "3" "8" "3-4" "5" "6"
We can investigate how many times this happens by using the grep
function. Take a minute and discuss with your classmate how you would use this function to analyze the problem
grep
functionThe grep
function does so-called string matching. That is, it looks for a certain string in our vector. It returns the indices where the string appears in the vector.
grep("-", df$hours)
## [1] 4
We can fix the error using the gsub
function
df$hours = gsub("3-4", "3.5", df$hours)
At last, we can convert the column to class integer
df$hours = as.integer(df$hours)
We can plot the distribution of the hours
variable
library("ggplot2") p = ggplot(df, aes(x = hours)) p + geom_histogram()
library("ggplot2") p = ggplot(df, aes(x = hours)) p + geom_histogram() + facet_wrap(~ gender)
first, try to describe the dataset. What simple commands would you run?
second, I want you to think about pros and cons of using this kind of data? What should we be worried about when working with this kind of data?
third, think about interesting questions to ask from the data?
Read the data
library("readr") my.df = read_csv("https://raw.githubusercontent.com/sebastianbarfort/sds/master/data/marijuana-street-price-clean.csv")
dim(my.df)
## [1] 22899 8
names(my.df)
## [1] "State" "HighQ" "HighQN" "MedQ" "MedQN" "LowQ" "LowQN" "date"
summary(my.df)
## State HighQ HighQN MedQ ## Length:22899 Min. :202.0 Min. : 93 Min. :144.8 ## Class :character 1st Qu.:303.8 1st Qu.: 597 1st Qu.:215.8 ## Mode :character Median :342.3 Median : 1420 Median :245.8 ## Mean :329.8 Mean : 2275 Mean :247.6 ## 3rd Qu.:356.6 3rd Qu.: 2958 3rd Qu.:274.2 ## Max. :415.7 Max. :18492 Max. :379.0 ## ## MedQN LowQ LowQN date ## Min. : 134 Min. : 63.7 Min. : 11.0 Min. :2013-12-27 ## 1st Qu.: 548 1st Qu.:147.1 1st Qu.: 51.0 1st Qu.:2014-04-18 ## Median : 1320 Median :186.8 Median : 139.0 Median :2014-08-09 ## Mean : 2184 Mean :203.7 Mean : 202.8 Mean :2014-08-14 ## 3rd Qu.: 2673 3rd Qu.:221.4 3rd Qu.: 263.0 3rd Qu.:2014-11-29 ## Max. :22027 Max. :734.6 Max. :1287.0 Max. :2015-06-11 ## NA's :10557
For figuring out which state has the highest mean marijuana price we need to look at the dplyr
package.
Take a look at this package and try if you can figure out how to do this (tip: you need the group_by
function)
library("dplyr") my.df %>% group_by(State) %>% summarise( m.price = mean(HighQ, na.rm = TRUE) ) %>% arrange(desc(m.price))
## Source: local data frame [51 x 2] ## ## State m.price ## 1 North Dakota 398.6688 ## 2 South Dakota 375.8185 ## 3 Vermont 374.2504 ## 4 Maryland 370.9852 ## 5 Virginia 368.1470 ## 6 Iowa 367.0958 ## 7 Louisiana 366.8325 ## 8 Delaware 366.7818 ## 9 Pennsylvania 366.1257 ## 10 Oklahoma 361.5731 ## .. ... ...
library("readr") my.df = read_csv("https://raw.githubusercontent.com/sebastianbarfort/sds/master/data/marijuana-street-price-clean.csv")
library("ggplot2") library("scales") p = ggplot(my.df, aes(x = date, y = HighQ)) p = p + geom_point(alpha = .05) # add points p = p + geom_line() # add points p = p + geom_smooth(colour = "red") p = p + facet_wrap(~ State, scales = "free_y") p = p + scale_x_date(breaks = pretty_breaks(4)) p = p + labs(x = NULL, y = "Price ($)", title = "Price of Marijuana")