Groups are coming
Rune is coming
Mikkel created some cool plots using our marijuana data
September, 2015
Groups are coming
Rune is coming
Mikkel created some cool plots using our marijuana data
R
has excellent plotting libraries
There is the base plotting function (plot
) which can be useful once in a while. Furthermore, there is the lattice
library which is very powerful but has a complicated syntax.
Today we will focus on ggplot2
, written by Hadley Wickham
I'll show you a lot of examples using some data I downloaded from Facebook. Afterwards, you will work with ggplot2
in groups
ggplot2
- a grammar of graphics in Rggplot2
is based on the grammar of graphics, the idea that you can build every graph from the same few components: a dataset, a set of geoms—visual marks that represent data points, and a coordinate system
A statistical graph is a mapping from data to aesthetic attributes (colour, shape, size) of geometric objects (points, lines, bars). The plot may also contain statistical transformations of the data and is drawn on a specific coordinate system. Faceting can be used to generate the same plot for different subsets of the dataset. It is the combination of these independent components that make up a graphic.
- Wickham, ggplot2, p. 3.
We will work with the latest 5,000 Facebook posts from the Danish newspaper Politiken.
The script that downloads the data is on github.
Read the data
library("readr") df = read_csv("https://raw.githubusercontent.com/sebastianbarfort/sds/master/data/politiken.csv", col_types = list(created_time = col_character())) df$created_time = parse_datetime(df$created_time)
Distribution of number of likes for each post.
library("ggplot2") p = ggplot(data = df, aes(x = likes_count)) # data & aesthetics p + geom_histogram() # add geom
Distribution of (logged) number of likes for each post.
p = ggplot(data = df, aes(x = likes_count)) # data & aesthetics p = p + geom_histogram() # add geom p + scale_x_log10() # add log scale
p = ggplot(data = df, aes(x = likes_count)) # data & aesthetics p = p + geom_density() # add geom p + scale_x_log10() # add log scale
p = ggplot(data = df, aes(x = likes_count)) # data & aesthetics p = p + geom_area(stat = "bin") # add geom p + scale_x_log10() # add log scale
p = ggplot(data = df, aes(x = likes_count)) # data & aesthetics p + geom_histogram(aes(y=..density..), colour="black", fill="white") + geom_density(alpha=.2, fill="#FF6666") + scale_x_log10()
p = ggplot(data = df, aes(x = likes_count, y = comments_count)) p + geom_point() + scale_x_log10() + scale_y_log10() # add log scales
p + geom_point() + geom_smooth(na.rm = TRUE, data = df[df$likes_count>0 & df$comments_count>0,]) + geom_smooth(na.rm = TRUE, data = df[df$likes_count>0 & df$comments_count>0,], method = "lm", colour = "red") + scale_x_log10() + scale_y_log10() # add log scales
Number of likes over time
p = ggplot(df, aes(x = as.Date(created_time), y = likes_count)) p + geom_line()
We can see the section by following the link
variable in the dataset
head(df$link, 3)
## [1] "http://politiken.dk/mad/ECE2837880/det-lille-mejeri-er-blevet-stor-industri/" ## [2] "http://politiken.dk/indland/ECE2839489/skoleelever-fodboldboern-og-socialt-udsatte-maa-betale-i-koebenhavn/" ## [3] "http://politiken.dk/sport/ECE2839793/skolepige-skiftede-strutskoert-og-ballet-ud-med-knaldhaarde-tacklinger/"
We can grab the section using a regular expression
library("stringr") df$section = str_extract(df$link, ".dk/[a-z]*") df$section = gsub(".dk/", "", df$section) head(df$section, 5)
## [1] "mad" "indland" "sport" "debat" ## [5] "forbrugogliv"
Count average number of likes by section
library("dplyr") df.section = df %>% filter(!is.na(section)) %>% filter(section != "") %>% group_by(section) %>% summarise( likes.pr.post = mean(likes_count, na.rm = TRUE) ) %>% arrange(-likes.pr.post)
p = ggplot(df.section, aes(x = reorder(section, likes.pr.post), y = likes.pr.post)) p + geom_bar(stat = "identity") + coord_flip()
Let's create a variable for the weekday of the post
library("lubridate") df$weekday = wday(df$created_time, label = TRUE)
Let's plot the distribution of likes by weekday
p = ggplot(df, aes(x = likes_count, colour = weekday)) p = p + geom_density() + scale_x_log10()
Relationship between likes and comments by weekday
p = ggplot(df, aes(x = likes_count, y = comments_count)) p = p + geom_point() + scale_x_log10() + scale_y_log10() + geom_smooth(na.rm = TRUE, data = df[df$likes_count>0 & df$comments_count>0,]) + facet_wrap(~ weekday, scales = "free")
Let's compare the indland
and udland
parts of Politiken
df.subset = df %>% filter(section %in% c("indland", "udland"))
Now we can plot the relationship between likes and comments by weekday and section
p = ggplot(df.subset, aes(x = likes_count, y = comments_count)) p = p + geom_point() + scale_x_log10() + scale_y_log10() + geom_smooth(na.rm = TRUE, data = df.subset[df.subset$likes_count>0 & df.subset$comments_count>0,]) + facet_grid(section~ weekday, scales = "free")
tab = data.frame(table(df$section, df$weekday)) names(tab) = c("section", "weekday", "count")
Tile plot
p = ggplot(tab, aes(x = section, y = weekday)) p = p + geom_tile(aes(fill = count))
In the rest of the class, I want you to work together to try to reproduce the following plots
For getting help, I want you to learn how to use the ggplot2 webpage: http://ggplot2.org/
Number of posts by section
ggplot2
has several options to deal with overplotting. Take a look at the alpha
argument to geom_point
and try to reproduce the following plot
Another option is to use hexagonal binning. Take a look at geom_hex
and try to reproduce the following plot (you need to install the package hexbin
from CRAN)
You can customize titles using the labs
argument. Use this function and try to reproduce the following
You can control the baseline look of the plot using the theme
function. Try to figure out what themes are available and reproduce the plot below
geom_rug
shows the marginal distribution of each of the plotting variable. Use this function to create the following plot
Use what you know about categorical variables to produce the following graph
At last, I want you to think of an graph that we haven't done yet and try to figure out how to produce it in ggplot2
. This can, for example, be
ggplot2