September, 2015

Quick announcements

Groups are coming

Rune is coming

Mikkel created some cool plots using our marijuana data

Data visualizations in R

R has excellent plotting libraries

There is the base plotting function (plot) which can be useful once in a while. Furthermore, there is the lattice library which is very powerful but has a complicated syntax.

Today we will focus on ggplot2, written by Hadley Wickham

I'll show you a lot of examples using some data I downloaded from Facebook. Afterwards, you will work with ggplot2 in groups

ggplot2 - a grammar of graphics in R

ggplot2 is based on the grammar of graphics, the idea that you can build every graph from the same few components: a dataset, a set of geoms—visual marks that represent data points, and a coordinate system

A statistical graph is a mapping from data to aesthetic attributes (colour, shape, size) of geometric objects (points, lines, bars). The plot may also contain statistical transformations of the data and is drawn on a specific coordinate system. Faceting can be used to generate the same plot for different subsets of the dataset. It is the combination of these independent components that make up a graphic.
- Wickham, ggplot2, p. 3.

Components of a graph

  • data: What you want to visualize, including variables (columns) to be mapped to aesthetic attributes.
  • geom: Geometric objects that are drawn to represent the data: bars, lines, points, etc.
  • stats: Statistical transformations of the data, such as binning or averaging.
  • scales: Map values in the data space to values in an aesthetic space (color, shape, size…)
  • coord: Coordinate system; provides axes and gridlines to make it possible to read the graph.
  • facets: Breaking up the data into subsets, to be displayed independently on a grid.

Data

We will work with the latest 5,000 Facebook posts from the Danish newspaper Politiken.

The script that downloads the data is on github.

Read the data

library("readr")

df = read_csv("https://raw.githubusercontent.com/sebastianbarfort/sds/master/data/politiken.csv", 
  col_types = list(created_time = col_character()))

df$created_time = parse_datetime(df$created_time)

Link to data

Plotting one (continuous) variable

Distribution of number of likes for each post.

library("ggplot2")
p = ggplot(data = df, aes(x = likes_count)) # data & aesthetics
p + geom_histogram() # add geom

Plotting one (continuous) variable: histogram

Distribution of (logged) number of likes for each post.

p = ggplot(data = df, aes(x = likes_count)) # data & aesthetics
p = p + geom_histogram() # add geom
p + scale_x_log10() # add log scale 

Plotting one (continuous) variable: density plot

p = ggplot(data = df, aes(x = likes_count)) # data & aesthetics
p = p + geom_density() # add geom
p + scale_x_log10() # add log scale 

Plotting one (continuous) variable: area plot

p = ggplot(data = df, aes(x = likes_count)) # data & aesthetics
p = p + geom_area(stat = "bin") # add geom
p + scale_x_log10() # add log scale 

Plotting one (continuous) variable: combining geoms

p = ggplot(data = df, aes(x = likes_count)) # data & aesthetics
p + geom_histogram(aes(y=..density..), colour="black", fill="white") + 
    geom_density(alpha=.2, fill="#FF6666") + scale_x_log10() 

Plotting two (continuous) variables

p = ggplot(data = df, aes(x = likes_count, y = comments_count))
p + geom_point() + scale_x_log10() + scale_y_log10() # add log scales 

Plotting two (continuous) variables: smoothers

p + geom_point() + geom_smooth(na.rm = TRUE, 
  data = df[df$likes_count>0 & df$comments_count>0,]) + 
  geom_smooth(na.rm = TRUE, 
    data = df[df$likes_count>0 & df$comments_count>0,], 
    method = "lm", colour = "red") + 
  scale_x_log10() + scale_y_log10() # add log scales 

Plotting two (continuous) variables

Number of likes over time

p = ggplot(df, aes(x = as.Date(created_time), y = likes_count))
p + geom_line()

Categorical variables

We can see the section by following the link variable in the dataset

head(df$link, 3)
## [1] "http://politiken.dk/mad/ECE2837880/det-lille-mejeri-er-blevet-stor-industri/"                                
## [2] "http://politiken.dk/indland/ECE2839489/skoleelever-fodboldboern-og-socialt-udsatte-maa-betale-i-koebenhavn/" 
## [3] "http://politiken.dk/sport/ECE2839793/skolepige-skiftede-strutskoert-og-ballet-ud-med-knaldhaarde-tacklinger/"

We can grab the section using a regular expression

library("stringr")
df$section = str_extract(df$link, ".dk/[a-z]*")
df$section = gsub(".dk/", "", df$section)
head(df$section, 5)
## [1] "mad"          "indland"      "sport"        "debat"       
## [5] "forbrugogliv"

Discrete X, continuous Y

Count average number of likes by section

library("dplyr")
df.section = df %>%
  filter(!is.na(section)) %>%
  filter(section != "") %>%
group_by(section) %>%
  summarise(
    likes.pr.post = mean(likes_count, na.rm = TRUE)
) %>%
  arrange(-likes.pr.post)

Plot mean likes by section

p = ggplot(df.section, aes(x = reorder(section, likes.pr.post), 
  y = likes.pr.post))
p + geom_bar(stat = "identity") + coord_flip()

Distribution of likes by weekday

Let's create a variable for the weekday of the post

library("lubridate")
df$weekday = wday(df$created_time, label = TRUE)

Let's plot the distribution of likes by weekday

p = ggplot(df, aes(x = likes_count, colour = weekday))
p = p + geom_density() + scale_x_log10()

Result

Two continuous and a categorical variable

Relationship between likes and comments by weekday

p = ggplot(df, aes(x = likes_count, y = comments_count))
p = p + geom_point() + scale_x_log10() + scale_y_log10() + 
  geom_smooth(na.rm = TRUE, 
    data = df[df$likes_count>0 & df$comments_count>0,]) +
  facet_wrap(~ weekday, scales = "free")

Result

Two continuous and two categorical variables

Let's compare the indland and udland parts of Politiken

df.subset = df %>%
  filter(section %in% c("indland", "udland"))

Now we can plot the relationship between likes and comments by weekday and section

p = ggplot(df.subset, aes(x = likes_count, y = comments_count))
p = p + geom_point() + scale_x_log10() + scale_y_log10() + 
  geom_smooth(na.rm = TRUE, 
  data = df.subset[df.subset$likes_count>0 & 
      df.subset$comments_count>0,]) +
  facet_grid(section~ weekday, scales = "free")

Result

Multiple scales

Two categorical variables: tile plot

tab = data.frame(table(df$section, df$weekday))
names(tab) = c("section", "weekday", "count")

Tile plot

p = ggplot(tab, aes(x = section, y = weekday))
p = p + geom_tile(aes(fill = count))

Result

Your turn

Exercises

In the rest of the class, I want you to work together to try to reproduce the following plots

For getting help, I want you to learn how to use the ggplot2 webpage: http://ggplot2.org/

Plot 1

Number of posts by section

Plot 2

ggplot2 has several options to deal with overplotting. Take a look at the alpha argument to geom_point and try to reproduce the following plot

Plot 3

Another option is to use hexagonal binning. Take a look at geom_hex and try to reproduce the following plot (you need to install the package hexbin from CRAN)

Plot 4

You can customize titles using the labs argument. Use this function and try to reproduce the following

Plot 5

You can control the baseline look of the plot using the theme function. Try to figure out what themes are available and reproduce the plot below

Plot 6

geom_rug shows the marginal distribution of each of the plotting variable. Use this function to create the following plot

Plot 7

Use what you know about categorical variables to produce the following graph

Plot 8

At last, I want you to think of an graph that we haven't done yet and try to figure out how to produce it in ggplot2. This can, for example, be

  • plot the number of likes by hour of the day Politiken posted the link
  • plot the distribution of likes depending on the type of link posted
  • is the relationship between likes and shares stronger for some sections than others?
  • if time permits, read up on plotting geographical data using ggplot2