Lecture 3: Data Visualization in R

September, 2015

Quick announcements

Groups are coming

Rune is coming

Mikkel created some cool plots using our marijuana data

Data visualizations in R

R has excellent plotting libraries

There is the base plotting function (plot) which can be useful once in a while. Furthermore, there is the lattice library which is very powerful but has a complicated syntax.

Today we will focus on ggplot2, written by Hadley Wickham

I'll show you a lot of examples using some data I downloaded from Facebook. Afterwards, you will work with ggplot2 in groups

source: Bob Rudis

source: Alex Bresler

source: Kyle Walker

source: Hillary Parker

`ggplot2` - a grammar of graphics in R

ggplot2 is based on the grammar of graphics, the idea that you can build every graph from the same few components: a dataset, a set of geoms—visual marks that represent data points, and a coordinate system

A statistical graph is a mapping from data to aesthetic attributes (colour, shape, size) of geometric objects (points, lines, bars). The plot may also contain statistical transformations of the data and is drawn on a specific coordinate system. Faceting can be used to generate the same plot for different subsets of the dataset. It is the combination of these independent components that make up a graphic.
- Wickham, ggplot2, p. 3.

source

Components of a graph

data: What you want to visualize, including variables (columns) to be mapped to aesthetic attributes.
geom: Geometric objects that are drawn to represent the data: bars, lines, points, etc.
stats: Statistical transformations of the data, such as binning or averaging.
scales: Map values in the data space to values in an aesthetic space (color, shape, size…)
coord: Coordinate system; provides axes and gridlines to make it possible to read the graph.
facets: Breaking up the data into subsets, to be displayed independently on a grid.

Data

We will work with the latest 5,000 Facebook posts from the Danish newspaper Politiken.

The script that downloads the data is on github.

Read the data

library("readr")

df = read_csv("https://raw.githubusercontent.com/sebastianbarfort/sds/master/data/politiken.csv", 
  col_types = list(created_time = col_character()))

df$created_time = parse_datetime(df$created_time)

Link to data

Plotting one (continuous) variable

Distribution of number of likes for each post.

library("ggplot2")
p = ggplot(data = df, aes(x = likes_count)) # data & aesthetics
p + geom_histogram() # add geom

Plotting one (continuous) variable: histogram

Distribution of (logged) number of likes for each post.

p = ggplot(data = df, aes(x = likes_count)) # data & aesthetics
p = p + geom_histogram() # add geom
p + scale_x_log10() # add log scale

Plotting one (continuous) variable: density plot

p = ggplot(data = df, aes(x = likes_count)) # data & aesthetics
p = p + geom_density() # add geom
p + scale_x_log10() # add log scale

Plotting one (continuous) variable: area plot

p = ggplot(data = df, aes(x = likes_count)) # data & aesthetics
p = p + geom_area(stat = "bin") # add geom
p + scale_x_log10() # add log scale

Plotting one (continuous) variable: combining geoms

p = ggplot(data = df, aes(x = likes_count)) # data & aesthetics
p + geom_histogram(aes(y=..density..), colour="black", fill="white") + 
    geom_density(alpha=.2, fill="#FF6666") + scale_x_log10()

Plotting two (continuous) variables

p = ggplot(data = df, aes(x = likes_count, y = comments_count))
p + geom_point() + scale_x_log10() + scale_y_log10() # add log scales

Plotting two (continuous) variables: smoothers

p + geom_point() + geom_smooth(na.rm = TRUE, 
  data = df[df$likes_count>0 & df$comments_count>0,]) + 
  geom_smooth(na.rm = TRUE, 
    data = df[df$likes_count>0 & df$comments_count>0,], 
    method = "lm", colour = "red") + 
  scale_x_log10() + scale_y_log10() # add log scales

Plotting two (continuous) variables

Number of likes over time

p = ggplot(df, aes(x = as.Date(created_time), y = likes_count))
p + geom_line()

Categorical variables

We can see the section by following the link variable in the dataset

head(df$link, 3)

## [1] "http://politiken.dk/mad/ECE2837880/det-lille-mejeri-er-blevet-stor-industri/"                                
## [2] "http://politiken.dk/indland/ECE2839489/skoleelever-fodboldboern-og-socialt-udsatte-maa-betale-i-koebenhavn/" 
## [3] "http://politiken.dk/sport/ECE2839793/skolepige-skiftede-strutskoert-og-ballet-ud-med-knaldhaarde-tacklinger/"

We can grab the section using a regular expression

library("stringr")
df$section = str_extract(df$link, ".dk/[a-z]*")
df$section = gsub(".dk/", "", df$section)
head(df$section, 5)

## [1] "mad"          "indland"      "sport"        "debat"       
## [5] "forbrugogliv"

Discrete X, continuous Y

Count average number of likes by section

library("dplyr")
df.section = df %>%
  filter(!is.na(section)) %>%
  filter(section != "") %>%
group_by(section) %>%
  summarise(
    likes.pr.post = mean(likes_count, na.rm = TRUE)
) %>%
  arrange(-likes.pr.post)

Plot mean likes by section

p = ggplot(df.section, aes(x = reorder(section, likes.pr.post), 
  y = likes.pr.post))
p + geom_bar(stat = "identity") + coord_flip()

Distribution of likes by weekday

Let's create a variable for the weekday of the post

library("lubridate")
df$weekday = wday(df$created_time, label = TRUE)

Let's plot the distribution of likes by weekday

p = ggplot(df, aes(x = likes_count, colour = weekday))
p = p + geom_density() + scale_x_log10()

Result

Two continuous and a categorical variable

Relationship between likes and comments by weekday

p = ggplot(df, aes(x = likes_count, y = comments_count))
p = p + geom_point() + scale_x_log10() + scale_y_log10() + 
  geom_smooth(na.rm = TRUE, 
    data = df[df$likes_count>0 & df$comments_count>0,]) +
  facet_wrap(~ weekday, scales = "free")

Result

Two continuous and two categorical variables

Let's compare the indland and udland parts of Politiken

df.subset = df %>%
  filter(section %in% c("indland", "udland"))

Now we can plot the relationship between likes and comments by weekday and section

p = ggplot(df.subset, aes(x = likes_count, y = comments_count))
p = p + geom_point() + scale_x_log10() + scale_y_log10() + 
  geom_smooth(na.rm = TRUE, 
  data = df.subset[df.subset$likes_count>0 & 
      df.subset$comments_count>0,]) +
  facet_grid(section~ weekday, scales = "free")

Result

Multiple scales

Two categorical variables: tile plot

tab = data.frame(table(df$section, df$weekday))
names(tab) = c("section", "weekday", "count")

Tile plot

p = ggplot(tab, aes(x = section, y = weekday))
p = p + geom_tile(aes(fill = count))

Result

Your turn

Exercises

In the rest of the class, I want you to work together to try to reproduce the following plots

For getting help, I want you to learn how to use the ggplot2 webpage: http://ggplot2.org/

Plot 1

Number of posts by section

Plot 2

ggplot2 has several options to deal with overplotting. Take a look at the alpha argument to geom_point and try to reproduce the following plot

Plot 3

Another option is to use hexagonal binning. Take a look at geom_hex and try to reproduce the following plot (you need to install the package hexbin from CRAN)

Plot 4

You can customize titles using the labs argument. Use this function and try to reproduce the following

Plot 5

You can control the baseline look of the plot using the theme function. Try to figure out what themes are available and reproduce the plot below

Plot 6

geom_rug shows the marginal distribution of each of the plotting variable. Use this function to create the following plot

Plot 7

Use what you know about categorical variables to produce the following graph

Plot 8

At last, I want you to think of an graph that we haven't done yet and try to figure out how to produce it in ggplot2. This can, for example, be

plot the number of likes by hour of the day Politiken posted the link
plot the distribution of likes depending on the type of link posted
is the relationship between likes and shares stronger for some sections than others?
if time permits, read up on plotting geographical data using ggplot2

Quick announcements

Data visualizations in R

ggplot2 - a grammar of graphics in R

Components of a graph

Data

Plotting one (continuous) variable

Plotting one (continuous) variable: histogram

Plotting one (continuous) variable: density plot

Plotting one (continuous) variable: area plot

Plotting one (continuous) variable: combining geoms

Plotting two (continuous) variables

Plotting two (continuous) variables: smoothers

Plotting two (continuous) variables

Categorical variables

Discrete X, continuous Y

Plot mean likes by section

Distribution of likes by weekday

Result

Two continuous and a categorical variable

Result

Two continuous and two categorical variables

Result

Multiple scales

Two categorical variables: tile plot

Result

Your turn

Exercises

Plot 1

Plot 2

Plot 3

Plot 4

Plot 5

Plot 6

Plot 7

Plot 8

`ggplot2` - a grammar of graphics in R