October, 2015

Today

Turning lists into data frames

Tidy data

Regular expressions

Exam brainstorming

Turning lists into data frames

This is actually not always easy

When the list is of equal length, try the ldply function in the plyr package

library("plyr")
my.list = list(x="a", value=123, y = "hej", z = 4)
ldply(my.list)
##     .id  V1
## 1     x   a
## 2 value 123
## 3     y hej
## 4     z   4

Lists of lists

Slightly more complicated example

obs1 = list(x="a", value=123)
obs2 = list(x="b", value=27)
obs3 = list(x="c", value=99)
dlist = list(obs1, obs2, obs3)
dlist
## [[1]]
## [[1]]$x
## [1] "a"
## 
## [[1]]$value
## [1] 123
## 
## 
## [[2]]
## [[2]]$x
## [1] "b"
## 
## [[2]]$value
## [1] 27
## 
## 
## [[3]]
## [[3]]$x
## [1] "c"
## 
## [[3]]$value
## [1] 99

Lists of lists

ldply(dlist, data.frame)
##   x value
## 1 a   123
## 2 b    27
## 3 c    99

Lists of unequal length

library("readr")
df = read_csv("https://raw.githubusercontent.com/sebastianbarfort/sds/gh-pages/data/finanslov_tidy.csv")
my.list = strsplit(names(df), "t")
my.list
## [[1]]
## [1] "paragraf"
## 
## [[2]]
## [1] "hovedomrode"
## 
## [[3]]
## [1] "ak"  "ivi" "e"  
## 
## [[4]]
## [1] "hovedkon" "o"       
## 
## [[5]]
## [1] "aar"
## 
## [[6]]
## [1] "udgif"

What we can't do

ldply won't work here

ldply(my.list)
## Error in list_to_dataframe(res, attr(.data, "split_labels"), .id, id_as_factor): Results do not have equal lengths

The problem is that the elements of the list have unequal length

Analyzing the problem

Let's count the length of each element in the list

ldply(my.list, length)
##   V1
## 1  1
## 2  1
## 3  3
## 4  2
## 5  1
## 6  1

This is equivalent to

sapply(my.list, length)
## [1] 1 1 3 2 1 1

Pick only first element?

ldply(my.list, function(x) data.frame(x[1]))
##          x.1.
## 1    paragraf
## 2 hovedomrode
## 3          ak
## 4    hovedkon
## 5         aar
## 6       udgif

Pick all elements?

ldply(llply(my.list, rbind), cbind)
##             1    2    3
## 1    paragraf <NA> <NA>
## 2 hovedomrode <NA> <NA>
## 3          ak  ivi    e
## 4    hovedkon    o <NA>
## 5         aar <NA> <NA>
## 6       udgif <NA> <NA>

tidy data

tidyr

Happy families are all alike; every unhappy family is unhappy in its own way

Leo Tolstoy

Goal of tidyr: take your messy data and turn it into a tidy format

tidy data: observations are in the rows, variables are in the columns

Tidy data

library("readr")
library("dplyr")
library("tidyr")

df = read_csv("https://raw.githubusercontent.com/hadley/tidyr/master/vignettes/pew.csv")
head(df, 3)
## Source: local data frame [3 x 11]
## 
##   religion <$10k $10-20k $20-30k $30-40k $40-50k $50-75k $75-100k
##      (chr) (int)   (int)   (int)   (int)   (int)   (int)    (int)
## 1 Agnostic    27      34      60      81      76     137      122
## 2  Atheist    12      27      37      52      35      70       73
## 3 Buddhist    27      21      30      34      33      58       62
## Variables not shown: $100-150k (int), >150k (int), Don't know/refused
##   (int)

Question 1: What variables are in this dataset?

Question 2: How does a tidy version of this data look like?

The gather function

Objective: Reshaping wide format to long format

To tidy this data, we need to gather the non-variable columns into a two-column key-value pair

args(gather)
## function (data, key, value, ..., na.rm = FALSE, convert = FALSE) 
## NULL

Arguments:

  • data: data frame
  • key: column name representing new variable
  • value: column name representing variable values
  • ...: names of columns to gather (or not gather)

gather at work

df %>% gather(income, frequency, -religion)
## Source: local data frame [180 x 3]
## 
##                   religion income frequency
##                      (chr) (fctr)     (int)
## 1                 Agnostic  <$10k        27
## 2                  Atheist  <$10k        12
## 3                 Buddhist  <$10k        27
## 4                 Catholic  <$10k       418
## 5       Don’t know/refused  <$10k        15
## 6         Evangelical Prot  <$10k       575
## 7                    Hindu  <$10k         1
## 8  Historically Black Prot  <$10k       228
## 9        Jehovah's Witness  <$10k        20
## 10                  Jewish  <$10k        19
## ..                     ...    ...       ...

Alternatives

This

df %>% gather(income, frequency, 2:11)

returns the same as

df %>% gather(income, frequency, -religion)

More complicated example

Billboard data

df = read_csv("https://raw.githubusercontent.com/hadley/tidyr/master/vignettes/billboard.csv")
head(df, 3)
## Source: local data frame [3 x 81]
## 
##    year       artist                   track  time date.entered   wk1
##   (int)        (chr)                   (chr) (chr)       (date) (int)
## 1  2000        2 Pac Baby Don't Cry (Keep...  4:22   2000-02-26    87
## 2  2000      2Ge+her The Hardest Part Of ...  3:15   2000-09-02    91
## 3  2000 3 Doors Down              Kryptonite  3:53   2000-04-08    81
## Variables not shown: wk2 (int), wk3 (int), wk4 (int), wk5 (int), wk6
##   (int), wk7 (int), wk8 (int), wk9 (int), wk10 (int), wk11 (int), wk12
##   (int), wk13 (int), wk14 (int), wk15 (int), wk16 (int), wk17 (int), wk18
##   (int), wk19 (int), wk20 (int), wk21 (int), wk22 (int), wk23 (int), wk24
##   (int), wk25 (int), wk26 (int), wk27 (int), wk28 (int), wk29 (int), wk30
##   (int), wk31 (int), wk32 (int), wk33 (int), wk34 (int), wk35 (int), wk36
##   (int), wk37 (int), wk38 (int), wk39 (int), wk40 (int), wk41 (int), wk42
##   (int), wk43 (int), wk44 (int), wk45 (int), wk46 (int), wk47 (int), wk48
##   (int), wk49 (int), wk50 (int), wk51 (int), wk52 (int), wk53 (int), wk54
##   (int), wk55 (int), wk56 (int), wk57 (int), wk58 (int), wk59 (int), wk60
##   (int), wk61 (int), wk62 (int), wk63 (int), wk64 (int), wk65 (int), wk66
##   (lgl), wk67 (lgl), wk68 (lgl), wk69 (lgl), wk70 (lgl), wk71 (lgl), wk72
##   (lgl), wk73 (lgl), wk74 (lgl), wk75 (lgl), wk76 (lgl)

Question: what are the variables here?

Tidying the Billboard data

To tidy this dataset, we first gather together all the wk columns. The column names give the week and the values are the ranks:

billboard2 = df %>% 
  gather(week, rank, wk1:wk76,na.rm = TRUE)
head(billboard2, 3)
## Source: local data frame [3 x 7]
## 
##    year       artist                   track  time date.entered   week
##   (int)        (chr)                   (chr) (chr)       (date) (fctr)
## 1  2000        2 Pac Baby Don't Cry (Keep...  4:22   2000-02-26    wk1
## 2  2000      2Ge+her The Hardest Part Of ...  3:15   2000-09-02    wk1
## 3  2000 3 Doors Down              Kryptonite  3:53   2000-04-08    wk1
## Variables not shown: rank (int)

Note:

What more would we want to do to the data?

Data cleaning

Let's turn the week into a numeric variable and create a proper date column

billboard3 = billboard2 %>%
  mutate(
    week = extract_numeric(week),
    date = as.Date(date.entered) + 7 * (week - 1)) %>%
  select(-date.entered) %>% 
  arrange(artist, track, week)
head(billboard3, 3)
## Source: local data frame [3 x 7]
## 
##    year artist                   track  time  week  rank       date
##   (int)  (chr)                   (chr) (chr) (dbl) (int)     (date)
## 1  2000  2 Pac Baby Don't Cry (Keep...  4:22     1    87 2000-02-26
## 2  2000  2 Pac Baby Don't Cry (Keep...  4:22     2    82 2000-03-04
## 3  2000  2 Pac Baby Don't Cry (Keep...  4:22     3    72 2000-03-11

Even more complicated example

After gathering columns, the key column is sometimes a combination of multiple underlying variable names.

df = read_csv("https://raw.githubusercontent.com/hadley/tidyr/master/vignettes/tb.csv")
head(df, 3)
## Source: local data frame [3 x 22]
## 
##    iso2  year   m04  m514  m014 m1524 m2534 m3544 m4554 m5564   m65    mu
##   (chr) (int) (int) (int) (int) (int) (int) (int) (int) (int) (int) (int)
## 1    AD  1989    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
## 2    AD  1990    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
## 3    AD  1991    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
## Variables not shown: f04 (int), f514 (int), f014 (int), f1524 (int), f2534
##   (int), f3544 (int), f4554 (int), f5564 (int), f65 (int), fu (int)

Question: what are the variables here?

Answer

The dataset comes from the World Health Organisation, and records the counts of confirmed tuberculosis cases by country, year, and demographic group. The demographic groups are broken down by sex (m, f) and age (0-14, 15-25, 25-34, 35-44, 45-54, 55-64, unknown).

Gathering the non-variable columns

tb2 = df %>% 
  gather(demo, n, -iso2, -year, na.rm = TRUE)
head(tb2, 3)
## Source: local data frame [3 x 4]
## 
##    iso2  year   demo     n
##   (chr) (int) (fctr) (int)
## 1    AD  2005    m04     0
## 2    AD  2006    m04     0
## 3    AD  2008    m04     0

Is this dataset tidy?

Separating the demo variable

separate makes it easy to split a compound variables into individual variables. You can either pass it a regular expression to split on or a vector of character positions. In this case we want to split after the first character.

tb3 = tb2 %>% 
  separate(demo, c("sex", "age"), 1)
head(tb3, 3)
## Source: local data frame [3 x 5]
## 
##    iso2  year   sex   age     n
##   (chr) (int) (chr) (chr) (int)
## 1    AD  2005     m    04     0
## 2    AD  2006     m    04     0
## 3    AD  2008     m    04     0

Question: Compare tb3 to the original data frame (df). What are the advantages of having our data stored in a tidy format?

Reshaping from long to wide format

There are times when we are required to turn long formatted data into wide formatted data. The spread function spreads a key-value pair across multiple columns.

args(spread)
## function (data, key, value, fill = NA, convert = FALSE, drop = TRUE) 
## NULL
  • data: data frame
  • key: column values to convert to multiple columns
  • value: single column values to convert to multiple columns' values
  • fill: If there isn't a value for every combination of the other variables and the key column, this value will be substituted

spread in action

tb3.wide = tb3 %>% spread(age, n)
tb3.wide
## Source: local data frame [4,885 x 13]
## 
##     iso2  year   sex   014    04  1524  2534  3544  4554   514  5564    65
##    (chr) (int) (chr) (int) (int) (int) (int) (int) (int) (int) (int) (int)
## 1     AD  1996     f     0    NA     1     1     0     0    NA     1     0
## 2     AD  1996     m     0    NA     0     0     4     1    NA     0     0
## 3     AD  1997     f     0    NA     1     2     3     0    NA     0     1
## 4     AD  1997     m     0    NA     0     1     2     2    NA     1     6
## 5     AD  1998     m     0    NA     0     0     1     0    NA     0     0
## 6     AD  1999     f     0    NA     0     0     1     0    NA     0     0
## 7     AD  1999     m     0    NA     0     0     1     1    NA     0     0
## 8     AD  2000     m     0    NA     0     1     0     0    NA     0     0
## 9     AD  2001     m     0    NA    NA    NA     2     1    NA    NA    NA
## 10    AD  2002     f     0    NA     1     0     0     0    NA     0     0
## ..   ...   ...   ...   ...   ...   ...   ...   ...   ...   ...   ...   ...
## Variables not shown: u (int)

Regular Expressions

Introduction to regular expressions

Regular expression is a pattern that describes a specific set of strings with a common structure. It is heavily used for string matching / replacing in all programming languages, although specific syntax may differ a bit.

Regular expressions typically specify characters (or character classes) to seek out, possibly with information about repeats and location within the string. This is accomplished with the help of metacharacters that have specific meaning: $ * + . ? [ ] ^ { } | ( ) \

I won't lie: they can be difficult

Quantifiers

Quantifiers specify how many repetitions of the pattern

  • *: matches at least 0 times
  • +: matches at least 1 times
  • ?: matches at most 1 times
  • {n}: matches exactly n times
  • {n,}: matches at least n times
  • {n,m}: matches between n and m times

Example

strings = c("a", "ab", "acb", "accb", "acccb", "accccb")
strings
## [1] "a"      "ab"     "acb"    "accb"   "acccb"  "accccb"
grep("ac*b", strings, value = TRUE)
## [1] "ab"     "acb"    "accb"   "acccb"  "accccb"
grep("ac+b", strings, value = TRUE)
## [1] "acb"    "accb"   "acccb"  "accccb"
grep("ac?b", strings, value = TRUE)
## [1] "ab"  "acb"

grep("ac{2}b", strings, value = TRUE)
## [1] "accb"
grep("ac{2,}b", strings, value = TRUE)
## [1] "accb"   "acccb"  "accccb"
grep("ac{2,3}b", strings, value = TRUE)
## [1] "accb"  "acccb"

Operators

  • .: matches any single character, as shown in the first example.
  • [...]: a character list, matches any one of the characters inside the square brackets. We can also use - inside the brackets to specify a range of characters.
  • [^...]: an inverted character list, similar to […], but matches any characters except those inside the square brackets.
  • |: an “or” operator, matches patterns on either side of the |.
  • \: suppress the special meaning of metacharacters in regular expression
  • ^: matches the start of the string.

Example

strings = c("^ab", "ab", "abc", "abd", "abe", "ab 12")
grep("ab.", strings, value = TRUE)
## [1] "abc"   "abd"   "abe"   "ab 12"
grep("ab[c-e]", strings, value = TRUE)
## [1] "abc" "abd" "abe"
grep("ab[^c]", strings, value = TRUE)
## [1] "abd"   "abe"   "ab 12"

grep("^ab", strings, value = TRUE)
## [1] "ab"    "abc"   "abd"   "abe"   "ab 12"
grep("\\^ab", strings, value = TRUE)
## [1] "^ab"
grep("abc|abd", strings, value = TRUE)
## [1] "abc" "abd"

Character classes

Character classes allows to specify entire classes of characters, such as numbers, letters, etc.

  • [:digit:] or \d: digits, 0 1 2 3 4 5 6 7 8 9, equivalent to [0-9].
  • \D: non-digits, equivalent to [^0-9].
  • [:lower:]: lower-case letters, equivalent to [a-z].
  • [:upper:]: upper-case letters, equivalent to [A-Z].
  • [:alpha:]: alphabetic characters, equivalent to [[:lower:][:upper:]] or [A-z].
  • [:alnum:]: alphanumeric characters, equivalent to [[:alpha:][:digit:]] or [A-z0-9].
  • \w: word characters, equivalent to [[:alnum:]_] or [A-z0-9_].
  • \W: not word, equivalent to [^A-z0-9_].

  • [:blank:]: blank characters, i.e. space and tab.
  • [:space:]: space characters: tab, newline, vertical tab, form feed, carriage return, space.
  • \s: space, ` `.
  • \S: not space.
  • [:punct:]: punctuation characters, ! " # $ % & ’ ( ) * + , - . / : ; < = > ? @ [ ] ^ _ ` { | } ~.

Exam brainstorming

Where are we at?

You know the basics of plotting and data manipulation in R

By next week, you will know how to gather data from the web as well

That means it's time to start thinking about the exam project

The boring stuff

The exam project has to be handed in not later than 11 December 2015

You will also need to have a project description aproved. Deadline here is November 16

What do we expect?

For the exam students are expected to pose an interesting social science question and attempt to answer it using standard academic practices including original data collection and statistical analysis.

Points for

  • collecting new data
  • good ideas
  • clear presentation
  • using statistical techniques correctly

Where to gather data?

That's really up to you (so you need to think)

I've had fun collecting data from the Danish Superliga, Goodreads, Voting patterns in the European Parliament, the U.S Congress, the New York Times, Twitter, Instagram, Facebook, MOMA, etc.

… but it depends on your interests!

So spend the last part of today's class brainstorming with your group