Turning lists into data frames
Tidy data
Regular expressions
Exam brainstorming
October, 2015
Turning lists into data frames
Tidy data
Regular expressions
Exam brainstorming
This is actually not always easy
When the list is of equal length, try the ldply
function in the plyr
package
library("plyr") my.list = list(x="a", value=123, y = "hej", z = 4) ldply(my.list)
## .id V1 ## 1 x a ## 2 value 123 ## 3 y hej ## 4 z 4
Slightly more complicated example
obs1 = list(x="a", value=123) obs2 = list(x="b", value=27) obs3 = list(x="c", value=99) dlist = list(obs1, obs2, obs3) dlist
## [[1]] ## [[1]]$x ## [1] "a" ## ## [[1]]$value ## [1] 123 ## ## ## [[2]] ## [[2]]$x ## [1] "b" ## ## [[2]]$value ## [1] 27 ## ## ## [[3]] ## [[3]]$x ## [1] "c" ## ## [[3]]$value ## [1] 99
ldply(dlist, data.frame)
## x value ## 1 a 123 ## 2 b 27 ## 3 c 99
library("readr") df = read_csv("https://raw.githubusercontent.com/sebastianbarfort/sds/gh-pages/data/finanslov_tidy.csv") my.list = strsplit(names(df), "t") my.list
## [[1]] ## [1] "paragraf" ## ## [[2]] ## [1] "hovedomrode" ## ## [[3]] ## [1] "ak" "ivi" "e" ## ## [[4]] ## [1] "hovedkon" "o" ## ## [[5]] ## [1] "aar" ## ## [[6]] ## [1] "udgif"
ldply
won't work here
ldply(my.list)
## Error in list_to_dataframe(res, attr(.data, "split_labels"), .id, id_as_factor): Results do not have equal lengths
The problem is that the elements of the list have unequal length
Let's count the length of each element in the list
ldply(my.list, length)
## V1 ## 1 1 ## 2 1 ## 3 3 ## 4 2 ## 5 1 ## 6 1
This is equivalent to
sapply(my.list, length)
## [1] 1 1 3 2 1 1
ldply(my.list, function(x) data.frame(x[1]))
## x.1. ## 1 paragraf ## 2 hovedomrode ## 3 ak ## 4 hovedkon ## 5 aar ## 6 udgif
ldply(llply(my.list, rbind), cbind)
## 1 2 3 ## 1 paragraf <NA> <NA> ## 2 hovedomrode <NA> <NA> ## 3 ak ivi e ## 4 hovedkon o <NA> ## 5 aar <NA> <NA> ## 6 udgif <NA> <NA>
tidyr
Happy families are all alike; every unhappy family is unhappy in its own way
Leo Tolstoy
Goal of tidyr
: take your messy data and turn it into a tidy format
tidy data: observations are in the rows, variables are in the columns
library("readr") library("dplyr") library("tidyr") df = read_csv("https://raw.githubusercontent.com/hadley/tidyr/master/vignettes/pew.csv") head(df, 3)
## Source: local data frame [3 x 11] ## ## religion <$10k $10-20k $20-30k $30-40k $40-50k $50-75k $75-100k ## (chr) (int) (int) (int) (int) (int) (int) (int) ## 1 Agnostic 27 34 60 81 76 137 122 ## 2 Atheist 12 27 37 52 35 70 73 ## 3 Buddhist 27 21 30 34 33 58 62 ## Variables not shown: $100-150k (int), >150k (int), Don't know/refused ## (int)
Question 1: What variables are in this dataset?
Question 2: How does a tidy version of this data look like?
gather
functionObjective: Reshaping wide format to long format
To tidy this data, we need to gather the non-variable columns into a two-column key-value pair
args(gather)
## function (data, key, value, ..., na.rm = FALSE, convert = FALSE) ## NULL
Arguments:
data
: data framekey
: column name representing new variablevalue
: column name representing variable values...
: names of columns to gather (or not gather)gather
at workdf %>% gather(income, frequency, -religion)
## Source: local data frame [180 x 3] ## ## religion income frequency ## (chr) (fctr) (int) ## 1 Agnostic <$10k 27 ## 2 Atheist <$10k 12 ## 3 Buddhist <$10k 27 ## 4 Catholic <$10k 418 ## 5 Don’t know/refused <$10k 15 ## 6 Evangelical Prot <$10k 575 ## 7 Hindu <$10k 1 ## 8 Historically Black Prot <$10k 228 ## 9 Jehovah's Witness <$10k 20 ## 10 Jewish <$10k 19 ## .. ... ... ...
This
df %>% gather(income, frequency, 2:11)
returns the same as
df %>% gather(income, frequency, -religion)
Billboard data
df = read_csv("https://raw.githubusercontent.com/hadley/tidyr/master/vignettes/billboard.csv") head(df, 3)
## Source: local data frame [3 x 81] ## ## year artist track time date.entered wk1 ## (int) (chr) (chr) (chr) (date) (int) ## 1 2000 2 Pac Baby Don't Cry (Keep... 4:22 2000-02-26 87 ## 2 2000 2Ge+her The Hardest Part Of ... 3:15 2000-09-02 91 ## 3 2000 3 Doors Down Kryptonite 3:53 2000-04-08 81 ## Variables not shown: wk2 (int), wk3 (int), wk4 (int), wk5 (int), wk6 ## (int), wk7 (int), wk8 (int), wk9 (int), wk10 (int), wk11 (int), wk12 ## (int), wk13 (int), wk14 (int), wk15 (int), wk16 (int), wk17 (int), wk18 ## (int), wk19 (int), wk20 (int), wk21 (int), wk22 (int), wk23 (int), wk24 ## (int), wk25 (int), wk26 (int), wk27 (int), wk28 (int), wk29 (int), wk30 ## (int), wk31 (int), wk32 (int), wk33 (int), wk34 (int), wk35 (int), wk36 ## (int), wk37 (int), wk38 (int), wk39 (int), wk40 (int), wk41 (int), wk42 ## (int), wk43 (int), wk44 (int), wk45 (int), wk46 (int), wk47 (int), wk48 ## (int), wk49 (int), wk50 (int), wk51 (int), wk52 (int), wk53 (int), wk54 ## (int), wk55 (int), wk56 (int), wk57 (int), wk58 (int), wk59 (int), wk60 ## (int), wk61 (int), wk62 (int), wk63 (int), wk64 (int), wk65 (int), wk66 ## (lgl), wk67 (lgl), wk68 (lgl), wk69 (lgl), wk70 (lgl), wk71 (lgl), wk72 ## (lgl), wk73 (lgl), wk74 (lgl), wk75 (lgl), wk76 (lgl)
Question: what are the variables here?
To tidy this dataset, we first gather together all the wk
columns. The column names give the week and the values are the ranks:
billboard2 = df %>% gather(week, rank, wk1:wk76,na.rm = TRUE) head(billboard2, 3)
## Source: local data frame [3 x 7] ## ## year artist track time date.entered week ## (int) (chr) (chr) (chr) (date) (fctr) ## 1 2000 2 Pac Baby Don't Cry (Keep... 4:22 2000-02-26 wk1 ## 2 2000 2Ge+her The Hardest Part Of ... 3:15 2000-09-02 wk1 ## 3 2000 3 Doors Down Kryptonite 3:53 2000-04-08 wk1 ## Variables not shown: rank (int)
Note:
What more would we want to do to the data?
Let's turn the week into a numeric variable and create a proper date column
billboard3 = billboard2 %>% mutate( week = extract_numeric(week), date = as.Date(date.entered) + 7 * (week - 1)) %>% select(-date.entered) %>% arrange(artist, track, week) head(billboard3, 3)
## Source: local data frame [3 x 7] ## ## year artist track time week rank date ## (int) (chr) (chr) (chr) (dbl) (int) (date) ## 1 2000 2 Pac Baby Don't Cry (Keep... 4:22 1 87 2000-02-26 ## 2 2000 2 Pac Baby Don't Cry (Keep... 4:22 2 82 2000-03-04 ## 3 2000 2 Pac Baby Don't Cry (Keep... 4:22 3 72 2000-03-11
After gathering columns, the key column is sometimes a combination of multiple underlying variable names.
df = read_csv("https://raw.githubusercontent.com/hadley/tidyr/master/vignettes/tb.csv") head(df, 3)
## Source: local data frame [3 x 22] ## ## iso2 year m04 m514 m014 m1524 m2534 m3544 m4554 m5564 m65 mu ## (chr) (int) (int) (int) (int) (int) (int) (int) (int) (int) (int) (int) ## 1 AD 1989 NA NA NA NA NA NA NA NA NA NA ## 2 AD 1990 NA NA NA NA NA NA NA NA NA NA ## 3 AD 1991 NA NA NA NA NA NA NA NA NA NA ## Variables not shown: f04 (int), f514 (int), f014 (int), f1524 (int), f2534 ## (int), f3544 (int), f4554 (int), f5564 (int), f65 (int), fu (int)
Question: what are the variables here?
The dataset comes from the World Health Organisation, and records the counts of confirmed tuberculosis cases by country, year, and demographic group. The demographic groups are broken down by sex (m, f) and age (0-14, 15-25, 25-34, 35-44, 45-54, 55-64, unknown).
tb2 = df %>% gather(demo, n, -iso2, -year, na.rm = TRUE) head(tb2, 3)
## Source: local data frame [3 x 4] ## ## iso2 year demo n ## (chr) (int) (fctr) (int) ## 1 AD 2005 m04 0 ## 2 AD 2006 m04 0 ## 3 AD 2008 m04 0
Is this dataset tidy?
demo
variableseparate
makes it easy to split a compound variables into individual variables. You can either pass it a regular expression to split on or a vector of character positions. In this case we want to split after the first character.
tb3 = tb2 %>% separate(demo, c("sex", "age"), 1) head(tb3, 3)
## Source: local data frame [3 x 5] ## ## iso2 year sex age n ## (chr) (int) (chr) (chr) (int) ## 1 AD 2005 m 04 0 ## 2 AD 2006 m 04 0 ## 3 AD 2008 m 04 0
Question: Compare tb3
to the original data frame (df
). What are the advantages of having our data stored in a tidy format?
There are times when we are required to turn long formatted data into wide formatted data. The spread
function spreads a key-value pair across multiple columns.
args(spread)
## function (data, key, value, fill = NA, convert = FALSE, drop = TRUE) ## NULL
data
: data framekey
: column values to convert to multiple columnsvalue
: single column values to convert to multiple columns' valuesfill
: If there isn't a value for every combination of the other variables and the key column, this value will be substitutedspread
in actiontb3.wide = tb3 %>% spread(age, n) tb3.wide
## Source: local data frame [4,885 x 13] ## ## iso2 year sex 014 04 1524 2534 3544 4554 514 5564 65 ## (chr) (int) (chr) (int) (int) (int) (int) (int) (int) (int) (int) (int) ## 1 AD 1996 f 0 NA 1 1 0 0 NA 1 0 ## 2 AD 1996 m 0 NA 0 0 4 1 NA 0 0 ## 3 AD 1997 f 0 NA 1 2 3 0 NA 0 1 ## 4 AD 1997 m 0 NA 0 1 2 2 NA 1 6 ## 5 AD 1998 m 0 NA 0 0 1 0 NA 0 0 ## 6 AD 1999 f 0 NA 0 0 1 0 NA 0 0 ## 7 AD 1999 m 0 NA 0 0 1 1 NA 0 0 ## 8 AD 2000 m 0 NA 0 1 0 0 NA 0 0 ## 9 AD 2001 m 0 NA NA NA 2 1 NA NA NA ## 10 AD 2002 f 0 NA 1 0 0 0 NA 0 0 ## .. ... ... ... ... ... ... ... ... ... ... ... ... ## Variables not shown: u (int)
Regular expression is a pattern that describes a specific set of strings with a common structure. It is heavily used for string matching / replacing in all programming languages, although specific syntax may differ a bit.
Regular expressions typically specify characters (or character classes) to seek out, possibly with information about repeats and location within the string. This is accomplished with the help of metacharacters that have specific meaning: $ * + . ? [ ] ^ { } | ( ) \
I won't lie: they can be difficult
Quantifiers specify how many repetitions of the pattern
*
: matches at least 0 times+
: matches at least 1 times?
: matches at most 1 times{n}
: matches exactly n times{n,}
: matches at least n times{n,m}
: matches between n and m timesstrings = c("a", "ab", "acb", "accb", "acccb", "accccb") strings
## [1] "a" "ab" "acb" "accb" "acccb" "accccb"
grep("ac*b", strings, value = TRUE)
## [1] "ab" "acb" "accb" "acccb" "accccb"
grep("ac+b", strings, value = TRUE)
## [1] "acb" "accb" "acccb" "accccb"
grep("ac?b", strings, value = TRUE)
## [1] "ab" "acb"
grep("ac{2}b", strings, value = TRUE)
## [1] "accb"
grep("ac{2,}b", strings, value = TRUE)
## [1] "accb" "acccb" "accccb"
grep("ac{2,3}b", strings, value = TRUE)
## [1] "accb" "acccb"
.
: matches any single character, as shown in the first example.[...]
: a character list, matches any one of the characters inside the square brackets. We can also use - inside the brackets to specify a range of characters.[^...]
: an inverted character list, similar to […], but matches any characters except those inside the square brackets.|
: an “or” operator, matches patterns on either side of the |.\
: suppress the special meaning of metacharacters in regular expression^
: matches the start of the string.strings = c("^ab", "ab", "abc", "abd", "abe", "ab 12") grep("ab.", strings, value = TRUE)
## [1] "abc" "abd" "abe" "ab 12"
grep("ab[c-e]", strings, value = TRUE)
## [1] "abc" "abd" "abe"
grep("ab[^c]", strings, value = TRUE)
## [1] "abd" "abe" "ab 12"
grep("^ab", strings, value = TRUE)
## [1] "ab" "abc" "abd" "abe" "ab 12"
grep("\\^ab", strings, value = TRUE)
## [1] "^ab"
grep("abc|abd", strings, value = TRUE)
## [1] "abc" "abd"
Character classes allows to specify entire classes of characters, such as numbers, letters, etc.
[:digit:]
or \d
: digits, 0 1 2 3 4 5 6 7 8 9, equivalent to [0-9]
.\D
: non-digits, equivalent to [^0-9]
.[:lower:]
: lower-case letters, equivalent to [a-z]
.[:upper:]
: upper-case letters, equivalent to [A-Z]
.[:alpha:]
: alphabetic characters, equivalent to [[:lower:][:upper:]]
or [A-z]
.[:alnum:]
: alphanumeric characters, equivalent to [[:alpha:][:digit:]]
or [A-z0-9]
.\w
: word characters, equivalent to [[:alnum:]_]
or [A-z0-9_]
.\W
: not word, equivalent to [^A-z0-9_]
.[:blank:]
: blank characters, i.e. space and tab.[:space:]
: space characters: tab, newline, vertical tab, form feed, carriage return, space.\s
: space, ` `.\S
: not space.[:punct:]
: punctuation characters, ! " # $ % & ’ ( ) * + , - . / : ; < = > ? @ [ ] ^ _ ` { | } ~.You know the basics of plotting and data manipulation in R
By next week, you will know how to gather data from the web as well
That means it's time to start thinking about the exam project
The exam project has to be handed in not later than 11 December 2015
You will also need to have a project description aproved. Deadline here is November 16
For the exam students are expected to pose an interesting social science question and attempt to answer it using standard academic practices including original data collection and statistical analysis.
Points for
That's really up to you (so you need to think)
I've had fun collecting data from the Danish Superliga, Goodreads, Voting patterns in the European Parliament, the U.S Congress, the New York Times, Twitter, Instagram, Facebook, MOMA, etc.
… but it depends on your interests!
So spend the last part of today's class brainstorming with your group