Social Data Science

Summer School 2016
Department of Economics
University of Copenhagen
Instructors: David Dreyer Lassen & Sebastian Barfort
Date: August 8 - August 26, 2016
Time: Lectures: 9-12, Exercises: 13-15


The objective of this course is to learn how to analyze, gather and work with modern quantitative social science data. Increasingly, social data–data that capture how people behave and interact with each other–is available online in new, challenging forms and formats. This opens up the possibility of gathering large amounts of interesting data, to investigate existing theories and new phenomena, provided that the analyst has sufficient computer literacy while at the same time being aware of the promises and pitfalls of working with various types of data.

This aim of this course is fourfold:

  1. We will introduce students to the state of the art social science literature using computational methods and social data.
  2. We will present students with an overview of key benefits and challenges of working with different kinds of social data. We will show how various kinds of data (survey, web-based, experimental, administrative, etc.) can be used to answer different questions within the social sciences. Furthermore, we will discuss ethical challenges related to the use of different types of data.
  3. We will introduce students to statistical techniques for predicting and classification, known as statistical learning, and we will discuss how these methods relate to existing empirical tools within economics such as causal inference and regression.
  4. We will present modern data science methods needed for working with computational social science and social data in practice. Being an effective economist and data scientist means spending large fractions of our time writing and debugging code. In this section you will learn how to write code to clean, transform, scrape, merge, visualize and analyze social data.

The course will consist of 3 hours of lectures and 2 hours of exercises and problem solving per day. The lectures will focus on the broad topics covered in the course (part 1-3 listed above). In the exercise classes we will get our hands dirty and present data science methods needed for collecting and analyzing real-world data. In addition to core computational concepts, these classes will focus on the following topics

  1. Generating new data: We will learn how to collect and scrape data from websites as well as working with APIs.
  2. Data manipulation tools: Participants will learn how to import, transform, munge and merge data from various sources.
  3. Visualization tools: We will learn best practices for visualizing data in different steps of a data analysis. Participants will learn how to visualize raw data as well as effective tools for communicating results from statistical models for broader audiences.
  4. Reproducability tools: Participants will learn how to use version control and social coding using Github and how to effectively communicate the insights of an analysis using markdown.
  5. Prediction tools: We will cover key implementations of statistical learning algorithms and participants will learn how to apply and interpret these models in practice.

2 hours of exercises a day is not a large amount of time for learning how to code. We will use some of this time like development meetings: going over assignments, having detailed code reviews of various forms, and discussing blocking issues and potential solutions.

As increasing emphasis in academics is being placed on the skills needed to effectively gather, handle, and analyze data as well as present results to a range of audiences, this course will provide you with important tools for future academic study. Furthermore, the skills taught in this course are also widely used in business. R programming skills in particular are highly valued in fields such as finance and information technology. As this course is focused on general skills for working with social science data such as gathering and visualization, it is equally relevant for students seeking careers outside academia where skills such as the ability to effectively communicate the results of an analysis are in high demand.

This course assumes no knowledge of any particular software or computer program, but while we will try to demystify the technological side of things so students feel comfortable getting started and thinking like a data scientist, this will be a technical course, and students should expect to spend a significant amount of time learning these tools.

Because the course builds on a wide range of techniques, we do not have any hard requirements, but students are expected to have an interest in some subset of: statistics, econometrics, linear algebra, and a scripting language (we will use R in this course).

Course work will include writing R code.

Code will be distributed and collected via Git, hosted on Github.