November, 2015

Today

  • new later exam date
  • KNN
  • Classification and Regression Tree (CART)
  • unsupervised learning

Classification

When we are trying to predict discrete outcomes we are effictively doing classification

We saw last time that the logit model could be used for predictions

Today I want to show you an alternative approach: KNN

K Nearest Neighbors

KNN attempts to estimate the conditional distribution of \(Y\) given \(X\), and then classify a given observation to the class with the highest estimated probability.

Specifically, given a positive integer \(K\) and a test observation \(x_0\), KNN identifies the \(K\) points in the training data that are closest to \(x_0\), represented by \(N_0\).

It then estimates the conditional probability for class \(j\) as the fraction of points in \(N_0\) whose response values equal \(j\):

\[Pr(Y = j|X = x_0) = \frac{1}{K} \sum_{i \in N_0} I(y_i = j)\]

and predicts based on majority vote

Question: What is the prediction when K = N?

Choosing k

The choice of \(k\) or the number of neighbors to be included in the classification at a new point is important.

A common choice is to take \(k = 1\) but this can give rise to very irregular and jagged regions with high variances in prediction.

Larger choices of \(k\) lead to smoother regions and less variable classifications, but do not capture local details and can have larger biases.

Cross-validate!

Curse of dimensionality

Distance is often Euclidean distance

kNN breaks down in high-dimensional space

This is because the "neighborhood" becomes very large

This is often called the curse of dimensionality.

Classification and Regression Trees (CART)

Decision trees can be applied to both regression and classification problems

They are intuitive, but run the danger of overfitting (what happens if you grow the largest possible decision tree for a given problem?)

Therefore, people usually use extensions such as random forests

Advantages of tree based methods

Easy to explain

Mimics the mental model we often use to make decisions

Can easily be displayed graphically

Main disadvantage:

Performance

CART example: Classifying cuisine given ingredients

library("jsonlite")
## 
## Attaching package: 'jsonlite'
## 
## The following object is masked from 'package:utils':
## 
##     View
food = fromJSON("~/git/sds/data/food.json")
head(food)
##      id     cuisine
## 1 10259       greek
## 2 25693 southern_us
## 3 20130    filipino
## 4 22213      indian
## 5 13162      indian
## 6  6602    jamaican
##                                                                                                                                                                                                                                        ingredients
## 1                                                                                                                     romaine lettuce, black olives, grape tomatoes, garlic, pepper, purple onion, seasoning, garbanzo beans, feta cheese crumbles
## 2                                                                                                              plain flour, ground pepper, salt, tomatoes, ground black pepper, thyme, eggs, green tomatoes, yellow corn meal, milk, vegetable oil
## 3                                                                                               eggs, pepper, salt, mayonaise, cooking oil, green chilies, grilled chicken breasts, garlic powder, yellow onion, soy sauce, butter, chicken livers
## 4                                                                                                                                                                                                                water, vegetable oil, wheat, salt
## 5 black pepper, shallots, cornflour, cayenne pepper, onions, garlic paste, milk, butter, salt, lemon juice, water, chili powder, passata, oil, ground cumin, boneless chicken skinless thigh, garam masala, double cream, natural yogurt, bay leaf
## 6                                                                                                  plain flour, sugar, butter, eggs, fresh ginger root, salt, ground cinnamon, milk, vanilla extract, ground ginger, powdered sugar, baking powder

Good thing we had Zoltan here last week…

library("tm")
combi_ingredients = c(Corpus(VectorSource(food$ingredients)), Corpus(VectorSource(food$ingredients)))
combi_ingredients = tm_map(combi_ingredients, stemDocument, language="english")
combi_ingredientsDTM = DocumentTermMatrix(combi_ingredients)
combi_ingredientsDTM = removeSparseTerms(combi_ingredientsDTM, 0.99)
combi_ingredientsDTM = as.data.frame(as.matrix(combi_ingredientsDTM))
combi = combi_ingredientsDTM
combi_ingredientsDTM$cuisine = as.factor(c(food$cuisine, rep("italian", nrow(food))))

Our food data

names(combi_ingredientsDTM)[1:15]
##  [1] "all_purpos" "allspic"    "almond"     "and"        "appl"      
##  [6] "avocado"    "babi"       "bacon"      "bake"       "balsam"    
## [11] "basil"      "bay"        "bean"       "beansprout" "beef"
trainDTM  = combi_ingredientsDTM[1:nrow(food), ]
testDTM = combi_ingredientsDTM[-(1:nrow(food)), ]

Estimate the model

library("rpart")
set.seed(1)
model = rpart(cuisine ~ ., data = trainDTM, method = "class")
cuisine = predict(model, newdata = testDTM, type = "class")
summary(model)
## Call:
## rpart(formula = cuisine ~ ., data = trainDTM, method = "class")
##   n= 39774 
## 
##           CP nsplit rel error    xerror        xstd
## 1 0.07781187      0 1.0000000 1.0000000 0.002484064
## 2 0.03143788      1 0.9221881 0.9221881 0.002737618
## 3 0.01000000      6 0.7362224 0.7310559 0.003074783
## 
## Variable importance
##     tortilla     parmesan          soy       masala        garam 
##           19           13           12            8            7 
##         oliv     cilantro        grate        sesam          oil 
##            7            7            4            4            3 
##         sauc        chees extra_virgin         chip         jack 
##            2            2            2            2            2 
##        salsa     monterey        mirin       chines         wine 
##            2            1            1            1            1 
##      parsley 
##            1 
## 
## Node number 1: 39774 observations,    complexity param=0.07781187
##   predicted class=italian      expected loss=0.8029366  P(node) =1
##     class counts:   467   804  1546  2673   755  2646  1175  3003   667  7838   526  1423   830  6438   821   489  4320   989  1539   825
##    probabilities: 0.012 0.020 0.039 0.067 0.019 0.067 0.030 0.076 0.017 0.197 0.013 0.036 0.021 0.162 0.021 0.012 0.109 0.025 0.039 0.021 
##   left son=2 (37167 obs) right son=3 (2607 obs)
##   Primary splits:
##       tortilla < 0.5 to the left,  improve=1966.590, (0 missing)
##       parmesan < 0.5 to the right, improve=1398.459, (0 missing)
##       soy      < 0.5 to the right, improve=1331.248, (0 missing)
##       chees    < 0.5 to the left,  improve=1126.163, (0 missing)
##       cilantro < 0.5 to the left,  improve=1096.958, (0 missing)
##   Surrogate splits:
##       chip     < 0.5 to the left,  agree=0.940, adj=0.092, (0 split)
##       jack     < 0.5 to the left,  agree=0.940, adj=0.087, (0 split)
##       salsa    < 0.5 to the left,  agree=0.940, adj=0.085, (0 split)
##       monterey < 0.5 to the left,  agree=0.938, adj=0.059, (0 split)
##       jalapeno < 1.5 to the left,  agree=0.935, adj=0.002, (0 split)
## 
## Node number 2: 37167 observations,    complexity param=0.03143788
##   predicted class=italian      expected loss=0.7895714  P(node) =0.9344547
##     class counts:   466   804  1540  2661   753  2645  1170  2995   666  7821   521  1422   820  3936   821   488  4311   981  1524   822
##    probabilities: 0.013 0.022 0.041 0.072 0.020 0.071 0.031 0.081 0.018 0.210 0.014 0.038 0.022 0.106 0.022 0.013 0.116 0.026 0.041 0.022 
##   left son=4 (2893 obs) right son=5 (34274 obs)
##   Primary splits:
##       parmesan < 0.5 to the right, improve=1331.4060, (0 missing)
##       soy      < 0.5 to the right, improve=1295.3210, (0 missing)
##       chees    < 0.5 to the right, improve=1236.8550, (0 missing)
##       oliv     < 0.5 to the right, improve=1006.5540, (0 missing)
##       ginger   < 0.5 to the right, improve= 949.5458, (0 missing)
##   Surrogate splits:
##       grate   < 0.5 to the right, agree=0.945, adj=0.294, (0 split)
##       chees   < 1.5 to the right, agree=0.934, adj=0.157, (0 split)
##       ricotta < 0.5 to the right, agree=0.923, adj=0.011, (0 split)
##       pasta   < 1.5 to the right, agree=0.922, adj=0.001, (0 split)
##       chive   < 1.5 to the right, agree=0.922, adj=0.001, (0 split)
## 
## Node number 3: 2607 observations
##   predicted class=mexican      expected loss=0.04027618  P(node) =0.06554533
##     class counts:     1     0     6    12     2     1     5     8     1    17     5     1    10  2502     0     1     9     8    15     3
##    probabilities: 0.000 0.000 0.002 0.005 0.001 0.000 0.002 0.003 0.000 0.007 0.002 0.000 0.004 0.960 0.000 0.000 0.003 0.003 0.006 0.001 
## 
## Node number 4: 2893 observations
##   predicted class=italian      expected loss=0.1634981  P(node) =0.07273596
##     class counts:    24    11    58     4     2   112    49     4     7  2420     2     8     0    38     3     2   135    11     2     1
##    probabilities: 0.008 0.004 0.020 0.001 0.001 0.039 0.017 0.001 0.002 0.837 0.001 0.003 0.000 0.013 0.001 0.001 0.047 0.004 0.001 0.000 
## 
## Node number 5: 34274 observations,    complexity param=0.03143788
##   predicted class=italian      expected loss=0.842417  P(node) =0.8617187
##     class counts:   442   793  1482  2657   751  2533  1121  2991   659  5401   519  1414   820  3898   818   486  4176   970  1522   821
##    probabilities: 0.013 0.023 0.043 0.078 0.022 0.074 0.033 0.087 0.019 0.158 0.015 0.041 0.024 0.114 0.024 0.014 0.122 0.028 0.044 0.024 
##   left son=10 (4564 obs) right son=11 (29710 obs)
##   Primary splits:
##       soy    < 0.5 to the right, improve=1205.2280, (0 missing)
##       ginger < 0.5 to the right, improve= 835.4422, (0 missing)
##       sauc   < 0.5 to the left,  improve= 826.3489, (0 missing)
##       masala < 0.5 to the right, improve= 820.0662, (0 missing)
##       cumin  < 0.5 to the left,  improve= 815.1154, (0 missing)
##   Surrogate splits:
##       sesam  < 0.5 to the right, agree=0.908, adj=0.305, (0 split)
##       sauc   < 1.5 to the right, agree=0.895, adj=0.209, (0 split)
##       mirin  < 0.5 to the right, agree=0.876, adj=0.070, (0 split)
##       chines < 0.5 to the right, agree=0.876, adj=0.067, (0 split)
##       oil    < 1.5 to the right, agree=0.875, adj=0.059, (0 split)
## 
## Node number 10: 4564 observations
##   predicted class=chinese      expected loss=0.555872  P(node) =0.1147483
##     class counts:     2     5    12  2027   287     7     6    33     7    19    80   755   528    36     4     3    35     4   485   229
##    probabilities: 0.000 0.001 0.003 0.444 0.063 0.002 0.001 0.007 0.002 0.004 0.018 0.165 0.116 0.008 0.001 0.001 0.008 0.001 0.106 0.050 
## 
## Node number 11: 29710 observations,    complexity param=0.03143788
##   predicted class=italian      expected loss=0.8188489  P(node) =0.7469704
##     class counts:   440   788  1470   630   464  2526  1115  2958   652  5382   439   659   292  3862   814   483  4141   966  1037   592
##    probabilities: 0.015 0.027 0.049 0.021 0.016 0.085 0.038 0.100 0.022 0.181 0.015 0.022 0.010 0.130 0.027 0.016 0.139 0.033 0.035 0.020 
##   left son=22 (1000 obs) right son=23 (28710 obs)
##   Primary splits:
##       masala   < 0.5 to the right, improve=810.7822, (0 missing)
##       cumin    < 0.5 to the left,  improve=800.7360, (0 missing)
##       garam    < 0.5 to the right, improve=735.6299, (0 missing)
##       ginger   < 0.5 to the right, improve=726.0664, (0 missing)
##       cilantro < 0.5 to the left,  improve=709.7499, (0 missing)
##   Surrogate splits:
##       garam   < 0.5 to the right, agree=0.997, adj=0.910, (0 split)
##       past    < 2.5 to the right, agree=0.966, adj=0.003, (0 split)
##       spinach < 1.5 to the right, agree=0.966, adj=0.001, (0 split)
## 
## Node number 22: 1000 observations
##   predicted class=indian       expected loss=0.065  P(node) =0.02514205
##     class counts:     1     2     0     1     0     0     0   935     0     1     1    49     0     0     6     1     2     0     1     0
##    probabilities: 0.001 0.002 0.000 0.001 0.000 0.000 0.000 0.935 0.000 0.001 0.001 0.049 0.000 0.000 0.006 0.001 0.002 0.000 0.001 0.000 
## 
## Node number 23: 28710 observations,    complexity param=0.03143788
##   predicted class=italian      expected loss=0.812574  P(node) =0.7218283
##     class counts:   439   786  1470   629   464  2526  1115  2023   652  5381   438   610   292  3862   808   482  4139   966  1036   592
##    probabilities: 0.015 0.027 0.051 0.022 0.016 0.088 0.039 0.070 0.023 0.187 0.015 0.021 0.010 0.135 0.028 0.017 0.144 0.034 0.036 0.021 
##   left son=46 (25195 obs) right son=47 (3515 obs)
##   Primary splits:
##       cilantro < 0.5 to the left,  improve=690.1183, (0 missing)
##       oliv     < 0.5 to the right, improve=653.9726, (0 missing)
##       cumin    < 0.5 to the left,  improve=636.2611, (0 missing)
##       chili    < 0.5 to the left,  improve=614.8058, (0 missing)
##       lime     < 0.5 to the left,  improve=578.3905, (0 missing)
##   Surrogate splits:
##       lime       < 0.5 to the left,  agree=0.884, adj=0.052, (0 split)
##       jalapeno   < 0.5 to the left,  agree=0.883, adj=0.047, (0 split)
##       avocado    < 0.5 to the left,  agree=0.882, adj=0.040, (0 split)
##       serrano    < 0.5 to the left,  agree=0.879, adj=0.015, (0 split)
##       beansprout < 0.5 to the left,  agree=0.878, adj=0.002, (0 split)
## 
## Node number 46: 25195 observations,    complexity param=0.03143788
##   predicted class=italian      expected loss=0.7879738  P(node) =0.633454
##     class counts:   360   782  1448   577   456  2513  1106  1420   648  5342   413   582   287  2349   533   474  4088   889   566   362
##    probabilities: 0.014 0.031 0.057 0.023 0.018 0.100 0.044 0.056 0.026 0.212 0.016 0.023 0.011 0.093 0.021 0.019 0.162 0.035 0.022 0.014 
##   left son=92 (7367 obs) right son=93 (17828 obs)
##   Primary splits:
##       oliv       < 0.5 to the right, improve=710.4396, (0 missing)
##       basil      < 0.5 to the right, improve=428.0619, (0 missing)
##       cumin      < 0.5 to the left,  improve=402.6612, (0 missing)
##       buttermilk < 0.5 to the left,  improve=378.7497, (0 missing)
##       mozzarella < 0.5 to the right, improve=355.6310, (0 missing)
##   Surrogate splits:
##       oil          < 0.5 to the right, agree=0.807, adj=0.339, (0 split)
##       extra_virgin < 0.5 to the right, agree=0.791, adj=0.286, (0 split)
##       wine         < 0.5 to the right, agree=0.732, adj=0.083, (0 split)
##       parsley      < 0.5 to the right, agree=0.732, adj=0.082, (0 split)
##       tomato       < 0.5 to the right, agree=0.728, adj=0.068, (0 split)
## 
## Node number 47: 3515 observations
##   predicted class=mexican      expected loss=0.569559  P(node) =0.08837431
##     class counts:    79     4    22    52     8    13     9   603     4    39    25    28     5  1513   275     8    51    77   470   230
##    probabilities: 0.022 0.001 0.006 0.015 0.002 0.004 0.003 0.172 0.001 0.011 0.007 0.008 0.001 0.430 0.078 0.002 0.015 0.022 0.134 0.065 
## 
## Node number 92: 7367 observations
##   predicted class=italian      expected loss=0.5743179  P(node) =0.1852215
##     class counts:    85    69   348    27    34   738   729   182    57  3136    43    29    14   509   316    58   359   588    32    14
##    probabilities: 0.012 0.009 0.047 0.004 0.005 0.100 0.099 0.025 0.008 0.426 0.006 0.004 0.002 0.069 0.043 0.008 0.049 0.080 0.004 0.002 
## 
## Node number 93: 17828 observations
##   predicted class=southern_us  expected loss=0.7908346  P(node) =0.4482325
##     class counts:   275   713  1100   550   422  1775   377  1238   591  2206   370   553   273  1840   217   416  3729   301   534   348
##    probabilities: 0.015 0.040 0.062 0.031 0.024 0.100 0.021 0.069 0.033 0.124 0.021 0.031 0.015 0.103 0.012 0.023 0.209 0.017 0.030 0.020

Does the tree pass the smell test?

library("rpart.plot")
prp(model)

 

Random forests

Random Forest algoriths are so-called ensemble models

This means that the model consists of many smaller models

The sub-models for Random Forests are classification and regression trees

Bagging

Breiman (1996) proposed bootstrap aggregating – “bagging” – to to reduce the risk of overfitting.

The core idea of bagging is to decrease the variance of the predictions of one model, by fitting several models and averaging over their predictions

In order to obtain a variety of models that are not overfit to the available data, each component model is fit only to a bootstrap sample of the data

Random forest intution

Random forests extended the logic of bagging to predictors.

This means that, instead of choosing the split from among all the explanatory variables at each node in each tree, only a random subset of the explanatory variables are used

If there are some very important variables they might overshadow the effect of weaker predictors because the algorithm searches for the split that results in the largest reduction in the loss function.

If at each split only a subset of predictors are available to be chosen, weaker predictors get a chance to be selected more often, reducing the risk of overlooking such variables

Unsupervised Learning

Supervised vs unsupervised

Supervised

  • You have an outcome Y and some covariates X

Unsupervised

  • You have a bunch of observations X and you want to understand the relationships between them.
  • You are usually trying to understand patterns in X or group the variables in X in some way

Examples of unsupervised learning

Principal Components Analysis

Clustering

Principal Components Analysis

You have a set of multivariate variables \(X_1,...,X_p\)

  • Find a new set of multivariate variables that are uncorrelated and explain as much variance as possible.
  • If you put all the variables together in one matrix, find the best matrix created with fewer variables (lower rank) that explains the original data.

The first goal is statistical and the second goal is data compression.

Example: Building a market index

library("readr")

df = read_csv("https://raw.githubusercontent.com/johnmyleswhite/ML_for_Hackers/master/08-PCA/data/stock_prices.csv")
head(df)
##         Date Stock Close
## 1 2011-05-25   DTE 51.12
## 2 2011-05-24   DTE 51.51
## 3 2011-05-23   DTE 51.47
## 4 2011-05-20   DTE 51.90
## 5 2011-05-19   DTE 51.91
## 6 2011-05-18   DTE 51.68

 

Market index

Idea: Let's reduce the 25 stocks to 1 dimension and let's call that our market index

Dimensionality reduction: shrink a large number of correlated variables into a smaller number

Can be used in many different situations: when we have too many variables for OLS, for unsupervised learning, etc.

library("tidyr")
df.wide = df %>% spread(Stock, Close)
df.wide = df.wide[complete.cases(df.wide), ]
head(df.wide)
## Source: local data frame [6 x 26]
## 
##         Date   ADC   AFL  ARKR  AZPN  CLFD   DDR   DTE  ENDP  FLWS    FR
##       (date) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl)
## 1 2002-01-02 17.70 23.78  8.15 17.10  3.19 18.80 42.37 11.54 15.77 31.16
## 2 2002-01-03 16.14 23.52  8.15 17.41  3.27 18.55 42.14 11.48 17.40 31.45
## 3 2002-01-04 15.45 23.92  7.79 17.90  3.28 18.46 41.79 11.60 17.11 31.46
## 4 2002-01-07 16.59 23.12  7.79 17.49  3.50 18.30 41.48 11.90 17.38 31.10
## 5 2002-01-08 16.76 25.54  7.35 17.89  4.24 18.57 40.69 12.41 14.62 31.40
## 6 2002-01-09 16.78 26.30  7.40 18.25  4.25 18.41 40.81 12.27 14.27 31.05
## Variables not shown: GMXR (dbl), GPC (dbl), HE (dbl), ISSC (dbl), ISSI
##   (dbl), KSS (dbl), MTSC (dbl), NWN (dbl), ODFL (dbl), PARL (dbl), RELV
##   (dbl), SIGM (dbl), STT (dbl), TRIB (dbl), UTR (dbl)

PCA

library("dplyr")
pca = princomp(select(df.wide, -Date))
summary(pca)
## Importance of components:
##                            Comp.1     Comp.2      Comp.3      Comp.4
## Standard deviation     32.8783239 21.9458151 12.66604184 11.48519188
## Proportion of Variance  0.5023324  0.2238078  0.07455103  0.06129829
## Cumulative Proportion   0.5023324  0.7261402  0.80069122  0.86198951
##                            Comp.5     Comp.6     Comp.7     Comp.8
## Standard deviation     8.52597847 8.21798001 6.15711399 5.14393138
## Proportion of Variance 0.03378005 0.03138354 0.01761677 0.01229595
## Cumulative Proportion  0.89576956 0.92715310 0.94476987 0.95706582
##                            Comp.9    Comp.10     Comp.11     Comp.12
## Standard deviation     4.79763615 4.29050691 3.516427432 2.659538627
## Proportion of Variance 0.01069612 0.00855439 0.005746126 0.003286884
## Cumulative Proportion  0.96776195 0.97631634 0.982062463 0.985349347
##                            Comp.13     Comp.14     Comp.15    Comp.16
## Standard deviation     2.530919035 2.177211855 2.040659474 1.88598714
## Proportion of Variance 0.002976654 0.002202791 0.001935142 0.00165291
## Cumulative Proportion  0.988326001 0.990528791 0.992463934 0.99411684
##                            Comp.17     Comp.18      Comp.19      Comp.20
## Standard deviation     1.725447133 1.638980130 1.4166499216 1.1858865221
## Proportion of Variance 0.001383487 0.001248301 0.0009326032 0.0006535188
## Cumulative Proportion  0.995500331 0.996748632 0.9976812350 0.9983347538
##                             Comp.21      Comp.22      Comp.23      Comp.24
## Standard deviation     1.1219624494 0.9370250272 0.8490574411 0.7360680459
## Proportion of Variance 0.0005849631 0.0004080132 0.0003350009 0.0002517722
## Cumulative Proportion  0.9989197169 0.9993277301 0.9996627310 0.9999145032
##                             Comp.25
## Standard deviation     4.289326e-01
## Proportion of Variance 8.549683e-05
## Cumulative Proportion  1.000000e+00

Creating market index

market.index = predict(pca)[, 1]
market.index = data.frame(market.index = market.index, Date = df.wide$Date)

Question: How do we validate our index?

One suggestion: we can compare it to Dow Jones

library("lubridate")
dj = read_csv("https://raw.githubusercontent.com/johnmyleswhite/ML_for_Hackers/master/08-PCA/data/DJI.csv")
dj = dj %>% filter(ymd(Date) > ymd('2001-12-31')) %>% 
  filter(ymd(Date) != ymd('2002-02-01')) %>% select(Date, Close)
market.data = inner_join(market.index, dj)
## Joining by: "Date"
head(market.data)
##   market.index       Date    Close
## 1     28.24125 2002-01-02 10073.40
## 2     28.16625 2002-01-03 10172.14
## 3     28.07273 2002-01-04 10259.74
## 4     28.30203 2002-01-07 10197.05
## 5     27.62799 2002-01-08 10150.55
## 6     27.49861 2002-01-09 10094.09

 

market.data = market.data %>% 
  mutate(
    market.index = scale(market.index * (-1)),
    Close = scale(Close)
  )
market.data = market.data %>% gather(index, value, -Date)
head(market.data)
##         Date        index      value
## 1 2002-01-02 market.index -0.8587810
## 2 2002-01-03 market.index -0.8565003
## 3 2002-01-04 market.index -0.8536565
## 4 2002-01-07 market.index -0.8606292
## 5 2002-01-08 market.index -0.8401325
## 6 2002-01-09 market.index -0.8361982

 

Clustering

An alternative approach to unsupervised learning is clustering

In the following we will use MDS - Multidirectional Scaling - to study polarization of politicians in the U.S Congress

MDS

MDS is a set of statistical techniques used to visually depict the similarities and differences from a set of distances

The model takes a distance matrix that specifies the (usually Euclidean) distance between every pair of points in our data and returns a set of coordinates for those two points that approximates those distances

It is implemented in base R in the cmdscale function

Ideology in the U.S Congress

data.dir = "~/git/sds/data"
files = list.files(data.dir, pattern =".dta")

library("foreign")
rollcall.data = lapply(files,
    function(f){read.dta(file.path(data.dir, f), convert.factors = FALSE)})

Cleaning…

rollcall.simplified = function(df){
  no.pres = subset(df, state < 99)
  for(i in 10:ncol(no.pres)){
    no.pres[,i] = ifelse(no.pres[,i] > 6, 0, no.pres[,i])
    no.pres[,i] = ifelse(no.pres[,i] > 0 & no.pres[,i] < 4, 1, no.pres[,i])
    no.pres[,i] = ifelse(no.pres[,i] > 1, -1, no.pres[,i])
  }
  return(as.matrix(no.pres[,10:ncol(no.pres)]))
}

rollcall.simple = lapply(rollcall.data, rollcall.simplified)

dim(rollcall.simple[[1]])
## [1] 102 638
head(rollcall.simple[[1]])
##   V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V20
## 2  1  1  1  1  1  1  1  1  1   1   1   1   1   1   1   1   1   1   1  -1
## 3  1  1  1  1  1  1  1  1  1   1   1   1   1  -1  -1   1   1   1   1   1
## 4  1  1  1  1  1  1  1  1  1   1  -1  -1   1   1   1   1   1   1   1   1
## 5  1  1  1  1  1  1  1  1  1   1  -1  -1   1   1  -1   1   1   1   1   1
## 6  1  1  1  1  1  1  1  1  1   1   1   1   1   1   1   1   1   1   1  -1
## 7  1  1  1  1  1  1  1  1  1   1   1   1   1  -1  -1   1   1   1   1   1
##   V21 V22 V23 V24 V25 V26 V27 V28 V29 V30 V31 V32 V33 V34 V35 V36 V37 V38
## 2   1  -1  -1   1   1   1  -1   1   1  -1   1   1   1   1   1   1   1  -1
## 3   1  -1  -1   1   1   1  -1   1   1  -1   1   1   1   1   1   1   1  -1
## 4   1  -1  -1   1   1   1   1   1  -1   1   1   1   1   1   1   1   1   1
## 5   1  -1  -1   1   1   1   1   1  -1   1   1   1   1   1   1   1  -1   1
## 6   1  -1  -1   1   1   1   1   1   1  -1   1   1   1   1   1   1   1  -1
## 7   1  -1  -1   1   1   1   1   1  -1   1   1   1   1   1   1  -1  -1   1
##   V39 V40 V41 V42 V43 V44 V45 V46 V47 V48 V49 V50 V51 V52 V53 V54 V55 V56
## 2   1  -1  -1  -1   1   1   1   1   1   1   1   1   1  -1  -1  -1  -1   1
## 3   1  -1  -1  -1   1   1   1   1   1   1   1   1   1  -1  -1  -1  -1   1
## 4  -1   1   1   1   1   1   1  -1  -1   1   1   1   1  -1  -1  -1  -1   1
## 5  -1  -1   1   1   1   1   1  -1  -1   1   1   1   1  -1  -1  -1  -1   1
## 6   1   1   1   1   1   1   1  -1   1   1  -1   1   1  -1  -1  -1  -1   1
## 7  -1  -1   1  -1   1   1   1   1  -1   1   1   1   1  -1  -1  -1  -1   1
##   V57 V58 V59 V60 V61 V62 V63 V64 V65 V66 V67 V68 V69 V70 V71 V72 V73 V74
## 2   1   1   1   1  -1   1   1  -1  -1   1   1   1  -1   1   1   1  -1   1
## 3   1   1   1   1  -1   1  -1   1  -1   1   1   1   1   1   1   1  -1  -1
## 4   1  -1   1   1  -1   1  -1   1  -1  -1   1  -1   1   1   1   1  -1   1
## 5   1  -1   1   1  -1   1  -1   1  -1  -1   1  -1   1   0   0   0   0   0
## 6   1   1   1  -1  -1   1  -1   1  -1  -1   1   1   1   1   1  -1   1  -1
## 7   1   0   1   1  -1   1  -1   1  -1  -1  -1  -1   1   1   1   1  -1   1
##   V75 V76 V77 V78 V79 V80 V81 V82 V83 V84 V85 V86 V87 V88 V89 V90 V91 V92
## 2   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1
## 3   1   1  -1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1
## 4   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1
## 5   0   0   0   1  -1   1   1   1   1   1   1   1   1   1   1   1   1   1
## 6   1  -1  -1   1   1   1   1   1   1   1   1   1  -1   1   1   1   1   1
## 7   1   1  -1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1
##   V93 V94 V95 V96 V97 V98 V99 V100 V101 V102 V103 V104 V105 V106 V107 V108
## 2  -1   1   1   1   1   1   1    1   -1    1    1    1    1    1   -1   -1
## 3  -1   1   1   1   1  -1   1   -1   -1   -1    1    1    1   -1    1   -1
## 4   1  -1   1   1   1   1  -1   -1    1   -1   -1    1    1    1    1   -1
## 5   1  -1   1   1   1  -1  -1   -1    1   -1   -1    1    1    1   -1   -1
## 6  -1   1   1   1   1   1   1    1   -1    1    1    1   -1   -1    1    1
## 7   1  -1   1   1   1  -1  -1   -1    1   -1    0    1    1    0    1   -1
##   V109 V110 V111 V112 V113 V114 V115 V116 V117 V118 V119 V120 V121 V122
## 2   -1    1    1    1    1   -1    1   -1   -1    1    1    1    1   -1
## 3    1    1    1   -1    1   -1    1   -1   -1    1   -1    1    1   -1
## 4    1    1    1   -1    1   -1    1    1    1   -1   -1    1   -1   -1
## 5   -1    1    1   -1    1   -1    1   -1    1    1   -1    1    1   -1
## 6    1   -1    1   -1    1    1   -1    1    1    1    1    1    1   -1
## 7    1   -1    1   -1    1    1   -1   -1    1    1   -1    1   -1   -1
##   V123 V124 V125 V126 V127 V128 V129 V130 V131 V132 V133 V134 V135 V136
## 2   -1    1   -1   -1    1    1   -1    1    1    1    1    1    1    1
## 3   -1    1   -1   -1    1    1   -1    1   -1    1    1    1    1    1
## 4   -1    1   -1   -1    1    1   -1    1    1    1    1    1   -1    1
## 5   -1    1   -1   -1    1    1   -1    1   -1    1    1    1    1    1
## 6   -1    1   -1   -1    1    1   -1    1    1    1    1    1    1   -1
## 7   -1    1    1   -1    1    1   -1    1    1    1    1    1    1    1
##   V137 V138 V139 V140 V141 V142 V143 V144 V145 V146 V147 V148 V149 V150
## 2    1    1    1    1    1    1    1    1   -1    1    1    1    1    1
## 3    1    1    1    1    1    1    1    1   -1    1    1    1    1    1
## 4    1   -1    1    1    1    1    1   -1    1   -1   -1    1    1    1
## 5    1    1    1    0    1    1    1   -1    1   -1    1    1    1    1
## 6    1    1    1    1    1    1    1    1    1   -1    1   -1    1   -1
## 7    1   -1    1    1    1    1    1    1    1   -1    1    1    1    1
##   V151 V152 V153 V154 V155 V156 V157 V158 V159 V160 V161 V162 V163 V164
## 2    1    1    1   -1   -1    1    1    1    1   -1    1   -1    1    1
## 3    1    1    1    1   -1    1    1   -1    1    1    1   -1    1    1
## 4    1    1    1   -1   -1   -1    1   -1    0   -1    1   -1    1   -1
## 5    1    1    1    1   -1    1    1   -1   -1   -1    1    1    1   -1
## 6    1   -1    1    1   -1    1    1    1    1    1    1   -1   -1   -1
## 7    1    1    1   -1    1    1    1   -1    1    1    1   -1    1   -1
##   V165 V166 V167 V168 V169 V170 V171 V172 V173 V174 V175 V176 V177 V178
## 2    1    1    1    1    1    1    1    1    1    1    1    1    1    1
## 3   -1    1   -1    1    1    1    1    1    1    1    1   -1    1    1
## 4    1   -1   -1    1   -1   -1    1    1    1    1   -1    1   -1    1
## 5    1   -1   -1    1    0    0    0    0    0    1   -1    1    1    1
## 6   -1    1   -1    1    1    1    1    1    1    1    1    1   -1    1
## 7    1   -1   -1    1    1    1    1    1    1    1   -1    1    1    1
##   V179 V180 V181 V182 V183 V184 V185 V186 V187 V188 V189 V190 V191 V192
## 2   -1    1    1    1   -1    1   -1   -1   -1   -1    1    1    1    1
## 3   -1    1    1    1   -1    1   -1   -1   -1   -1    1    1    1   -1
## 4    1    0    1    1    1   -1   -1    1    1    1    1    1    1   -1
## 5    1    1   -1   -1    1   -1    1   -1   -1   -1    1    1    1   -1
## 6   -1    1    1    1   -1    1    1   -1   -1   -1   -1    1    1    1
## 7   -1    1   -1   -1    1   -1   -1   -1   -1   -1    1    1    1   -1
##   V193 V194 V195 V196 V197 V198 V199 V200 V201 V202 V203 V204 V205 V206
## 2    1   -1    1    1    1   -1    1    1    1    1   -1    1    1    1
## 3    1    1    1    1    1   -1    1    1    1   -1   -1    1   -1    1
## 4    1   -1    1    1    1   -1   -1   -1   -1   -1   -1    1    1    1
## 5   -1   -1    1    1    1   -1   -1   -1    1   -1   -1    1    1    1
## 6    1    1    1    1    1   -1    1    1   -1   -1   -1    1    1   -1
## 7   -1   -1    1    1    1   -1   -1   -1   -1   -1   -1    1   -1    1
##   V207 V208 V209 V210 V211 V212 V213 V214 V215 V216 V217 V218 V219 V220
## 2    1    1    1    1    1    1    1   -1   -1   -1    1   -1   -1    1
## 3    1    1    1    1    1    1    1   -1   -1   -1    1   -1   -1    1
## 4    1    1    1    1    1    1    1   -1   -1   -1    1   -1    1    1
## 5    1    1    1    1    1    1    1   -1   -1   -1    1   -1    1    1
## 6    1    1    1    1    1    1   -1    1    1   -1    1    1    1    1
## 7    1    1    1    1    1    1    1   -1   -1   -1    1   -1   -1    1
##   V221 V222 V223 V224 V225 V226 V227 V228 V229 V230 V231 V232 V233 V234
## 2   -1    1    1   -1    1   -1    1   -1    1   -1    1    1    1   -1
## 3   -1    1    1   -1    1   -1    1   -1    1   -1    1    1    1   -1
## 4   -1    1    1   -1    1   -1    1    1    1   -1    1    1    1   -1
## 5   -1   -1    1   -1    1   -1    1    1    1    1   -1    1    1   -1
## 6    1    1    1    1   -1    1    1    1    1   -1    1    1    1    1
## 7    1   -1    1   -1    1   -1    1    1    1    1   -1    1    1   -1
##   V235 V236 V237 V238 V239 V240 V241 V242 V243 V244 V245 V246 V247 V248
## 2   -1   -1   -1   -1    1    1    1   -1   -1   -1   -1   -1    1   -1
## 3   -1   -1   -1   -1   -1    1   -1   -1   -1   -1    1   -1    1   -1
## 4   -1   -1   -1   -1   -1    1    1    1    1    1    1    1   -1   -1
## 5   -1   -1   -1   -1   -1    1    1   -1    1    1    1    1   -1   -1
## 6    1   -1   -1    1   -1    1    1   -1    1    1    1    1   -1   -1
## 7   -1   -1   -1   -1   -1    1    1   -1    1    1    1    1   -1   -1
##   V249 V250 V251 V252 V253 V254 V255 V256 V257 V258 V259 V260 V261 V262
## 2   -1    1    1    1    1   -1   -1   -1   -1   -1   -1   -1   -1   -1
## 3   -1    1    1   -1    1   -1   -1   -1   -1   -1   -1   -1   -1   -1
## 4   -1    1    1    1    1    1   -1    1    1    1    1    1    1    1
## 5   -1    1    1    1    1    1    1    1    1    1   -1    1    1    1
## 6   -1    1    1    1    1    1    1    1    1    1    1    1    1    1
## 7   -1    1    1   -1    1    1    1    1    1    1   -1    1    1    1
##   V263 V264 V265 V266 V267 V268 V269 V270 V271 V272 V273 V274 V275 V276
## 2   -1   -1    1    1    1    1    1    1    1    1    1   -1    1    1
## 3   -1   -1    1    1    1    1    1    1    1   -1    1   -1    1    1
## 4   -1    1    1    1    1    1    1    1    1    1   -1   -1    1    1
## 5   -1    1    1    1    1    1    1    1    1    1   -1   -1    1   -1
## 6   -1    1    1    1    1    1    1    1    1    1    1   -1    1    1
## 7   -1    1    1    1    1    1    1    1    1    1   -1   -1    1   -1
##   V277 V278 V279 V280 V281 V282 V283 V284 V285 V286 V287 V288 V289 V290
## 2    1    1    1   -1    1    1    1   -1   -1    1    1    1   -1    1
## 3    1   -1    1   -1   -1    1    1    1    1    1    1   -1   -1   -1
## 4    1   -1    1    1    1    1    1   -1   -1    1    1    1   -1   -1
## 5    1    1    1   -1    1    1    1   -1    1    1   -1   -1   -1   -1
## 6    1   -1    1   -1    1   -1    1   -1   -1    1    1    1    1    1
## 7    1    1    1   -1   -1    1    0    0    0    0    0    0    1   -1
##   V291 V292 V293 V294 V295 V296 V297 V298 V299 V300 V301 V302 V303 V304
## 2   -1    1    1    1    1    1    1    1    1    1   -1    1   -1    1
## 3   -1    1   -1    1    1    1    1    1    1   -1   -1    1    1   -1
## 4    1    1    1   -1    1   -1    1    1    1    1   -1    1   -1    1
## 5   -1    1    1    1    1   -1    1    1    1   -1   -1    1   -1    1
## 6    1    1    1   -1    1    1    1    1    1   -1    1    1   -1    1
## 7   -1    1   -1    1    1    1    1    1    1   -1    1    1    1   -1
##   V305 V306 V307 V308 V309 V310 V311 V312 V313 V314 V315 V316 V317 V318
## 2   -1   -1   -1   -1    1   -1    1    1    1    1    1    1    1   -1
## 3   -1   -1   -1   -1    1   -1    1    1    1    1    1    1    1   -1
## 4    1    1    1    1    1   -1    1    1   -1    1    1    1    1    1
## 5    1    1    1   -1    1   -1    1    1   -1    1    1    1    1    1
## 6   -1   -1   -1   -1    1   -1    1    1    1    1    1    1    1   -1
## 7   -1   -1   -1   -1    1   -1    1    1   -1    1    1    1    1    1
##   V319 V320 V321 V322 V323 V324 V325 V326 V327 V328 V329 V330 V331 V332
## 2    1   -1   -1    1    1    1    1    1    1   -1   -1    1    1    1
## 3    1   -1   -1    1    1    1    1    1    1    1    1    1    1    1
## 4   -1    1   -1    1    1   -1    1    1    1    1    1    1   -1    1
## 5   -1    1   -1    1    1   -1    1    0    1    1    1    1   -1   -1
## 6    1   -1    1    1    1   -1    1    1    1    1    1    1    1    1
## 7   -1    1   -1    1    1   -1    1    1    1   -1   -1    1   -1   -1
##   V333 V334 V335 V336 V337 V338 V339 V340 V341 V342 V343 V344 V345 V346
## 2    1    1    1    1    1    1    1    1    1    1    1   -1   -1    1
## 3    1    1    1    1    1    1    1    1    1    1   -1    1   -1    1
## 4   -1    1    1    1   -1   -1    1    1    1    1    1    1    1    1
## 5   -1    1    1    1   -1    1   -1   -1    1    1    1   -1   -1    1
## 6    1   -1   -1    1    1    1    1    1    1    1    1    1    1   -1
## 7    1    1    1    1   -1    1   -1    1    1    1    1    1    1    1
##   V347 V348 V349 V350 V351 V352 V353 V354 V355 V356 V357 V358 V359 V360
## 2    1    1    1    1   -1   -1    1    1   -1    1   -1    1    1    1
## 3    1    1    1    1   -1   -1    1    1   -1    1   -1   -1    1    1
## 4    0    0    0    0    0    1    1    1   -1    1    1   -1    1    1
## 5    1    1    1    1    1   -1    1    1   -1    1   -1    1   -1    1
## 6   -1    1   -1    1    1   -1    1    1   -1    1    1   -1    1   -1
## 7   -1    1   -1    1    1    1    1    1   -1    1    1   -1   -1   -1
##   V361 V362 V363 V364 V365 V366 V367 V368 V369 V370 V371 V372 V373 V374
## 2   -1   -1    1    1    1   -1    1    1    1    1    1    1    1    1
## 3    1   -1    1    1    1   -1    1    1    1    1    1   -1    1   -1
## 4    1   -1   -1    1    1   -1    1    1    1    1    1    1    1   -1
## 5   -1   -1   -1   -1    1   -1    1    1    1    1    1    1    1    1
## 6    1   -1    1    1    1   -1    1   -1    1    0    1    1    1   -1
## 7    1   -1   -1    1    1   -1    1    1    1    1   -1    1   -1   -1
##   V375 V376 V377 V378 V379 V380 V381 V382 V383 V384 V385 V386 V387 V388
## 2    1   -1    1    1    1    1   -1    1   -1   -1   -1    1    1    1
## 3    1   -1   -1    1    1    1   -1    1   -1   -1   -1    1    1    1
## 4    1    1    1    1    1    1    1    1    0    1    1   -1    1    1
## 5    1    1    1    1    1    1   -1    1   -1    1    1   -1    1    1
## 6    1    1    0    1    1    1   -1    1   -1   -1   -1    1    1    1
## 7    1    1    0    1    1    1   -1    1    0    1   -1    1    1    1
##   V389 V390 V391 V392 V393 V394 V395 V396 V397 V398 V399 V400 V401 V402
## 2    1    1    1    1   -1    1    1    1    1    1    1    1    1    1
## 3    1    1    1    1   -1    1    1    1    1    1    1    1    1    1
## 4    1    1    1    1    1   -1   -1    1    1    1    1    1    1    1
## 5    1    1    1   -1   -1   -1   -1   -1   -1   -1   -1    1    1   -1
## 6    1    1    1    1    1    1    1    1    1    1    1    1    1    1
## 7    1    1    1    0    0    0    0   -1   -1    1   -1    1    1    1
##   V403 V404 V405 V406 V407 V408 V409 V410 V411 V412 V413 V414 V415 V416
## 2    1    1    1    1    1    1    1    1   -1   -1    1    1    1    1
## 3    1   -1    1    1    1    1    1    1   -1   -1    1    1    1    1
## 4    1    1   -1    1    1    1    1    1   -1    1    1    1    1    1
## 5    1    1    1    1    1    1    1    1   -1    1    1    1    1    1
## 6    1   -1    1    1   -1    1    1    1   -1    1    1    1   -1   -1
## 7    1    1   -1    1    1    1    1    1   -1    1    1    1    1    1
##   V417 V418 V419 V420 V421 V422 V423 V424 V425 V426 V427 V428 V429 V430
## 2    1    1    1    1   -1   -1   -1   -1    1    1    1    1    1    1
## 3    1    1    1    1    0   -1    1   -1    1    1    1    1    1    1
## 4    1    1   -1    1    1   -1   -1    1    1   -1   -1    1    1    1
## 5    1    1    1    1    1   -1    1   -1    1   -1   -1    1    1    1
## 6   -1   -1   -1    1    1    0    0    1    1    1    1    1    1    1
## 7    1    1    1    1   -1   -1    1   -1    1   -1   -1    1    1    1
##   V431 V432 V433 V434 V435 V436 V437 V438 V439 V440 V441 V442 V443 V444
## 2    1    1    1    1    1    1    1   -1   -1    1    1    1   -1    1
## 3    1    1    1    1    1    1    1   -1   -1    1    1    1   -1    1
## 4    1    1    1    1   -1    1    1   -1   -1    1   -1   -1    1    1
## 5    1   -1   -1    1   -1   -1    1   -1   -1    1    1    1    1    1
## 6    1    1    1   -1    1    1   -1   -1   -1    1    1   -1   -1    1
## 7    1   -1    1   -1   -1   -1    1   -1   -1    1   -1    1    1    1
##   V445 V446 V447 V448 V449 V450 V451 V452 V453 V454 V455 V456 V457 V458
## 2   -1    1    1    1   -1    1    1    1    1    1   -1   -1    1    1
## 3   -1    1    1    1    1    1    1    1    1    1   -1   -1    1    1
## 4   -1    1    1    1   -1    1   -1    1    1    1   -1    0    0    1
## 5   -1    1    1    1   -1    1   -1    1    1    1   -1   -1    1    1
## 6    1    1   -1   -1    1    1   -1    1   -1    1   -1    0    0    1
## 7   -1    1    1    1   -1    1   -1    1    1    1   -1    1   -1    1
##   V459 V460 V461 V462 V463 V464 V465 V466 V467 V468 V469 V470 V471 V472
## 2    1    1    1    1    1    1    1    1    1    1    1    1    1    1
## 3    1    1    1    1    1    1    1    1    1    1    1    1    1    1
## 4    1    1    1   -1    1    1    1    1    1    1    1   -1   -1   -1
## 5    1    1    1   -1   -1    1    1    1   -1    1    1   -1   -1   -1
## 6    1   -1    1    1    1    1    1    1    1    1    1    1   -1    1
## 7    1   -1    1   -1   -1    1   -1    0   -1   -1   -1   -1   -1   -1
##   V473 V474 V475 V476 V477 V478 V479 V480 V481 V482 V483 V484 V485 V486
## 2    1    1   -1   -1    1    1   -1    1    1   -1    1    1    1    1
## 3    1    1   -1   -1    1    1   -1    1    1   -1    1    1    1    1
## 4   -1   -1    1    1   -1    1    1   -1    1   -1   -1   -1    1    1
## 5   -1   -1    1    1    1   -1    1   -1    1   -1   -1   -1    1    1
## 6    1    1   -1   -1   -1    1   -1    1    1   -1    1   -1    1    1
## 7   -1    1    1    1    1   -1    1   -1   -1   -1   -1   -1   -1    1
##   V487 V488 V489 V490 V491 V492 V493 V494 V495 V496 V497 V498 V499 V500
## 2    1   -1    1    1    1    1    1    1    1   -1   -1    1    1   -1
## 3    1   -1    1    1    1   -1    1    1    1   -1   -1    1    1    0
## 4    1    1   -1    1    1    1    1   -1    1   -1   -1    1    1    1
## 5    1    1    1    1    1    1    1   -1   -1    1    1    1    1    1
## 6    1   -1   -1   -1    1    1    1   -1    1   -1   -1    1    1   -1
## 7    1    1    1    1    1    1    1   -1   -1   -1   -1   -1    1    1
##   V501 V502 V503 V504 V505 V506 V507 V508 V509 V510 V511 V512 V513 V514
## 2   -1    1    1   -1    1   -1    1    1   -1    1    1    1   -1   -1
## 3   -1    1    1   -1    1   -1    1    1   -1   -1    1    1   -1   -1
## 4   -1   -1    1   -1    1   -1    1    1   -1    1   -1   -1    1    1
## 5    1   -1   -1    1    1   -1    1    1    1    1   -1   -1    1    1
## 6   -1    1    1   -1    1    1   -1    1   -1   -1   -1    1   -1   -1
## 7    1   -1    1    1    1   -1    1    1    1    1   -1   -1    1    1
##   V515 V516 V517 V518 V519 V520 V521 V522 V523 V524 V525 V526 V527 V528
## 2    1    1    1    1    1   -1   -1   -1    1    1    1    1    1    1
## 3   -1    1    1    1    1   -1   -1   -1    1   -1    1   -1    1    1
## 4   -1   -1    1    1    1   -1   -1    1    1    1    1    1    1    1
## 5    1   -1    1    1    1   -1   -1    1    1   -1    1    1    1    1
## 6    1    1    1    1    1    1    1    1    1    1   -1    1    1   -1
## 7    1    1    1    1    1   -1    1    1    1   -1    1   -1    1   -1
##   V529 V530 V531 V532 V533 V534 V535 V536 V537 V538 V539 V540 V541 V542
## 2    1    1    1   -1    1   -1    1    1    1    1    1   -1    1    1
## 3    1   -1    1   -1    1   -1    1    1    1    1    1   -1   -1    1
## 4    1    1    1    1   -1   -1   -1    1    1    1    1   -1    1    1
## 5    1    1    1    1   -1    0    0    0    0    0    1   -1   -1    1
## 6    1    1   -1   -1    1    1    1    1   -1   -1    1   -1   -1    1
## 7    1    1    1   -1   -1   -1   -1    1    1    1    1   -1   -1    1
##   V543 V544 V545 V546 V547 V548 V549 V550 V551 V552 V553 V554 V555 V556
## 2    1   -1    1   -1   -1    1    1    1    1    1    1    1    1    1
## 3    1   -1    1   -1   -1    1    1    1    1    1    1    1    1    1
## 4    1    1    1   -1    1    1    1    1    1    1    1    1    1    1
## 5    1    1    1   -1    1    1    1    1    1    1    1    1    1    1
## 6    1   -1    1   -1    1    1    1    1    1    1    1    1    1    1
## 7    1    1    1   -1    1    1    1    1    1    1    1    1    1    1
##   V557 V558 V559 V560 V561 V562 V563 V564 V565 V566 V567 V568 V569 V570
## 2    1   -1    1   -1    1    1    1    1   -1    1   -1    1    1    1
## 3    1   -1    1   -1    1    1    1   -1    1    1   -1   -1   -1    1
## 4    1    1    1   -1    1    1    1    1   -1    1   -1   -1   -1    1
## 5    1    1    1   -1    1    1   -1    1    1    1   -1   -1   -1    1
## 6    1   -1    1    1    1    1    1   -1   -1    1   -1    1    1    1
## 7    1    1   -1    1    1    1    1   -1    1    1   -1   -1   -1    1
##   V571 V572 V573 V574 V575 V576 V577 V578 V579 V580 V581 V582 V583 V584
## 2    1    1   -1    1    1    1    1   -1    1    1    1    1    1   -1
## 3    1    1   -1    1    1    1    1   -1   -1   -1    1   -1   -1   -1
## 4    1    1    1   -1    1    1    1    1    1    1    1    1   -1   -1
## 5    1    1    1   -1    1    1    1   -1    1   -1    1   -1   -1   -1
## 6    1    1   -1    1    1    1    1   -1    1    1    1    1    1    1
## 7    1    1   -1   -1    1    1    1   -1    1    1    1    0   -1    1
##   V585 V586 V587 V588 V589 V590 V591 V592 V593 V594 V595 V596 V597 V598
## 2    1    1   -1    1   -1    1   -1   -1    1   -1    1   -1    1    1
## 3    1    1   -1    1    1    1   -1    1    1   -1    1   -1    1    1
## 4    1    0    0    0   -1   -1   -1   -1   -1   -1    1   -1   -1    1
## 5    1    1    1   -1   -1   -1   -1   -1   -1   -1    1   -1   -1   -1
## 6    1    1   -1    1    1   -1    1    1    1   -1    1   -1    1   -1
## 7    1   -1    1   -1   -1    1    1   -1    1   -1    1   -1    1    1
##   V599 V600 V601 V602 V603 V604 V605 V606 V607 V608 V609 V610 V611 V612
## 2    1   -1    1    1   -1   -1    1    1    1    1    1   -1    1   -1
## 3   -1   -1    1    1    1   -1    1    1    1    1    1    1    1   -1
## 4   -1   -1   -1    1    1    1    1    1    1    1   -1    1    1   -1
## 5   -1   -1   -1   -1    1    1    1    1    1    1   -1    1    1   -1
## 6   -1    1    1    1   -1   -1    1    1   -1    1   -1    1    1   -1
## 7    1   -1   -1    1    1   -1   -1    1    1    1   -1    1    1   -1
##   V613 V614 V615 V616 V617 V618 V619 V620 V621 V622 V623 V624 V625 V626
## 2   -1    1    1    1    1    1    1    1    1   -1    1   -1    1    1
## 3   -1   -1   -1    1    1   -1    1   -1    1   -1    1    1    1    1
## 4   -1    1   -1   -1   -1    1   -1    1    1    1    1   -1    1   -1
## 5   -1    1   -1   -1   -1    1    1    1    1   -1    1   -1   -1    1
## 6   -1    1    1    1    1    1   -1    1    1    0    1   -1    1    1
## 7   -1    1    1   -1   -1    1    1   -1    1   -1   -1    1    1   -1
##   V627 V628 V629 V630 V631 V632 V633 V634 V635 V636 V637 V638
## 2    1   -1    1    1    1    1    1    1    1    1    1   -1
## 3    1   -1    1    1    1    1    1    1    1    1    1   -1
## 4    1    1    1    1    1    1   -1   -1    1    1    1    1
## 5    1    1    1    1    1    1   -1   -1    1    1    1   -1
## 6    1   -1    1    1    1    1    1    1    1    1    1   -1
## 7    1    1    1    1    1    1   -1   -1    1    1    1   -1

rollcall.dist = lapply(rollcall.simple, function(x) dist(x %*% t(x)))
rollcall.mds = lapply(rollcall.dist,
                       function(d) as.data.frame((cmdscale(d, k = 2)) * -1))

congresses = 101:111

for(i in 1:length(rollcall.mds))
{
  names(rollcall.mds[[i]]) = c("x", "y")
  
  congress = subset(rollcall.data[[i]], state < 99)
  
  congress.names = sapply(as.character(congress$name),
                           function(n) strsplit(n, "[, ]")[[1]][1])
  
  rollcall.mds[[i]] = transform(rollcall.mds[[i]],
                                 name = congress.names,
                                 party = as.factor(congress$party),
                                 congress = congresses[i])
}

head(rollcall.mds[[1]])
##            x        y      name party congress
## 2  -11.44068 293.0001    SHELBY   100      101
## 3  283.82580 132.4369    HEFLIN   100      101
## 4  885.85564 430.3451   STEVENS   200      101
## 5 1714.21327 185.5262 MURKOWSKI   200      101
## 6 -843.58421 220.1038 DECONCINI   100      101
## 7 1594.50998 225.8166    MCCAIN   200      101