Sunday, May 15, 2011

Rejiggering an R Data Frame

This is a new experience for me.  Mark Allen (author of the Open Source Research blog) tweeted a question about rearranging a data frame in R.  The question being longer than 140 characters, he used Deck.ly to post an extended tweet.  So I learned something new about the Twitterverse.  I don't have a TweetDeck account, though, so my best option to post a response is to put it here.

Mark has a data frame with one row for each response to any of a set of questions, and three columns: respondent ID; question number; response. Here's a chunk of R code to create a small demo data frame along those lines:
# create some data
m <- matrix(c(1,1,11,2,3,23,1,2,12,2,1,21,2,2,22,1,3,13), nrow=6, ncol=3, byrow=TRUE)
d <- data.frame(m)
names(d) <- c("ID","Question","Answer")
print(d)

Created by Pretty R at inside-R.org

The output is:

  ID Question Answer
1  1        1     11
2  2        3     23
3  1        2     12
4  2        1     21
5  2        2     22
6  1        3     13

Here is code to rearrange it:

# sort the data by ID, then by Question
d <- d[do.call(order,d),]
# extract a list of unique IDs and Question numbers
id <- unique(d[,"ID"])
q <- unique(d[,"Question"])
# rearrange the answers into the desired matrix layout
m <- matrix(d[,"Answer"], nrow=length(id), ncol=length(q), byrow=TRUE)
# add the ids and make a new data frame
m <- cbind(id, m)
dd <- data.frame(m)
names(dd) <- c("ID", paste("Q", q, sep=""))
print(dd)

Created by Pretty R at inside-R.org

The output of the last line (the rejiggered data frame) is:

  ID Q1 Q2 Q3
1  1 11 12 13
2  2 21 22 23

4 comments:

  1. That's one way to do it. My recommendation instead would be to use the reshape or reshape2 packages. Here's the solution with reshape:

    > cast(d, ID ~ Question)
    ID 1 2 3
    1 1 11 12 13
    2 2 21 22 23

    ReplyDelete
  2. Thanks a lot for the solution, tweetDeck is pretty useful for tweeting.

    I think your solution might actually work better with large datasets. I am using a relatively large dataset and reshape bloats the memory to three times the size of my dataset. It freezes my computer.

    The only thing is that when there is no value for a cell we get NA in regular for-loop approach or in reshape but I guess here it will break

    ReplyDelete
  3. @Siah: What breaks with missing values? I changed my little example so that one of the responses was NA and the code still worked. If either the respondent ID or the question number in the original data frame is missing, bad things will happen, but I think that's true regardless of the script (it means you have an answer but you're not sure from whom or to what question).

    For really big data sets, I might be tempted to stuff the data into SQLite or MySQL and then query out what I wanted.

    ReplyDelete
  4. @Harlan: Cool! I wasn't aware of the reshape package.

    ReplyDelete

Due to intermittent spamming, comments are being moderated. If this is your first time commenting on the blog, please read the Ground Rules for Comments. In particular, if you want to ask an operations research-related question not relevant to this post, consider asking it on Operations Research Stack Exchange.