Damion Dooley
2009-Aug-19 19:12 UTC
[R] Basic question: Reading in multiple choice question responses to a single column in data frame
I'm using read.delim to successfully read in tab delimited data, but some columns' values are comma seperated, reflecting the fact that user chose a few answers on a multi-select question. I understand that each answer is its own category and so could be represented as a seperate column in the data set, but I'd like the option of reading in the data column, and converting it to a vector that has all row values (comma seperated or not) each have their own vector entry, so that the "table(columnData)" function does counts correctly. So some code: myData = read.delim(myDataFile, row.names=1,header=TRUE,skip=10); #works fine myColumn = myData[[question]]; #works fine, selects correct question column data myColumn data is now e.g.: 1 0 2 0,2 0 3 2 2,1 with the comma seperated values looking like atomic string values I guess. But I would like: 1 0 2 0 2 0 3 2 2 1 I've tried various things, e.g. grep to recognize and expand the comma seperated values, but since vector functions are at work, I can only replace 1 value back into the myColumn data, e.g. "0,2" entry becomes "0" or "2" if I use myColumn=gsub("^([0-9]+),([0-9]+),$",c('\\1'),myColumn,perl=TRUE) #or replace with c('\\2') but I can't replace into c('\\1','\\2') Any elegant or otherwise ways to do this? Much appreciated, Damion [[alternative HTML version deleted]]
Frank E Harrell Jr
2009-Aug-19 19:17 UTC
[R] Basic question: Reading in multiple choice question responses to a single column in data frame
You might look at the mChoice function in the Hmisc package for some indirect help. Frank Damion Dooley wrote:> I'm using read.delim to successfully read in tab delimited data, but some > columns' values are comma seperated, reflecting the fact that user chose a > few answers on a multi-select question. I understand that each answer is > its own category and so could be represented as a seperate column in the > data set, but I'd like the option of reading in the data column, and > converting it to a vector that has all row values (comma seperated or not) > each have their own vector entry, so that the "table(columnData)" function > does counts correctly. > > So some code: > > myData = read.delim(myDataFile, row.names=1,header=TRUE,skip=10); #works > fine > myColumn = myData[[question]]; #works fine, selects correct question > column data > > myColumn data is now e.g.: > > 1 > 0 > 2 > 0,2 > 0 > 3 > 2 > 2,1 > > with the comma seperated values looking like atomic string values I guess. > But I would like: > > 1 > 0 > 2 > 0 > 2 > 0 > 3 > 2 > 2 > 1 > > I've tried various things, e.g. grep to recognize and expand the comma > seperated values, but since vector functions are at work, I can only replace > 1 value back into the myColumn data, e.g. "0,2" entry becomes "0" or "2" if > I use > > myColumn=gsub("^([0-9]+),([0-9]+),$",c('\\1'),myColumn,perl=TRUE) #or > replace with c('\\2') > > but I can't replace into c('\\1','\\2') > > Any elegant or otherwise ways to do this? > > Much appreciated, > > Damion > > > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >-- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University
Magnus Torfason
2009-Aug-19 19:32 UTC
[R] Basic question: Reading in multiple choice question responses to a single column in data frame
Are you looking for something like this? > d = data.frame(a=1:5,b=c("1","2,3","2","3,4","1")) > d a b 1 1 1 2 2 2,3 3 3 2 4 4 3,4 5 5 1 > multis = strsplit(d$b,",") > counts = sapply(strsplit(d$b,","),length ) > d2 = data.frame( a=rep(d$a,counts), b=unlist(multis) ) > d2 a b 1 1 1 2 2 2 3 2 3 4 3 2 5 4 3 6 4 4 7 5 1 Best, Magnus On 8/19/2009 3:12 PM, Damion Dooley wrote:> I'm using read.delim to successfully read in tab delimited data, but some > columns' values are comma seperated, reflecting the fact that user chose a > few answers on a multi-select question. I understand that each answer is > its own category and so could be represented as a seperate column in the > data set, but I'd like the option of reading in the data column, and > converting it to a vector that has all row values (comma seperated or not) > each have their own vector entry, so that the "table(columnData)" function > does counts correctly. > > So some code: > > myData = read.delim(myDataFile, row.names=1,header=TRUE,skip=10); #works > fine > myColumn = myData[[question]]; #works fine, selects correct question > column data > > myColumn data is now e.g.: > > 1 > 0 > 2 > 0,2 > 0 > 3 > 2 > 2,1 > > with the comma seperated values looking like atomic string values I guess. > But I would like: > > 1 > 0 > 2 > 0 > 2 > 0 > 3 > 2 > 2 > 1 > > I've tried various things, e.g. grep to recognize and expand the comma > seperated values, but since vector functions are at work, I can only replace > 1 value back into the myColumn data, e.g. "0,2" entry becomes "0" or "2" if > I use > > myColumn=gsub("^([0-9]+),([0-9]+),$",c('\\1'),myColumn,perl=TRUE) #or > replace with c('\\2') > > but I can't replace into c('\\1','\\2') > > Any elegant or otherwise ways to do this? > > Much appreciated, > > Damion > > > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >
Damion Dooley
2009-Aug-19 20:07 UTC
[R] Basic question: Reading in multiple choice question responses to a single column in data frame
Magnus, Looks like that solution should work, and I like the flexibility of your data output, but I get a "error in strsplit(d$b,","): non-character argument" at: multis = strsplit(d$b,",") Seems like the c() function converts integer looking items like "1" into integers and then strsplit fails on them? I was running into this earlier when attempting strsplit directly on column values. Damion -----Original Message----- From: Magnus Torfason [mailto:zulutime.net at gmail.com] Sent: August 19, 2009 12:33 PM To: Damion Dooley Cc: r-help at r-project.org Subject: Re: [R] Basic question: Reading in multiple choice question responses to a single column in data frame Are you looking for something like this? > d = data.frame(a=1:5,b=c("1","2,3","2","3,4","1")) > d a b 1 1 1 2 2 2,3 3 3 2 4 4 3,4 5 5 1 > multis = strsplit(d$b,",") > counts = sapply(strsplit(d$b,","),length )> d2 = data.frame( a=rep(d$a,counts), b=unlist(multis) ) > d2a b 1 1 1 2 2 2 3 2 3 4 3 2 5 4 3 6 4 4 7 5 1 Best, Magnus
Damion Dooley
2009-Aug-20 03:06 UTC
[R] Basic question: Reading in multiple choice question responses to a single column in data frame
Slight addendum. Working from your code, I found 1 line of code does the conversion: myColumn = unlist(strsplit(as.character(myData[[myQuestion]]),",")); But the dataframe you set up may prove more useful. Regards, Damion -----Original Message----- From: Magnus Torfason [mailto:zulutime.net at gmail.com] Sent: August 19, 2009 12:33 PM To: Damion Dooley Cc: r-help at r-project.org Subject: Re: [R] Basic question: Reading in multiple choice question responses to a single column in data frame Are you looking for something like this? > d = data.frame(a=1:5,b=c("1","2,3","2","3,4","1")) > d a b 1 1 1 2 2 2,3 3 3 2 4 4 3,4 5 5 1 > multis = strsplit(d$b,",") > counts = sapply(strsplit(d$b,","),length )> d2 = data.frame( a=rep(d$a,counts), b=unlist(multis) ) > d2a b 1 1 1 2 2 2 3 2 3 4 3 2 5 4 3 6 4 4 7 5 1 Best, Magnus