Ki L. Matlock
2009-Nov-12 23:14 UTC
[R] Transforming a dataframe into a response/predictor matrix
I currently have a data frame whose rows correspond to each student and whose columns are different variables for the student, as shown below: Lastname Firstname CATALOG_NBR Email StudentID EMPLID Start 1 alastname afirstname 1213 *@uark.edu 10295236 # 12/2/2008 2 anotherlastname anotherfirstname 1213 **@uark.edu ## 10295236 9/3/2008 Xattempts Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Q12 Q13 Q14 Q15 Q16 Q17 Q18 Q19 1 1 1 1 0 0 0 0 0 0 0 1 0 0 1 1 0 1 1 0 1 2 1 1 1 1 1 1 0 1 0 0 1 1 0 0 1 0 0 0 0 1 Q20 Q21 Q22 Q23 Q24 Q25 Q26 Q27 Q28 Q29 Q30 Q31 Q32 Score Form CRSE_GRADE_OFF 1 0 0 0 0 0 0 0 0 0 1 0 0 0 9 E D 2 0 0 0 0 0 0 0 0 0 0 1 1 0 13 G D Each student took a pre- and post- test indicated by the date under "Start", column 7. (a date, mm/dd/yyyy, whose mm is 08 or 09 is pre-test; a date whose mm is 11 or 12 is post-test. This test was one of four forms, E, F, G, or H, listed under "Form", column 42. Each test had 32 questions, Q1 to Q32, with a binary 1 indicating the student answered correctly to this question and 0 if incorrectly. I am needing a matrix, y, with five columns labeled: response, i, j, r, s. Column 1 indicates the response (0 or 1) for i-th student, on the j-th question (1:32), on the r-th form (E,F,G,H- these could be changed to numeric 1 for E, 2 for F, etc.), on the s-th test (pre or post indicated by a binary 0 for pre, 1 for post). The data-set is very lengthy of approximately 2000 rows. An efficient way to transform this data into the desired matrix would be very helpful. Thank you.
Ki L. Matlock wrote:> > I currently have a data frame whose rows correspond to each student and > whose columns are different variables for the student, as shown below: > > Lastname Firstname CATALOG_NBR Email StudentID EMPLID > Start > 1 alastname afirstname 1213 *@uark.edu 10295236 # > 12/2/2008 > 2 anotherlastname anotherfirstname 1213 **@uark.edu ## > 10295236 9/3/2008 > Xattempts Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Q12 Q13 Q14 Q15 Q16 Q17 Q18 > Q19 > 1 1 1 1 0 0 0 0 0 0 0 1 0 0 1 1 0 1 1 0 > 1 > 2 1 1 1 1 1 1 0 1 0 0 1 1 0 0 1 0 0 0 0 > 1 > Q20 Q21 Q22 Q23 Q24 Q25 Q26 Q27 Q28 Q29 Q30 Q31 Q32 Score Form > CRSE_GRADE_OFF > 1 0 0 0 0 0 0 0 0 0 1 0 0 0 9 E > D > 2 0 0 0 0 0 0 0 0 0 0 1 1 0 13 G > D >Thanks for providing test data-- however this sort of format is difficult to work with as email tends to mangle the line wrapping. It took me about 5 minutes and the combined powers of Vim and OpenOffice to reflow and re-export the example into a format that R could ingest. And I probably made a mistake somewhere along the way. Nothing wrong with providing data like this-- but it probably limits the number of people who are willing to give your problem a try. A good way to share test data frames if they contain a lot of rows/columns is to dump them using the dput() function. This encodes the data in the following format: structure(list(Lastname = structure(1:2, .Label = c("alastname", "anotherlastname"), class = "factor"), Firstname = structure(1:2, .Label c("afirstname", "anotherfirstname"), class = "factor"), CATALOG_NBR = c(1213L, 1213L), Email = structure(1:2, .Label = c("*@uark.edu", "**@uark.edu" ), class = "factor"), StudentID = structure(c(2L, 1L), .Label = c("##", "10295236"), class = "factor"), EMPLID = structure(1:2, .Label = c("#", "10295236"), class = "factor"), Start = structure(c(14215, 14125 ), class = "Date"), Xattempts = c(1L, 1L), Q1 = c(1L, 1L), Q2 = c(1L, 1L), Q3 = 0:1, Q4 = 0:1, Q5 = 0:1, Q6 = c(0L, 0L), Q7 = 0:1, Q8 = c(0L, 0L), Q9 = c(0L, 0L), Q10 = c(1L, 1L), Q11 = 0:1, Q12 = c(0L, 0L), Q13 = c(1L, 0L), Q14 = c(1L, 1L), Q15 = c(0L, 0L), Q16 = c(1L, 0L), Q17 = c(1L, 0L), Q18 = c(0L, 0L), Q19 = c(1L, 1L), Q20 = c(0L, 0L), Q21 = c(0L, 0L), Q22 = c(0L, 0L), Q23 = c(0L, 0L), Q24 = c(0L, 0L), Q25 = c(0L, 0L), Q26 = c(0L, 0L), Q27 = c(0L, 0L), Q28 = c(0L, 0L), Q29 = c(1L, 0L), Q30 = 0:1, Q31 = 0:1, Q32 = c(0L, 0L), Score = c(9L, 13L), Form = structure(1:2, .Label c("E", "G"), class = "factor"), CRSE_GRADE_OFF = structure(c(1L, 1L), .Label = "D", class = "factor")), .Names = c("Lastname", "Firstname", "CATALOG_NBR", "Email", "StudentID", "EMPLID", "Start", "Xattempts", "Q1", "Q2", "Q3", "Q4", "Q5", "Q6", "Q7", "Q8", "Q9", "Q10", "Q11", "Q12", "Q13", "Q14", "Q15", "Q16", "Q17", "Q18", "Q19", "Q20", "Q21", "Q22", "Q23", "Q24", "Q25", "Q26", "Q27", "Q28", "Q29", "Q30", "Q31", "Q32", "Score", "Form", "CRSE_GRADE_OFF" ), row.names = c(NA, -2L), class = "data.frame") Not very pretty, but this format is more resistant to email mangling and can generally be copied/pasted into an R session-- saves all the monkey business with Vim/OpenOffice/Excel/whatever. Ki L. Matlock wrote:> > > Each student took a pre- and post- test indicated by the date under > "Start", column 7. (a date, mm/dd/yyyy, whose mm is 08 or 09 is pre-test; > a date whose mm is 11 or 12 is post-test. This test was one of four > forms, E, F, G, or H, listed under "Form", column 42. Each test had 32 > questions, Q1 to Q32, with a binary 1 indicating the student answered > correctly to this question and 0 if incorrectly. > > I am needing a matrix, y, with five columns labeled: response, i, j, r, s. > Column 1 indicates the response (0 or 1) for i-th student, on the j-th > question (1:32), on the r-th form (E,F,G,H- these could be changed to > numeric 1 for E, 2 for F, etc.), on the s-th test (pre or post indicated > by a binary 0 for pre, 1 for post). > > The data-set is very lengthy of approximately 2000 rows. An efficient way > to transform this data into the desired matrix would be very helpful. > Thank you. > >The melt() function from Hadley Wickham's 'reshape' package can probably take care of this for you. Assuming the data.frame is named "studentData", the following might process your data the way you want it: require( reshape ) # Retrieve the names of all columns holding responses to questions. questions <- names( studentData )[ grep( '^[Q]', names( studentData ) ) ] testBreakdown <- melt( studentData, c( 'StudentID', 'Form'), questions, variable_name = 'Question' ) The first argument after the name of the data set specifies the names of those columns that we wish to use in order to categorize the data. The second argument specifies the names of columns that contain the data we are interested in. testBreakdown is now a data.frame containing: A column labeled "StudentID"-- contains the ID of the student. A column labeled "Form" -- contains the code of the form they used. A column labeled "Question" -- contains the name of the question they answered. The default name for this column is "variable", but I overrode it by setting variable_name in the above call to melt(). A column labeled "value"-- contains the result of the student's answer to the given question. I was not able to figure out which part of your data.frame contained information concerning the "s-th" test taken by a student-- maybe it got lost in translation. Anyway, if the column names and order you gave above are important, then all you need to do is rename and reorder the columns of testBreakdown. Hope this helps! -Charlie ----- Charlie Sharpsteen Undergraduate Environmental Resources Engineering Humboldt State University -- View this message in context: http://old.nabble.com/Transforming-a-dataframe-into-a-response-predictor-matrix-tp26328345p26328719.html Sent from the R help mailing list archive at Nabble.com.