Gosse, Michelle
2010-Apr-26 19:39 UTC
[R] failing to select a subset of observations based on variable values [Sec: UNCLASSIFIED]
Greetings all. I'm starting analysis in R on a reasonably sized pre-existing dataset, of 583 variables and 1127 observations. This was an SPSS datafile, which I read in using the read.spss command using the foreign package, and the data was assigned to a data.frame when it was read in. The defaults in read.spss were used, except I set to.data.frame = TRUE. The data is a survey dataset (each observation/case = one participant), and many of the variables are participants' responses to Likert scale items. These have been coded on a 1 to 7 scale, with "8" used to code "Don't know" responses. The assumption is that the 1-7 responses are at least interval level, however the response "8" is clearly not. For many analyses, this doesn't matter because I'm only doing chi-square tests. However, for a between-group comparison crosstab I would like to exclude those who gave "8" responses because I am only interesting in testing differences for the participants who gave responses measured on the Likert scale proper. I have encountered problems when I need to exclude the observations from analysis, where they gave an "8" response to either of two questions (Question 1A and Question 1B), which relate to columns 72 and 73 of the dataframe. The chi-square I am trying to do is based on two other variables (mean of Q1A+Q1B for each participant) and a grouping variable, which are contained in columns 8 and 80 of the dataframe, respectively. The reason I am excluding anyone who gave an "8" ("Don't know) response on questions 1A and 1B is that their mean on these two questions cannot be interpreted as the value "8" is nominal rather than interval/ratio and therefore cannot be used in a mathematical expression. I've been trying to use an if-or combination, and I can't get it to work. The chi-square test without the attempt to subset using "if" is working fine, I don't understand what I am doing wrong in my attempts to subset. I have tried to reference the variables like this:> if ("Q1A"!=8 | "Q1B"!=8)+ (table(micronutrients[,8,80])) <group counts snipped>> chisq.test(table(micronutrients[,8,80]))The group counts returned from the table statement show me that no observations are being excluded from the analysis. The chisq.test works fine on (table(micronutrients[,8,80])) but, of course, it is being performed on the entire dataset as I have been unsuccessful in subsetting the data. I tried to see if the column names were objects and I got these errors:> object("Q1A")Error: could not find function "object"> Q1AError: object 'Q1A' not found I'm not sure if this is important. So I tried to do the if-or using the column number, but that didn't work either:> if (micronutrients[,72]!=8 | micronutrients[,73]!=8)+ (table(micronutrients[,8,80])) <group counts snipped> Warning message: In if (micronutrients[, 72] != 8 | micronutrients[, 73] != 8) (table(micronutrients[, : the condition has length > 1 and only the first element will be used I got exactly the same chi-square output as in my previous attempt. If any of you know SPSS, what I am trying to do in R is equivalent to: temporary. select if not (Q1A=8 or Q1B=8). In SAS, it would be the same as a subsetting if that lasted only for the particular analysis, or a where, e.g. proc tabulate; where Q1A ne 8 or Q1B ne 8; How can I subset the data? I would prefer not to create another variable to hold the recodes as the dataset is already complex. I only wish the subsetting condition to hold for the test immediately following the instruction to subset (I need to subset the data in different ways for different question combinations). Because the instruction is complete once the table() command is issued, I am assuming that the if statement only relates to the table() command and therefore only indirectly to the chisq.test() command following (as this is being performed on the subsetted table) - which is exactly what I want. Cheers Michelle ********************************************************************** This email and any files transmitted with it are confidential and intended solely for the use of the individual or entity to whom they are addressed. If you have received this email in error please notify the system manager. This footnote also confirms that this email message has been swept by MIMEsweeper for the presence of computer viruses. www.clearswift.com ********************************************************************** [[alternative HTML version deleted]]
David Winsemius
2010-Apr-26 20:20 UTC
[R] failing to select a subset of observations based on variable values [Sec: UNCLASSIFIED]
On Apr 26, 2010, at 3:39 PM, Gosse, Michelle wrote:> Greetings all. > > I'm starting analysis in R on a reasonably sized pre-existing > dataset, of 583 variables and 1127 observations. This was an SPSS > datafile, which I read in using the read.spss command using the > foreign package, and the data was assigned to a data.frame when it > was read in. The defaults in read.spss were used, except I set > to.data.frame = TRUE. > > The data is a survey dataset (each observation/case = one > participant), and many of the variables are participants' responses > to Likert scale items. These have been coded on a 1 to 7 scale, with > "8" used to code "Don't know" responses. The assumption is that the > 1-7 responses are at least interval level, however the response "8" > is clearly not. For many analyses, this doesn't matter because I'm > only doing chi-square tests. However, for a between-group comparison > crosstab I would like to exclude those who gave "8" responses > because I am only interesting in testing differences for the > participants who gave responses measured on the Likert scale proper. > > I have encountered problems when I need to exclude the observations > from analysis, where they gave an "8" response to either of two > questions (Question 1A and Question 1B), which relate to columns 72 > and 73 of the dataframe. The chi-square I am trying to do is based > on two other variables (mean of Q1A+Q1B for each participant) and a > grouping variable, which are contained in columns 8 and 80 of the > dataframe, respectively. The reason I am excluding anyone who gave > an "8" ("Don't know) response on questions 1A and 1B is that their > mean on these two questions cannot be interpreted as the value "8" > is nominal rather than interval/ratio and therefore cannot be used > in a mathematical expression. > > I've been trying to use an if-or combination, and I can't get it to > work.Did you read the help page for if? ?"if"> The chi-square test without the attempt to subset using "if" is > working fine, I don't understand what I am doing wrong in my > attempts to subset. > > I have tried to reference the variables like this: >> if ("Q1A"!=8 | "Q1B"!=8) > + (table(micronutrients[,8,80])) > <group counts snipped> >> chisq.test(table(micronutrients[,8,80])) > > The group counts returned from the table statement show me that no > observations are being excluded from the analysis. The chisq.test > works fine on (table(micronutrients[,8,80])) but, of course, it is > being performed on the entire dataset as I have been unsuccessful in > subsetting the data. > > I tried to see if the column names were objects and I got these > errors: >> object("Q1A") > Error: could not find function "object" >> Q1A > Error: object 'Q1A' not found > I'm not sure if this is important. > > So I tried to do the if-or using the column number, but that didn't > work either: >> if (micronutrients[,72]!=8 | micronutrients[,73]!=8)Leave behind your SPSS syntactical constructions. SPSS and the SAS data steps have implicit loops that operate sequantially along rows of datasets. R does not work that way. A corresponding operation in R might be: apply(dataframe1, 1, <function that works on a row of data>) "if" is a program control mechanism that does not operate on vectors. If it gets a vector if evaluates the first element and ignores the rest. (You should have gotten a warning and you should have posted the warning.) There is also the ifelse function that works with and returns vectors.> + () > <group counts snipped> > Warning message: > In if () (table(micronutrients[, : > the condition has length > 1 and only the first element will be used > > I got exactly the same chi-square output as in my previous attempt. > > If any of you know SPSS, what I am trying to do in R is equivalent > to: temporary. select if not (Q1A=8 or Q1B=8). In SAS, it would be > the same as a subsetting if that lasted only for the particular > analysis, or a where, e.g. proc tabulate; where Q1A ne 8 or Q1B ne 8; > > How can I subset the data? I would prefer not to create another > variable to hold the recodes as the dataset is already complex.?subset ?with with(subset(dfrm, micronutrients[, 72] != 8 | micronutrients[, 73] != 8), table(...) ) Not sure what you intended with ... table(micronutrients[,8,80]) ... , but generally one does not first reference an object with two dimensions and then do so with three. It is considered good form around these parts to offer at the very least str() on an object about which you are hoping to get specific advice. We cannot read your mind.> > I only wish the subsetting condition to hold for the test > immediately following the instruction to subset (I need to subset > the data in different ways for different question combinations). > Because the instruction is complete once the table() command is > issued, I am assuming that the if statement only relates to the > table() command and therefore only indirectly to the chisq.test() > command following (as this is being performed on the subsetted > table) - which is exactly what I want.Do not see any code to which this refers. You can assign the result of the with(subset( ...), ... ) operation to an object on which you can do statistical tests, but table does not automatically do chi-square test. Maybe you should look at summary.table() or xtabs()> > Cheers > Michelle > > ********************************************************************** > This email and any files transmitted with it are confidential and > intended solely for the use of the individual or entity to whom they > are addressed. If you have received this email in error please notify > the system manager. > > This footnote also confirms that this email message has been swept by > MIMEsweeper for the presence of computer viruses. > > www.clearswift.com > ********************************************************************** > > > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.David Winsemius, MD West Hartford, CT