Gosse, Michelle
2010-Apr-26 19:39 UTC
[R] failing to select a subset of observations based on variable values [Sec: UNCLASSIFIED]
Greetings all.
I'm starting analysis in R on a reasonably sized pre-existing dataset, of
583 variables and 1127 observations. This was an SPSS datafile, which I read in
using the read.spss command using the foreign package, and the data was assigned
to a data.frame when it was read in. The defaults in read.spss were used, except
I set to.data.frame = TRUE.
The data is a survey dataset (each observation/case = one participant), and many
of the variables are participants' responses to Likert scale items. These
have been coded on a 1 to 7 scale, with "8" used to code
"Don't know" responses. The assumption is that the 1-7 responses
are at least interval level, however the response "8" is clearly not.
For many analyses, this doesn't matter because I'm only doing chi-square
tests. However, for a between-group comparison crosstab I would like to exclude
those who gave "8" responses because I am only interesting in testing
differences for the participants who gave responses measured on the Likert scale
proper.
I have encountered problems when I need to exclude the observations from
analysis, where they gave an "8" response to either of two questions
(Question 1A and Question 1B), which relate to columns 72 and 73 of the
dataframe. The chi-square I am trying to do is based on two other variables
(mean of Q1A+Q1B for each participant) and a grouping variable, which are
contained in columns 8 and 80 of the dataframe, respectively. The reason I am
excluding anyone who gave an "8" ("Don't know) response on
questions 1A and 1B is that their mean on these two questions cannot be
interpreted as the value "8" is nominal rather than interval/ratio and
therefore cannot be used in a mathematical expression.
I've been trying to use an if-or combination, and I can't get it to
work. The chi-square test without the attempt to subset using "if" is
working fine, I don't understand what I am doing wrong in my attempts to
subset.
I have tried to reference the variables like this:> if ("Q1A"!=8 | "Q1B"!=8)
+ (table(micronutrients[,8,80]))
<group counts snipped>> chisq.test(table(micronutrients[,8,80]))
The group counts returned from the table statement show me that no observations
are being excluded from the analysis. The chisq.test works fine on
(table(micronutrients[,8,80])) but, of course, it is being performed on the
entire dataset as I have been unsuccessful in subsetting the data.
I tried to see if the column names were objects and I got these
errors:> object("Q1A")
Error: could not find function "object"> Q1A
Error: object 'Q1A' not found
I'm not sure if this is important.
So I tried to do the if-or using the column number, but that didn't work
either:> if (micronutrients[,72]!=8 | micronutrients[,73]!=8)
+ (table(micronutrients[,8,80]))
<group counts snipped>
Warning message:
In if (micronutrients[, 72] != 8 | micronutrients[, 73] != 8)
(table(micronutrients[, :
the condition has length > 1 and only the first element will be used
I got exactly the same chi-square output as in my previous attempt.
If any of you know SPSS, what I am trying to do in R is equivalent to:
temporary. select if not (Q1A=8 or Q1B=8). In SAS, it would be the same as a
subsetting if that lasted only for the particular analysis, or a where, e.g.
proc tabulate; where Q1A ne 8 or Q1B ne 8;
How can I subset the data? I would prefer not to create another variable to hold
the recodes as the dataset is already complex.
I only wish the subsetting condition to hold for the test immediately following
the instruction to subset (I need to subset the data in different ways for
different question combinations). Because the instruction is complete once the
table() command is issued, I am assuming that the if statement only relates to
the table() command and therefore only indirectly to the chisq.test() command
following (as this is being performed on the subsetted table) - which is exactly
what I want.
Cheers
Michelle
**********************************************************************
This email and any files transmitted with it are confidential and
intended solely for the use of the individual or entity to whom they
are addressed. If you have received this email in error please notify
the system manager.
This footnote also confirms that this email message has been swept by
MIMEsweeper for the presence of computer viruses.
www.clearswift.com
**********************************************************************
[[alternative HTML version deleted]]
David Winsemius
2010-Apr-26 20:20 UTC
[R] failing to select a subset of observations based on variable values [Sec: UNCLASSIFIED]
On Apr 26, 2010, at 3:39 PM, Gosse, Michelle wrote:> Greetings all. > > I'm starting analysis in R on a reasonably sized pre-existing > dataset, of 583 variables and 1127 observations. This was an SPSS > datafile, which I read in using the read.spss command using the > foreign package, and the data was assigned to a data.frame when it > was read in. The defaults in read.spss were used, except I set > to.data.frame = TRUE. > > The data is a survey dataset (each observation/case = one > participant), and many of the variables are participants' responses > to Likert scale items. These have been coded on a 1 to 7 scale, with > "8" used to code "Don't know" responses. The assumption is that the > 1-7 responses are at least interval level, however the response "8" > is clearly not. For many analyses, this doesn't matter because I'm > only doing chi-square tests. However, for a between-group comparison > crosstab I would like to exclude those who gave "8" responses > because I am only interesting in testing differences for the > participants who gave responses measured on the Likert scale proper. > > I have encountered problems when I need to exclude the observations > from analysis, where they gave an "8" response to either of two > questions (Question 1A and Question 1B), which relate to columns 72 > and 73 of the dataframe. The chi-square I am trying to do is based > on two other variables (mean of Q1A+Q1B for each participant) and a > grouping variable, which are contained in columns 8 and 80 of the > dataframe, respectively. The reason I am excluding anyone who gave > an "8" ("Don't know) response on questions 1A and 1B is that their > mean on these two questions cannot be interpreted as the value "8" > is nominal rather than interval/ratio and therefore cannot be used > in a mathematical expression. > > I've been trying to use an if-or combination, and I can't get it to > work.Did you read the help page for if? ?"if"> The chi-square test without the attempt to subset using "if" is > working fine, I don't understand what I am doing wrong in my > attempts to subset. > > I have tried to reference the variables like this: >> if ("Q1A"!=8 | "Q1B"!=8) > + (table(micronutrients[,8,80])) > <group counts snipped> >> chisq.test(table(micronutrients[,8,80])) > > The group counts returned from the table statement show me that no > observations are being excluded from the analysis. The chisq.test > works fine on (table(micronutrients[,8,80])) but, of course, it is > being performed on the entire dataset as I have been unsuccessful in > subsetting the data. > > I tried to see if the column names were objects and I got these > errors: >> object("Q1A") > Error: could not find function "object" >> Q1A > Error: object 'Q1A' not found > I'm not sure if this is important. > > So I tried to do the if-or using the column number, but that didn't > work either: >> if (micronutrients[,72]!=8 | micronutrients[,73]!=8)Leave behind your SPSS syntactical constructions. SPSS and the SAS data steps have implicit loops that operate sequantially along rows of datasets. R does not work that way. A corresponding operation in R might be: apply(dataframe1, 1, <function that works on a row of data>) "if" is a program control mechanism that does not operate on vectors. If it gets a vector if evaluates the first element and ignores the rest. (You should have gotten a warning and you should have posted the warning.) There is also the ifelse function that works with and returns vectors.> + () > <group counts snipped> > Warning message: > In if () (table(micronutrients[, : > the condition has length > 1 and only the first element will be used > > I got exactly the same chi-square output as in my previous attempt. > > If any of you know SPSS, what I am trying to do in R is equivalent > to: temporary. select if not (Q1A=8 or Q1B=8). In SAS, it would be > the same as a subsetting if that lasted only for the particular > analysis, or a where, e.g. proc tabulate; where Q1A ne 8 or Q1B ne 8; > > How can I subset the data? I would prefer not to create another > variable to hold the recodes as the dataset is already complex.?subset ?with with(subset(dfrm, micronutrients[, 72] != 8 | micronutrients[, 73] != 8), table(...) ) Not sure what you intended with ... table(micronutrients[,8,80]) ... , but generally one does not first reference an object with two dimensions and then do so with three. It is considered good form around these parts to offer at the very least str() on an object about which you are hoping to get specific advice. We cannot read your mind.> > I only wish the subsetting condition to hold for the test > immediately following the instruction to subset (I need to subset > the data in different ways for different question combinations). > Because the instruction is complete once the table() command is > issued, I am assuming that the if statement only relates to the > table() command and therefore only indirectly to the chisq.test() > command following (as this is being performed on the subsetted > table) - which is exactly what I want.Do not see any code to which this refers. You can assign the result of the with(subset( ...), ... ) operation to an object on which you can do statistical tests, but table does not automatically do chi-square test. Maybe you should look at summary.table() or xtabs()> > Cheers > Michelle > > ********************************************************************** > This email and any files transmitted with it are confidential and > intended solely for the use of the individual or entity to whom they > are addressed. If you have received this email in error please notify > the system manager. > > This footnote also confirms that this email message has been swept by > MIMEsweeper for the presence of computer viruses. > > www.clearswift.com > ********************************************************************** > > > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.David Winsemius, MD West Hartford, CT