Hi all, Can anyone explain why the following use of the subset() function produces a different outcome than the use of the "[" extractor? The subset() function as used in density(subset(mydf, ht >= 150.0 & wt <= 150.0, select = c(age))) appears to me from documentation to be equivalent to density(mydf[mydf$ht >= 150.0 & mydf$wt <= 150.0, "age"]) (modulo exclusion of NAs) but use of the former yields an error from density.default() (shown below). Is this a bug in the subset() machinery? Or is it a documentation issue for the subset() function documentation or density() documentation? I'm seeing issues such as this with newcomers to R who initially seem to prefer using subset() instead of the bracket extractor. At this point these functions are clearly not exchangeable. Should code be patched so that they are, or documentation amended to show when use of subset() is not appropriate?> ### Bug in subset()?> set.seed(123) > mydf <- data.frame(ht = 150 + 10 * rnorm(100),+ wt = 150 + 10 * rnorm(100), + age = sample(20:60, size = 100, replace = TRUE) + )> density(subset(mydf, ht >= 150.0 & wt <= 150.0, select = c(age)))Error in density.default(subset(mydf, ht >= 150 & wt <= 150, select = c(age))) : argument 'x' must be numeric> density(mydf[mydf$ht >= 150.0 & mydf$wt <= 150.0, "age"])Call: density.default(x = mydf[mydf$ht >= 150 & mydf$wt <= 150, "age"]) Data: mydf[mydf$ht >= 150 & mydf$wt <= 150, "age"] (29 obs.); Bandwidth 'bw' = 5.816 x y Min. : 4.553 Min. :3.781e-05 1st Qu.:22.776 1st Qu.:3.108e-03 Median :41.000 Median :1.775e-02 Mean :41.000 Mean :1.370e-02 3rd Qu.:59.224 3rd Qu.:2.128e-02 Max. :77.447 Max. :2.665e-02> sessionInfo()R version 2.8.0 Patched (2008-11-06 r46845) powerpc-apple-darwin9.5.0 locale: C attached base packages: [1] stats graphics grDevices datasets utils methods base loaded via a namespace (and not attached): [1] Matrix_0.999375-16 grid_2.8.0 lattice_0.17-15 lme4_0.99875-9 [5] nlme_3.1-89>Steven McKinney Statistician Molecular Oncology and Breast Cancer Program British Columbia Cancer Research Centre email: smckinney +at+ bccrc +dot+ ca tel: 604-675-8000 x7561 BCCRC Molecular Oncology 675 West 10th Ave, Floor 4 Vancouver B.C. V5Z 1L3 Canada
Steven, check the class of the objects that you are creating. Cheers, Andrew On Wed, January 21, 2009 10:02 am, Steven McKinney wrote:> Hi all, > > Can anyone explain why the following use of > the subset() function produces a different > outcome than the use of the "[" extractor? > > The subset() function as used in > > density(subset(mydf, ht >= 150.0 & wt <= 150.0, select = c(age))) > > appears to me from documentation to be equivalent to > > density(mydf[mydf$ht >= 150.0 & mydf$wt <= 150.0, "age"]) > > (modulo exclusion of NAs) but use of the former yields an > error from density.default() (shown below). > > > Is this a bug in the subset() machinery? Or is it > a documentation issue for the subset() function > documentation or density() documentation? > > I'm seeing issues such as this with newcomers to R > who initially seem to prefer using subset() instead > of the bracket extractor. At this point these functions > are clearly not exchangeable. Should code be patched > so that they are, or documentation amended to show > when use of subset() is not appropriate? > >> ### Bug in subset()? > >> set.seed(123) >> mydf <- data.frame(ht = 150 + 10 * rnorm(100), > + wt = 150 + 10 * rnorm(100), > + age = sample(20:60, size = 100, replace = TRUE) > + ) > > >> density(subset(mydf, ht >= 150.0 & wt <= 150.0, select = c(age))) > Error in density.default(subset(mydf, ht >= 150 & wt <= 150, select > c(age))) : > argument 'x' must be numeric > > >> density(mydf[mydf$ht >= 150.0 & mydf$wt <= 150.0, "age"]) > > Call: > density.default(x = mydf[mydf$ht >= 150 & mydf$wt <= 150, "age"]) > > Data: mydf[mydf$ht >= 150 & mydf$wt <= 150, "age"] (29 obs.); Bandwidth > 'bw' = 5.816 > > x y > Min. : 4.553 Min. :3.781e-05 > 1st Qu.:22.776 1st Qu.:3.108e-03 > Median :41.000 Median :1.775e-02 > Mean :41.000 Mean :1.370e-02 > 3rd Qu.:59.224 3rd Qu.:2.128e-02 > Max. :77.447 Max. :2.665e-02 > > >> sessionInfo() > R version 2.8.0 Patched (2008-11-06 r46845) > powerpc-apple-darwin9.5.0 > > locale: > C > > attached base packages: > [1] stats graphics grDevices datasets utils methods base > > loaded via a namespace (and not attached): > [1] Matrix_0.999375-16 grid_2.8.0 lattice_0.17-15 > lme4_0.99875-9 > [5] nlme_3.1-89 >> > > > > > > > Steven McKinney > > Statistician > Molecular Oncology and Breast Cancer Program > British Columbia Cancer Research Centre > > email: smckinney +at+ bccrc +dot+ ca > > tel: 604-675-8000 x7561 > > BCCRC > Molecular Oncology > 675 West 10th Ave, Floor 4 > Vancouver B.C. > V5Z 1L3 > Canada > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >Andrew Robinson Senior Lecturer in Statistics Tel: +61-3-8344-6410 Department of Mathematics and Statistics Fax: +61-3-8344 4599 University of Melbourne, VIC 3010 Australia Email: a.robinson at ms.unimelb.edu.au Website: http://www.ms.unimelb.edu.au
on 01/20/2009 05:02 PM Steven McKinney wrote:> Hi all, > > Can anyone explain why the following use of > the subset() function produces a different > outcome than the use of the "[" extractor? > > The subset() function as used in > > density(subset(mydf, ht >= 150.0 & wt <= 150.0, select = c(age)))Here you are asking density to be run on a data frame, which is what subset returns, even when you select a single column. Thus, you get an error since density() expects a numeric vector. No bug in either subset() or the documentation. You could do this: density(subset(mydf, ht >= 150.0 & wt <= 150.0, select = age)[[1]])> appears to me from documentation to be equivalent to > > density(mydf[mydf$ht >= 150.0 & mydf$wt <= 150.0, "age"])Here you are running density on a vector, so it works. This is because the default behavior for "[.data.frame" has 'drop = TRUE', which means that the returned result is coerced to the lowest possible dimension. Thus, rather than a single data frame column, a vector is returned. The result from subset() would be equivalent to using 'drop = FALSE'. HTH, Marc Schwartz> (modulo exclusion of NAs) but use of the former yields an > error from density.default() (shown below). > > > Is this a bug in the subset() machinery? Or is it > a documentation issue for the subset() function > documentation or density() documentation? > > I'm seeing issues such as this with newcomers to R > who initially seem to prefer using subset() instead > of the bracket extractor. At this point these functions > are clearly not exchangeable. Should code be patched > so that they are, or documentation amended to show > when use of subset() is not appropriate? > >> ### Bug in subset()? > >> set.seed(123) >> mydf <- data.frame(ht = 150 + 10 * rnorm(100), > + wt = 150 + 10 * rnorm(100), > + age = sample(20:60, size = 100, replace = TRUE) > + ) > > >> density(subset(mydf, ht >= 150.0 & wt <= 150.0, select = c(age))) > Error in density.default(subset(mydf, ht >= 150 & wt <= 150, select = c(age))) : > argument 'x' must be numeric > > >> density(mydf[mydf$ht >= 150.0 & mydf$wt <= 150.0, "age"]) > > Call: > density.default(x = mydf[mydf$ht >= 150 & mydf$wt <= 150, "age"]) > > Data: mydf[mydf$ht >= 150 & mydf$wt <= 150, "age"] (29 obs.); Bandwidth 'bw' = 5.816 > > x y > Min. : 4.553 Min. :3.781e-05 > 1st Qu.:22.776 1st Qu.:3.108e-03 > Median :41.000 Median :1.775e-02 > Mean :41.000 Mean :1.370e-02 > 3rd Qu.:59.224 3rd Qu.:2.128e-02 > Max. :77.447 Max. :2.665e-02 >
Consider an alternative and realize that it is density() that is complaining about being passed a dataframe rather than subset misbehaving: density(subset(mydf, ht >= 150.0 & wt <= 150.0)$age) Call: density.default(x = subset(mydf, ht >= 150 & wt <= 150)$age) Data: subset(mydf, ht >= 150 & wt <= 150)$age (29 obs.); Bandwidth 'bw' = 5.816 x y Min. : 4.553 Min. :3.781e-05 1st Qu.:22.776 1st Qu.:3.108e-03 Median :41.000 Median :1.775e-02 Mean :41.000 Mean :1.370e-02 3rd Qu.:59.224 3rd Qu.:2.128e-02 Max. :77.447 Max. :2.665e-02 -- David Winsemius On Jan 20, 2009, at 6:02 PM, Steven McKinney wrote:> Hi all, > > Can anyone explain why the following use of > the subset() function produces a different > outcome than the use of the "[" extractor? > > The subset() function as used in > > density(subset(mydf, ht >= 150.0 & wt <= 150.0, select = c(age))) > > appears to me from documentation to be equivalent to > > density(mydf[mydf$ht >= 150.0 & mydf$wt <= 150.0, "age"]) > > (modulo exclusion of NAs) but use of the former yields an > error from density.default() (shown below). > > > Is this a bug in the subset() machinery? Or is it > a documentation issue for the subset() function > documentation or density() documentation? > > I'm seeing issues such as this with newcomers to R > who initially seem to prefer using subset() instead > of the bracket extractor. At this point these functions > are clearly not exchangeable. Should code be patched > so that they are, or documentation amended to show > when use of subset() is not appropriate? > >> ### Bug in subset()? > >> set.seed(123) >> mydf <- data.frame(ht = 150 + 10 * rnorm(100), > + wt = 150 + 10 * rnorm(100), > + age = sample(20:60, size = 100, replace = TRUE) > + ) > > >> density(subset(mydf, ht >= 150.0 & wt <= 150.0, select = c(age))) > Error in density.default(subset(mydf, ht >= 150 & wt <= 150, select > = c(age))) : > argument 'x' must be numeric > > >> density(mydf[mydf$ht >= 150.0 & mydf$wt <= 150.0, "age"]) > > Call: > density.default(x = mydf[mydf$ht >= 150 & mydf$wt <= 150, "age"]) > > Data: mydf[mydf$ht >= 150 & mydf$wt <= 150, "age"] (29 obs.); > Bandwidth 'bw' = 5.816 > > x y > Min. : 4.553 Min. :3.781e-05 > 1st Qu.:22.776 1st Qu.:3.108e-03 > Median :41.000 Median :1.775e-02 > Mean :41.000 Mean :1.370e-02 > 3rd Qu.:59.224 3rd Qu.:2.128e-02 > Max. :77.447 Max. :2.665e-02 > > >> sessionInfo() > R version 2.8.0 Patched (2008-11-06 r46845) > powerpc-apple-darwin9.5.0 > > locale: > C > > attached base packages: > [1] stats graphics grDevices datasets utils methods base > > loaded via a namespace (and not attached): > [1] Matrix_0.999375-16 grid_2.8.0 lattice_0.17-15 > lme4_0.99875-9 > [5] nlme_3.1-89 >> > > > > > > > Steven McKinney > > Statistician > Molecular Oncology and Breast Cancer Program > British Columbia Cancer Research Centre > > email: smckinney +at+ bccrc +dot+ ca > > tel: 604-675-8000 x7561 > > BCCRC > Molecular Oncology > 675 West 10th Ave, Floor 4 > Vancouver B.C. > V5Z 1L3 > Canada > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.
Passing extra arguments to FUN=mean or median in apply seems fine, but when FUN=min warnings are generated? See below. Any ideas why? Best regards, Ryszard Ryszard Czerminski AstraZeneca Pharmaceuticals LP> m[,1] [,2] [1,] 1 2 [2,] 3 NA [3,] NA NA> apply(m, 1, median, na.rm=T)[1] 1.5 3.0 NA> apply(m, 1, mean, na.rm=T)[1] 1.5 3.0 NaN> apply(m, 1, min, na.rm=T)[1] 1 3 Inf Warning message: In FUN(newX[, i], ...) : no non-missing arguments to min; returning Inf>
on 01/20/2009 05:26 PM Czerminski, Ryszard wrote:> Passing extra arguments to FUN=mean or median in apply > seems fine, but when FUN=min warnings are generated? > See below. > > Any ideas why? > > Best regards, > Ryszard > > Ryszard Czerminski > AstraZeneca Pharmaceuticals LP > >> m > [,1] [,2] > [1,] 1 2 > [2,] 3 NA > [3,] NA NA >> apply(m, 1, median, na.rm=T) > [1] 1.5 3.0 NA >> apply(m, 1, mean, na.rm=T) > [1] 1.5 3.0 NaN >> apply(m, 1, min, na.rm=T) > [1] 1 3 Inf > Warning message: > In FUN(newX[, i], ...) : no non-missing arguments to min; returning InfNot a problem with min(), it is an issue with the last row being an empty set after removing the 2 NAs since na.rm = TRUE.> min(NA, NA, na.rm = TRUE)[1] Inf Warning message: In min(NA, NA, na.rm = TRUE) : no non-missing arguments to min; returning Inf You are effectively doing:> min(numeric(0))[1] Inf Warning message: In min(logical(0)) : no non-missing arguments to min; returning Inf See: http://wiki.r-project.org/rwiki/doku.php?id=tips:surprises:emptysetfuncs for more information on empty sets. HTH, Marc Schwartz