Bashir Saghir (Aztek Global)
2005-Jul-05 10:04 UTC
[R] by (tapply) and for loop differences
I am getting a difference in results when running some analysis using by and tapply compare to using a for loop. I've tried searching the web but had no luck with the keywords I used. I've attached a simple example below to illustrates my problem. I get a difference in the mean of yvar, diff and the p-value using tapply & by compared to a for loop. I cannot see what I am doing wrong. Can anyone help?> # Simulate some data - I'll do 2 simulations... > > xvar = rnorm(40, 20, 5) > yvar = rnorm(40, 22, 2) > num = factor(rep(1:2, each=20)) > sdat = data.frame(cbind(num, xvar, yvar)) > > # Define a function to do a simple t test and return some values... > > kindtest = function(varx, vary){+ res = t.test(varx, vary) + x.mn = res$estimate[1] + y.mn = res$estimate[2] + diff = y.mn-x.mn + pval = res$p.value + cat("Mean xvar =", x.mn, " Mean yvar =", y.mn) + cat(" diff =", diff, " p-value=", pval, "\n\n") + list(x.mn=x.mn, y.mn=y.mn, diff=diff, pval=pval) + } ## Results from by and tapply> attach(sdat) > bres = by(xvar, num, kindtest, yvar)Mean xvar = 19.8904 Mean yvar = 21.97729 diff = 2.086891 p-value0.06222805 Mean xvar = 19.88329 Mean yvar = 21.97729 diff = 2.093996 p-value0.05245329> tres = tapply(xvar, num, kindtest, yvar)Mean xvar = 19.8904 Mean yvar = 21.97729 diff = 2.086891 p-value0.06222805 Mean xvar = 19.88329 Mean yvar = 21.97729 diff = 2.093996 p-value0.05245329> detach(sdat,1)## Results from for> for(i in 1:2) {+ subdat= subset(sdat, num==i) + kindtest(subdat$xvar, subdat$yvar) + } Mean xvar = 19.8904 Mean yvar = 21.98615 diff = 2.095746 p-value0.07319223 Mean xvar = 19.88329 Mean yvar = 21.96843 diff = 2.085141 p-value0.05850057 OKAY - I'm going to brave and show you that I am still on version 1.9.0! I asked the IT/IS department for an upgrade when version 2 was first released! Last I heard my request was in the black hole of documented and undocumented processes to approve software upgrades... So this error may not occur in the latest version... If so, just let me know which of the above is correct (if any) and I'll just live with it (or run it at home on version 2.1.1). Thanks.> version_ platform i386-pc-mingw32 arch i386 os mingw32 system i386, mingw32 status major 1 minor 9.0 year 2004 month 04 day 12 language R Thanks, Saghir --------------------------------------------------------- Legal Notice: This electronic mail and its attachments are i...{{dropped}}
"Bashir Saghir (Aztek Global)" <Saghir.Bashir at ucb-group.com> writes:> I am getting a difference in results when running some analysis using by and > tapply compare to using a for loop. I've tried searching the web but had no > luck with the keywords I used. > > I've attached a simple example below to illustrates my problem. I get a > difference in the mean of yvar, diff and the p-value using tapply & by > compared to a for loop. I cannot see what I am doing wrong. Can anyone help? > > > # Simulate some data - I'll do 2 simulations... > > > > xvar = rnorm(40, 20, 5) > > yvar = rnorm(40, 22, 2) > > num = factor(rep(1:2, each=20)) > > sdat = data.frame(cbind(num, xvar, yvar)) > > > > # Define a function to do a simple t test and return some values... > > > > kindtest = function(varx, vary){ > + res = t.test(varx, vary) > + x.mn = res$estimate[1] > + y.mn = res$estimate[2] > + diff = y.mn-x.mn > + pval = res$p.value > + cat("Mean xvar =", x.mn, " Mean yvar =", y.mn) > + cat(" diff =", diff, " p-value=", pval, "\n\n") > + list(x.mn=x.mn, y.mn=y.mn, diff=diff, pval=pval) > + } > > ## Results from by and tapply > > > attach(sdat) > > bres = by(xvar, num, kindtest, yvar) > Mean xvar = 19.8904 Mean yvar = 21.97729 diff = 2.086891 p-value> 0.06222805 > Mean xvar = 19.88329 Mean yvar = 21.97729 diff = 2.093996 p-value> 0.05245329 > > > tres = tapply(xvar, num, kindtest, yvar) > Mean xvar = 19.8904 Mean yvar = 21.97729 diff = 2.086891 p-value> 0.06222805 > Mean xvar = 19.88329 Mean yvar = 21.97729 diff = 2.093996 p-value> 0.05245329 > > > detach(sdat,1) > > ## Results from for > > > for(i in 1:2) { > + subdat= subset(sdat, num==i) > + kindtest(subdat$xvar, subdat$yvar) > + } > Mean xvar = 19.8904 Mean yvar = 21.98615 diff = 2.095746 p-value> 0.07319223 > Mean xvar = 19.88329 Mean yvar = 21.96843 diff = 2.085141 p-value> 0.05850057 >The fact that the by/tapply approach is giving you the same Mean yvar for both groups should be a dead giveaway.... Stick print(varx) and print(vary) into kindtest, and you'll see the point. You are passing yvar *without* subsetting (and since the t.test isn't paired, it can hardly be expected to complain that x and y differ in length...). This is probably closer to the mark: by(sdat, num, with, kindtest(xvar, yvar)) -- O__ ---- Peter Dalgaard ??ster Farimagsgade 5, Entr.B c/ /'_ --- Dept. of Biostatistics PO Box 2099, 1014 Cph. K (*) \(*) -- University of Copenhagen Denmark Ph: (+45) 35327918 ~~~~~~~~~~ - (p.dalgaard at biostat.ku.dk) FAX: (+45) 35327907