I'm having a weird problem with length(), in R1.6.1 under windows2000. I have a dataframe called byyr, with ten columns, the first of which is named cnd95. summary(byyr) shows that byyr$cnd95 contains the factor level "tr" 66 times. Also, when I enter byyr$cnd95 at the command line, I can count 66 "tr" elements in the resulting vector. However, when I enter n95trt <- length(byyr$cnd95[byyr$cnd95=="tr"]) n95trt the result is 68! Any ideas why this is happening, and how I can fix the miscount? (That column also contains 69 entries of "c", and (relevantly?) two NA's.) Thanks for any help. Dave Parkhurst
It's the users who are misbehaving -- it usually is! I think you mean [byyr$cnd95 %in% "tr"], not the same thing as R has NA character strings.> x <- c("a", "a", NA, "b2") > x == "a"[1] TRUE TRUE NA FALSE> x[x == "a"][1] "a" "a" NA> x[x %in% "a"][1] "a" "a" MASS4 page 30 discusses this and similar traps. On Fri, 14 Mar 2003, David Parkhurst wrote:> I'm having a weird problem with length(), in R1.6.1 under windows2000. I have a > dataframe called byyr, with ten columns, the first of which is named cnd95. > summary(byyr) shows that byyr$cnd95 contains the factor level "tr" 66 times. Also, > when I enter byyr$cnd95 at the command line, I can count 66 "tr" elements in the > resulting vector. However, when I enter > > n95trt <- length(byyr$cnd95[byyr$cnd95=="tr"]) > n95trt > > the result is 68! Any ideas why this is happening, and how I can fix the miscount? > (That column also contains 69 entries of "c", and (relevantly?) two NA's.) > > Thanks for any help. > > Dave Parkhurst > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://www.stat.math.ethz.ch/mailman/listinfo/r-help >-- Brian D. Ripley, ripley at stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UK Fax: +44 1865 272595
David Parkhurst wrote:> I'm having a weird problem with length(), in R1.6.1 under windows2000. I have a > dataframe called byyr, with ten columns, the first of which is named cnd95. > summary(byyr) shows that byyr$cnd95 contains the factor level "tr" 66 times. Also, > when I enter byyr$cnd95 at the command line, I can count 66 "tr" elements in the > resulting vector. However, when I enter > > n95trt <- length(byyr$cnd95[byyr$cnd95=="tr"]) > n95trt > > the result is 68! Any ideas why this is happening, and how I can fix the miscount? > (That column also contains 69 entries of "c", and (relevantly?) two NA's.) > > Thanks for any help. > > Dave ParkhurstThe result you are looking for can be calculated with sum(byyr$cnd95 == "tr", na.rm=TRUE) Look at byyr$cnd95 == "tr" you'll get TRUE, FALSE, and NAs Indexing with NAs yields NAs and hence these are included in the length of the resulting vector. Uwe Ligges
HI Dave, | From: "David Parkhurst" <parkhurs at indiana.edu> | Date: Fri, 14 Mar 2003 10:35:19 -0500 | | I'm having a weird problem with length(), in R1.6.1 under windows2000. I have a | dataframe called byyr, with ten columns, the first of which is named cnd95. | summary(byyr) shows that byyr$cnd95 contains the factor level "tr" 66 times. Also, | when I enter byyr$cnd95 at the command line, I can count 66 "tr" elements in the | resulting vector. However, when I enter | | n95trt <- length(byyr$cnd95[byyr$cnd95=="tr"]) | n95trt | | the result is 68! Any ideas why this is happening, and how I can fix the miscount? | (That column also contains 69 entries of "c", and (relevantly?) two NA's.) Yes, NA-s are relevant. Try:> a <- factor(c("a", "a", NA)) > a[1] a a <NA> Levels: a> summary(a)a NA's 2 1> a=="a"[1] TRUE TRUE NA # there are 3 elements in the vector, hence there is 3 in a[a=="a"]?too.> sum(a=="a", na.rm=T)[1] 2 will give you the correct length. perhaps it helps. Ott | | Thanks for any help. | | Dave Parkhurst
>-----Original Message----- >From: r-help-bounces at stat.math.ethz.ch >[mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of DavidParkhurst>Sent: Friday, March 14, 2003 9:35 AM >To: r-help at stat.math.ethz.ch >Subject: [R] length() misbehaving? > > >I'm having a weird problem with length(), in R1.6.1 under >windows2000. I have a dataframe called byyr, with ten >columns, the first of which is named cnd95. >summary(byyr) shows that byyr$cnd95 contains the factor level >"tr" 66 times. Also, when I enter byyr$cnd95 at the command >line, I can count 66 "tr" elements in the resulting vector. >However, when I enter > >n95trt <- length(byyr$cnd95[byyr$cnd95=="tr"]) >n95trt > >the result is 68! Any ideas why this is happening, and how I >can fix the miscount? (That column also contains 69 entries of >"c", and (relevantly?) two NA's.) > >Thanks for any help. > >Dave ParkhurstIt is expected. Since NA represents a true unknown, the two NA's in your vector 'may be' a "tr". Thus, you get TRUE for the NA's when making the comparison. Instead of length(), you might want to use: sum(byyr$cnd95[byyr$cnd95 == "tr"], na.rm = TRUE) which will remove the two NA's. See ?sum HTH, Marc Schwartz
>-----Original Message----- >From: Marc Schwartz [mailto:mschwartz at medanalytics.com] >Sent: Friday, March 14, 2003 10:23 AM >To: 'David Parkhurst'; 'r-help at stat.math.ethz.ch' >Subject: RE: [R] length() misbehaving? > > >>-----Original Message----- >>From: r-help-bounces at stat.math.ethz.ch >>[mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of DavidParkhurst>>Sent: Friday, March 14, 2003 9:35 AM >>To: r-help at stat.math.ethz.ch >>Subject: [R] length() misbehaving? >> >> >>I'm having a weird problem with length(), in R1.6.1 under >>windows2000. I have a dataframe called byyr, with ten >>columns, the first of which is named cnd95. >>summary(byyr) shows that byyr$cnd95 contains the factor level >>"tr" 66 times. Also, when I enter byyr$cnd95 at the command >>line, I can count 66 "tr" elements in the resulting vector. >>However, when I enter >> >>n95trt <- length(byyr$cnd95[byyr$cnd95=="tr"]) >>n95trt >> >>the result is 68! Any ideas why this is happening, and how I >>can fix the miscount? (That column also contains 69 entries of >>"c", and (relevantly?) two NA's.) >> >>Thanks for any help. >> >>Dave Parkhurst > > >It is expected. > >Since NA represents a true unknown, the two NA's in your >vector 'may be' a "tr". Thus, you get TRUE for the NA's when >making the comparison. > >Instead of length(), you might want to use: > >sum(byyr$cnd95[byyr$cnd95 == "tr"], na.rm = TRUE) > >which will remove the two NA's. > >See ?sum > >HTH, > >Marc SchwartzCorrection. I mis-copied the code. It should be: sum(byyr$cnd95 == "tr", na.rm = TRUE) Apologies, Marc
On Fri, 14 Mar 2003, David Parkhurst wrote:> I'm having a weird problem with length(), in R1.6.1 under windows2000. I have a > dataframe called byyr, with ten columns, the first of which is named cnd95. > summary(byyr) shows that byyr$cnd95 contains the factor level "tr" 66 times. Also, > when I enter byyr$cnd95 at the command line, I can count 66 "tr" elements in the > resulting vector. However, when I enter > > n95trt <- length(byyr$cnd95[byyr$cnd95=="tr"]) > n95trt >In addition to the sum() approach suggested by other people there's an approach using %in% n95trt <- length(byyr$cnd95[byyr$cnd95 %in% "tr"]) You can check that NA=="tr" returns NA, but NA %in% "tr" returns FALSE. -thomas