Todd A. Johnson
2007-Feb-20 09:48 UTC
[R] Difficulties with dataframe filter using elements from an array created using a for loop or seq()
Hi All- This seems like such a pathetic problem to be posting about, but I have no idea why this testcase does not work. I have tried this using R 2.4.1, 2.4.0, 2.3.0, and 2.0.0 on several different computers (Mac OS 10.4.8, Windows XP, Linux). Below the signature, you will find my test case R code. My point in this folly is to take a dataframe of 300,000 rows, create a filter based on two of the rows, and count the number of rows in the filtered and unfiltered dataframe. One column in the filter only has the numbers 0.05, 0.15, 0.25, 0.35, 0.45, 0.55, 0.65, 0.75, 0.85, 0.95, so I thought that I could just iterate in a for loop and get the job done. Just the simple single column filter case is presented here. Obviously, there are only ten numbers, so the "manual" method is easy, but I would like to have a more flexible program. (Plus it worries me if the simple things don't do what I expect... :-) )>From the output, you can see that the loop using the "handmadevector" thatcreates a filter and counts the elements, correctly finds one match for each element in the vector, but the seq() and for loop produced vectors each give a mixture of true and false matches. Can anyone tell me why the "loopvector" and "seqvector" do not provide the same output as the "handmadevector". Thank you for your assistance! Todd -- Todd A. Johnson Research Associate, Laboratory for Medical Informatics SNP Research Center,RIKEN 1-7-22Suehiro,Tsurumi-ku,Yokohama Kanagawa 230-0045,Japan Cellphone: 090-5309-5867 E-mail: tjohnson at src.riken.jp Here's the testcase, with the sample code between the lines and the output following: _____________________________________________________________________ ## Set up three different vectors, each with the numbers 0.05, 0.15, 0.25, 0.35, 0.45, 0.55, 0.65, 0.75, 0.85, 0.95 ## each of which is used to select records from a dataframe based on equality to a particular column ## The first vector is created by using a for loop loopvector <- c() for (i in 0:9){ loopvector <- c(loopvector, (i*0.10)+0.05); } ## The second vector is made "by hand" handmadevector <- c(0.05, 0.15, 0.25, 0.35, 0.45, 0.55, 0.65, 0.75, 0.85, 0.95) ## The third vector is made using seq() seqvector <- seq(0.05, 0.95, 0.10) ## Are the vectors the same? all.equal(loopvector, handmadevector) all.equal(loopvector, seqvector) print(handmadevector) print(loopvector) print(seqvector) ## As a simple testcase, I create a dataframe with two variables, a varA of dummy data, and bBins ## which is the column on which I was trying to filter. a <- c(0,1,2,0,1,3,4,5,3,5) b <- c(0.05,0.15,0.25,0.35,0.45,0.55,0.65,0.75,0.85,0.95) testdf <- data.frame(varA = a, bBins = b) attach(testdf) ## Loop through each of the vectors, create a filter on the dataframe based on equality with the current iteration, ## and print that number and the count of records in the dataframe that match that number. for (i in loopvector){ aqs_filt <- bBins==i; print(i); print(length(testdf$varA[aqs_filt])); } for (i in handmadevector){ aqs_filt <- bBins==i; print(i); print(length(testdf$varA[aqs_filt])); } for (i in seqvector){ aqs_filt <- bBins==i; print(i); print(length(testdf$varA[aqs_filt])); } _____________________________________________________________________ Here's the output from R 2.4.1 running on an Apple 12" Powerbook.> ## Set up three different vectors, each with the numbers 0.05, 0.15, 0.25,0.35, 0.45, 0.55, 0.65, 0.75, 0.85, 0.95> ## each of which is used to select records from a dataframe based on equalityto a particular column> ## The first vector is created by using a for loop > loopvector <- c() > for (i in 0:9){+ loopvector <- c(loopvector, (i*0.10)+0.05); + }> ## The second vector is made "by hand" > handmadevector <- c(0.05, 0.15, 0.25, 0.35, 0.45, 0.55, 0.65, 0.75, 0.85,0.95)> ## The thirs vector is made using seq() > seqvector <- seq(0.05, 0.95, 0.10) > ## Are the vectors the same? > all.equal(loopvector, handmadevector)[1] TRUE> all.equal(loopvector, seqvector)[1] TRUE> > print(handmadevector)[1] 0.05 0.15 0.25 0.35 0.45 0.55 0.65 0.75 0.85 0.95> print(loopvector)[1] 0.05 0.15 0.25 0.35 0.45 0.55 0.65 0.75 0.85 0.95> print(seqvector)[1] 0.05 0.15 0.25 0.35 0.45 0.55 0.65 0.75 0.85 0.95> ## As a simple testcase, I create a dataframe with two variables, a varA ofdummy data, and bBins> ## which is the column on which I was trying to filter. > a <- c(0,1,2,0,1,3,4,5,3,5) > b <- c(0.05,0.15,0.25,0.35,0.45,0.55,0.65,0.75,0.85,0.95) > testdf <- data.frame(varA = a, bBins = b) > attach(testdf) > ## Loop through each of the vectors, create a filter on the dataframe based onequality with the current iteration,> ## and print that number and the count of records in the dataframe that matchthat number.> for (i in loopvector){+ aqs_filt <- bBins==i; + print(i); + print(length(testdf$varA[aqs_filt])); + } [1] 0.05 [1] 1 [1] 0.15 [1] 0 [1] 0.25 [1] 1 [1] 0.35 [1] 0 [1] 0.45 [1] 1 [1] 0.55 [1] 1 [1] 0.65 [1] 0 [1] 0.75 [1] 0 [1] 0.85 [1] 0 [1] 0.95 [1] 0> for (i in handmadevector){+ aqs_filt <- bBins==i; + print(i); + print(length(testdf$varA[aqs_filt])); + } [1] 0.05 [1] 1 [1] 0.15 [1] 1 [1] 0.25 [1] 1 [1] 0.35 [1] 1 [1] 0.45 [1] 1 [1] 0.55 [1] 1 [1] 0.65 [1] 1 [1] 0.75 [1] 1 [1] 0.85 [1] 1 [1] 0.95 [1] 1> for (i in seqvector){+ aqs_filt <- bBins==i; + print(i); + print(length(testdf$varA[aqs_filt])); + } [1] 0.05 [1] 1 [1] 0.15 [1] 0 [1] 0.25 [1] 1 [1] 0.35 [1] 0 [1] 0.45 [1] 1 [1] 0.55 [1] 1 [1] 0.65 [1] 0 [1] 0.75 [1] 0 [1] 0.85 [1] 0 [1] 0.95 [1] 0>
Prof Brian Ripley
2007-Feb-20 10:07 UTC
[R] Difficulties with dataframe filter using elements from an array created using a for loop or seq()
FAQ Q7.31 On Tue, 20 Feb 2007, Todd A. Johnson wrote:> Hi All- > > This seems like such a pathetic problem to be posting about, but I have no > idea why this testcase does not work. I have tried this using R 2.4.1, > 2.4.0, 2.3.0, and 2.0.0 on several different computers (Mac OS 10.4.8, > Windows XP, Linux). Below the signature, you will find my test case R code. > > My point in this folly is to take a dataframe of 300,000 rows, create a > filter based on two of the rows, and count the number of rows in the > filtered and unfiltered dataframe. One column in the filter only has the > numbers 0.05, 0.15, 0.25, 0.35, 0.45, 0.55, 0.65, 0.75, 0.85, 0.95, so I > thought that I could just iterate in a for loop and get the job done. Just > the simple single column filter case is presented here. Obviously, there are > only ten numbers, so the "manual" method is easy, but I would like to have a > more flexible program. (Plus it worries me if the simple things don't do > what I expect... :-) ) > >> From the output, you can see that the loop using the "handmadevector" that > creates a filter and counts the elements, correctly finds one match for each > element in the vector, but the seq() and for loop produced vectors each give > a mixture of true and false matches. > > Can anyone tell me why the "loopvector" and "seqvector" do not provide the > same output as the "handmadevector". > > > Thank you for your assistance! > > Todd > >-- Brian D. Ripley, ripley at stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UK Fax: +44 1865 272595