Carl Sutton
2017-Apr-16 00:18 UTC
[R] the difference between "-" and "!" between base and data.table package
Hi I normally use package data.table but today was doing some base R coding. Had a problem for a bit which I finally resolved. I was attempting to separate a data frame between train and test sets, and in base R was using the "!" to exclude training set indices from the data frame. All I was getting was zero observations. Changed to using "-" and it worked. I recalled that in data.table the "!" function worked, so created this little bit of code. # Base R Functions str(mtcars) train_indices <- sample(nrow(mtcars), round(0.75*nrow(mtcars))) train <- mtcars[train_indices,] mode(train_indices); class(train_indices) test <- mtcars[!train_indices,] # the "!" function returning 0 observations test_1 <- mtcars[-train_indices,] identical(test, test_1) # Using data.table package library(data.table) dt1 <- data.table(mtcars) train_indices <- sample(nrow(dt1), round(0.75*nrow(dt1))) train <- dt1[train_indices,] mode(train_indices); class(train_indices) test <- dt1[!train_indices,] # the "!" function test_1 <- dt1[-train_indices,] identical(test, test_1) The documentation appears to me to accept "!" in base, so do I have some kind of ridiculous error or ..?? Carl Sutton
Jeff Newmiller
2017-Apr-16 07:51 UTC
[R] the difference between "-" and "!" between base and data.table package
! is a logical operator... it means "not". When you write lidx <- seq_along( mtcars[[ 1 ]] ) %in% train_indices you end up with a vector of logical values for which ! makes sense. Since R supports logical indexing this can be a very convenient way to select one group or the other. If you give an integer to the ! operator, any non-zero value is treated as TRUE, which can be useful sometimes but not in this case, since all of the train_indices are greater than zero. Look at what !train_indices actually is. As the Introduction to R document says, integer indexing always starts at 1 instead of zero as in many other languages. This makes it feasible to let negative integers as indexes represent the idea of excluding those positions. Thus identical( mtcars[ !lidx, ], mtcars[ -train_indices, ] ) The ItoR document is really quite informative to re-read occasionally. For example, look up indexing with a matrix as the index. -- Sent from my phone. Please excuse my brevity. On April 15, 2017 5:18:43 PM PDT, Carl Sutton via R-help <r-help at r-project.org> wrote:>Hi > > >I normally use package data.table but today was doing some base R >coding. Had a problem for a bit which I finally resolved. I was >attempting to separate a data frame between train and test sets, and in >base R was using the "!" to exclude training set indices from the data >frame. All I was getting was zero observations. Changed to using "-" >and it worked. I recalled that in data.table the "!" function worked, >so created this little bit of code. > ># Base R Functions >str(mtcars) >train_indices <- sample(nrow(mtcars), round(0.75*nrow(mtcars))) >train <- mtcars[train_indices,] >mode(train_indices); class(train_indices) >test <- mtcars[!train_indices,] # the "!" function returning 0 >observations >test_1 <- mtcars[-train_indices,] >identical(test, test_1) > ># Using data.table package >library(data.table) >dt1 <- data.table(mtcars) >train_indices <- sample(nrow(dt1), round(0.75*nrow(dt1))) >train <- dt1[train_indices,] >mode(train_indices); class(train_indices) >test <- dt1[!train_indices,] # the "!" function >test_1 <- dt1[-train_indices,] >identical(test, test_1) >The documentation appears to me to accept "!" in base, so do I have >some kind of ridiculous error or ..?? >Carl Sutton > >______________________________________________ >R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >https://stat.ethz.ch/mailman/listinfo/r-help >PLEASE do read the posting guide >http://www.R-project.org/posting-guide.html >and provide commented, minimal, self-contained, reproducible code.
David Winsemius
2017-Apr-16 08:00 UTC
[R] the difference between "-" and "!" between base and data.table package
> On Apr 15, 2017, at 5:18 PM, Carl Sutton via R-help <r-help at r-project.org> wrote: > > Hi > > > I normally use package data.table but today was doing some base R coding. Had a problem for a bit which I finally resolved. I was attempting to separate a data frame between train and test sets, and in base R was using the "!" to exclude training set indices from the data frame. All I was getting was zero observations. Changed to using "-" and it worked. I recalled that in data.table the "!" function worked, so created this little bit of code. > > # Base R Functions > str(mtcars) > train_indices <- sample(nrow(mtcars), round(0.75*nrow(mtcars))) > train <- mtcars[train_indices,] > mode(train_indices); class(train_indices) > test <- mtcars[!train_indices,] # the "!" function returning 0 observationsThe arguments you are supplying:> table( !train_indices )FALSE 24> test_1 <- mtcars[-train_indices,] > identical(test, test_1) > > # Using data.table package > library(data.table) > dt1 <- data.table(mtcars) > train_indices <- sample(nrow(dt1), round(0.75*nrow(dt1))) > train <- dt1[train_indices,]The data.table "[" function has very different syntax and evaluation rules than does the data.frame "[" function, but I guess you know that.> mode(train_indices); class(train_indices) > test <- dt1[!train_indices,] # the "!" function > test_1 <- dt1[-train_indices,] > identical(test, test_1) > The documentation appears to me to accept "!" in base, so do I have some kind of ridiculous error or ..??Not sure about "ridiculous" and you have not actually said what it was that _you_ were questioning. If it is the lack of any return from `test <- mtcars[!train_indices,]` than it could be argued that was a ridiculous expectation at least according to the rules of vector evaluation in row selection that I thought I understood. Giving a vector of FALSE values to `[.data.frame` would not reasonably be expected to return anything. Whether giving a vector of only FALSE's to `[.data.table` and actually getting something back does seem kind of unexpected to me, but clearly it didn't seem ridiculous to Matt Dowle. Clearly the recycling rules for `[.data.table are different than those of `[.data.frame`. Data.tables don't use rownames. The results from:> dt1[rep(FALSE,24), ]Error in `[.data.table`(dt1, rep(FALSE, 24), ) : i evaluates to a logical vector length 24 but there are 32 rows. Recycling of logical i is no longer allowed as it hides more bugs than is worth the rare convenience. Explicitly use rep(...,length=.N) if you really need to recycle. ... is different than from dt1[!train_indices, ] # get 8 rows. To me that doesn't make sense. I generally use %in% for row selection. But many people would also find this pair of results "ridiculous":> mtcars[ which( train_indices %in% 50:100), ][1] mpg cyl disp hp drat wt qsec vs am gear carb <0 rows> (or 0-length row.names)> mtcars[ -which( train_indices %in% 50:100), ] # bad idea to use minus before which()[1] mpg cyl disp hp drat wt qsec vs am gear carb <0 rows> (or 0-length row.names) Yes, I know that some people think the `which` is not needed. I'm not one of them. -- David.> Carl Sutton > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.David Winsemius Alameda, CA, USA
Carl Sutton
2017-Apr-16 17:44 UTC
[R] the difference between "-" and "!" between base and data.table package
Hi Thank you all for your input. But I must apologize. When I was searching the help page I went this far and stopped Logic {base} R Documentation Logical Operators Description These operators act on raw, logical and number-like vectors. Usage ! x x & y x && y x | y x || y xor(x, y) isTRUE(x) Arguments x, y raw or logical or ?number-like? vectors (i.e., of types double (class numeric), integer and complex)), or objects for which methods have been written. At that point I took a look at train_indices and it was indeed a vector of integers, so directed my inquiry to the list. After reading the answers it was fairly obvious I had missed something, and in the details is Numeric and complex vectors will be coerced to logical values, with zero being false and all non-zero values being true. Raw vectors are handled without any coercion for !, &, | and xor, with these operators being applied bitwise (so ! is the 1s-complement). I was truly lax in my search of the documentation Again, thank for you for your time and expertise, I will try to be more complete in my research in the future Carl Sutton On Sunday, April 16, 2017 1:00 AM, David Winsemius <dwinsemius at comcast.net> wrote:> On Apr 15, 2017, at 5:18 PM, Carl Sutton via R-help <r-help at r-project.org> wrote: > > Hi > > > I normally use package data.table but today was doing some base R coding. Had a problem for a bit which I finally resolved. I was attempting to separate a data frame between train and test sets, and in base R was using the "!" to exclude training set indices from the data frame. All I was getting was zero observations. Changed to using "-" and it worked. I recalled that in data.table the "!" function worked, so created this little bit of code. > > # Base R Functions > str(mtcars) > train_indices <- sample(nrow(mtcars), round(0.75*nrow(mtcars))) > train <- mtcars[train_indices,] > mode(train_indices); class(train_indices) > test <- mtcars[!train_indices,] # the "!" function returning 0 observationsThe arguments you are supplying:> table( !train_indices )FALSE 24> test_1 <- mtcars[-train_indices,] > identical(test, test_1) > > # Using data.table package > library(data.table) > dt1 <- data.table(mtcars) > train_indices <- sample(nrow(dt1), round(0.75*nrow(dt1))) > train <- dt1[train_indices,]The data.table "[" function has very different syntax and evaluation rules than does the data.frame "[" function, but I guess you know that.> mode(train_indices); class(train_indices) > test <- dt1[!train_indices,] # the "!" function > test_1 <- dt1[-train_indices,] > identical(test, test_1) > The documentation appears to me to accept "!" in base, so do I have some kind of ridiculous error or ..??Not sure about "ridiculous" and you have not actually said what it was that _you_ were questioning. If it is the lack of any return from `test <- mtcars[!train_indices,]` than it could be argued that was a ridiculous expectation at least according to the rules of vector evaluation in row selection that I thought I understood. Giving a vector of FALSE values to `[.data.frame` would not reasonably be expected to return anything. Whether giving a vector of only FALSE's to `[.data.table` and actually getting something back does seem kind of unexpected to me, but clearly it didn't seem ridiculous to Matt Dowle. Clearly the recycling rules for `[.data.table are different than those of `[.data.frame`. Data.tables don't use rownames. The results from:> dt1[rep(FALSE,24), ]Error in `[.data.table`(dt1, rep(FALSE, 24), ) : i evaluates to a logical vector length 24 but there are 32 rows. Recycling of logical i is no longer allowed as it hides more bugs than is worth the rare convenience. Explicitly use rep(...,length=.N) if you really need to recycle. ... is different than from dt1[!train_indices, ] # get 8 rows. To me that doesn't make sense. I generally use %in% for row selection. But many people would also find this pair of results "ridiculous":> mtcars[ which( train_indices %in% 50:100), ][1] mpg cyl disp hp drat wt qsec vs am gear carb <0 rows> (or 0-length row.names)> mtcars[ -which( train_indices %in% 50:100), ] # bad idea to use minus before which()[1] mpg cyl disp hp drat wt qsec vs am gear carb <0 rows> (or 0-length row.names) Yes, I know that some people think the `which` is not needed. I'm not one of them. -- David.> Carl Sutton > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.David Winsemius Alameda, CA, USA