Karl Schilling
2015-Nov-17 19:14 UTC
[R] Strange result when subsetting a data frame based on a character variable
Dear all, I have one observation that I do not quite understand. Maybe someone can clarify this issue for me. I have a data frame which I want to subset based on a grouping variable, say "group". Actually, "group" is a numeric value, but it is saved as a character. I give some code to generate an exemplary data frame below. Now, if I use MySubset <- subset(Data, Data$group == "..") everything works fine, as expected. ".." stands here for the value of group given as a character string. Surprisingly, I also get a correct subsetting if I simply give the plain numeric value of group (like MySubset <- subset(Data, Data$group == ..), AS LONG AS this numeric value is less then 100000. If the numeric value is 100000 or larger, I get an empty subset. OK, I know how to avoid this situation, but I wonder what the explanation for this for me rather strange behavior might be. Thank you so much for your suggestions. Karl Schilling ##### Exemplary code for reproducing the above described problem: options(stringsAsFactors = F) # set up some data frame value <- c(1:6) group <- rep(c("20000", "99999", "100000"), each = 2) Data <- data.frame(value = value, group = group) str(Data) # subset data frame based on the value of the variable "group", # treating this value once as a character, and once as a number: Data20 <- subset(Data, Data$group =="20000") str(Data20) Data20N <- subset(Data, Data$group ==20000) str(Data20N) Data99 <- subset(Data, Data$group =="99999") str(Data99) Data99N <- subset(Data, Data$group ==99999) str(Data99N) Data100 <- subset(Data, Data$group =="100000") str(Data100) Data100N <- subset(Data, Data$group ==100000) str(Data100N) -- Karl Schilling
Conklin, Mike (GfK)
2015-Nov-17 19:22 UTC
[R] Strange result when subsetting a data frame based on a character variable
R silently converts the integer to a character for comparison in the subset operation. But if we explicitly do the conversion we see that it does not work with the default R settings.> as.character(100000)[1] "1e+05"> as.character(99999)[1] "99999" -- W. Michael Conklin EVP Marketing & Data Sciences GfK T +1 763 417 4545 | M +1 612 567 8287 -----Original Message----- From: R-help [mailto:r-help-bounces at r-project.org] On Behalf Of Karl Schilling Sent: Tuesday, November 17, 2015 1:14 PM To: r-help at r-project.org Subject: [R] Strange result when subsetting a data frame based on a character variable Dear all, I have one observation that I do not quite understand. Maybe someone can clarify this issue for me. I have a data frame which I want to subset based on a grouping variable, say "group". Actually, "group" is a numeric value, but it is saved as a character. I give some code to generate an exemplary data frame below. Now, if I use MySubset <- subset(Data, Data$group == "..") everything works fine, as expected. ".." stands here for the value of group given as a character string. Surprisingly, I also get a correct subsetting if I simply give the plain numeric value of group (like MySubset <- subset(Data, Data$group == ..), AS LONG AS this numeric value is less then 100000. If the numeric value is 100000 or larger, I get an empty subset. OK, I know how to avoid this situation, but I wonder what the explanation for this for me rather strange behavior might be. Thank you so much for your suggestions. Karl Schilling ##### Exemplary code for reproducing the above described problem: options(stringsAsFactors = F) # set up some data frame value <- c(1:6) group <- rep(c("20000", "99999", "100000"), each = 2) Data <- data.frame(value = value, group = group) str(Data) # subset data frame based on the value of the variable "group", # treating this value once as a character, and once as a number: Data20 <- subset(Data, Data$group =="20000") str(Data20) Data20N <- subset(Data, Data$group ==20000) str(Data20N) Data99 <- subset(Data, Data$group =="99999") str(Data99) Data99N <- subset(Data, Data$group ==99999) str(Data99N) Data100 <- subset(Data, Data$group =="100000") str(Data100) Data100N <- subset(Data, Data$group ==100000) str(Data100N) -- Karl Schilling ______________________________________________ R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Duncan Murdoch
2015-Nov-17 19:25 UTC
[R] Strange result when subsetting a data frame based on a character variable
On 17/11/2015 2:14 PM, Karl Schilling wrote:> Dear all, > > I have one observation that I do not quite understand. Maybe someone > can clarify this issue for me. > > I have a data frame which I want to subset based on a grouping variable, > say "group". Actually, "group" is a numeric value, but it is saved as a > character. I give some code to generate an exemplary data frame below. > > Now, if I use > > MySubset <- subset(Data, Data$group == "..") > > everything works fine, as expected. ".." stands here for the value of > group given as a character string. > > Surprisingly, I also get a correct subsetting if I simply give the plain > numeric value of group (like MySubset <- subset(Data, Data$group == ..), > AS LONG AS this numeric value is less then 100000. > > If the numeric value is 100000 or larger, I get an empty subset. > > OK, I know how to avoid this situation, but I wonder what the > explanation for this for me rather strange behavior might be. > > Thank you so much for your suggestions.If you are comparing a character value to a numeric value, the numeric value is converted to character using as.character() for the comparison. as.character(100000) or a larger number is likely not "100000"; try it. (With the options I have on my computer, I get "1e+05".) If you want a numeric comparison, be explicit: subset(Data, as.numeric(Data$group) == ..) Duncan Murdoch> > > Karl Schilling > > > ##### > Exemplary code for reproducing the above described problem: > > options(stringsAsFactors = F) > > # set up some data frame > value <- c(1:6) > group <- rep(c("20000", "99999", "100000"), each = 2) > Data <- data.frame(value = value, group = group) > str(Data) > > # subset data frame based on the value of the variable "group", > # treating this value once as a character, and once as a number: > > Data20 <- subset(Data, Data$group =="20000") > str(Data20) > Data20N <- subset(Data, Data$group ==20000) > str(Data20N) > > > Data99 <- subset(Data, Data$group =="99999") > str(Data99) > Data99N <- subset(Data, Data$group ==99999) > str(Data99N) > Data100 <- subset(Data, Data$group =="100000") > str(Data100) > Data100N <- subset(Data, Data$group ==100000) > str(Data100N) >
Thierry Onkelinx
2015-Nov-17 19:30 UTC
[R] Strange result when subsetting a data frame based on a character variable
Dear Karl, Since you compare a character with a numeric, R converts the numeric silently. And then you're into trouble. as.character(99999) # "99999" as.character(100000) # "1e+5" Bottom line, use the same type on both sides of the binary operator. Best regards, ir. Thierry Onkelinx Instituut voor natuur- en bosonderzoek / Research Institute for Nature and Forest team Biometrie & Kwaliteitszorg / team Biometrics & Quality Assurance Kliniekstraat 25 1070 Anderlecht Belgium To call in the statistician after the experiment is done may be no more than asking him to perform a post-mortem examination: he may be able to say what the experiment died of. ~ Sir Ronald Aylmer Fisher The plural of anecdote is not data. ~ Roger Brinner The combination of some data and an aching desire for an answer does not ensure that a reasonable answer can be extracted from a given body of data. ~ John Tukey 2015-11-17 20:14 GMT+01:00 Karl Schilling <karl.schilling at uni-bonn.de>:> Dear all, > > I have one observation that I do not quite understand. Maybe someone > can clarify this issue for me. > > I have a data frame which I want to subset based on a grouping variable, > say "group". Actually, "group" is a numeric value, but it is saved as a > character. I give some code to generate an exemplary data frame below. > > Now, if I use > > MySubset <- subset(Data, Data$group == "..") > > everything works fine, as expected. ".." stands here for the value of > group given as a character string. > > Surprisingly, I also get a correct subsetting if I simply give the plain > numeric value of group (like MySubset <- subset(Data, Data$group == ..), AS > LONG AS this numeric value is less then 100000. > > If the numeric value is 100000 or larger, I get an empty subset. > > OK, I know how to avoid this situation, but I wonder what the explanation > for this for me rather strange behavior might be. > > Thank you so much for your suggestions. > > > Karl Schilling > > > ##### > Exemplary code for reproducing the above described problem: > > options(stringsAsFactors = F) > > # set up some data frame > value <- c(1:6) > group <- rep(c("20000", "99999", "100000"), each = 2) > Data <- data.frame(value = value, group = group) > str(Data) > > # subset data frame based on the value of the variable "group", > # treating this value once as a character, and once as a number: > > Data20 <- subset(Data, Data$group =="20000") > str(Data20) > Data20N <- subset(Data, Data$group ==20000) > str(Data20N) > > > Data99 <- subset(Data, Data$group =="99999") > str(Data99) > Data99N <- subset(Data, Data$group ==99999) > str(Data99N) > Data100 <- subset(Data, Data$group =="100000") > str(Data100) > Data100N <- subset(Data, Data$group ==100000) > str(Data100N) > > -- > Karl Schilling > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >[[alternative HTML version deleted]]
Bert Gunter
2015-Nov-17 19:37 UTC
[R] Strange result when subsetting a data frame based on a character variable
> 2 == "2"[1] TRUE ?"==" says: "If the two arguments are atomic vectors of different types, one is coerced to the type of the other, the (decreasing) order of precedence being character, complex, numeric, integer, logical and raw."> as.character(99999)[1] "99999"> as.character(100000)[1] "1e+05"> as.character(100000) == "100000"[1] FALSE Cheers, Bert Bert Gunter "Data is not information. Information is not knowledge. And knowledge is certainly not wisdom." -- Clifford Stoll On Tue, Nov 17, 2015 at 11:14 AM, Karl Schilling <karl.schilling at uni-bonn.de> wrote:> Dear all, > > I have one observation that I do not quite understand. Maybe someone > can clarify this issue for me. > > I have a data frame which I want to subset based on a grouping variable, say > "group". Actually, "group" is a numeric value, but it is saved as a > character. I give some code to generate an exemplary data frame below. > > Now, if I use > > MySubset <- subset(Data, Data$group == "..") > > everything works fine, as expected. ".." stands here for the value of group > given as a character string. > > Surprisingly, I also get a correct subsetting if I simply give the plain > numeric value of group (like MySubset <- subset(Data, Data$group == ..), AS > LONG AS this numeric value is less then 100000. > > If the numeric value is 100000 or larger, I get an empty subset. > > OK, I know how to avoid this situation, but I wonder what the explanation > for this for me rather strange behavior might be. > > Thank you so much for your suggestions. > > > Karl Schilling > > > ##### > Exemplary code for reproducing the above described problem: > > options(stringsAsFactors = F) > > # set up some data frame > value <- c(1:6) > group <- rep(c("20000", "99999", "100000"), each = 2) > Data <- data.frame(value = value, group = group) > str(Data) > > # subset data frame based on the value of the variable "group", > # treating this value once as a character, and once as a number: > > Data20 <- subset(Data, Data$group =="20000") > str(Data20) > Data20N <- subset(Data, Data$group ==20000) > str(Data20N) > > > Data99 <- subset(Data, Data$group =="99999") > str(Data99) > Data99N <- subset(Data, Data$group ==99999) > str(Data99N) > Data100 <- subset(Data, Data$group =="100000") > str(Data100) > Data100N <- subset(Data, Data$group ==100000) > str(Data100N) > > -- > Karl Schilling > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.
Duncan Murdoch
2015-Nov-17 20:27 UTC
[R] Strange result when subsetting a data frame based on a character variable
On 17/11/2015 2:25 PM, Duncan Murdoch wrote:> On 17/11/2015 2:14 PM, Karl Schilling wrote: > > Dear all, > > > > I have one observation that I do not quite understand. Maybe someone > > can clarify this issue for me. > > > > I have a data frame which I want to subset based on a grouping variable, > > say "group". Actually, "group" is a numeric value, but it is saved as a > > character. I give some code to generate an exemplary data frame below. > > > > Now, if I use > > > > MySubset <- subset(Data, Data$group == "..") > > > > everything works fine, as expected. ".." stands here for the value of > > group given as a character string. > > > > Surprisingly, I also get a correct subsetting if I simply give the plain > > numeric value of group (like MySubset <- subset(Data, Data$group == ..), > > AS LONG AS this numeric value is less then 100000. > > > > If the numeric value is 100000 or larger, I get an empty subset. > > > > OK, I know how to avoid this situation, but I wonder what the > > explanation for this for me rather strange behavior might be. > > > > Thank you so much for your suggestions. > > If you are comparing a character value to a numeric value, the numeric > value is converted to character using as.character() for the > comparison. as.character(100000) or a larger number is likely not > "100000"; try it. (With the options I have on my > computer, I get "1e+05".) > > If you want a numeric comparison, be explicit: > > subset(Data, as.numeric(Data$group) == ..)This might be bad advice. If Data$group is a factor (as it tends to be when character data is put in a dataframe), this will use the underlying factor code, not the visible one. You need to use as.numeric(as.character(Data$group)) to do the conversion you probably want. Duncan Murdoch> > > Duncan Murdoch > > > > > > > Karl Schilling > > > > > > ##### > > Exemplary code for reproducing the above described problem: > > > > options(stringsAsFactors = F) > > > > # set up some data frame > > value <- c(1:6) > > group <- rep(c("20000", "99999", "100000"), each = 2) > > Data <- data.frame(value = value, group = group) > > str(Data) > > > > # subset data frame based on the value of the variable "group", > > # treating this value once as a character, and once as a number: > > > > Data20 <- subset(Data, Data$group =="20000") > > str(Data20) > > Data20N <- subset(Data, Data$group ==20000) > > str(Data20N) > > > > > > Data99 <- subset(Data, Data$group =="99999") > > str(Data99) > > Data99N <- subset(Data, Data$group ==99999) > > str(Data99N) > > Data100 <- subset(Data, Data$group =="100000") > > str(Data100) > > Data100N <- subset(Data, Data$group ==100000) > > str(Data100N) > > >
peter dalgaard
2015-Nov-17 21:57 UTC
[R] Strange result when subsetting a data frame based on a character variable
> On 17 Nov 2015, at 20:37 , Bert Gunter <bgunter.4567 at gmail.com> wrote: > >> 2 == "2" > [1] TRUE > > ?"==" says: > > "If the two arguments are atomic vectors of different types, one is > coerced to the type of the other, the (decreasing) order of precedence > being character, complex, numeric, integer, logical and raw." > >> as.character(99999) > [1] "99999" >> as.character(100000) > [1] "1e+05" >> as.character(100000) == "100000" > [1] FALSE >Also notice that, for similar reasons> 10 > "2"[1] FALSE (At least in most collations. I recently discovered that OSX Finder sorted 2dnorm.R between 02-Probability.toc and 03-Combinatorics-2x2.pdf.) -- Peter Dalgaard, Professor, Center for Statistics, Copenhagen Business School Solbjerg Plads 3, 2000 Frederiksberg, Denmark Phone: (+45)38153501 Email: pd.mes at cbs.dk Priv: PDalgd at gmail.com