Deal list, I have a data frame (birth) with mixed variables (numeric and alphanumeric). One variable "t1stvisit" was originally coded as numeric with values 1,2, and 3. After attaching the data frame, this is what I see when I use str(t1stvisit) $ t1stvisit: int 1 1 1 1 1 1 1 1 2 2 ... This is as expected. I then convert t1stvisit to a factor and to avoid creating a second copy of this variable independent of the data frame I use: birth$t1stvisit = as.factor(birth$t1stvisit) if I check that the conversion has worked: is.factor(t1stvisit) [1] FALSE Now the only object present in the workspace in the data frame "birth" and, as noted, I have not created any new variables. So why does R still treat t1stvisit as numeric? is.factor(t1stvisit) [1] FALSE Yet when I try the following: > is.factor(birth$t1stvisit) [1] TRUE So, there appears to be two versions of "t1stvisit" - the original numeric version and the correct factor version although ls() only shows "birth" as present in the workspace. If I type: > summary(t1stvisit) Min. 1st Qu. Median Mean 3rd Qu. Max. NA's 1.000 1.000 2.000 1.574 2.000 3.000 29.000 I get the numeric version, but if I try summary(birth$t1stvisit) 1 2 3 NA's 180 169 22 29 I get the factor version. Frankly I feel that this behaviour is non-intuitive and potentially problematic. Nor have I seen warnings about this in the various text books on R. Can anyone comment on why this should occur? Many thanks, Alan Kelly Dr. Alan Kelly Department of Public Health & Primary Care Trinity College Dublin
Alan Kelly wrote:> Deal list, > I have a data frame (birth) with mixed variables (numeric and > alphanumeric). One variable "t1stvisit" was originally coded as numeric > with values 1,2, and 3. After attaching the data frame, this is what I > see when I use str(t1stvisit)actually, str(birth), I suspect, but not important.> > $ t1stvisit: int 1 1 1 1 1 1 1 1 2 2 ... > > This is as expected. > I then convert t1stvisit to a factor and to avoid creating a second copy > of this variable independent of the data frame I use: > birth$t1stvisit = as.factor(birth$t1stvisit) > if I check that the conversion has worked: > is.factor(t1stvisit) > [1] FALSE > Now the only object present in the workspace in the data frame "birth" > and, as noted, I have not created any new variables. So why does R > still treat t1stvisit as numeric? > is.factor(t1stvisit) > [1] FALSE > > Yet when I try the following: > > is.factor(birth$t1stvisit) > [1] TRUE > So, there appears to be two versions of "t1stvisit" - the original > numeric version and the correct factor version although ls() only shows > "birth" as present in the workspace. > If I type: > > summary(t1stvisit) > Min. 1st Qu. Median Mean 3rd Qu. Max. NA's > 1.000 1.000 2.000 1.574 2.000 3.000 29.000 > I get the numeric version, but if I try > summary(birth$t1stvisit) > 1 2 3 NA's > 180 169 22 29 > I get the factor version. > > Frankly I feel that this behaviour is non-intuitive and potentially > problematic. Nor have I seen warnings about this in the various text > books on R. > Can anyone comment on why this should occur?I haven't looked at discussions of 'attach()' for a while, since I rarely use it nowadays (I find with() more convenient most of the time), but Chapter 6 in 'An Introduction to R' does discuss it. There are indeed two versions of 'birth'. Your basic problem is which version of 'birth' is being modified. Hint: it's NOT the attached version. Small example: dat <- data.frame(x=1:3) attach(dat) dat$y <- 4:6 y #Error: object 'y' not found dat$y #[1] 4 5 6 BTW, you don't need as.factor(); use factor(). -Peter Ehlers> Many thanks, > Alan Kelly > > Dr. Alan Kelly > Department of Public Health & Primary Care > Trinity College Dublin > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > >
When you attach() something, it loads it into memory and there it stays. It is not a link, reference, or pointer to the original. Changing the original (the version in the dataframe), which is what you did, does not change the attached copy in memory. In essence, you did a type conversion on one copy, but afterwards started looking at the other copy. See also an interjected comments below. -Don At 8:54 AM +0000 11/23/09, Alan Kelly wrote:>Deal list, >I have a data frame (birth) with mixed variables (numeric and >alphanumeric). One variable "t1stvisit" was originally coded as >numeric with values 1,2, and 3. After attaching the data frame, >this >is what I see when I use str(t1stvisit) > >$ t1stvisit: int 1 1 1 1 1 1 1 1 2 2 ... > >This is as expected. >I then convert t1stvisit to a factor and to avoid creating a second >copy of this variable independent of the data frame I use: >birth$t1stvisit = as.factor(birth$t1stvisit) >if I check that the conversion has worked: >is.factor(t1stvisit) >[1] FALSE >Now the only object present in the workspace in the data frame >"birth" and, as noted, I have not created any new variables. So why >does R still treat t1stvisit as numeric? >is.factor(t1stvisit) >[1] FALSE > >Yet when I try the following: >> is.factor(birth$t1stvisit) >[1] TRUE >So, there appears to be two versions of "t1stvisit" - the original >numeric version and the correct factor version although ls() only >shows "birth" as present in the workspace.Right. find('t1stvisit') will show you there are two of them, and where in memory they are located. If you type t1stvisit at the prompt, you always get the first one. The one in the attached dataframe is the second one. Use the search() function to show you the different locations in memory where objects can be found. When you did the attach(), did you get a message like:> attach(tmp)The following object(s) are masked _by_ .GlobalEnv : x (yours would have referred to your variables, not the "x" in my example). That message tells you you have two variables of the same name, stored in two different locations in the search path. As a general rule, it's just plain confusing to have more than one object of the same name in more than one location. In your situation, I would get rid of the one that's not in the dataframe. But even then, if you change it in the dataframe you'll still need to detach and re-attach the dataframe, so using attach() is probably not the best choice in the long run. Maybe the with() function would meet your needs.>If I type: >> summary(t1stvisit) > Min. 1st Qu. Median Mean 3rd Qu. Max. NA's > 1.000 1.000 2.000 1.574 2.000 3.000 29.000 >I get the numeric version, but if I try >summary(birth$t1stvisit) > 1 2 3 NA's > 180 169 22 29 >I get the factor version. > >Frankly I feel that this behaviour is non-intuitive and potentially >problematic. Nor have I seen warnings about this in the various text >books on R. >Can anyone comment on why this should occur? >Many thanks, >Alan Kelly > >Dr. Alan Kelly >Department of Public Health & Primary Care >Trinity College Dublin > >______________________________________________ >R-help at r-project.org mailing list >https://*stat.ethz.ch/mailman/listinfo/r-help >PLEASE do read the posting guide http://*www.*R-project.org/posting-guide.html >and provide commented, minimal, self-contained, reproducible code.-- -------------------------------------- Don MacQueen Environmental Protection Department Lawrence Livermore National Laboratory Livermore, CA, USA 925-423-1062
Reasonably Related Threads
- apply the function "factor" to multiple columns
- Nicely formatted tables
- Help with the Error Message in R "Error in 1:nchid : result would be too long a vector"
- Bug in by() function which works for some FUN argument and does not work for others
- Bug in by() function which works for some FUN argument and does not work for others