Hi all, I hope this isn't a naive newbie question again. Here you can see column 264 of a data frame containing data of the same interview in May and September. Column 264 contains the answers of 49 persons to a question in May.> fbhint.spss1[,264][1] teils/teils sehr wichtig <NA> <NA> sehr wichtig [6] sehr wichtig sehr wichtig sehr wichtig <NA> <NA> [11] <NA> <NA> wichtig <NA> <NA> [16] sehr wichtig <NA> <NA> <NA> <NA> [21] <NA> <NA> <NA> wichtig <NA> [26] <NA> <NA> <NA> <NA> <NA> [31] <NA> <NA> <NA> <NA> teils/teils [36] sehr wichtig <NA> <NA> <NA> <NA> [41] wichtig <NA> sehr wichtig <NA> <NA> [46] sehr wichtig wichtig <NA> <NA> Levels: sehr wichtig wichtig teils/teils unwichtig ganz unwichtig Column 566 contains the answers from the same persons to the same question in September.> fbhint.spss1[,566][1] <NA> <NA> <NA> wichtig wichtig [6] sehr wichtig sehr wichtig wichtig wichtig <NA> [11] <NA> <NA> sehr wichtig sehr wichtig sehr wichtig [16] sehr wichtig <NA> unwichtig wichtig wichtig [21] <NA> <NA> teils/teils teils/teils <NA> [26] unwichtig <NA> <NA> <NA> <NA> [31] wichtig sehr wichtig sehr wichtig <NA> unwichtig [36] sehr wichtig <NA> <NA> teils/teils wichtig [41] wichtig wichtig <NA> <NA> wichtig [46] <NA> sehr wichtig teils/teils <NA> Levels: sehr wichtig wichtig teils/teils unwichtig ganz unwichtig The following works:> median(fbhint.spss1[,264], na.rm=T)[1] sehr wichtig Levels: sehr wichtig wichtig teils/teils unwichtig ganz unwichtig ... but here it doesn't:> median(fbhint.spss1[,566], na.rm=T)Error in Summary.factor(..., na.rm = na.rm) : "sum" not meaningful for factors I don't have any ideas why! Can somebody give me a hint? TIA Best regards, Christoph
What do you expect the median of a factor to be? Are you sure you don't want the *mode* (most common value)? If you want the median numeric code of ordered factors, maybe use `as.numeric` first. Consider what you think should happen if this median is not an integer. HTH> -----Original Message----- > From: Christoph Bier [mailto:christoph.bier at web.de] > Sent: 30 October 2003 21:35 > To: r-help at stat.math.ethz.ch > Subject: [R] Weird problem with median on a factor > > > Security Warning: > If you are not sure an attachment is safe to open please contact > Andy on x234. There are 0 attachments with this message. > ________________________________________________________________ > > Hi all, > > I hope this isn't a naive newbie question again. Here you can > see column 264 of > a data frame containing data of the same interview in May and > September. Column > 264 contains the answers of 49 persons to a question in May. > > > fbhint.spss1[,264] > [1] teils/teils sehr wichtig <NA> <NA> sehr wichtig > [6] sehr wichtig sehr wichtig sehr wichtig <NA> <NA> > [11] <NA> <NA> wichtig <NA> <NA> > [16] sehr wichtig <NA> <NA> <NA> <NA> > [21] <NA> <NA> <NA> wichtig <NA> > [26] <NA> <NA> <NA> <NA> <NA> > [31] <NA> <NA> <NA> <NA> teils/teils > [36] sehr wichtig <NA> <NA> <NA> <NA> > [41] wichtig <NA> sehr wichtig <NA> <NA> > [46] sehr wichtig wichtig <NA> <NA> > Levels: sehr wichtig wichtig teils/teils unwichtig ganz unwichtig > > Column 566 contains the answers from the same persons to the > same question in > September. > > > fbhint.spss1[,566] > [1] <NA> <NA> <NA> wichtig wichtig > [6] sehr wichtig sehr wichtig wichtig wichtig <NA> > [11] <NA> <NA> sehr wichtig sehr wichtig sehr wichtig > [16] sehr wichtig <NA> unwichtig wichtig wichtig > [21] <NA> <NA> teils/teils teils/teils <NA> > [26] unwichtig <NA> <NA> <NA> <NA> > [31] wichtig sehr wichtig sehr wichtig <NA> unwichtig > [36] sehr wichtig <NA> <NA> teils/teils wichtig > [41] wichtig wichtig <NA> <NA> wichtig > [46] <NA> sehr wichtig teils/teils <NA> > Levels: sehr wichtig wichtig teils/teils unwichtig ganz unwichtig > > The following works: > > > median(fbhint.spss1[,264], na.rm=T) > [1] sehr wichtig > Levels: sehr wichtig wichtig teils/teils unwichtig ganz unwichtig > > ... but here it doesn't: > > > median(fbhint.spss1[,566], na.rm=T) > Error in Summary.factor(..., na.rm = na.rm) : > "sum" not meaningful for factors > > I don't have any ideas why! Can somebody give me a hint? > > TIA > > Best regards, > > Christoph > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://www.stat.math.ethz.ch/mailman/listinfo/r-help >Simon Fear Senior Statistician Syne qua non Ltd Tel: +44 (0) 1379 644449 Fax: +44 (0) 1379 644445 email: Simon.Fear at synequanon.com web: http://www.synequanon.com Number of attachments included with this message: 0 This message (and any associated files) is confidential and\...{{dropped}}
Christoph Bier <christoph.bier at web.de> writes:> The following works: > > > median(fbhint.spss1[,264], na.rm=T) > [1] sehr wichtig > Levels: sehr wichtig wichtig teils/teils unwichtig ganz unwichtig > > ... but here it doesn't: > > > median(fbhint.spss1[,566], na.rm=T) > Error in Summary.factor(..., na.rm = na.rm) : > "sum" not meaningful for factors > > I don't have any ideas why! Can somebody give me a hint?Offhand, I'd guess that the "median" is inbetween two factor levels in one case and not in the other. However, both cases should give an error, especially for unordered factors, but it is not well-defined for ordered factors either. If you want to interpret your factor as a numeric scale, use as.numeric first. -- O__ ---- Peter Dalgaard Blegdamsvej 3 c/ /'_ --- Dept. of Biostatistics 2200 Cph. N (*) \(*) -- University of Copenhagen Denmark Ph: (+45) 35327918 ~~~~~~~~~~ - (p.dalgaard at biostat.ku.dk) FAX: (+45) 35327907
Simon Fear schrieb:> The last part of my message is what I thought might be the cause - > maybe your median is not an integer, so what category > should it be mapped to? What do you get if you slip > in an `as.numeric` before calculating the median? > [if still an error, then there is definitely something else > going wrong to report]Then everything is ok. > as.numeric(fbhint.spss1$V15.SP1) -> tmp.data2 > median(tmp.data2, na.rm=T) [1] 2 Christoph -- Christoph Bier, Dipl.Oecotroph., Email: bier at wiz.uni-kassel.de Universitaet Kassel, FG Oekologische Lebensmittelqualitaet und Ernaehrungskultur \\ Postfach 12 52 \\ 37202 Witzenhausen Tel.: +49 (0) 55 42 / 98 -17 21, Fax: -17 13
Final guess as to observed behaviour: in the first case after removal of NAs there were an odd number of observations (so that sum was not called within the code for median). In your second call I suspect that even though you got an integer answer, it was found as sum(2,2)/2. It seems to me the best way to deal with this "bug" would be to make calling median with a factor argument be an immediate error. Or just trust users never to attempt such a thing ... Simon Fear Senior Statistician Syne qua non Ltd Tel: +44 (0) 1379 644449 Fax: +44 (0) 1379 644445 email: Simon.Fear at synequanon.com web: http://www.synequanon.com Number of attachments included with this message: 0 This message (and any associated files) is confidential and\...{{dropped}}
Beating a dead horse... I am an R beginner trying to understand this factor business. While the entire business of finding the median of factor may be silly from a practical point of view, this email chain has helped me understand something. I have looked at the median function and it tests to see if what is passed to it is numeric. If I were building a function, if I tested for mode numeric, and if something told me it was numeric then like the median function I would naively assume that I could do arithmetic on it:> saywhut<-as.factor(c(NA,"1","1","1","1","2","10")) > mode(saywhut)[1] "numeric" It appears to me that the when the median function tests for numeric it doesn't have the desired result with an object of class factor (and maybe other classes?) as was shown by the example. I have a suspicion that something of class factor has at least two pieces, one of which is the levels which can possibly be character or something else and the other piece is the ordering of the levels which is of storage.mode integer. Is it this ordering that determines the mode of the factor?? But if the mode of factor is truly numeric, why doesn't the median function use the numeric piece for finding the median (like it did with odd n - not that anyone would ever really want the median of a factor:)?? I think that Simon Fear hit on the right idea because of the definition of median that is used for an even number of observations takes the sum of the ordered middle two observations. It is the sum (called by the median function) that chokes on a factor.> sum(saywhut,na.rm=T)Error in Summary.factor(..., na.rm = na.rm) : "sum" not meaningful for factors It appears that whoever built the sum function built in a test for factor (Simon Fear's first suggestion for median) On the other hand:> sd(saywhut,na.rm=T)[1] 3.614784 (Simon Fear's second suggestion for median) Bytheway, mean treats factor in different way: mean(saywhut) [1] NA Warning message: argument is not numeric or logical: returning NA in: mean.default(saywhut). There is an R-FAQ that tells one how to convert a factor to 'numeric' but if I had tested for something being numeric to begin with I never would have guessed that I needed to convert it to numeric. I think what this conversion is really doing is getting rid of the machinery associated with the class factor:> #from the R-FAQ > test<-as.numeric(as.character(saywhut)) > mode(test)[1] "numeric"> median(test,na.rm=T)[1] 1 and bytheway:> not.a.factor<-c(NA,"1","1","2","10") > mode(not.a.factor)[1] "character"> median(not.a.factor,na.rm=T)Error in median(not.a.factor, na.rm = T) : need numeric data <Simon Fear: It seems to me the best way to deal with this "bug" would be to make calling median with a factor argument be an immediate error.> Do you think that all base functions (sum, sd, mean, median,...) should deal with this in a consistent way (This might be much more work.)? Another thing that would make things consistent would be to take the stop-work behavior out of sum:) I don't think there is any real problem in the current behavior of factor as long as the interaction between functions and classes produces this stop-work behavior - preferably with a warning - and not unexpected side effects. I am curious if there are other classes of mode numeric which median-mean-sum-sd-etc might choke on. <tongue-in-cheek on> Of course, R would produce a median for factors by using the "correct" defintion of a median of samples i.e., one that agrees with the definition of median on a CDF, even though this concept gives most people apoplexy. <off> Thanks Bob Usual disclaimers.... -----Original Message----- From: Simon Fear [mailto:Simon.Fear@synequanon.com] Sent: Friday, October 31, 2003 6:18 AM To: Christoph Bier Cc: r-help@stat.math.ethz.ch Subject: RE: [R] Weird problem with median on a factor Final guess as to observed behaviour: in the first case after removal of NAs there were an odd number of observations (so that sum was not called within the code for median). In your second call I suspect that even though you got an integer answer, it was found as sum(2,2)/2. It seems to me the best way to deal with this "bug" would be to make calling median with a factor argument be an immediate error. Or just trust users never to attempt such a thing ... Simon Fear Senior Statistician Syne qua non Ltd Tel: +44 (0) 1379 644449 Fax: +44 (0) 1379 644445 email: Simon.Fear@synequanon.com web: http://www.synequanon.com Number of attachments included with this message: 0 This message (and any associated files) is confidential and\...{{dropped}} ______________________________________________ R-help@stat.math.ethz.ch mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help [[alternative HTML version deleted]]
Dave Cacela schrieb:> Christoph, > > I concur with the other respondents who questioned why someone would wish to > calculate the median of a factor. However, with regard to your actual > question, I suspect that median() is giving different answers because the > two vectors are not both factors, i.e., that one of them is a character. Did > you test that?Yes, I did and find the same like Tony Plate: > is.factor(fbhint.spss1$V15.SPS) # = column 264 [1] TRUE > mode(fbhint.spss1$V15.SPS) [1] "numeric" > is.factor(fbhint.spss1$V15.SP1) # = column 566 [1] TRUE > mode(fbhint.spss1$V15.SP1) [1] "numeric"> Using S, I have seen quirks in this regard that relate to import procedure > and the value of the first element in the vector. In your case, the first > elements differ in that one is NA while the other is "teils/teils".It also occurs if the first element is the same. For example "wichtig" in columns 263 and 565. Regards, Christoph
Continuing to beat the greasy spot in the road where the dead horse used to be.... 1) I know that the people building r are working on bigger and better things than this silly question and I appreciate the existence of this complicated package that was dropped in my lap for free. 2) Tony Platt succinctly pointed out one of the underlying 'problems' (possibly in my understanding):> #this is a perfectly reasonable r object > some.weird.object<-factor(c("a","b","c")) > #this is an internal r function acting on an object > typeof(some.weird.object)[1] "integer"> #this is a primitive r function acting on an object > is.numeric(some.weird.object)[1] FALSE>Do these functions behave in a design consistent manor?? Can a single r object simultaneously be of type integer and NOT numeric?? If this is intentional can someone explain why? I don't think this has anything to do with taking the median of a factor (median calls mode that calls typeof). It just requires a sufficiently complex object, such as factor, before we start seeing this behavior. I wasn't clever enough to come up with examples of non-factor objects that produced this behavior so I am curious if this problem is internal to factor or to the functions themselves. Thanks Bob -----Original Message----- From: Duncan Murdoch [mailto:dmurdoch@pair.com] Sent: Sunday, November 02, 2003 9:40 AM To: Peter Dalgaard Cc: r-help@stat.math.ethz.ch Subject: Re: [R] Weird problem with median on a factor On 02 Nov 2003 12:50:37 +0100, you wrote:>(Arguably, sorting an unordered factor ought to Verboten as well, >though!)No, arbitrarily assigning an ordering and using that to sort is a useful thing in many situations, e.g. searching. Duncan Murdoch ______________________________________________ R-help@stat.math.ethz.ch mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help [[alternative HTML version deleted]]