Hi all, I am very confused with class. I am looking at some weather data which I want to use as explanatory variables in an lm. R has treated these variables as factors (i.e. with different levels), whereas I want them treated as discretely measured continuous variables. So I need to reassign the class of these variables, right? Indeed, doing class(southwest$pressure) (pressure being air pressure), I get #> factor. Now what class should I use to reassign them so that my model fitting process goes as I want it to? I have obviously done something wrong. I did southwest$pressure <- as(southwest$pressure,"numeric") numeric seeming like a reasonable class to assign to this variable. However, doing some summary stats like mean(southwest$pressure) #> 341, max(southwest$pressure) #> 761, which is clearly nonsense, as my maximum value is around 1040. Something similar has happened to maxtemp (maximum temperature), which I also reassigned from a factor to class numeric, which now apparently has a maximum value of 147! Clearly it must be the reassignment of class that has caused these problems, as summary stats on the data before I reassigned the classes were fine. What is wrong with the class numeric? Reading the numeric help page didn't reveal anything to me. Can someone suggest the correct class? Many thanks for any help. Robin Williams Met Office summer intern - Health Forecasting robin.williams@metoffice.gov.uk [[alternative HTML version deleted]]
Williams, Robin <robin.williams <at> metoffice.gov.uk> writes:> > Hi all, > I am very confused with class. > I am looking at some weather data which I want to use as explanatory > variables in an lm. R has treated these variables as factors (i.e. with > different levels), whereas I want them treated as discretely measured > continuous variables.The short answer to your problem can be found in the documentation for factor, particularly in the "Warning" section. ?factor Mark
This seems to be FAQ Q7.10 On Thu, 21 Aug 2008, Williams, Robin wrote:> Hi all, > I am very confused with class. > I am looking at some weather data which I want to use as explanatory > variables in an lm. R has treated these variables as factors (i.e. with > different levels), whereas I want them treated as discretely measured > continuous variables. So I need to reassign the class of these > variables, right? > Indeed, doing > class(southwest$pressure) > (pressure being air pressure), I get > #> factor. > Now what class should I use to reassign them so that my model fitting > process goes as I want it to? I have obviously done something wrong. I > did > southwest$pressure <- as(southwest$pressure,"numeric") > numeric seeming like a reasonable class to assign to this variable. > However, doing some summary stats like > mean(southwest$pressure) > #> 341, > max(southwest$pressure) > #> 761, > which is clearly nonsense, as my maximum value is around 1040. Something > similar has happened to maxtemp (maximum temperature), which I also > reassigned from a factor to class numeric, which now apparently has a > maximum value of 147! > Clearly it must be the reassignment of class that has caused these > problems, as summary stats on the data before I reassigned the classes > were fine. What is wrong with the class numeric? Reading the numeric > help page didn't reveal anything to me. Can someone suggest the correct > class? > Many thanks for any help. > Robin Williams > Met Office summer intern - Health Forecasting > robin.williams at metoffice.gov.uk > > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >-- Brian D. Ripley, ripley at stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UK Fax: +44 1865 272595
Hi Robin, You haven't said where you're getting the data from. But if the answer is that you're using read.table, read.csv or similar to read the data into R, then I advise you to go back to that stage and get it right from the outset. It's very, very common to see people who are relatively new to R splattering their code with calls to as.numeric, just because they haven't read the data in properly in the first place. It's also common in those who aren't new to R... So e.g. if you are using read.table, then use the colClasses argument to specify the classes of your columns, and use str() on the result until you're happy with the data frame produced. It's not entirely clear why you would have ended up with factors if your data are numeric. That often happens when people mix characters with numbers. Perhaps you have mixed the header row up with the data? Anyway, what you are seeing are the integer encodings of the factors. E.g.> f <- factor(11:20) > str(f)Factor w/ 10 levels "11","12","13",..: 1 2 3 4 5 6 7 8 9 10> as.numeric(f)[1] 1 2 3 4 5 6 7 8 9 10 But don't mess with them. Just make sure that things which shouldn't be factors never become factors. Dan On Thu, Aug 21, 2008 at 03:40:58PM +0100, Williams, Robin wrote:> Hi all, > I am very confused with class. > I am looking at some weather data which I want to use as explanatory > variables in an lm. R has treated these variables as factors (i.e. with > different levels), whereas I want them treated as discretely measured > continuous variables. So I need to reassign the class of these > variables, right? > Indeed, doing > class(southwest$pressure) > (pressure being air pressure), I get > #> factor. > Now what class should I use to reassign them so that my model fitting > process goes as I want it to? I have obviously done something wrong. I > did > southwest$pressure <- as(southwest$pressure,"numeric") > numeric seeming like a reasonable class to assign to this variable. > However, doing some summary stats like > mean(southwest$pressure) > #> 341, > max(southwest$pressure) > #> 761, > which is clearly nonsense, as my maximum value is around 1040. Something > similar has happened to maxtemp (maximum temperature), which I also > reassigned from a factor to class numeric, which now apparently has a > maximum value of 147! > Clearly it must be the reassignment of class that has caused these > problems, as summary stats on the data before I reassigned the classes > were fine. What is wrong with the class numeric? Reading the numeric > help page didn't reveal anything to me. Can someone suggest the correct > class? > Many thanks for any help. > Robin Williams > Met Office summer intern - Health Forecasting > robin.williams at metoffice.gov.uk > > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.-- http://www.stats.ox.ac.uk/~davison
On Thu, Aug 21, 2008 at 04:20:57PM +0100, Williams, Robin wrote:> Hi Dan, > Thanks for the reply, yes, I am using read.csv on the attached file.OK, so how about using the colClasses argument. Your problem is that some malfunctioning software has inserted the value "#VALUE!" into some of your supposedly numeric cells. So deal with that with the na.strings argument. Like I said, when reading in data, it's worth spending a minute looking at the documentation for read.table/read.csv rather than spending an hour messing about with the results of not doing so.> Southwest <- read.csv("southwest.csv", colClasses=c("character",rep("numeric",10), "character"), na.strings="#VALUE!") > str(Southwest)'data.frame': 1530 obs. of 12 variables: $ date : chr "5/1/1997" "5/2/1997" "5/3/1997" "5/4/1997" ... $ maxtemp : num 18.8 21.8 16.6 14.9 14.2 9.3 9.9 12.7 12.8 13.2 ... $ mintemp : num 7.7 9.8 11 12.2 11.3 4.5 2.1 5.7 6.7 7.3 ... $ pressure : num 1028 1023 1015 1001 989 ... $ humid : num 59 44 83 80 87 57 64 83 70 69 ... $ wind : num 8.4 11.1 8.2 17.4 13.8 16.2 11.1 14.9 12.7 16.6 ... $ rain : num 0 0 6 1 3.3 2.6 4.3 6 3.2 1.6 ... $ index : num 1 2 3 4 5 6 7 8 9 10 ... $ admissions: num 5.00 4.72 5.16 3.67 3.62 ... $ detrended : num 4.79 4.47 5.30 3.91 3.51 ... $ detrended2: num 4.79 4.47 5.30 3.91 3.51 ... $ d.o.w. : chr "Thu" "Fri" "Sat" "Sun" ... NB you could coerce those dates to a date class rather than character but I'll leave that up to you. str() is your friend. Dan> However, as when I do > Southwest <- data.frame(read.csv("southwest.csv")read.csv returns a data frame; no need to wrap it in data.frame()> Names(southwest) > the output is the column headings (i.e. the variables), and looking at > the data I only get the numbers, I assume the column headings haven't > become confused with the data. > I.e. if I just do > Southwest$pressure > The output is correct, i.e. the values contained in the pressure column. > > Appologies for my repeated question, but I'm somewhat confused on this > one and my lack of experience with R isn't helping matters. I don't even > understand why R is interpreting these figures as factors in the first > place, doesn't this imply that any similar data would be interpreted as > factors? > Thanks for any further help. > Robin Williams > Met Office summer intern - Health Forecasting > robin.williams at metoffice.gov.uk > -----Original Message----- > From: Dan Davison [mailto:davison at stats.ox.ac.uk] > Sent: Thursday, August 21, 2008 4:11 PM > To: Williams, Robin > Cc: r-help at r-project.org > Subject: Re: [R] Very confused with class > > Hi Robin, > > You haven't said where you're getting the data from. But if the answer > is that you're using read.table, read.csv or similar to read the data > into R, then I advise you to go back to that stage and get it right from > the outset. It's very, very common to see people who are relatively new > to R splattering their code with calls to as.numeric, just because they > haven't read the data in properly in the first place. It's also common > in those who aren't new to R... So e.g. if you are using read.table, > then use the colClasses argument to specify the classes of your columns, > and use str() on the result until you're happy with the data frame > produced. > > It's not entirely clear why you would have ended up with factors if your > data are numeric. That often happens when people mix characters with > numbers. Perhaps you have mixed the header row up with the data? > > Anyway, what you are seeing are the integer encodings of the factors. > E.g. > > > f <- factor(11:20) > > str(f) > Factor w/ 10 levels "11","12","13",..: 1 2 3 4 5 6 7 8 9 10 > > as.numeric(f) > [1] 1 2 3 4 5 6 7 8 9 10 > > But don't mess with them. Just make sure that things which shouldn't be > factors never become factors. > > Dan > > On Thu, Aug 21, 2008 at 03:40:58PM +0100, Williams, Robin wrote: > > Hi all, > > I am very confused with class. > > I am looking at some weather data which I want to use as explanatory > > > variables in an lm. R has treated these variables as factors (i.e. > > with different levels), whereas I want them treated as discretely > > measured continuous variables. So I need to reassign the class of > > these variables, right? > > Indeed, doing > > class(southwest$pressure) > > (pressure being air pressure), I get > > #> factor. > > Now what class should I use to reassign them so that my model > > fitting process goes as I want it to? I have obviously done something > > wrong. I did southwest$pressure <- as(southwest$pressure,"numeric") > > numeric seeming like a reasonable class to assign to this variable. > > However, doing some summary stats like > > mean(southwest$pressure) > > #> 341, > > max(southwest$pressure) > > #> 761, > > which is clearly nonsense, as my maximum value is around 1040. > > Something similar has happened to maxtemp (maximum temperature), which > > > I also reassigned from a factor to class numeric, which now apparently > > > has a maximum value of 147! > > Clearly it must be the reassignment of class that has caused these > > problems, as summary stats on the data before I reassigned the classes > > > were fine. What is wrong with the class numeric? Reading the numeric > > help page didn't reveal anything to me. Can someone suggest the > > correct class? > > Many thanks for any help. > > Robin Williams > > Met Office summer intern - Health Forecasting > > robin.williams at metoffice.gov.uk > > > > > > [[alternative HTML version deleted]] > > > > ______________________________________________ > > R-help at r-project.org mailing list > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide > > http://www.R-project.org/posting-guide.html > > and provide commented, minimal, self-contained, reproducible code. > > -- > http://www.stats.ox.ac.uk/~davison-- http://www.stats.ox.ac.uk/~davison