Roman, Sally (MRC)
2014-Mar-07 17:16 UTC
[R] posting for R-help on the predict function and categorical variables in R
Hi - I would like to post to the R-help mailing list. Here is my post: -This is more of a general question about how the predict function treats categorical variables and how to interpret the output from predict. I have a zeroinfl model to predict the number of animals encountered: b9<-zeroinfl(Count ~ as.factor (Area) + as.factor(Season)|1, dist="negbin",data = total) where Count is the number of animals and the explanatory variables are Area and Season, both are coded as factors in the model. Area has three levels and Season has 4 levels. Coding the two as factors allows for R to create dummy variables for each variable for use in the model. When I use the predict function to predict the number of animals for a larger data set, I want to make sure I'm understanding what is happening. My newdata for predict is: newdata<- as.data.frame(Season, Area) Both variables are coded as factors and the dataframe is in a long format. There are records for each combination of Season and Area that correspond to trips taken over the course of 8 years. There are 113,804 rows of data in the newdata data,frame. str(newdata) 'data.frame': 113804 obs. of 2 variables: $ Season: Factor w/ 4 levels "1","2","3","4": 1 1 1 1 1 1 1 1 1 1 ... $ Area : Factor w/ 3 levels "625","631","Bay": 3 3 3 3 3 3 3 3 3 3 ... Example: Season Area 1 1 Bay 2 1 625 3 1 631 4 2 Bay 5 2 625 6 2 631 7 3 Bay 8 3 625 9 3 631 10 4 Bay 11 4 625 12 4 631 1. Do I need to create dummy variables for all levels for the two variables for input into predict, or does predict function act like zeroinfl where if the variables are coded as factors, this is done automatically by R. 2. Predict returns values of 0.0461 - 0.6015. If I am trying to predict the number of animals how do I interpret this? Since no predicted values are greater than 1 and I need whole numbers, I rounded the predicted data so that any value less than 0.5 was equal to 0 and any value greater than 0.5 was equal to 1. Does this seem correct? Thanks for any help. Sally Roman Fisheries Management Specialist Virginia Marine Resources Commission 2600 Washington Avenue, 3rd Floor Newport News, VA 23607 Phone: 757-247-2243 [[alternative HTML version deleted]]
Greg Snow
2014-Mar-08 19:52 UTC
[R] posting for R-help on the predict function and categorical variables in R
How predict works depends on the method written for that type of object. The zeroinfl function is not in any of the standard packages, so it must be in another package, but you did not tell us which. Since it is from a package other than the main ones, it may work similarly to the regular predict functions, or it may work completely different. Have you tried reading the help pages for zeroinfl and the predict method for it? Have you worked through any vignettes for that package? On Fri, Mar 7, 2014 at 10:16 AM, Roman, Sally (MRC) <Sally.Roman at mrc.virginia.gov> wrote:> Hi - I would like to post to the R-help mailing list. Here is my post: > > -This is more of a general question about how the predict function treats categorical variables and how to interpret the output from predict. > > I have a zeroinfl model to predict the number of animals encountered: > > b9<-zeroinfl(Count ~ as.factor (Area) + as.factor(Season)|1, dist="negbin",data = total) > > where Count is the number of animals and the explanatory variables are Area and Season, both are coded as factors in the model. Area has three levels and Season has 4 levels. Coding the two as factors allows for R to create dummy variables for each variable for use in the model. > > When I use the predict function to predict the number of animals for a larger data set, I want to make sure I'm understanding what is happening. > > My newdata for predict is: > newdata<- as.data.frame(Season, Area) > Both variables are coded as factors and the dataframe is in a long format. There are records for each combination of Season and Area that correspond to trips taken over the course of 8 years. There are 113,804 rows of data in the newdata data,frame. > > str(newdata) > 'data.frame': 113804 obs. of 2 variables: > $ Season: Factor w/ 4 levels "1","2","3","4": 1 1 1 1 1 1 1 1 1 1 ... > $ Area : Factor w/ 3 levels "625","631","Bay": 3 3 3 3 3 3 3 3 3 3 ... > > Example: > Season Area > 1 1 Bay > 2 1 625 > 3 1 631 > 4 2 Bay > 5 2 625 > 6 2 631 > 7 3 Bay > 8 3 625 > 9 3 631 > 10 4 Bay > 11 4 625 > 12 4 631 > > 1. Do I need to create dummy variables for all levels for the two variables for input into predict, or does predict function act like zeroinfl where if the variables are coded as factors, this is done automatically by R. > > 2. Predict returns values of 0.0461 - 0.6015. If I am trying to predict the number of animals how do I interpret this? Since no predicted values are greater than 1 and I need whole numbers, I rounded the predicted data so that any value less than 0.5 was equal to 0 and any value greater than 0.5 was equal to 1. Does this seem correct? > > Thanks for any help. > > Sally Roman > Fisheries Management Specialist > Virginia Marine Resources Commission > 2600 Washington Avenue, 3rd Floor > Newport News, VA 23607 > Phone: 757-247-2243 > > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.-- Gregory (Greg) L. Snow Ph.D. 538280 at gmail.com