Hi all - I'm trying to find a way to create dummy variables from factors in a regression. I have been using biglm along the lines of ff <- log(Price) ~ factor(Colour):factor(Store) + factor(DummyVar):factor(Colour):factor(Store) lm1 <- biglm(ff, data=my.dataset) but because there are lots of colours (>100) and lots of stores (>250), I run it to memory problems. Now, not every store sells every colour and so it should be possible to create the matrix of factor variables myself and greatly reduce the size of the problem. it seems that lm / biglm use all combinations of factor levels when used in factor(Colour):factor(Store) so by creating my own matrix of factor variables i should be able to reduce the size of the problem considerably. If i have a data frame>my.dataset <- data.frame(Price=1:12, Colour= c('red','blue','green'),Store=c('a', 'b', 'c', 'a', 'c', 'd', 'e', 'e', 'e', 'e', 'b', 'e'), DummyVar = sort(rep(c(0,1),6)) ) i want to create a data frame with the dummy vars that looks like red:a red:e blue:b blue:c blue:e green:c green:d green:e 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 any ideas would be appreciated. -- Tim Calkins 0406 753 997
On Wed, 5 Dec 2007, Tim Calkins wrote:> Hi all - > > I'm trying to find a way to create dummy variables from factors in a > regression. I have been using biglm along the lines of > > ff <- log(Price) ~ factor(Colour):factor(Store) + > factor(DummyVar):factor(Colour):factor(Store) > > lm1 <- biglm(ff, data=my.dataset) > > but because there are lots of colours (>100) and lots of stores > (>250), I run it to memory problems. Now, not every store sells every > colour and so it should be possible to create the matrix of factor > variables myself and greatly reduce the size of the problem. it seems > that lm / biglm use all combinations of factor levels when used in > factor(Colour):factor(Store) so by creating my own matrix of factor > variables i should be able to reduce the size of the problem > considerably. > > If i have a data frame >> my.dataset <- data.frame(Price=1:12, Colour= c('red','blue','green'), > Store=c('a', 'b', 'c', 'a', 'c', 'd', 'e', 'e', 'e', 'e', 'b', 'e'), > DummyVar = sort(rep(c(0,1),6)) ) > > i want to create a data frame with the dummy vars that looks like > > red:a red:e blue:b blue:c blue:e green:c green:d green:e > 1 0 0 0 0 0 0 0 > 0 0 1 0 0 0 0 0 > 0 0 0 0 0 1 0 0 > 1 0 0 0 0 0 0 0 > 0 0 0 1 0 0 0 0 > 0 0 0 0 0 0 1 0 > 0 1 0 0 0 0 0 0 > 0 0 0 0 1 0 0 0 > 0 0 0 0 0 0 0 1 > 0 1 0 0 0 0 0 0 > 0 0 1 0 0 0 0 0 > 0 0 0 0 0 0 0 1 > > any ideas would be appreciated.Use mat <- model.matrix( ~ClrStr-1, transform( my.dataset, ClrStr factor( paste(Colour,Store,sep=":") ) ) ) then pretty up the colnames() and re-order columns if order matters. ---- However, if DummyVar is a categorical variable, you could just compute means on the appropriate subsets by maintaining a table of sums and totals. Then in a second pass through the data get the residual sums of squares. If the data are already in a database, it might make sense to do these operations there and import the results to R for further massaging. HTH, Chuck> > > -- > Tim Calkins > 0406 753 997 > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >Charles C. Berry (858) 534-2098 Dept of Family/Preventive Medicine E mailto:cberry at tajo.ucsd.edu UC San Diego http://famprevmed.ucsd.edu/faculty/cberry/ La Jolla, San Diego 92093-0901
Try this also: table(cbind.data.frame(Price=my.dataset$Price, Colour=paste(my.dataset$Colour, my.dataset$Store, sep=":"))) On 05/12/2007, Tim Calkins <tim.calkins at gmail.com> wrote:> Hi all - > > I'm trying to find a way to create dummy variables from factors in a > regression. I have been using biglm along the lines of > > ff <- log(Price) ~ factor(Colour):factor(Store) + > factor(DummyVar):factor(Colour):factor(Store) > > lm1 <- biglm(ff, data=my.dataset) > > but because there are lots of colours (>100) and lots of stores > (>250), I run it to memory problems. Now, not every store sells every > colour and so it should be possible to create the matrix of factor > variables myself and greatly reduce the size of the problem. it seems > that lm / biglm use all combinations of factor levels when used in > factor(Colour):factor(Store) so by creating my own matrix of factor > variables i should be able to reduce the size of the problem > considerably. > > If i have a data frame > >my.dataset <- data.frame(Price=1:12, Colour= c('red','blue','green'), > Store=c('a', 'b', 'c', 'a', 'c', 'd', 'e', 'e', 'e', 'e', 'b', 'e'), > DummyVar = sort(rep(c(0,1),6)) ) > > i want to create a data frame with the dummy vars that looks like > > red:a red:e blue:b blue:c blue:e green:c green:d green:e > 1 0 0 0 0 0 0 0 > 0 0 1 0 0 0 0 0 > 0 0 0 0 0 1 0 0 > 1 0 0 0 0 0 0 0 > 0 0 0 1 0 0 0 0 > 0 0 0 0 0 0 1 0 > 0 1 0 0 0 0 0 0 > 0 0 0 0 1 0 0 0 > 0 0 0 0 0 0 0 1 > 0 1 0 0 0 0 0 0 > 0 0 1 0 0 0 0 0 > 0 0 0 0 0 0 0 1 > > any ideas would be appreciated. > > > -- > Tim Calkins > 0406 753 997 > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >-- Henrique Dallazuanna Curitiba-Paran?-Brasil 25? 25' 40" S 49? 16' 22" O