Hi,
This might be due to the fact that factor levels are arbitary unless
they are ordinal, even that quantitative relationships between levels
are unclear. Therefore, the model has no way to predict unseen factor
levels.
Does it make sense to treat 'No_databases' as numeric instead of a
factor variable?
Weidong
On Mon, Dec 26, 2011 at 6:29 AM, Giovanni Azua <bravegag at gmail.com>
wrote:> Hello,
>
> I have tried reading the documentation and googling for the answer but
reviewing the online matches I end up more confused than before.
>
> My problem is apparently simple. I fit a glm model (2^k experiment), and
then I would like to predict the response variable (Throughput) for unseen
factor levels.
>
> When I try to predict I get the following error:
>> throughput.pred <-
predict(throughput.fit,experiments,type="response")
> Error in model.frame.default(Terms, newdata, na.action = na.action, xlev =
object$xlevels) :
> ?factor 'No_databases' has new level(s) 200, 400, 600, 800, 1000
>
> Of course these are new factor levels, it is exactly what I am trying to
achieve i.e. extrapolate the values of Throughput.
>
> Can anyone please advice? Below I include all details.
>
> Thanks in advance,
> Best regards,
> Giovanni
>
>> # define the extreme (factors and levels)
>> experiments <- expand.grid(No_databases ? = seq(1000,100,by=-200),
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?Partitioning ? =
c("sharding", "replication"),
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?No_middlewares =
seq(500,100,by=-100),
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?Queue_size ? ? = c(100))
>> experiments$No_databases <- as.factor(experiments$No_databases)
>> experiments$Partitioning <- as.factor(experiments$Partitioning)
>> experiments$No_middlewares <- as.factor(experiments$No_middlewares)
>> experiments$Queue_size <- as.factor(experiments$Queue_size)
>> str(experiments)
> 'data.frame': ? 50 obs. of ?4 variables:
> ?$ No_databases ?: Factor w/ 5 levels
"200","400","600",..: 5 4 3 2 1 5 4 3 2 1 ...
> ?$ Partitioning ?: Factor w/ 2 levels
"sharding","replication": 1 1 1 1 1 2 2 2 2 2 ...
> ?$ No_middlewares: Factor w/ 5 levels
"100","200","300",..: 5 5 5 5 5 5 5 5 5 5 ...
> ?$ Queue_size ? ?: Factor w/ 1 level "100": 1 1 1 1 1 1 1 1 1 1
...
> ?- attr(*, "out.attrs")=List of 2
> ?..$ dim ? ? : Named int ?5 2 5 1
> ?.. ..- attr(*, "names")= chr ?"No_databases"
"Partitioning" "No_middlewares" "Queue_size"
> ?..$ dimnames:List of 4
> ?.. ..$ No_databases ?: chr ?"No_databases=1000"
"No_databases= 800" "No_databases= 600" "No_databases=
400" ...
> ?.. ..$ Partitioning ?: chr ?"Partitioning=sharding"
"Partitioning=replication"
> ?.. ..$ No_middlewares: chr ?"No_middlewares=500"
"No_middlewares=400" "No_middlewares=300"
"No_middlewares=200" ...
> ?.. ..$ Queue_size ? ?: chr "Queue_size=100"
>> head(experiments)
> ?No_databases Partitioning No_middlewares Queue_size
> 1 ? ? ? ? 1000 ? ? sharding ? ? ? ? ? ?500 ? ? ? ?100
> 2 ? ? ? ? ?800 ? ? sharding ? ? ? ? ? ?500 ? ? ? ?100
> 3 ? ? ? ? ?600 ? ? sharding ? ? ? ? ? ?500 ? ? ? ?100
> 4 ? ? ? ? ?400 ? ? sharding ? ? ? ? ? ?500 ? ? ? ?100
> 5 ? ? ? ? ?200 ? ? sharding ? ? ? ? ? ?500 ? ? ? ?100
> 6 ? ? ? ? 1000 ?replication ? ? ? ? ? ?500 ? ? ? ?100
>> # or
>> throughput.fit <-
glm(log(Throughput)~(No_databases*No_middlewares)+Partitioning+Queue_size,
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? data=throughput)
>> summary(throughput.fit)
>
> Call:
> glm(formula = log(Throughput) ~ (No_databases * No_middlewares) +
> ? ?Partitioning + Queue_size, data = throughput)
>
> Deviance Residuals:
> ? ?Min ? ? ? 1Q ? Median ? ? ? 3Q ? ? ?Max
> -2.5966 ?-0.6612 ?-0.1944 ? 0.5548 ? 3.2136
>
> Coefficients:
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?Estimate Std. Error t value Pr(>|t|)
> (Intercept) ? ? ? ? ? ? ? ? ? ?5.74701 ? ?0.09127 ?62.970 ?< 2e-16 ***
> No_databases4 ? ? ? ? ? ? ? ? ?0.43309 ? ?0.10985 ? 3.943 8.66e-05 ***
> No_middlewares2 ? ? ? ? ? ? ? -1.99374 ? ?0.11035 -18.067 ?< 2e-16 ***
> No_middlewares4 ? ? ? ? ? ? ? -1.23004 ? ?0.10969 -11.214 ?< 2e-16 ***
> Partitioningreplication ? ? ? ?0.33291 ? ?0.06181 ? 5.386 9.15e-08 ***
> Queue_size100 ? ? ? ? ? ? ? ? ?0.15850 ? ?0.06181 ? 2.564 ? 0.0105 *
> No_databases4:No_middlewares2 ?2.71525 ? ?0.15262 ?17.791 ?< 2e-16 ***
> No_databases4:No_middlewares4 ?1.94191 ? ?0.15226 ?12.754 ?< 2e-16 ***
> ---
> Signif. codes: ?0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1
>
> (Dispersion parameter for gaussian family taken to be 0.8921778)
>
> ? ?Null deviance: 2175.58 ?on 936 ?degrees of freedom
> Residual deviance: ?828.83 ?on 929 ?degrees of freedom
> AIC: 2562.2
>
> Number of Fisher Scoring iterations: 2
>
>> throughput.pred <-
predict(throughput.fit,experiments,type="response")
> Error in model.frame.default(Terms, newdata, na.action = na.action, xlev =
object$xlevels) :
> ?factor 'No_databases' has new level(s) 200, 400, 600, 800, 1000
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.