thr3ads.net - R help - [R] How can you find the optimal number of values to randomly sample to optimize random forest classification without trial and error? [Dec 2017]

If this information is useful, please help other people find it:
Share via:

Jack Arnestad

2017-Dec-02 18:43 UTC

[R] How can you find the optimal number of values to randomly sample to optimize random forest classification without trial and error?

I have data set up like the following:

control1 <- sample(1:75, 3947398, replace=TRUE)
control2 <- sample(1:75, 28793, replace=TRUE)
control3 <- sample(1:100, 392733, replace=TRUE)
control4 <- sample(1:75, 858383, replace=TRUE)
patient1 <- sample(1:100, 28048, replace=TRUE)
patient2 <- sample(1:50, 80400, replace=TRUE)
patient3 <- sample(1:100, 48239, replace=TRUE)
control <- list(control1, control2, control3, control4)
patient <- list(patient1, patient2, patient3)

To classify these samples as either control or patient, I want make
frequency distributions of presence of each of the 100 variables being
considered. To do this, I randomly sample "s" values from each sample
and
generate a frequency vector of length 100. This is how I would do it:

control_s <- list()
patient_s <- list()for (i in 1:length(control))
        control_s[[i]] <- sample(control[[i]], s)for (i in 1:length(patient))
        patient_s[[i]] <- sample(patient[[i]], s)

Once I do this, I generate the frequency vector of length 100 as follows:

controlfreq <- list()for (i in 1:length(control_s)){
controlfreq[[i]] <-
    as.data.frame(prop.table(table(factor(
        control_s[[i]], levels = 1:100
    ))))[,2]}
patientfreq <- list()for (i in 1:length(patient_s)){
patientfreq[[i]] <-
    as.data.frame(prop.table(table(factor(
        patient_s[[i]], levels = 1:100
    ))))[,2]}
controlfreq <- t(as.data.frame(controlfreq))
controltrainingset <- transform(controlfreq, status = "control")
patientfreq <- t(as.data.frame(patientfreq))
patienttrainingset <- transform(patientfreq, status = "patient")

dataset <- rbind(controltrainingset, patienttrainingset)

This is the final data frame being used in the classification algorithm. My
goal of this post is to figure out how to identify the optimal "s"
value so
that the highest ROC is achieved. I am using "rf" from the caret
package to
do classification.

library(caret)
fitControl <-trainControl(method = "LOOCV", classProbs = T,
savePredictions = T)
model <- train(status ~ ., data = dataset, method = "rf", trControl
fitControl)

How can I automate it to start "s" at 5000, change it to another
value, and
based on the change in ROC, keep changing "s" to work towards the best
possible "s" value?

Thanks!

	[[alternative HTML version deleted]]

Possibly Parallel Threads

Search for more maybe matching threads

R help - Dec 2017 - How can you find the optimal number of values to randomly sample to optimize random forest classification without trial and error?

[R] How can you find the optimal number of values to randomly sample to optimize random forest classification without trial and error?

Possibly Parallel Threads

Wisdom of the Ancients