If I understand your situation correctly, you may be able to make use of
the "strata" and "sampsize" arguments in randomForest() to
get bootstrap
samples that resemble the original data distribution. They allow you to
specify stratified samples using the "strata" variable.
Best,
Andy
From: Raghu Naik>
> Folks,
>
> I have a query around weighting in Random Forest (RF). I know
> that several
> earlier emails in this group have raised this issue, but I
> did not find an
> answer to my query.
>
> I am working on a dataset (dataset1) that consists of 4
> million records that
> can be reduced to a dataset (dataset2) of approximately 1500
> unique records
> with frequency counts that add up to the 4 million records
> number as above.
> Because of size issues, I cannot work with dataset1 in R and
> therefore, I am
> working with dataset2 .
>
> Each record consists of whether or not a patient chose a
> particular drug
> based on 14 comorbidity (Yes / No) variables; I am using RF
> to understand
> the comorbidity drivers of drug adoption (yes/no) classification.
>
> At full dataset level (dataset1), the drug adoption incidence
> is ~11%. At
> the reduced dataset dataset2 level, the drug adoption
> incidence increases to
> ~38%.
>
> My question is that, if am using the reduced dataset
> (dataset2), how should
> I inform RF that the adoption incidence at the full dataset
> level was 11%.
> Should that be used as a classwt prior with
> classwt=c(Yes=.11, No=.89)? My
> understanding is that RF does not allow case weighting.
> Or can this be handled with the sampsize arguement through
> oversampling?
> What proportions should one use for this (e.g., sampsize=c(Yes=100,
> No=100))?
>
>
>
> I would appreciate any feedback or pointers to any earlier
> thread that I may
> have overlooked.
>
> Regards,
>
> Raghu
Notice: This e-mail message, together with any attachme...{{dropped:12}}