thr3ads.net - R help - [R] How to subset my data and at the same time keep the balance? [Nov 2012]

If this information is useful, please help other people find it:
Share via:

Eddie Smith

2012-Nov-19 17:16 UTC

[R] How to subset my data and at the same time keep the balance?

Hi guys,

I have 1000 rows of a dataset. In my analysis, I need 70% of the data,
run my analysis and then use the remaining 30% to test my model.

Could anybody kindly help me on this?

Cheers

Rui Barradas

2012-Nov-19 17:25 UTC

head link

[R] How to subset my data and at the same time keep the balance?

Hello,

See the following example.

x <- matrix(rnorm(2000), ncol = 2)

idx <- sample(nrow(x), 0.7*nrow(x))
x2 <- x[idx, ]
nrow(x2)  # 700

x3 <- x[-idx, ]
nrow(x3)  # 300

Hope this helps,

Rui Barradas
Em 19-11-2012 17:16, Eddie Smith escreveu:> Hi guys,
>
> I have 1000 rows of a dataset. In my analysis, I need 70% of the data,
> run my analysis and then use the remaining 30% to test my model.
>
> Could anybody kindly help me on this?
>
> Cheers
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

Sarah Goslee

2012-Nov-19 17:26 UTC

head link

[R] How to subset my data and at the same time keep the balance?

I'm not sure what you mean by "balance", but you can use sample()
to
randomly order the values 1:1000, then use the first 700 as row
indices for the first set, and the last 300 as the test set.

Sarah

On Mon, Nov 19, 2012 at 12:16 PM, Eddie Smith <eddieatr at gmail.com>
wrote:> Hi guys,
>
> I have 1000 rows of a dataset. In my analysis, I need 70% of the data,
> run my analysis and then use the remaining 30% to test my model.
>
> Could anybody kindly help me on this?
>
> Cheers--
Sarah Goslee
http://www.functionaldiversity.org

arun

2012-Nov-19 17:31 UTC

head link

[R] How to subset my data and at the same time keep the balance?

HI,
May be this helps:
dat1<-read.table(text="
? V1 V2
1 5 10
2 6? 3
3 8? 4
4 9 20
5 15 30
6 25 40
7 2? 4
8 3? 1
9 1? 5
10 8 10
",header=TRUE)
dat2<-dat1[sample(NROW(dat1),NROW(dat1)*(1-0.3)),] #70% of data
dat2$newcol<-TRUE
?dat1$newcol1<-TRUE
?dat4<-merge(dat1,dat2,by=c("V1","V2"),all=TRUE)
?dat5<-dat4[is.na(dat4$newcol),][,1:2]? #remaining 30%
?dat5
#? V1 V2
#2? 2? 4
#4? 5 10
#8? 9 20
A.K.



----- Original Message -----
From: Eddie Smith <eddieatr at gmail.com>
To: r-help at r-project.org
Cc: 
Sent: Monday, November 19, 2012 12:16 PM
Subject: [R] How to subset my data and at the same time keep the balance?

Hi guys,

I have 1000 rows of a dataset. In my analysis, I need 70% of the data,
run my analysis and then use the remaining 30% to test my model.

Could anybody kindly help me on this?

Cheers

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Eddie Smith

2012-Nov-19 19:07 UTC

head link

[R] How to subset my data and at the same time keep the balance?

Thanks a lot! I got some ideas from all the replies and here is the final one.

newdata

select <- sample(nrow(newdata), nrow(newdata) * .7)
data70 <- newdata[select,]  # select
write.csv(data70, "data70.csv", row.names=FALSE)

data30 <- newdata[-select,]  # testing
write.csv(data30, "data30.csv", row.names=FALSE)

Cheers

Brian Feeny

2012-Nov-20 02:23 UTC

head link

[R] How to subset my data and at the same time keep the balance?

Just curious, once you have a model that works well, does it make sense to then
tune it against 100% of the dataset (with known outcomes)
so you can apply it to data you wish to predict for or is that a bad approach?

I have done like is explained in this thread many times, taken a sample, learned
against it, and then tested on the remaining.  But this is using data
for which we know the predicted variable and can compare to validate.  So after
your done, should you re-tune with the entire training set?

As for which method, I am using mostly SVM

Brian

On Nov 19, 2012, at 2:07 PM, Eddie Smith <eddieatr at gmail.com> wrote:
> Thanks a lot! I got some ideas from all the replies and here is the final
one.
> 
> newdata
> 
> select <- sample(nrow(newdata), nrow(newdata) * .7)
> data70 <- newdata[select,]  # select
> write.csv(data70, "data70.csv", row.names=FALSE)
> 
> data30 <- newdata[-select,]  # testing
> write.csv(data30, "data30.csv", row.names=FALSE)
> 
> Cheers
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

Jeff Newmiller

2012-Nov-20 05:24 UTC

head link

[R] How to subset my data and at the same time keep the balance?

No.
---------------------------------------------------------------------------
Jeff Newmiller                        The     .....       .....  Go Live...
DCN:<jdnewmil at dcn.davis.ca.us>        Basics: ##.#.       ##.#.  Live
Go...
                                      Live:   OO#.. Dead: OO#..  Playing
Research Engineer (Solar/Batteries            O.O#.       #.O#.  with
/Software/Embedded Controllers)               .OO#.       .OO#.  rocks...1k
--------------------------------------------------------------------------- 
Sent from my phone. Please excuse my brevity.

Brian Feeny <bfeeny at me.com> wrote:
>
>Just curious, once you have a model that works well, does it make sense
>to then tune it against 100% of the dataset (with known outcomes)
>so you can apply it to data you wish to predict for or is that a bad
>approach?
>
>I have done like is explained in this thread many times, taken a
>sample, learned against it, and then tested on the remaining.  But this
>is using data
>for which we know the predicted variable and can compare to validate. 
>So after your done, should you re-tune with the entire training set?
>
>As for which method, I am using mostly SVM
>
>Brian
>
>On Nov 19, 2012, at 2:07 PM, Eddie Smith <eddieatr at gmail.com>
wrote:
>
>> Thanks a lot! I got some ideas from all the replies and here is the
>final one.
>> 
>> newdata
>> 
>> select <- sample(nrow(newdata), nrow(newdata) * .7)
>> data70 <- newdata[select,]  # select
>> write.csv(data70, "data70.csv", row.names=FALSE)
>> 
>> data30 <- newdata[-select,]  # testing
>> write.csv(data30, "data30.csv", row.names=FALSE)
>> 
>> Cheers
>> 
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>
>______________________________________________
>R-help at r-project.org mailing list
>https://stat.ethz.ch/mailman/listinfo/r-help
>PLEASE do read the posting guide
>http://www.R-project.org/posting-guide.html
>and provide commented, minimal, self-contained, reproducible code.

Maybe Matching Threads

Search for more possibly parallel threads

R help - Nov 2012 - How to subset my data and at the same time keep the balance?

[R] How to subset my data and at the same time keep the balance?

[R] How to subset my data and at the same time keep the balance?

[R] How to subset my data and at the same time keep the balance?

[R] How to subset my data and at the same time keep the balance?

[R] How to subset my data and at the same time keep the balance?

[R] How to subset my data and at the same time keep the balance?

[R] How to subset my data and at the same time keep the balance?

Maybe Matching Threads