Emmanuel Levy
2008-Aug-12 23:35 UTC
[R] which(df$name=="A") takes ~1 second! (df is very large), but can it be speeded up?
Dear All, I have a large data frame ( 2700000 lines and 14 columns), and I would like to extract the information in a particular way illustrated below: Given a data frame "df":> col1=sample(c(0,1),10, rep=T) > names = factor(c(rep("A",5),rep("B",5))) > df = data.frame(names,col1) > dfnames col1 1 A 1 2 A 0 3 A 1 4 A 0 5 A 1 6 B 0 7 B 0 8 B 1 9 B 0 10 B 0 I would like to tranform it in the form:> index = c("A","B") > col1[[1]]=df$col1[which(df$name=="A")] > col1[[2]]=df$col1[which(df$name=="B")]My problem is that the command: *** which(df$name=="A") *** takes about 1 second because df is so big. I was thinking that a "level" could maybe be accessed instantly but I am not sure about how to do it. I would be very grateful for any advice that would allow me to speed this up. Best wishes, Emmanuel
Peter Cowan
2008-Aug-13 02:31 UTC
[R] which(df$name=="A") takes ~1 second! (df is very large), but can it be speeded up?
Emmanuel, On Tue, Aug 12, 2008 at 4:35 PM, Emmanuel Levy <emmanuel.levy at gmail.com> wrote:> Dear All, > > I have a large data frame ( 2700000 lines and 14 columns), and I would like to > extract the information in a particular way illustrated below: > > > Given a data frame "df": > >> col1=sample(c(0,1),10, rep=T) >> names = factor(c(rep("A",5),rep("B",5))) >> df = data.frame(names,col1) >> df > names col1 > 1 A 1 > 2 A 0 > 3 A 1 > 4 A 0 > 5 A 1 > 6 B 0 > 7 B 0 > 8 B 1 > 9 B 0 > 10 B 0 > > I would like to tranform it in the form: > >> index = c("A","B") >> col1[[1]]=df$col1[which(df$name=="A")] >> col1[[2]]=df$col1[which(df$name=="B")]I'm not sure I fully understand your problem, you example would not run for me. You could get a small speedup by omitting which(), you can subset by a logical vector also which give a small speedup.> n <- 2700000 > foo <- data.frame(+ one = sample(c(0,1), n, rep = T), + two = factor(c(rep("A", n/2 ),rep("B", n/2 ))) + )> system.time(out <- which(foo$two=="A"))user system elapsed 0.566 0.146 0.761> system.time(out <- foo$two=="A")user system elapsed 0.429 0.075 0.588 You might also find use for unstack(), though I didn't see a speedup.> system.time(out <- unstack(foo))user system elapsed 1.068 0.697 2.004 HTH Peter> My problem is that the command: *** which(df$name=="A") *** > takes about 1 second because df is so big. > > I was thinking that a "level" could maybe be accessed instantly but I am not > sure about how to do it. > > I would be very grateful for any advice that would allow me to speed this up. > > Best wishes, > > Emmanuel