> On Jul 4, 2015, at 3:09 AM, Alex Kim <dumboisverydumb at gmail.com>
wrote:
>
> Hi guys,
>
> Suppose I have an extremely large data frame with 2 columns and .5 mil
> rows. For example, the last 6 rows may look like this:
> .
> ..
> ...
> 89 100
> 93 120
> 95 125
> 101 NA
> 115 NA
> 123 NA
> 124 NA
>
> I would like to manipulate this data frame to output a data frame that
> looks like:,
>
> 100 89, 93, 95
> 120 101, 115
> 125 123, 124
>
> What would be the absolute quickest way to do this, given that there are
> many rows? Currently I have this:
>
> # m is the large two column data frame
> end <- na.omit(m[,'V2']);
> out <- data.frame(End=end,
>
Start=unname(sapply(split(m[,'V1'],findInterval(m[,'V1'],end))[as.character(0:c(length(end)-1))],paste,collapse='.')))
>
This might be a little faster. It skips some of the steps in your version:
dput(m)
structure(list(V1 = c(89, 93, 95, 101, 115, 123, 124), V2 = c(100,
120, 125, NA, NA, NA, NA)), .Names = c("V1", "V2"),
row.names = c(NA,
-7L), class = "data.frame")
end <- na.omit(m[,'V2?])
# this will only work if that vector is sorted
data.frame(End = end,
Start = sapply( split( m$V1,
findInterval(m$V1, c(-Inf, end))),
paste,collapse="," ) )
End Start
1 100 89,93,95
2 120 101,115
3 125 123,124
> However this is taking a little bit too long.
>
> Thank you for your help!
>
> [[alternative HTML version deleted]]
This is a plain-text mailing list and posting triplicate questions is poor form.
Do read the posting guide.
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
?
David Winsemius, MD
Alameda, CA, USA