Dustin
2011-Jan-30 18:49 UTC
[R] Extract subsets of different and unknown lengths from huge dataset
Dear prospective reader, I apologize for posting my problem but I've just no idea how to go on by processing this huge (over 70 MB) dataset. Thank you in advance for any help or comment! I do appreciate it! My textfile contains 1 column of interest (numbers/values only). The overall issue is to extract 'events', starting points of which are defined by at least 24 preceding values being equal to 0. Then, if the 25th value is greater than 0, this is the start of an event of unknown length (unknown number of values). And the end of an event again is defined by at least 24 values being equal to 0. I want to subset the single events for the purpose of examining the maximum value within each event. I tried:> xx1 <- read.table(pipe("cut -f2 corrected_data.txt"),header=T) > nrow(xx1)[1] 2500000> start1 <- data.frame(start=rep("NA",length.out=nrow(xx1))) > stop1 <- data.frame(stop=rep("NA",length.out=nrow(xx1))) > max.xx1 <- data.frame(max.xx=rep("NA",length.out=nrow(xx1))) > XXframe <- data.frame(XX=xx1, start=start1, stop=stop1, max.xx=max.xx1) > attach(XXframe) > for(i in 1:(nrow(XX)-25)){+ start[i+24] <- ifelse(XX[i:(i+23)]==0 && XX[i+24]>0, "start", "NA") + } But this doesn't work - and every time I try it again, after changing the 'start' and the 'NA' within 'ifelse', e.g. into integers, a different error appears (after hours). But this is only to set starts and stops; for the original issue I further would try to number the starts and then maybe to subset the single events using subset(). Do you think this could work, or does anyone know a way to number the events? This would help me a lot! Thanks again, Dustin -- View this message in context: http://r.789695.n4.nabble.com/Extract-subsets-of-different-and-unknown-lengths-from-huge-dataset-tp3247511p3247511.html Sent from the R help mailing list archive at Nabble.com.
Phil Spector
2011-Jan-30 20:17 UTC
[R] Extract subsets of different and unknown lengths from huge dataset
A reproducible example would be nice, but if I understand you, you want to find the index of values which are preceded by at least 24 zeroes. The rle (run length encoding) function is very handy for problems like these. Suppose the vector of interest is called "vec". To create a vector called "start" whose value is "NA" except for those positions immediately after at least 24 zeroes, you could try something like this: start = rep("NA",length(vec)) rls = rle(vec==0) ind = cumsum(rls$lengths)[rls$lengths >= 24 & rls$values == TRUE] + 1 if(rls$values[length(rls$values)] == TRUE)ind = ind[-length(ind)] start[ind] = 'start' To number the starts, you could use something like num = rep(0,length(vec)) num[start == 'start'] = 1:sum(start == 'start') - Phil Spector Statistical Computing Facility Department of Statistics UC Berkeley spector at stat.berkeley.edu On Sun, 30 Jan 2011, Dustin wrote:> > Dear prospective reader, > > > I apologize for posting my problem but I've just no idea how to go on by > processing this huge (over 70 MB) dataset. Thank you in advance for any help > or comment! I do appreciate it! > > My textfile contains 1 column of interest (numbers/values only). The overall > issue is to extract 'events', starting points of which are defined by at > least 24 preceding values being equal to 0. Then, if the 25th value is > greater than 0, this is the start of an event of unknown length (unknown > number of values). And the end of an event again is defined by at least 24 > values being equal to 0. I want to subset the single events for the purpose > of examining the maximum value within each event. > > I tried: > >> xx1 <- read.table(pipe("cut -f2 corrected_data.txt"),header=T) >> nrow(xx1) > [1] 2500000 >> start1 <- data.frame(start=rep("NA",length.out=nrow(xx1))) >> stop1 <- data.frame(stop=rep("NA",length.out=nrow(xx1))) >> max.xx1 <- data.frame(max.xx=rep("NA",length.out=nrow(xx1))) >> XXframe <- data.frame(XX=xx1, start=start1, stop=stop1, max.xx=max.xx1) >> attach(XXframe) >> for(i in 1:(nrow(XX)-25)){ > + start[i+24] <- ifelse(XX[i:(i+23)]==0 && XX[i+24]>0, "start", "NA") > + } > > But this doesn't work - and every time I try it again, after changing the > 'start' and the 'NA' within 'ifelse', e.g. into integers, a different error > appears (after hours). But this is only to set starts and stops; for the > original issue I further would try to number the starts and then maybe to > subset the single events using subset(). Do you think this could work, or > does anyone know a way to number the events? This would help me a lot! > > Thanks again, > Dustin > -- > View this message in context: http://r.789695.n4.nabble.com/Extract-subsets-of-different-and-unknown-lengths-from-huge-dataset-tp3247511p3247511.html > Sent from the R help mailing list archive at Nabble.com. > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >
Possibly Parallel Threads
- ?to calculate sth for groups defined between points in one variable (string), / value separating/ spliting variable into groups by i.e. between start, NA, NA, stop1, start2, NA, stop2
- access to right time unit when checking for time execution
- Lattice plot
- [LLVMdev] Can simplifycfg kill llvm.lifetime intrinsics?
- [LLVMdev] Can simplifycfg kill llvm.lifetime intrinsics?