I've got two data frames, as shown below: (NR means Number of Record)> record.lenthsNR length 1 100 2 130 3 150 4 148 5 100 6 83 7 60> valida.recordsNR factor 1 3 2 4 4 8 7 9 And I intend to obtain the following skip-table:> skip.tableNR skip factor 1 0 3 2 0 4 4 150 8 7 183 9 The column 'skip' is the space needed to skip invalid records. For example, the 3rd element of skip.table has skip of '150', intended to skip the invalid record No.3 in record.lengths For example, the 4th element of skip.table has skip of '183', intended to skip the invalid record No.5 and No.6, together is 100+83. It's rather apparently intended for reading huge data files, and looks simple math, and I admit I couldn't find an R-ish way doing it. Thanks in advance and also thanks for pointing out if I had been on the right track to start with.
try this:> record.length <- read.table(text = " NR length+ 1 100 + 2 130 + 3 150 + 4 148 + 5 100 + 6 83 + 7 60", header = TRUE)> valida.records <- read.table(text = " NR factor+ 1 3 + 2 4 + 4 8 + 7 9", header = TRUE)> x <- merge(record.length, valida.records, by = "NR", all.x = TRUE) > x$seq <- cumsum(!is.na(x$factor)) > > # need to add 1 to lines with NA to associate with next group > x$seq[is.na(x$factor)] <- x$seq[is.na(x$factor)] + 1 > > # split by 'seq', output last record and sum of preceeding records > do.call(rbind+ , lapply(split(x, x$seq), function(.sk){ + if (nrow(.sk) > 1) .sk$skip <- sum(.sk$length[1:(nrow(.sk) - 1L)]) + else .sk$skip <- 0 + .sk[nrow(.sk), ] # return first value + }) + ) NR length factor seq skip 1 1 100 3 1 0 2 2 130 4 2 0 3 4 148 8 3 150 4 7 60 9 4 183>Jim Holtman Data Munger Guru What is the problem that you are trying to solve? Tell me what you want to do, not how you want to do it. On Thu, Sep 12, 2013 at 1:17 PM, Zhang Weiwu <zhangweiwu at realss.com> wrote:> > I've got two data frames, as shown below: > (NR means Number of Record) > >> record.lenths > > NR length > 1 100 > 2 130 > 3 150 > 4 148 > 5 100 > 6 83 > 7 60 > >> valida.records > > NR factor > 1 3 > 2 4 > 4 8 > 7 9 > > And I intend to obtain the following skip-table: > >> skip.table > > NR skip factor > 1 0 3 > 2 0 4 > 4 150 8 > 7 183 9 > > > The column 'skip' is the space needed to skip invalid records. > > For example, the 3rd element of skip.table has skip of '150', intended to > skip the invalid record No.3 in record.lengths > > For example, the 4th element of skip.table has skip of '183', intended to > skip the invalid record No.5 and No.6, together is 100+83. > > It's rather apparently intended for reading huge data files, and looks > simple math, and I admit I couldn't find an R-ish way doing it. > > Thanks in advance and also thanks for pointing out if I had been on the > right track to start with. > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.
n.record <- length(record.lenths$NR) index <- record.lenths$NR %in% valida.records$NR tmp <- 1:n.record ind <- tmp[index] st <- 1 skip <- rep(0,length(ind)) for(i in 1:length(ind)){ if(st<ind[i]){ skip[i]<-sum(record.lenths$ length[st:(ind[i]-1)]) } st <- ind[i]+1 } 2013/9/12 Zhang Weiwu <zhangweiwu@realss.com>> > I've got two data frames, as shown below: > (NR means Number of Record) > > record.lenths >> > NR length > 1 100 > 2 130 > 3 150 > 4 148 > 5 100 > 6 83 > 7 60 > > valida.records >> > NR factor > 1 3 > 2 4 > 4 8 > 7 9 > > And I intend to obtain the following skip-table: > > skip.table >> > NR skip factor > 1 0 3 > 2 0 4 > 4 150 8 > 7 183 9 > > > The column 'skip' is the space needed to skip invalid records. > > For example, the 3rd element of skip.table has skip of '150', intended to > skip the invalid record No.3 in record.lengths > > For example, the 4th element of skip.table has skip of '183', intended to > skip the invalid record No.5 and No.6, together is 100+83. > > It's rather apparently intended for reading huge data files, and looks > simple math, and I admit I couldn't find an R-ish way doing it. > > Thanks in advance and also thanks for pointing out if I had been on the > right track to start with. > > ______________________________**________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/**listinfo/r-help<https://stat.ethz.ch/mailman/listinfo/r-help> > PLEASE do read the posting guide http://www.R-project.org/** > posting-guide.html <http://www.R-project.org/posting-guide.html> > and provide commented, minimal, self-contained, reproducible code. >[[alternative HTML version deleted]]
HI, May be this helps: record.length <- read.table(text = "NR??? length ??????? 1????? 100 ??????? 2????? 130 ??????? 3????? 150 ??????? 4????? 148 ??????? 5????? 100 ??????? 6??????? 83 ??????? 7??????? 60", sep="",header = TRUE) ?valida.records <- read.table(text = "NR??? factor ??????? 1????? 3 ??????? 2????? 4 ??????? 4????? 8 ??????? 7????? 9", sep="", header = TRUE) ?indx<-diff(valida.records$NR)-1 skip.table<- within(valida.records, {skip<- with(record.length,tapply(length,c(-1,rep(indx,indx+1)),function(x) sum(x[-length(x)])))})[,c(1,3,2)] skip.table ? NR skip factor #1? 1??? 0????? 3 #2? 2??? 0????? 4 #3? 4? 150????? 8 #4? 7? 183????? 9 A.K. ----- Original Message ----- From: Zhang Weiwu <zhangweiwu at realss.com> To: r-help at r-project.org Cc: Sent: Thursday, September 12, 2013 1:17 PM Subject: [R] on how to make a skip-table I've got two data frames, as shown below: (NR means Number of Record)> record.lenths? ? ? ? NR? ? length ? ? ? ? 1? ? ? 100 ? ? ? ? 2? ? ? 130 ? ? ? ? 3? ? ? 150 ? ? ? ? 4? ? ? 148 ? ? ? ? 5? ? ? 100 ? ? ? ? 6? ? ? ? 83 ??? 7??? 60> valida.records??? NR? ? factor ??? 1? ? ? 3 ??? 2? ? ? 4 ??? 4? ? ? 8 ??? 7? ? ? 9 And I intend to obtain the following skip-table:> skip.table??? NR? ? skip? factor ??? 1? ? ? 0? ? ? 3 ??? 2? ? ? 0? ? ? 4 ??? 4? ? ? 150? ? 8 ??? 7? ? ? 183? ? 9 The column 'skip' is the space needed to skip invalid records. For example, the 3rd element of skip.table has skip of '150', intended to skip the invalid record No.3 in record.lengths For example, the 4th element of skip.table has skip of '183', intended to skip the invalid record No.5 and No.6, together is 100+83. It's rather apparently intended for reading huge data files, and looks simple math, and I admit I couldn't find an R-ish way doing it. Thanks in advance and also thanks for pointing out if I had been on the right track to start with. ______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
It is a nice surprise to wake up receiving three answers, all producing correct results. Many thanks to all of you. Jim Holtman solved it with amazing clarity. Gang Peng using a traditioanl C-like pointer style and Arun with awesome tight code thanks to diff(). I am embrassed to see my mis-spellings inherited in the answers ('lenths' should be 'lengths' and 'valida' should be 'valid'). This experience is to behove me to not to code in midnight again. For anyone wishing to test these methods, I have compiled them all into one R script file, pasted at the end of this email. Jim Holtman asked me to elaborate the problem: It is a common problem in reading sparse variable-lenght record data file. Records are stored in file one next to another. The length of each record is known in advance, but a lot of them records are invalid, and should be skipped to make efficient use of memory. Ideally the datafile-reading routine should receive a skip-table. Before reading each wanted/valid record, it seeks forward for the distance given in the skip-table. The problem is how to obtain such a skip table. What we have at hand to produce the skip table, is a set of two data frames: a record.lengths data frame about each record's length, and a valid.records data frame about which records are significant and should be read. -- ###### input data: record.lengths <- read.table(text = " NR length 1 100 2 130 3 150 4 148 5 100 6 83 7 60", header = TRUE) valid.records <- read.table(text = " NR factor 1 3 2 4 4 8 7 9", header = TRUE) ####### Jim Holtman's method: x <- merge(record.length, valid.records, by = "NR", all.x = TRUE) x$seq <- cumsum(!is.na(x$factor)) # need to add 1 to lines with NA to associate with next group x$seq[is.na(x$factor)] <- x$seq[is.na(x$factor)] + 1 # split by 'seq', output last record and sum of preceeding records skip.table <- do.call(rbind , lapply(split(x, x$seq), function(.sk){ if (nrow(.sk) > 1) .sk$skip <- sum(.sk$length[1:(nrow(.sk) - 1L)]) else .sk$skip <- 0 .sk[nrow(.sk), ] # return first value }) ) print(skip.table) ####### Gang Peng's method: n.record <- length(record.lengths$NR) index <- record.lengths$NR %in% valid.records$NR tmp <- 1:n.record ind <- tmp[index] st <- 1 skip <- rep(0,length(ind)) for (i in 1:length(ind)) { if(st<ind[i]){ skip[i]<-sum(record.lengths$length[st:(ind[i]-1)]) } st <- ind[i]+1 } print(cbind(valid.records,skip)) ####### Arun's method: indx<-diff(valid.records$NR)-1 skip.table<- within(valid.records, {skip<- with(record.lengths,tapply(length,c(-1,rep(indx,indx+1)),function(x) sum(x[-length(x)])))})[,c(1,3,2)] print(skip.table)
Seemingly Similar Threads
- how to get values within a threshold
- --cuesheet include the full cue sheet or just the seekponints?
- Intel Boot Agent: PXE-E32: TFTP open timeout for correctly configured tftp-hpa
- replace values in vector from a replacement table
- samba: offer public share to Windows 98 and writable share to Windows XP