thr3ads.net - R help - [R] on how to make a skip-table [Sep 2013]

If this information is useful, please help other people find it:
Share via:

Zhang Weiwu

2013-Sep-12 17:17 UTC

[R] on how to make a skip-table

I've got two data frames, as shown below:
(NR means Number of Record)
> record.lenths         NR     length
         1       100
         2       130
         3       150
         4       148
         5       100
         6        83
 	7	 60
> valida.records 	NR     factor
 	1       3
 	2       4
 	4       8
 	7       9

And I intend to obtain the following skip-table:
> skip.table 	NR     skip   factor
 	1       0       3
 	2       0       4
 	4       150     8
 	7       183     9


The column 'skip' is the space needed to skip invalid records.

For example, the 3rd element of skip.table has skip of '150', intended
to
skip the invalid record No.3 in record.lengths

For example, the 4th element of skip.table has skip of '183', intended
to
skip the invalid record No.5 and No.6, together is 100+83.

It's rather apparently intended for reading huge data files, and looks 
simple math, and I admit I couldn't find an R-ish way doing it.

Thanks in advance and also thanks for pointing out if I had been on the 
right track to start with.

jim holtman

2013-Sep-12 20:21 UTC

head link

[R] on how to make a skip-table

try this:
> record.length <- read.table(text = "    NR     length+         1       100
+         2       130
+         3       150
+         4       148
+         5       100
+         6        83
+         7        60", header = TRUE)> valida.records <- read.table(text = "  NR     factor+         1       3
+         2       4
+         4       8
+         7       9", header = TRUE)> x <- merge(record.length, valida.records, by = "NR", all.x =
TRUE)
> x$seq <- cumsum(!is.na(x$factor))
>
> # need to add 1 to lines with NA to associate with next group
> x$seq[is.na(x$factor)] <- x$seq[is.na(x$factor)] + 1
>
> # split by 'seq', output last record and sum of preceeding records
> do.call(rbind+     , lapply(split(x, x$seq), function(.sk){
+         if (nrow(.sk) > 1) .sk$skip <- sum(.sk$length[1:(nrow(.sk) -
1L)])
+         else .sk$skip <- 0
+         .sk[nrow(.sk), ] # return first value
+         })
+     )
  NR length factor seq skip
1  1    100      3   1    0
2  2    130      4   2    0
3  4    148      8   3  150
4  7     60      9   4  183>
Jim Holtman
Data Munger Guru

What is the problem that you are trying to solve?
Tell me what you want to do, not how you want to do it.


On Thu, Sep 12, 2013 at 1:17 PM, Zhang Weiwu <zhangweiwu at realss.com>
wrote:>
> I've got two data frames, as shown below:
> (NR means Number of Record)
>
>> record.lenths
>
>         NR     length
>         1       100
>         2       130
>         3       150
>         4       148
>         5       100
>         6        83
>         7        60
>
>> valida.records
>
>         NR     factor
>         1       3
>         2       4
>         4       8
>         7       9
>
> And I intend to obtain the following skip-table:
>
>> skip.table
>
>         NR     skip   factor
>         1       0       3
>         2       0       4
>         4       150     8
>         7       183     9
>
>
> The column 'skip' is the space needed to skip invalid records.
>
> For example, the 3rd element of skip.table has skip of '150',
intended to
> skip the invalid record No.3 in record.lengths
>
> For example, the 4th element of skip.table has skip of '183',
intended to
> skip the invalid record No.5 and No.6, together is 100+83.
>
> It's rather apparently intended for reading huge data files, and looks
> simple math, and I admit I couldn't find an R-ish way doing it.
>
> Thanks in advance and also thanks for pointing out if I had been on the
> right track to start with.
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

Gang Peng

2013-Sep-12 20:31 UTC

head link

[R] on how to make a skip-table

n.record <- length(record.lenths$NR)
index <- record.lenths$NR %in% valida.records$NR
tmp <- 1:n.record
ind <- tmp[index]
st <- 1
skip <- rep(0,length(ind))
for(i in 1:length(ind)){
  if(st<ind[i]){
    skip[i]<-sum(record.lenths$
length[st:(ind[i]-1)])
  }
  st <- ind[i]+1
}


2013/9/12 Zhang Weiwu <zhangweiwu@realss.com>
>
> I've got two data frames, as shown below:
> (NR means Number of Record)
>
>  record.lenths
>>
>         NR     length
>         1       100
>         2       130
>         3       150
>         4       148
>         5       100
>         6        83
>         7        60
>
>  valida.records
>>
>         NR     factor
>         1       3
>         2       4
>         4       8
>         7       9
>
> And I intend to obtain the following skip-table:
>
>  skip.table
>>
>         NR     skip   factor
>         1       0       3
>         2       0       4
>         4       150     8
>         7       183     9
>
>
> The column 'skip' is the space needed to skip invalid records.
>
> For example, the 3rd element of skip.table has skip of '150',
intended to
> skip the invalid record No.3 in record.lengths
>
> For example, the 4th element of skip.table has skip of '183',
intended to
> skip the invalid record No.5 and No.6, together is 100+83.
>
> It's rather apparently intended for reading huge data files, and looks
> simple math, and I admit I couldn't find an R-ish way doing it.
>
> Thanks in advance and also thanks for pointing out if I had been on the
> right track to start with.
>
> ______________________________**________________
> R-help@r-project.org mailing list
>
https://stat.ethz.ch/mailman/**listinfo/r-help<https://stat.ethz.ch/mailman/listinfo/r-help>
> PLEASE do read the posting guide http://www.R-project.org/**
> posting-guide.html <http://www.R-project.org/posting-guide.html>
> and provide commented, minimal, self-contained, reproducible code.
>
	[[alternative HTML version deleted]]

arun

2013-Sep-13 00:10 UTC

head link

[R] on how to make a skip-table

HI,
May be this helps:

record.length <- read.table(text = "NR??? length
??????? 1????? 100
??????? 2????? 130
??????? 3????? 150
??????? 4????? 148
??????? 5????? 100
??????? 6??????? 83
??????? 7??????? 60", sep="",header = TRUE)
?valida.records <- read.table(text = "NR??? factor
??????? 1????? 3
??????? 2????? 4
??????? 4????? 8
??????? 7????? 9", sep="", header = TRUE)
?indx<-diff(valida.records$NR)-1
skip.table<- within(valida.records, {skip<-
with(record.length,tapply(length,c(-1,rep(indx,indx+1)),function(x)
sum(x[-length(x)])))})[,c(1,3,2)]
skip.table
? NR skip factor
#1? 1??? 0????? 3
#2? 2??? 0????? 4
#3? 4? 150????? 8
#4? 7? 183????? 9
A.K.





----- Original Message -----
From: Zhang Weiwu <zhangweiwu at realss.com>
To: r-help at r-project.org
Cc: 
Sent: Thursday, September 12, 2013 1:17 PM
Subject: [R] on how to make a skip-table


I've got two data frames, as shown below:
(NR means Number of Record)
> record.lenths? ? ? ?  NR? ?  length
? ? ? ?  1? ? ?  100
? ? ? ?  2? ? ?  130
? ? ? ?  3? ? ?  150
? ? ? ?  4? ? ?  148
? ? ? ?  5? ? ?  100
? ? ? ?  6? ? ? ? 83
??? 7???  60
> valida.records??? NR? ?  factor
??? 1? ? ?  3
??? 2? ? ?  4
??? 4? ? ?  8
??? 7? ? ?  9

And I intend to obtain the following skip-table:
> skip.table??? NR? ?  skip?  factor
??? 1? ? ?  0? ? ?  3
??? 2? ? ?  0? ? ?  4
??? 4? ? ?  150? ?  8
??? 7? ? ?  183? ?  9


The column 'skip' is the space needed to skip invalid records.

For example, the 3rd element of skip.table has skip of '150', intended
to
skip the invalid record No.3 in record.lengths

For example, the 4th element of skip.table has skip of '183', intended
to
skip the invalid record No.5 and No.6, together is 100+83.

It's rather apparently intended for reading huge data files, and looks 
simple math, and I admit I couldn't find an R-ish way doing it.

Thanks in advance and also thanks for pointing out if I had been on the 
right track to start with.

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Zhang Weiwu

2013-Sep-13 04:18 UTC

head link

[R] on how to make a skip-table

It is a nice surprise to wake up receiving three answers, all producing 
correct results. Many thanks to all of you.

Jim Holtman solved it with amazing clarity. Gang Peng using a traditioanl 
C-like pointer style and Arun with awesome tight code thanks to diff().

I am embrassed to see my mis-spellings inherited in the answers
('lenths'
should be 'lengths' and 'valida' should be 'valid').
This experience is to
behove me to not to code in midnight again.

For anyone wishing to test these methods, I have compiled them all into one 
R script file, pasted at the end of this email.

Jim Holtman asked me to elaborate the problem:

     It is a common problem in reading sparse variable-lenght record data
     file.  Records are stored in file one next to another. The length of
     each record is known in advance, but a lot of them records are invalid,
     and should be skipped to make efficient use of memory.

     Ideally the datafile-reading routine should receive a skip-table. Before
     reading each wanted/valid record, it seeks forward for the distance
     given in the skip-table. The problem is how to obtain such a skip table.

     What we have at hand to produce the skip table, is a set of two data
     frames: a record.lengths data frame about each record's length, and a
     valid.records data frame about which records are significant and should
     be read.

--

###### input data:

record.lengths <- read.table(text = "    NR     length
          1       100
          2       130
          3       150
          4       148
          5       100
          6        83
          7        60", header = TRUE)

valid.records <- read.table(text = "  NR     factor
          1       3
          2       4
          4       8
          7       9", header = TRUE)

####### Jim Holtman's method:

x <- merge(record.length, valid.records, by = "NR", all.x = TRUE)
x$seq <- cumsum(!is.na(x$factor))

# need to add 1 to lines with NA to associate with next group
x$seq[is.na(x$factor)] <- x$seq[is.na(x$factor)] + 1

# split by 'seq', output last record and sum of preceeding records
skip.table <- do.call(rbind
      , lapply(split(x, x$seq), function(.sk){
          if (nrow(.sk) > 1) .sk$skip <- sum(.sk$length[1:(nrow(.sk) -
1L)])
          else .sk$skip <- 0
          .sk[nrow(.sk), ] # return first value
          })
      )

print(skip.table)


####### Gang Peng's method:

n.record <- length(record.lengths$NR)
index    <- record.lengths$NR %in% valid.records$NR
tmp <- 1:n.record
ind <- tmp[index]
st  <- 1
skip <- rep(0,length(ind))
for (i in 1:length(ind)) {
 	if(st<ind[i]){
 		skip[i]<-sum(record.lengths$length[st:(ind[i]-1)])
 	}
 	st <- ind[i]+1
}
print(cbind(valid.records,skip))

####### Arun's method:
indx<-diff(valid.records$NR)-1
skip.table<- within(valid.records, {skip<-
with(record.lengths,tapply(length,c(-1,rep(indx,indx+1)),function(x)
sum(x[-length(x)])))})[,c(1,3,2)]
print(skip.table)

Maybe Matching Threads

Search for more apparently analagous threads

R help - Sep 2013 - on how to make a skip-table

[R] on how to make a skip-table

[R] on how to make a skip-table

[R] on how to make a skip-table

[R] on how to make a skip-table

[R] on how to make a skip-table

Maybe Matching Threads