thr3ads.net - R help - [R] How to pre-filter large amounts of data effectively [Aug 2005]

If this information is useful, please help other people find it:
Share via:

Torsten Schindler

2005-Aug-09 10:22 UTC

[R] How to pre-filter large amounts of data effectively

Hi,

I'm a R newbie and want to accelerate the following pre-filtering  
step of a data set with more than 115,000 rows :

#-----------------
# Function to filter out constant data columns
filter.const<-function(X, vectors=c('column', 'row'), tol=0){
   realdata=c()
   filteredX<-matrix()
   if( vectors[1] == 'row' ){
     for( row in (1:nrow(X)) ){
       if( length(which(X[row,]!=median(X[row,])))>tol ){
         realdata[length(realdata)+1]=row
       }
     }
     filteredX=X[realdata,]
   } else if( vectors[1] == 'column' ){
     for( col in (1:ncol(X)) ){
       if( length(which(X[,col]!=median(X[,col])))>tol ){
         realdata[length(realdata)+1]=col
       }
     }
     filteredX=X[,realdata]
   }
   return(list(x=filteredX, ix=realdata))
}

#-----------------
# Filter out all all-constant columns in my training data set
#
# Read training data set with class information in the first column
training <- read.csv('training_data.txt')
dim(training) # => 49 rows and 525 columns

# Prepare column names by stripping the underline and the number at  
the end
colnames(training) <- sub('_\\d+$', '', colnames(training),
perl=TRUE)

# Filter out the all-constant columns, exclude column 1, the class  
column called myclass
training.filter <- filter.const(training[,-1])

# The filtered data frame is
training.filtered <- cbind(myclass=training[,1], training.filter$x)
dim(training.filtered) # => 49 rows and 250 columns

# Save the filtered training set for later use in classification
filtered.data <- 'training_set_filtered.Rdata'
save(training.filtered, file=filtered.data)

#-----------------
# THE FOLLOWING FILTERING STEP TAKES 3 HOUR ON MY PowerBook
# AND CONSUMES ABOUT 600 Mb MEMORY.
#
# I WOULD BE HAPPY ABOUT ANY HINT HOW TO IMPROVE THIS.

# Pre-filter the big data set (more than 115,000 rows and 524  
columns) for later class predictions.
# The big data set contains the same column names as the training  
set, but in a different order.

input.file <- 'big_data_set.txt'
filtered.file <- 'big_data_set_filtered.txt'

# Read header with first row
prediction.set <- read.csv(input.file, header=TRUE, skip=0, nrow=1)

# Prepare column names by stripping the underline and the number at  
the end
colnames(prediction.set) <- sub('_\\d+$', '', colnames 
(prediction.set), perl=TRUE)
prediction.set.header <- colnames(prediction.set)

# Get descriptor columns of the training data set without the  
Activity_Class column
training.filtered.property.colnames <- colnames(training.filtered)[-1]

# Filter out the all-constant columns from the training set
prediction.set.filtered <- prediction.set 
[training.filtered.property.colnames]
dim(prediction.set.filtered) # => 1 row and 249 columns

# Write header and the first filtered row
write.csv(prediction.set.filtered, file=filtered.file,
             append=FALSE,  
col.names=training.filtered.property.colnames)

blocksize <- 1000
for (lineid in (0:120)*blocksize) {
   cat('lineid: ', lineid, '\n')

   # Read block of data
   # We have to add an dummy colname "x" in the col.names, when the  
header is not read!
   prediction.set <- try(read.csv(input.file, header=FALSE,
                         col.names=c('x',prediction.set.header),  
row.names=1,
                         skip=lineid+2, nrow=blocksize))
   if (class(prediction.set) == "try-error") break

   # Filter out all-constant training set columns from the block
   prediction.set.filtered <- prediction.set 
[training.filtered.property.colnames]

   # Append the data
   # (I know this function is slow, but I couldn't figure out how to  
do it faster, so far.)
   write.table(prediction.set.filtered, file=filtered.file,
                         append=TRUE, col.names=FALSE, sep=",")
}

#-------------
# Now read in the filtered data set and save it for later use in  
classification
prediction.set.filtered <- read.csv(filtered.file, header=TRUE,  
row.names=1)
filtered.data <- 'prediction_set_filtered.Rdata'
save(prediction.set.filtered, file=filtered.data)



I would be very happy about any hints how to improve the code above!!!

Best regards,

Torsten

Torsten Schindler

2005-Aug-09 10:53 UTC

head link

[R] How to pre-filter large amounts of data effectively

You are right, but unfortunately this is not the limiting step or  
bottleneck in the code below.
The filter.const() function is only used to get the non-constant  
columns in the
training data set, which is initially small (49 rows and 525 columns).
And this function is only applied for filtering the training set and  
takes about 2 seconds on my PowerBook.
After filtering the training data set, just the list of column names  
is used to filter the huge "prediction.set".
I think, the really time and memory consuming part is the for-loop  
below, but I don't know how to improve this part.

Anyway, thanks for the hint!!!

Best,
Torsten

On Aug 9, 2005, at 12:37 PM, Patrick Burns wrote:
> Building up an object like you do with 'realdata' is very
> wasteful (S Poetry says why).  I think you want something
> along the lines of:
>
> if(vectors[1] == 'column') {
>    realdata <- apply(X, 2, function(x) diff(range(x))) > tol
>    filteredX <- X[, realdata]
> } else {
>    realdata <- apply(X, 1, function(x) diff(range(x))) > tol
>    filteredX <- X[realdata, ]
> }
>
> Patrick Burns
> patrick at burns-stat.com
> +44 (0)20 8525 0696
> http://www.burns-stat.com
> (home of S Poetry and "A Guide for the Unwilling S User")
>
> Torsten Schindler wrote:
>
>
>> Hi,
>>
>> I'm a R newbie and want to accelerate the following pre-filtering
>> step of a data set with more than 115,000 rows :
>>
>> #-----------------
>> # Function to filter out constant data columns
>> filter.const<-function(X, vectors=c('column',
'row'), tol=0){
>>   realdata=c()
>>   filteredX<-matrix()
>>   if( vectors[1] == 'row' ){
>>     for( row in (1:nrow(X)) ){
>>       if( length(which(X[row,]!=median(X[row,])))>tol ){
>>         realdata[length(realdata)+1]=row
>>       }
>>     }
>>     filteredX=X[realdata,]
>>   } else if( vectors[1] == 'column' ){
>>     for( col in (1:ncol(X)) ){
>>       if( length(which(X[,col]!=median(X[,col])))>tol ){
>>         realdata[length(realdata)+1]=col
>>       }
>>     }
>>     filteredX=X[,realdata]
>>   }
>>   return(list(x=filteredX, ix=realdata))
>> }
>>
>> #-----------------
>> # Filter out all all-constant columns in my training data set
>> #
>> # Read training data set with class information in the first column
>> training <- read.csv('training_data.txt')
>> dim(training) # => 49 rows and 525 columns
>>
>> # Prepare column names by stripping the underline and the number  
>> at  the end
>> colnames(training) <- sub('_\\d+$', '',
colnames(training),
>> perl=TRUE)
>>
>> # Filter out the all-constant columns, exclude column 1, the  
>> class  column called myclass
>> training.filter <- filter.const(training[,-1])
>>
>> # The filtered data frame is
>> training.filtered <- cbind(myclass=training[,1], training.filter$x)
>> dim(training.filtered) # => 49 rows and 250 columns
>>
>> # Save the filtered training set for later use in classification
>> filtered.data <- 'training_set_filtered.Rdata'
>> save(training.filtered, file=filtered.data)
>>
>> #-----------------
>> # THE FOLLOWING FILTERING STEP TAKES 3 HOUR ON MY PowerBook
>> # AND CONSUMES ABOUT 600 Mb MEMORY.
>> #
>> # I WOULD BE HAPPY ABOUT ANY HINT HOW TO IMPROVE THIS.
>>
>> # Pre-filter the big data set (more than 115,000 rows and 524   
>> columns) for later class predictions.
>> # The big data set contains the same column names as the training   
>> set, but in a different order.
>>
>> input.file <- 'big_data_set.txt'
>> filtered.file <- 'big_data_set_filtered.txt'
>>
>> # Read header with first row
>> prediction.set <- read.csv(input.file, header=TRUE, skip=0, nrow=1)
>>
>> # Prepare column names by stripping the underline and the number  
>> at  the end
>> colnames(prediction.set) <- sub('_\\d+$', '',
colnames
>> (prediction.set), perl=TRUE)
>> prediction.set.header <- colnames(prediction.set)
>>
>> # Get descriptor columns of the training data set without the   
>> Activity_Class column
>> training.filtered.property.colnames <- colnames(training.filtered) 
>> [-1]
>>
>> # Filter out the all-constant columns from the training set
>> prediction.set.filtered <- prediction.set  
>> [training.filtered.property.colnames]
>> dim(prediction.set.filtered) # => 1 row and 249 columns
>>
>> # Write header and the first filtered row
>> write.csv(prediction.set.filtered, file=filtered.file,
>>             append=FALSE,   
>> col.names=training.filtered.property.colnames)
>>
>> blocksize <- 1000
>> for (lineid in (0:120)*blocksize) {
>>   cat('lineid: ', lineid, '\n')
>>
>>   # Read block of data
>>   # We have to add an dummy colname "x" in the col.names,
when
>> the  header is not read!
>>   prediction.set <- try(read.csv(input.file, header=FALSE,
>>                         col.names=c('x',prediction.set.header),
>> row.names=1,
>>                         skip=lineid+2, nrow=blocksize))
>>   if (class(prediction.set) == "try-error") break
>>
>>   # Filter out all-constant training set columns from the block
>>   prediction.set.filtered <- prediction.set  
>> [training.filtered.property.colnames]
>>
>>   # Append the data
>>   # (I know this function is slow, but I couldn't figure out how  
>> to  do it faster, so far.)
>>   write.table(prediction.set.filtered, file=filtered.file,
>>                         append=TRUE, col.names=FALSE,
sep=",")
>> }
>>
>> #-------------
>> # Now read in the filtered data set and save it for later use in   
>> classification
>> prediction.set.filtered <- read.csv(filtered.file, header=TRUE,   
>> row.names=1)
>> filtered.data <- 'prediction_set_filtered.Rdata'
>> save(prediction.set.filtered, file=filtered.data)
>>
>>
>>
>> I would be very happy about any hints how to improve the code  
>> above!!!
>>
>> Best regards,
>>
>> Torsten
>>
>> ______________________________________________
>> R-help at stat.math.ethz.ch mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide! http://www.R-project.org/posting- 
>> guide.html
>>
>>
>>
>>
>>
>
>

Adaikalavan Ramasamy

2005-Aug-09 11:10 UTC

head link

[R] How to pre-filter large amounts of data effectively

I do not fully comprehend the codes below. But if I usually want to
check if all the elements in a row/column are the same, then I would
check the variance or range and see if they are nearly zero.

 v.row <- apply( mat, 1, var )
 v.col <- apply( mat, 2, var )

 tol      <- 0
 good.row <- which( v.row > tol )
 good.col <- which( v.col > tol )


Regards, Adai



On Tue, 2005-08-09 at 12:22 +0200, Torsten Schindler
wrote:> Hi,
> 
> I'm a R newbie and want to accelerate the following pre-filtering  
> step of a data set with more than 115,000 rows :
> 
> #-----------------
> # Function to filter out constant data columns
> filter.const<-function(X, vectors=c('column', 'row'),
tol=0){
>    realdata=c()
>    filteredX<-matrix()
>    if( vectors[1] == 'row' ){
>      for( row in (1:nrow(X)) ){
>        if( length(which(X[row,]!=median(X[row,])))>tol ){
>          realdata[length(realdata)+1]=row
>        }
>      }
>      filteredX=X[realdata,]
>    } else if( vectors[1] == 'column' ){
>      for( col in (1:ncol(X)) ){
>        if( length(which(X[,col]!=median(X[,col])))>tol ){
>          realdata[length(realdata)+1]=col
>        }
>      }
>      filteredX=X[,realdata]
>    }
>    return(list(x=filteredX, ix=realdata))
> }
> 
> #-----------------
> # Filter out all all-constant columns in my training data set
> #
> # Read training data set with class information in the first column
> training <- read.csv('training_data.txt')
> dim(training) # => 49 rows and 525 columns
> 
> # Prepare column names by stripping the underline and the number at  
> the end
> colnames(training) <- sub('_\\d+$', '',
colnames(training), perl=TRUE)
> 
> # Filter out the all-constant columns, exclude column 1, the class  
> column called myclass
> training.filter <- filter.const(training[,-1])
> 
> # The filtered data frame is
> training.filtered <- cbind(myclass=training[,1], training.filter$x)
> dim(training.filtered) # => 49 rows and 250 columns
> 
> # Save the filtered training set for later use in classification
> filtered.data <- 'training_set_filtered.Rdata'
> save(training.filtered, file=filtered.data)
> 
> #-----------------
> # THE FOLLOWING FILTERING STEP TAKES 3 HOUR ON MY PowerBook
> # AND CONSUMES ABOUT 600 Mb MEMORY.
> #
> # I WOULD BE HAPPY ABOUT ANY HINT HOW TO IMPROVE THIS.
> 
> # Pre-filter the big data set (more than 115,000 rows and 524  
> columns) for later class predictions.
> # The big data set contains the same column names as the training  
> set, but in a different order.
> 
> input.file <- 'big_data_set.txt'
> filtered.file <- 'big_data_set_filtered.txt'
> 
> # Read header with first row
> prediction.set <- read.csv(input.file, header=TRUE, skip=0, nrow=1)
> 
> # Prepare column names by stripping the underline and the number at  
> the end
> colnames(prediction.set) <- sub('_\\d+$', '', colnames 
> (prediction.set), perl=TRUE)
> prediction.set.header <- colnames(prediction.set)
> 
> # Get descriptor columns of the training data set without the  
> Activity_Class column
> training.filtered.property.colnames <- colnames(training.filtered)[-1]
> 
> # Filter out the all-constant columns from the training set
> prediction.set.filtered <- prediction.set 
> [training.filtered.property.colnames]
> dim(prediction.set.filtered) # => 1 row and 249 columns
> 
> # Write header and the first filtered row
> write.csv(prediction.set.filtered, file=filtered.file,
>              append=FALSE,  
> col.names=training.filtered.property.colnames)
> 
> blocksize <- 1000
> for (lineid in (0:120)*blocksize) {
>    cat('lineid: ', lineid, '\n')
> 
>    # Read block of data
>    # We have to add an dummy colname "x" in the col.names, when
the
> header is not read!
>    prediction.set <- try(read.csv(input.file, header=FALSE,
>                          col.names=c('x',prediction.set.header),  
> row.names=1,
>                          skip=lineid+2, nrow=blocksize))
>    if (class(prediction.set) == "try-error") break
> 
>    # Filter out all-constant training set columns from the block
>    prediction.set.filtered <- prediction.set 
> [training.filtered.property.colnames]
> 
>    # Append the data
>    # (I know this function is slow, but I couldn't figure out how to  
> do it faster, so far.)
>    write.table(prediction.set.filtered, file=filtered.file,
>                          append=TRUE, col.names=FALSE, sep=",")
> }
> 
> #-------------
> # Now read in the filtered data set and save it for later use in  
> classification
> prediction.set.filtered <- read.csv(filtered.file, header=TRUE,  
> row.names=1)
> filtered.data <- 'prediction_set_filtered.Rdata'
> save(prediction.set.filtered, file=filtered.data)
> 
> 
> 
> I would be very happy about any hints how to improve the code above!!!
> 
> Best regards,
> 
> Torsten
> 
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide!
http://www.R-project.org/posting-guide.html
>

Possibly Parallel Threads

Search for more possibly parallel threads

R help - Aug 2005 - How to pre-filter large amounts of data effectively

[R] How to pre-filter large amounts of data effectively

[R] How to pre-filter large amounts of data effectively

[R] How to pre-filter large amounts of data effectively

Possibly Parallel Threads