Torsten Schindler
2005-Aug-09 10:22 UTC
[R] How to pre-filter large amounts of data effectively
Hi, I'm a R newbie and want to accelerate the following pre-filtering step of a data set with more than 115,000 rows : #----------------- # Function to filter out constant data columns filter.const<-function(X, vectors=c('column', 'row'), tol=0){ realdata=c() filteredX<-matrix() if( vectors[1] == 'row' ){ for( row in (1:nrow(X)) ){ if( length(which(X[row,]!=median(X[row,])))>tol ){ realdata[length(realdata)+1]=row } } filteredX=X[realdata,] } else if( vectors[1] == 'column' ){ for( col in (1:ncol(X)) ){ if( length(which(X[,col]!=median(X[,col])))>tol ){ realdata[length(realdata)+1]=col } } filteredX=X[,realdata] } return(list(x=filteredX, ix=realdata)) } #----------------- # Filter out all all-constant columns in my training data set # # Read training data set with class information in the first column training <- read.csv('training_data.txt') dim(training) # => 49 rows and 525 columns # Prepare column names by stripping the underline and the number at the end colnames(training) <- sub('_\\d+$', '', colnames(training), perl=TRUE) # Filter out the all-constant columns, exclude column 1, the class column called myclass training.filter <- filter.const(training[,-1]) # The filtered data frame is training.filtered <- cbind(myclass=training[,1], training.filter$x) dim(training.filtered) # => 49 rows and 250 columns # Save the filtered training set for later use in classification filtered.data <- 'training_set_filtered.Rdata' save(training.filtered, file=filtered.data) #----------------- # THE FOLLOWING FILTERING STEP TAKES 3 HOUR ON MY PowerBook # AND CONSUMES ABOUT 600 Mb MEMORY. # # I WOULD BE HAPPY ABOUT ANY HINT HOW TO IMPROVE THIS. # Pre-filter the big data set (more than 115,000 rows and 524 columns) for later class predictions. # The big data set contains the same column names as the training set, but in a different order. input.file <- 'big_data_set.txt' filtered.file <- 'big_data_set_filtered.txt' # Read header with first row prediction.set <- read.csv(input.file, header=TRUE, skip=0, nrow=1) # Prepare column names by stripping the underline and the number at the end colnames(prediction.set) <- sub('_\\d+$', '', colnames (prediction.set), perl=TRUE) prediction.set.header <- colnames(prediction.set) # Get descriptor columns of the training data set without the Activity_Class column training.filtered.property.colnames <- colnames(training.filtered)[-1] # Filter out the all-constant columns from the training set prediction.set.filtered <- prediction.set [training.filtered.property.colnames] dim(prediction.set.filtered) # => 1 row and 249 columns # Write header and the first filtered row write.csv(prediction.set.filtered, file=filtered.file, append=FALSE, col.names=training.filtered.property.colnames) blocksize <- 1000 for (lineid in (0:120)*blocksize) { cat('lineid: ', lineid, '\n') # Read block of data # We have to add an dummy colname "x" in the col.names, when the header is not read! prediction.set <- try(read.csv(input.file, header=FALSE, col.names=c('x',prediction.set.header), row.names=1, skip=lineid+2, nrow=blocksize)) if (class(prediction.set) == "try-error") break # Filter out all-constant training set columns from the block prediction.set.filtered <- prediction.set [training.filtered.property.colnames] # Append the data # (I know this function is slow, but I couldn't figure out how to do it faster, so far.) write.table(prediction.set.filtered, file=filtered.file, append=TRUE, col.names=FALSE, sep=",") } #------------- # Now read in the filtered data set and save it for later use in classification prediction.set.filtered <- read.csv(filtered.file, header=TRUE, row.names=1) filtered.data <- 'prediction_set_filtered.Rdata' save(prediction.set.filtered, file=filtered.data) I would be very happy about any hints how to improve the code above!!! Best regards, Torsten
Torsten Schindler
2005-Aug-09 10:53 UTC
[R] How to pre-filter large amounts of data effectively
You are right, but unfortunately this is not the limiting step or bottleneck in the code below. The filter.const() function is only used to get the non-constant columns in the training data set, which is initially small (49 rows and 525 columns). And this function is only applied for filtering the training set and takes about 2 seconds on my PowerBook. After filtering the training data set, just the list of column names is used to filter the huge "prediction.set". I think, the really time and memory consuming part is the for-loop below, but I don't know how to improve this part. Anyway, thanks for the hint!!! Best, Torsten On Aug 9, 2005, at 12:37 PM, Patrick Burns wrote:> Building up an object like you do with 'realdata' is very > wasteful (S Poetry says why). I think you want something > along the lines of: > > if(vectors[1] == 'column') { > realdata <- apply(X, 2, function(x) diff(range(x))) > tol > filteredX <- X[, realdata] > } else { > realdata <- apply(X, 1, function(x) diff(range(x))) > tol > filteredX <- X[realdata, ] > } > > Patrick Burns > patrick at burns-stat.com > +44 (0)20 8525 0696 > http://www.burns-stat.com > (home of S Poetry and "A Guide for the Unwilling S User") > > Torsten Schindler wrote: > > >> Hi, >> >> I'm a R newbie and want to accelerate the following pre-filtering >> step of a data set with more than 115,000 rows : >> >> #----------------- >> # Function to filter out constant data columns >> filter.const<-function(X, vectors=c('column', 'row'), tol=0){ >> realdata=c() >> filteredX<-matrix() >> if( vectors[1] == 'row' ){ >> for( row in (1:nrow(X)) ){ >> if( length(which(X[row,]!=median(X[row,])))>tol ){ >> realdata[length(realdata)+1]=row >> } >> } >> filteredX=X[realdata,] >> } else if( vectors[1] == 'column' ){ >> for( col in (1:ncol(X)) ){ >> if( length(which(X[,col]!=median(X[,col])))>tol ){ >> realdata[length(realdata)+1]=col >> } >> } >> filteredX=X[,realdata] >> } >> return(list(x=filteredX, ix=realdata)) >> } >> >> #----------------- >> # Filter out all all-constant columns in my training data set >> # >> # Read training data set with class information in the first column >> training <- read.csv('training_data.txt') >> dim(training) # => 49 rows and 525 columns >> >> # Prepare column names by stripping the underline and the number >> at the end >> colnames(training) <- sub('_\\d+$', '', colnames(training), >> perl=TRUE) >> >> # Filter out the all-constant columns, exclude column 1, the >> class column called myclass >> training.filter <- filter.const(training[,-1]) >> >> # The filtered data frame is >> training.filtered <- cbind(myclass=training[,1], training.filter$x) >> dim(training.filtered) # => 49 rows and 250 columns >> >> # Save the filtered training set for later use in classification >> filtered.data <- 'training_set_filtered.Rdata' >> save(training.filtered, file=filtered.data) >> >> #----------------- >> # THE FOLLOWING FILTERING STEP TAKES 3 HOUR ON MY PowerBook >> # AND CONSUMES ABOUT 600 Mb MEMORY. >> # >> # I WOULD BE HAPPY ABOUT ANY HINT HOW TO IMPROVE THIS. >> >> # Pre-filter the big data set (more than 115,000 rows and 524 >> columns) for later class predictions. >> # The big data set contains the same column names as the training >> set, but in a different order. >> >> input.file <- 'big_data_set.txt' >> filtered.file <- 'big_data_set_filtered.txt' >> >> # Read header with first row >> prediction.set <- read.csv(input.file, header=TRUE, skip=0, nrow=1) >> >> # Prepare column names by stripping the underline and the number >> at the end >> colnames(prediction.set) <- sub('_\\d+$', '', colnames >> (prediction.set), perl=TRUE) >> prediction.set.header <- colnames(prediction.set) >> >> # Get descriptor columns of the training data set without the >> Activity_Class column >> training.filtered.property.colnames <- colnames(training.filtered) >> [-1] >> >> # Filter out the all-constant columns from the training set >> prediction.set.filtered <- prediction.set >> [training.filtered.property.colnames] >> dim(prediction.set.filtered) # => 1 row and 249 columns >> >> # Write header and the first filtered row >> write.csv(prediction.set.filtered, file=filtered.file, >> append=FALSE, >> col.names=training.filtered.property.colnames) >> >> blocksize <- 1000 >> for (lineid in (0:120)*blocksize) { >> cat('lineid: ', lineid, '\n') >> >> # Read block of data >> # We have to add an dummy colname "x" in the col.names, when >> the header is not read! >> prediction.set <- try(read.csv(input.file, header=FALSE, >> col.names=c('x',prediction.set.header), >> row.names=1, >> skip=lineid+2, nrow=blocksize)) >> if (class(prediction.set) == "try-error") break >> >> # Filter out all-constant training set columns from the block >> prediction.set.filtered <- prediction.set >> [training.filtered.property.colnames] >> >> # Append the data >> # (I know this function is slow, but I couldn't figure out how >> to do it faster, so far.) >> write.table(prediction.set.filtered, file=filtered.file, >> append=TRUE, col.names=FALSE, sep=",") >> } >> >> #------------- >> # Now read in the filtered data set and save it for later use in >> classification >> prediction.set.filtered <- read.csv(filtered.file, header=TRUE, >> row.names=1) >> filtered.data <- 'prediction_set_filtered.Rdata' >> save(prediction.set.filtered, file=filtered.data) >> >> >> >> I would be very happy about any hints how to improve the code >> above!!! >> >> Best regards, >> >> Torsten >> >> ______________________________________________ >> R-help at stat.math.ethz.ch mailing list >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide! http://www.R-project.org/posting- >> guide.html >> >> >> >> >> > >
Adaikalavan Ramasamy
2005-Aug-09 11:10 UTC
[R] How to pre-filter large amounts of data effectively
I do not fully comprehend the codes below. But if I usually want to check if all the elements in a row/column are the same, then I would check the variance or range and see if they are nearly zero. v.row <- apply( mat, 1, var ) v.col <- apply( mat, 2, var ) tol <- 0 good.row <- which( v.row > tol ) good.col <- which( v.col > tol ) Regards, Adai On Tue, 2005-08-09 at 12:22 +0200, Torsten Schindler wrote:> Hi, > > I'm a R newbie and want to accelerate the following pre-filtering > step of a data set with more than 115,000 rows : > > #----------------- > # Function to filter out constant data columns > filter.const<-function(X, vectors=c('column', 'row'), tol=0){ > realdata=c() > filteredX<-matrix() > if( vectors[1] == 'row' ){ > for( row in (1:nrow(X)) ){ > if( length(which(X[row,]!=median(X[row,])))>tol ){ > realdata[length(realdata)+1]=row > } > } > filteredX=X[realdata,] > } else if( vectors[1] == 'column' ){ > for( col in (1:ncol(X)) ){ > if( length(which(X[,col]!=median(X[,col])))>tol ){ > realdata[length(realdata)+1]=col > } > } > filteredX=X[,realdata] > } > return(list(x=filteredX, ix=realdata)) > } > > #----------------- > # Filter out all all-constant columns in my training data set > # > # Read training data set with class information in the first column > training <- read.csv('training_data.txt') > dim(training) # => 49 rows and 525 columns > > # Prepare column names by stripping the underline and the number at > the end > colnames(training) <- sub('_\\d+$', '', colnames(training), perl=TRUE) > > # Filter out the all-constant columns, exclude column 1, the class > column called myclass > training.filter <- filter.const(training[,-1]) > > # The filtered data frame is > training.filtered <- cbind(myclass=training[,1], training.filter$x) > dim(training.filtered) # => 49 rows and 250 columns > > # Save the filtered training set for later use in classification > filtered.data <- 'training_set_filtered.Rdata' > save(training.filtered, file=filtered.data) > > #----------------- > # THE FOLLOWING FILTERING STEP TAKES 3 HOUR ON MY PowerBook > # AND CONSUMES ABOUT 600 Mb MEMORY. > # > # I WOULD BE HAPPY ABOUT ANY HINT HOW TO IMPROVE THIS. > > # Pre-filter the big data set (more than 115,000 rows and 524 > columns) for later class predictions. > # The big data set contains the same column names as the training > set, but in a different order. > > input.file <- 'big_data_set.txt' > filtered.file <- 'big_data_set_filtered.txt' > > # Read header with first row > prediction.set <- read.csv(input.file, header=TRUE, skip=0, nrow=1) > > # Prepare column names by stripping the underline and the number at > the end > colnames(prediction.set) <- sub('_\\d+$', '', colnames > (prediction.set), perl=TRUE) > prediction.set.header <- colnames(prediction.set) > > # Get descriptor columns of the training data set without the > Activity_Class column > training.filtered.property.colnames <- colnames(training.filtered)[-1] > > # Filter out the all-constant columns from the training set > prediction.set.filtered <- prediction.set > [training.filtered.property.colnames] > dim(prediction.set.filtered) # => 1 row and 249 columns > > # Write header and the first filtered row > write.csv(prediction.set.filtered, file=filtered.file, > append=FALSE, > col.names=training.filtered.property.colnames) > > blocksize <- 1000 > for (lineid in (0:120)*blocksize) { > cat('lineid: ', lineid, '\n') > > # Read block of data > # We have to add an dummy colname "x" in the col.names, when the > header is not read! > prediction.set <- try(read.csv(input.file, header=FALSE, > col.names=c('x',prediction.set.header), > row.names=1, > skip=lineid+2, nrow=blocksize)) > if (class(prediction.set) == "try-error") break > > # Filter out all-constant training set columns from the block > prediction.set.filtered <- prediction.set > [training.filtered.property.colnames] > > # Append the data > # (I know this function is slow, but I couldn't figure out how to > do it faster, so far.) > write.table(prediction.set.filtered, file=filtered.file, > append=TRUE, col.names=FALSE, sep=",") > } > > #------------- > # Now read in the filtered data set and save it for later use in > classification > prediction.set.filtered <- read.csv(filtered.file, header=TRUE, > row.names=1) > filtered.data <- 'prediction_set_filtered.Rdata' > save(prediction.set.filtered, file=filtered.data) > > > > I would be very happy about any hints how to improve the code above!!! > > Best regards, > > Torsten > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html >