Hello list, I'm regularly in the position where I have to do a lot of data manipulation, in order to get the data I have into a format R is happy with. This manipulation would generally be in one of two forms: - getting data from e.g. text log files into a tabular format - extracting sensible sample data from a very large data set (i.e. too large for R to handle) In general, I use Perl or Python to do the task; I'm curious as to what others use when they hit the same problem. Regards Dave Mitchell
On 19-Nov-04 David Mitchell wrote:> Hello list, > > I'm regularly in the position where I have to do a lot of data > manipulation, in order to get the data I have into a format R > is happy with. This manipulation would generally be in one of > two forms: > - getting data from e.g. text log files into a tabular format > - extracting sensible sample data from a very large data set > (i.e. too large for R to handle) > > In general, I use Perl or Python to do the task; I'm curious > as to what others use when they hit the same problem.I generally use 'awk' with help from 'sed' when needed. This is on the same lines as your choice though lighter-weight and less powerful (but I've never had a case that needed more). Since the sort of task you describe is basically on a line-by-line basis (and what's meant by a "line" can be pretty flexible in 'awk'), this sort of thing can be done straightforwardly; but greater flexibility is also possible. E.g. it is easy to extract a line from the input, or apply a certain transformation to fields in a line, if & only if it has already been preceded by a line satisfying a certain condition, and so on. Best wishes, Ted. -------------------------------------------------------------------- E-Mail: (Ted Harding) <Ted.Harding at nessie.mcc.ac.uk> Fax-to-email: +44 (0)870 094 0861 [NB: New number!] Date: 19-Nov-04 Time: 08:56:47 ------------------------------ XFMail ------------------------------
Hello David,
I had the same problem with log files containing many fields separated by the
"|" character.
My task was to extract parts of some fields with regular expression and
normalize the result to compact them (using R functions factor and table)
To reduce the data size, I first split the logfile into "subfiles"
containing only one field from the original data.
So I could process one field after the other instead of loading the complete log
file.
under Linux:
	cutfile<-function(index,afile,tmpdir,wd){
	#index: list of fields to keep
	#afile: logfile
	setwd(wd)
	system(paste('for n  in ',index,'; \n',
         'do sudo gzip -dc ',afile,' | cut -f$n -d"|" >
',tmpdir,'/',afile,'.$n \n',
         'done;',sep=''))
	return(1)
}
exampe: cutfile(c(1,5,8),'mylog',outputdir,sourcedir)
=> files mylog,1, mylog.5, mylog.8
HTH,
Marc Mamin
-----Original Message-----
From: r-help-bounces at stat.math.ethz.ch
[mailto:r-help-bounces at stat.math.ethz.ch]On Behalf Of David Mitchell
Sent: Friday, November 19, 2004 4:54 AM
To: r-help at stat.math.ethz.ch
Subject: [R] Tools for data preparation?
Hello list,
I'm regularly in the position where I have to do a lot of data
manipulation, in order to get the data I have into a format R is happy
with.  This manipulation would generally be in one of two forms:
- getting data from e.g. text log files into a tabular format
- extracting sensible sample data from a very large data set (i.e. too
large for R to handle)
In general, I use Perl or Python to do the task; I'm curious as to
what others use when they hit the same problem.
Regards
Dave Mitchell
______________________________________________
R-help at stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
My choices are (in the order of my preference): - use connections and readLines()/strsplit()/etc. in R to process the file a chunk at a time - use cut/paste/grep/etc., perhaps within pipe() in R - use awk, perhaps within pipe() in R - Python is my last resort, as I'm not familiar with it The first preference is to do it all in R, mostly for the reason that I can keep track of what was done all in one place (the R script or function). Andy> From: David Mitchell > > Hello list, > > I'm regularly in the position where I have to do a lot of data > manipulation, in order to get the data I have into a format R is happy > with. This manipulation would generally be in one of two forms: > - getting data from e.g. text log files into a tabular format > - extracting sensible sample data from a very large data set (i.e. too > large for R to handle) > > In general, I use Perl or Python to do the task; I'm curious as to > what others use when they hit the same problem. > > Regards > > Dave Mitchell > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide! > http://www.R-project.org/posting-guide.html > >