Dear all, My question is concerning the line "This is adequate for small files, but for anything more complicated we recommend using the facilities of a language like perl to pre-process the file." in the import/export manual. I have a large fixed-width file that I would like to preprocess in Perl or awk. The problem is that I do not know where to start. Does anyone have a simple example on how to turn a fixed-width file in any of these facilities into csv or tab delimited file. I guess I am looking for somewhat a perl for dummies or awk for dummies that does this. any pointers for website will be greatly appreciated Thank you Jean Eid
Some time ago, Doug Bates wrote a useful paper called "Data manipulatation in perl." It is a very concise intoduction and introduces the unpack function which is one way to deal with fixed format data. Just google for "data manipulation in perl" bates and you should be able to find a copy. Jean Eid wrote:> Dear all, > > My question is concerning the line > "This is adequate for small files, but for anything more complicated we > recommend using the facilities of a language like perl to pre-process > the file." > > in the import/export manual. > > I have a large fixed-width file that I would like to preprocess in Perl or > awk. The problem is that I do not know where to start. Does anyone have a > simple example on how to turn a fixed-width file in any of these > facilities into csv or tab delimited file. I guess I am looking for > somewhat a perl for dummies or awk for dummies that does this. any > pointers for website will be greatly appreciated > > Thank you > > > Jean Eid > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html >-- Kevin E. Thorpe Biostatistician/Trialist, Knowledge Translation Program Assistant Professor, Department of Public Health Sciences Faculty of Medicine, University of Toronto email: kevin.thorpe at utoronto.ca Tel: 416.946.8081 Fax: 416.971.2462
Thank you, that is exactly what I was looking for. Just a minor suggestion to the manual Import/Export. maybe a reference to the paper right underneath the line below would be helpfull for people like me that have never used perl and would like to take the suggestion to preprosses the data Jean On Tue, 16 Aug 2005, Kevin E. Thorpe wrote:> Some time ago, Doug Bates wrote a useful paper called "Data > manipulatation in perl." It is a very concise intoduction and > introduces the unpack function which is one way to deal with fixed > format data. Just google for > > "data manipulation in perl" bates > > and you should be able to find a copy. > > Jean Eid wrote: > > Dear all, > > > > My question is concerning the line > > "This is adequate for small files, but for anything more complicated we > > recommend using the facilities of a language like perl to pre-process > > the file." > > > > in the import/export manual. > > > > I have a large fixed-width file that I would like to preprocess in Perl or > > awk. The problem is that I do not know where to start. Does anyone have a > > simple example on how to turn a fixed-width file in any of these > > facilities into csv or tab delimited file. I guess I am looking for > > somewhat a perl for dummies or awk for dummies that does this. any > > pointers for website will be greatly appreciated > > > > Thank you > > > > > > Jean Eid > > > > ______________________________________________ > > R-help at stat.math.ethz.ch mailing list > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html > > > > > -- > Kevin E. Thorpe > Biostatistician/Trialist, Knowledge Translation Program > Assistant Professor, Department of Public Health Sciences > Faculty of Medicine, University of Toronto > email: kevin.thorpe at utoronto.ca Tel: 416.946.8081 Fax: 416.971.2462 >
On 8/16/05, Jean Eid <jeaneid at chass.utoronto.ca> wrote:> Dear all, > > My question is concerning the line > "This is adequate for small files, but for anything more complicated we > recommend using the facilities of a language like perl to pre-process > the file." > > in the import/export manual. > > I have a large fixed-width file that I would like to preprocess in Perl or > awk. The problem is that I do not know where to start. Does anyone have a > simple example on how to turn a fixed-width file in any of these > facilities into csv or tab delimited file. I guess I am looking for > somewhat a perl for dummies or awk for dummies that does this. any > pointers for website will be greatly appreciated >Try to do it in R first. I have found that I rarely need to go to an outside language to massage my data. # fixed with fields of 10 and 5 Lines <- readLines("mydata.dat") data.frame( field1 = as.numeric(substring(1,10,Lines), field2 = as.numeric(substring(11,15,Lines) ) If you do find that you have speed or memory problems that require that you go outside of R to preprocess your data then the gawk version of awk has a FIELDWIDTHS variable that makes handling fixed fields very easy. The gawk program below assumes two fields of widths 10 and 5, respectively, which is set in the first line. Then it repeatedly executes the second line for each input line forcing field splitting by a dummy manipulation (since field splitting is lazy) and then printing each line, the default being to print out the entire line with a space between successive fields: BEGIN { FIELDWIDTHS = "10 5" } { $1 = $1; print } In R, do the following assuming the above two lines are in split.awk: read.table(pipe("gawk -f split.awk mydata.dat")) or else run gawk outside of R then read in the output file created: gawk -f split.awk mydata.dat > mydata2.dat For more information, google for FIELDWIDTHS gawk for that portion of the manual on FIELDWIDTHS -- it includes an example and, of course, the whole manual is there too. The book by Kernighan et al is also good. I have used both awk and perl and I think its unlikely you would need perl given that you have R at your disposal for the hard parts and awk is easier to learn, better designed and more focused on this sort of task.
Thank you Gabor, Jean On Tue, 16 Aug 2005, Gabor Grothendieck wrote:> On 8/16/05, Jean Eid <jeaneid at chass.utoronto.ca> wrote: > > Dear all, > > > > My question is concerning the line > > "This is adequate for small files, but for anything more complicated we > > recommend using the facilities of a language like perl to pre-process > > the file." > > > > in the import/export manual. > > > > I have a large fixed-width file that I would like to preprocess in Perl or > > awk. The problem is that I do not know where to start. Does anyone have a > > simple example on how to turn a fixed-width file in any of these > > facilities into csv or tab delimited file. I guess I am looking for > > somewhat a perl for dummies or awk for dummies that does this. any > > pointers for website will be greatly appreciated > > > > > > Try to do it in R first. I have found that I rarely need to go to > an outside language to massage my data. > > # fixed with fields of 10 and 5 > Lines <- readLines("mydata.dat") > data.frame( field1 = as.numeric(substring(1,10,Lines), > field2 = as.numeric(substring(11,15,Lines) ) > > If you do find that you have speed or memory problems that > require that you go outside of R to preprocess your data > then the gawk version of awk has a FIELDWIDTHS variable that > makes handling fixed fields very easy. The gawk program below > assumes two fields of widths 10 and 5, respectively, which > is set in the first line. Then it repeatedly executes the > second line for each input line forcing field splitting by a > dummy manipulation (since field splitting is lazy) and then > printing each line, the default being to print out the > entire line with a space between successive fields: > > BEGIN { FIELDWIDTHS = "10 5" } > { $1 = $1; print } > > In R, do the following assuming the above two lines are in > split.awk: > > read.table(pipe("gawk -f split.awk mydata.dat")) > > or else run gawk outside of R then read in the output file > created: > > gawk -f split.awk mydata.dat > mydata2.dat > > For more information, google for > > FIELDWIDTHS gawk > > for that portion of the manual on FIELDWIDTHS -- it includes > an example and, of course, the whole manual is there too. The > book by Kernighan et al is also good. > > I have used both awk and perl and I think its unlikely you > would need perl given that you have R at your disposal for > the hard parts and awk is easier to learn, better designed > and more focused on this sort of task. >
> My question is concerning the line > "This is adequate for small files, but for anything more > complicated we > recommend using the facilities of a language like perl to > pre-process the file."An alternative to Perl is to use the big data library of S-PLUS 7 Enterprise, which would allow you to read in the entire fixed-format file and pre-process it using S commands. You could then export the processed data to a file from S-PLUS and import into R. If your university has S-PLUS, S-PLUS 7 Enterprise should be available (all academic institutions were upgraded to S-PLUS 7 Enterprise, which has the big data library). You can read more information about the big data library at: http://www.insightful.com/insightful_doclib/document.asp?id=167 # David Smith -- David M Smith <dsmith at insightful.com> Senior Product Manager, Insightful Corp, Seattle WA Tel: +1 (206) 802 2360 Fax: +1 (206) 283 6310 New S-PLUS 7! Create advanced statistical applications with large data sets. www.insightful.com/splus> -----Original Message----- > From: Jean Eid [mailto:jeaneid at chass.utoronto.ca] > Sent: Tuesday, August 16, 2005 5:39 AM > To: r-help at stat.math.ethz.ch > Subject: [R] preprocessing data > > > Dear all, > > My question is concerning the line > "This is adequate for small files, but for anything more > complicated we > recommend using the facilities of a language like perl to > pre-process > the file." > > in the import/export manual. > > I have a large fixed-width file that I would like to > preprocess in Perl or > awk. The problem is that I do not know where to start. Does > anyone have a > simple example on how to turn a fixed-width file in any of these > facilities into csv or tab delimited file. I guess I am looking for > somewhat a perl for dummies or awk for dummies that does this. any > pointers for website will be greatly appreciated > > Thank you > > > Jean Eid > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide! > http://www.R-project.org/posting-guide.html >