Hi, I need to parse a data file (output of a measuring device) of the following format: BEGIN RECORD [first record data] RECORD [second record data] RECORD [third record data] END Line breaks can (and do ;-() occur anywhere. White space behaves very much like TeX, eg it is not important whether there are one or more spaces or linebreaks as long as there is one of them. It is a text file, not binary. I need to extract the record data I marked with []'s, eg a vector such as c("[first record data]", "[second]", ...) would be nice as a result. What functions should I use for this? Thanks, Tamas -- Tam??s K. Papp E-mail: tpapp at axelero.hu Please try to send only (latin-2) plain text, not HTML or other garbage.
Hi, I dont think there is any built-in function to do that... Your friend is readLines and some "manual" post-processing. Here is what I did (not sure it is the best...) tmptxt = readLines("g:/record.txt") tmptxt = paste(tmptxt,collapse=" ") # All as a single string tmptxt = strsplit(tmptxt,"RECORD")[[1]] tmptxt = tmptxt[-c(1,length(tmptxt))] num = as.numeric(tmptxt) which you could transform into a function readRecords = function(file){ tmptxt=readLines(file) tmptxt = readLines(file) tmptxt = paste(tmptxt,collapse=" ") # All as a single string tmptxt = strsplit(tmptxt,"RECORD")[[1]] tmptxt = tmptxt[-c(1,length(tmptxt))] num = as.numeric(tmptxt) return(num) } Eric At 11:00 27/04/2004, Tamas Papp wrote:>Hi, > >I need to parse a data file (output of a measuring device) of the >following format: > >BEGIN RECORD [first record data] RECORD [second >record data] RECORD >[third record data] >END > >Line breaks can (and do ;-() occur anywhere. White space behaves very >much like TeX, eg it is not important whether there are one or more >spaces or linebreaks as long as there is one of them. It is a text >file, not binary. > >I need to extract the record data I marked with []'s, eg a vector such >as c("[first record data]", "[second]", ...) would be nice as a >result. > >What functions should I use for this? > >Thanks, > >Tamas > > >-- >Tam??s K. Papp >E-mail: tpapp at axelero.huEric Lecoutre UCL / Institut de Statistique Voie du Roman Pays, 20 1348 Louvain-la-Neuve Belgium tel: (+32)(0)10473050 lecoutre at stat.ucl.ac.be http://www.stat.ucl.ac.be/ISpersonnel/lecoutre If the statistics are boring, then you've got the wrong numbers. -Edward Tufte
On 27-Apr-04 Tamas Papp wrote:> I need to parse a data file (output of a measuring device) of the > following format: > > BEGIN RECORD [first record data] RECORD [second > record data] RECORD > [third record data] > END > > Line breaks can (and do ;-() occur anywhere. White space behaves very > much like TeX, eg it is not important whether there are one or more > spaces or linebreaks as long as there is one of them. It is a text > file, not binary. > > I need to extract the record data I marked with []'s, eg a vector such > as c("[first record data]", "[second]", ...) would be nice as a > result. > > What functions should I use for this?I don't know whether there is any R function capable of handling a format as anarchic as this one, but if you are willing to do the job outside R (i.e. produce a derived data file which is cleanly structured which can then be read by R) then it looks like an awk job (some might say perl job). You can use sed to strip cruft. For example: cat temp BEGIN RECORD [first record data] RECORD [second record data] RECORD [third record data] END cat temp | sed 's/BEGIN//' | sed 's/END//' | tr '\n' ' ' | awk 'BEGIN{RS="RECORD"}{print $0}' [first record data] [second record data] [third record data] Does this help? Ted. -------------------------------------------------------------------- E-Mail: (Ted Harding) <Ted.Harding at nessie.mcc.ac.uk> Fax-to-email: +44 (0)870 167 1972 Date: 27-Apr-04 Time: 11:10:55 ------------------------------ XFMail ------------------------------
Hello Tamas, here is a starting point for your file parsing problem: @ <<*>>x1<-scan("t",what="") print(x1) x2<-paste(x1,collapse=" ") print(x2) x3<-strsplit(x2, "RECORD")[[1]] print(x3) @ output-start [1] "BEGIN" "RECORD" "[first" "record" "data]" "RECORD" "[second" [8] "record" "data]" "RECORD" "[third" "record" "data]" "END" [1] "BEGIN RECORD [first record data] RECORD [second record data] RECORD [third record data] END" [1] "BEGIN " " [first record data] " [3] " [second record data] " " [third record data] END" output-end Peter Wolf Tamas Papp wrote:>Hi, > >I need to parse a data file (output of a measuring device) of the >following format: > >BEGIN RECORD [first record data] RECORD [second >record data] RECORD >[third record data] >END > >Line breaks can (and do ;-() occur anywhere. White space behaves very >much like TeX, eg it is not important whether there are one or more >spaces or linebreaks as long as there is one of them. It is a text >file, not binary. > >I need to extract the record data I marked with []'s, eg a vector such >as c("[first record data]", "[second]", ...) would be nice as a >result. > >What functions should I use for this? > >Thanks, > >Tamas > > > >
Oups, I realize i do remove the last value with my code... Better remove then from text directly: readRecords = function(file){ tmptxt = readLines(file) tmptxt = paste(tmptxt,collapse=" ") # All as a single string tmptxt=substr(tmptxt, 6,nchar(tmptxt)-3)# remove BEGIN and END tmptxt = strsplit(tmptxt,"RECORD")[[1]] num = as.numeric(tmptxt) return(num) } Eric --- Eric Lecoutre UCL / Institut de Statistique Voie du Roman Pays, 20 1348 Louvain-la-Neuve Belgium tel: (+32)(0)10473050 lecoutre at stat.ucl.ac.be http://www.stat.ucl.ac.be/ISpersonnel/lecoutre If the statistics are boring, then you've got the wrong numbers. -Edward Tufte
Tamas Papp wrote:> I need to parse a data file (output of a measuring device) of the > following format: > > BEGIN RECORD [first record data] RECORD [second > record data] RECORD > [third record data] > ENDIs it just the one 'BEGIN/END' pair per file? Or are there several? What's the format of the [first record data] entries? Numbers, strings? Are there literally square brackets in there?> I need to extract the record data I marked with []'s, eg a vector such > as c("[first record data]", "[second]", ...) would be nice as a > result. > > What functions should I use for this?I'd consider writing a Perl script that converted this into an XML file, then you could probably use the RXML package to read it, and it would be in a format readable by any XML-reading thing, or at least in a more easily-convertable form. But that might be a bit heavyweight, and the Ted Harding approach of sed, tr, and awk is always appealing, assuming you have a Unix box or a Unix box-of-tricks on Windows (cygwin). Baz