Hi,
I need to parse a data file (output of a measuring device) of the
following format:
BEGIN RECORD [first record data] RECORD [second
record data] RECORD
[third record data]
END
Line breaks can (and do ;-() occur anywhere. White space behaves very
much like TeX, eg it is not important whether there are one or more
spaces or linebreaks as long as there is one of them. It is a text
file, not binary.
I need to extract the record data I marked with []'s, eg a vector such
as c("[first record data]", "[second]", ...) would be nice
as a
result.
What functions should I use for this?
Thanks,
Tamas
--
Tam??s K. Papp
E-mail: tpapp at axelero.hu
Please try to send only (latin-2) plain text, not HTML or other garbage.
Hi,
I dont think there is any built-in function to do that...
Your friend is readLines and some "manual" post-processing.
Here is what I did (not sure it is the best...)
tmptxt = readLines("g:/record.txt")
tmptxt = paste(tmptxt,collapse=" ") # All as a single string
tmptxt = strsplit(tmptxt,"RECORD")[[1]]
tmptxt = tmptxt[-c(1,length(tmptxt))]
num = as.numeric(tmptxt)
which you could transform into a function
readRecords = function(file){
tmptxt=readLines(file)
tmptxt = readLines(file)
tmptxt = paste(tmptxt,collapse=" ") # All as a single string
tmptxt = strsplit(tmptxt,"RECORD")[[1]]
tmptxt = tmptxt[-c(1,length(tmptxt))]
num = as.numeric(tmptxt)
return(num)
}
Eric
At 11:00 27/04/2004, Tamas Papp wrote:>Hi,
>
>I need to parse a data file (output of a measuring device) of the
>following format:
>
>BEGIN RECORD [first record data] RECORD [second
>record data] RECORD
>[third record data]
>END
>
>Line breaks can (and do ;-() occur anywhere. White space behaves very
>much like TeX, eg it is not important whether there are one or more
>spaces or linebreaks as long as there is one of them. It is a text
>file, not binary.
>
>I need to extract the record data I marked with []'s, eg a vector such
>as c("[first record data]", "[second]", ...) would be
nice as a
>result.
>
>What functions should I use for this?
>
>Thanks,
>
>Tamas
>
>
>--
>Tam??s K. Papp
>E-mail: tpapp at axelero.hu
Eric Lecoutre
UCL / Institut de Statistique
Voie du Roman Pays, 20
1348 Louvain-la-Neuve
Belgium
tel: (+32)(0)10473050
lecoutre at stat.ucl.ac.be
http://www.stat.ucl.ac.be/ISpersonnel/lecoutre
If the statistics are boring, then you've got the wrong numbers. -Edward
Tufte
On 27-Apr-04 Tamas Papp wrote:> I need to parse a data file (output of a measuring device) of the > following format: > > BEGIN RECORD [first record data] RECORD [second > record data] RECORD > [third record data] > END > > Line breaks can (and do ;-() occur anywhere. White space behaves very > much like TeX, eg it is not important whether there are one or more > spaces or linebreaks as long as there is one of them. It is a text > file, not binary. > > I need to extract the record data I marked with []'s, eg a vector such > as c("[first record data]", "[second]", ...) would be nice as a > result. > > What functions should I use for this?I don't know whether there is any R function capable of handling a format as anarchic as this one, but if you are willing to do the job outside R (i.e. produce a derived data file which is cleanly structured which can then be read by R) then it looks like an awk job (some might say perl job). You can use sed to strip cruft. For example: cat temp BEGIN RECORD [first record data] RECORD [second record data] RECORD [third record data] END cat temp | sed 's/BEGIN//' | sed 's/END//' | tr '\n' ' ' | awk 'BEGIN{RS="RECORD"}{print $0}' [first record data] [second record data] [third record data] Does this help? Ted. -------------------------------------------------------------------- E-Mail: (Ted Harding) <Ted.Harding at nessie.mcc.ac.uk> Fax-to-email: +44 (0)870 167 1972 Date: 27-Apr-04 Time: 11:10:55 ------------------------------ XFMail ------------------------------
Hello Tamas,
here is a starting point for your file parsing problem:
@
<<*>>x1<-scan("t",what="")
print(x1)
x2<-paste(x1,collapse=" ")
print(x2)
x3<-strsplit(x2, "RECORD")[[1]]
print(x3)
@
output-start
[1] "BEGIN" "RECORD" "[first"
"record" "data]" "RECORD" "[second"
[8] "record" "data]" "RECORD"
"[third" "record" "data]" "END"
[1] "BEGIN RECORD [first record data] RECORD [second record data] RECORD
[third record data] END"
[1] "BEGIN " " [first record data] "
[3] " [second record data] " " [third record data] END"
output-end
Peter Wolf
Tamas Papp wrote:
>Hi,
>
>I need to parse a data file (output of a measuring device) of the
>following format:
>
>BEGIN RECORD [first record data] RECORD [second
>record data] RECORD
>[third record data]
>END
>
>Line breaks can (and do ;-() occur anywhere. White space behaves very
>much like TeX, eg it is not important whether there are one or more
>spaces or linebreaks as long as there is one of them. It is a text
>file, not binary.
>
>I need to extract the record data I marked with []'s, eg a vector such
>as c("[first record data]", "[second]", ...) would be
nice as a
>result.
>
>What functions should I use for this?
>
>Thanks,
>
>Tamas
>
>
>
>
Oups,
I realize i do remove the last value with my code...
Better remove then from text directly:
readRecords = function(file){
tmptxt = readLines(file)
tmptxt = paste(tmptxt,collapse=" ") # All as a single string
tmptxt=substr(tmptxt, 6,nchar(tmptxt)-3)# remove BEGIN and END
tmptxt = strsplit(tmptxt,"RECORD")[[1]]
num = as.numeric(tmptxt)
return(num)
}
Eric
---
Eric Lecoutre
UCL / Institut de Statistique
Voie du Roman Pays, 20
1348 Louvain-la-Neuve
Belgium
tel: (+32)(0)10473050
lecoutre at stat.ucl.ac.be
http://www.stat.ucl.ac.be/ISpersonnel/lecoutre
If the statistics are boring, then you've got the wrong numbers. -Edward
Tufte
Tamas Papp wrote:> I need to parse a data file (output of a measuring device) of the > following format: > > BEGIN RECORD [first record data] RECORD [second > record data] RECORD > [third record data] > ENDIs it just the one 'BEGIN/END' pair per file? Or are there several? What's the format of the [first record data] entries? Numbers, strings? Are there literally square brackets in there?> I need to extract the record data I marked with []'s, eg a vector such > as c("[first record data]", "[second]", ...) would be nice as a > result. > > What functions should I use for this?I'd consider writing a Perl script that converted this into an XML file, then you could probably use the RXML package to read it, and it would be in a format readable by any XML-reading thing, or at least in a more easily-convertable form. But that might be a bit heavyweight, and the Ted Harding approach of sed, tr, and awk is always appealing, assuming you have a Unix box or a Unix box-of-tricks on Windows (cygwin). Baz