thr3ads.net - R help - [R] parsing a data file [Apr 2004]

If this information is useful, please help other people find it:
Share via:

Tamas Papp

2004-Apr-27 09:00 UTC

[R] parsing a data file

Hi,

I need to parse a data file (output of a measuring device) of the
following format:

BEGIN RECORD [first record data] RECORD [second
record data] RECORD
[third record data]
END

Line breaks can (and do ;-() occur anywhere.  White space behaves very
much like TeX, eg it is not important whether there are one or more
spaces or linebreaks as long as there is one of them.  It is a text
file, not binary.

I need to extract the record data I marked with []'s, eg a vector such
as c("[first record data]", "[second]", ...) would be nice
as a
result.

What functions should I use for this?

Thanks,

Tamas


-- 
Tam??s K. Papp
E-mail: tpapp at axelero.hu
Please try to send only (latin-2) plain text, not HTML or other garbage.

Eric Lecoutre

2004-Apr-27 10:09 UTC

head link

[R] parsing a data file

Hi,

I dont think there is any built-in function to do that...
Your friend is readLines and some "manual" post-processing.
Here is what I did (not sure it is the best...)

tmptxt = readLines("g:/record.txt")
tmptxt = paste(tmptxt,collapse=" ") # All as a single string
tmptxt = strsplit(tmptxt,"RECORD")[[1]]
tmptxt = tmptxt[-c(1,length(tmptxt))]
num = as.numeric(tmptxt)

which you could transform into a function

readRecords = function(file){
        tmptxt=readLines(file)
         tmptxt = readLines(file)
         tmptxt = paste(tmptxt,collapse=" ") # All as a single string
         tmptxt = strsplit(tmptxt,"RECORD")[[1]]
         tmptxt = tmptxt[-c(1,length(tmptxt))]
         num = as.numeric(tmptxt)
         return(num)
}


Eric

At 11:00 27/04/2004, Tamas Papp wrote:>Hi,
>
>I need to parse a data file (output of a measuring device) of the
>following format:
>
>BEGIN RECORD [first record data] RECORD [second
>record data] RECORD
>[third record data]
>END
>
>Line breaks can (and do ;-() occur anywhere.  White space behaves very
>much like TeX, eg it is not important whether there are one or more
>spaces or linebreaks as long as there is one of them.  It is a text
>file, not binary.
>
>I need to extract the record data I marked with []'s, eg a vector such
>as c("[first record data]", "[second]", ...) would be
nice as a
>result.
>
>What functions should I use for this?
>
>Thanks,
>
>Tamas
>
>
>--
>Tam??s K. Papp
>E-mail: tpapp at axelero.hu
Eric Lecoutre
UCL /  Institut de Statistique
Voie du Roman Pays, 20
1348 Louvain-la-Neuve
Belgium

tel: (+32)(0)10473050
lecoutre at stat.ucl.ac.be
http://www.stat.ucl.ac.be/ISpersonnel/lecoutre

If the statistics are boring, then you've got the wrong numbers. -Edward 
Tufte

(Ted Harding)

2004-Apr-27 10:10 UTC

head link

[R] parsing a data file

On 27-Apr-04 Tamas Papp wrote:> I need to parse a data file (output of a measuring device) of the
> following format:
> 
> BEGIN RECORD [first record data] RECORD [second
> record data] RECORD
> [third record data]
> END
> 
> Line breaks can (and do ;-() occur anywhere.  White space behaves very
> much like TeX, eg it is not important whether there are one or more
> spaces or linebreaks as long as there is one of them.  It is a text
> file, not binary.
> 
> I need to extract the record data I marked with []'s, eg a vector such
> as c("[first record data]", "[second]", ...) would be
nice as a
> result.
> 
> What functions should I use for this?
I don't know whether there is any R function capable of handling
a format as anarchic as this one, but if you are willing to do the
job outside R (i.e. produce a derived data file which is cleanly
structured which can then be read by R) then it looks like an awk
job (some might say perl job). You can use sed to strip cruft.

For example:

cat temp
BEGIN RECORD [first record data] RECORD [second
record data] RECORD
[third record data]
END

cat temp | sed 's/BEGIN//' | sed 's/END//' | tr '\n'
' ' |
    awk 'BEGIN{RS="RECORD"}{print $0}'

 [first record data] 
 [second record data] 
 [third record data]

Does this help?
Ted.


--------------------------------------------------------------------
E-Mail: (Ted Harding) <Ted.Harding at nessie.mcc.ac.uk>
Fax-to-email: +44 (0)870 167 1972
Date: 27-Apr-04                                       Time: 11:10:55
------------------------------ XFMail ------------------------------

Peter Wolf

2004-Apr-27 10:14 UTC

head link

[R] parsing a data file

Hello Tamas,
here is a starting point for your file parsing problem:

@
<<*>>x1<-scan("t",what="")
print(x1)
x2<-paste(x1,collapse=" ")
print(x2)
x3<-strsplit(x2, "RECORD")[[1]]
print(x3)

@
output-start
 [1] "BEGIN"   "RECORD"  "[first" 
"record"  "data]"   "RECORD"  "[second"
 [8] "record"  "data]"   "RECORD" 
"[third"  "record"  "data]"   "END"

[1] "BEGIN RECORD [first record data] RECORD [second record data] RECORD 
[third record data] END"

[1] "BEGIN "                   " [first record data] "  
[3] " [second record data] "   " [third record data] END"
output-end

Peter Wolf

Tamas Papp wrote:
>Hi,
>
>I need to parse a data file (output of a measuring device) of the
>following format:
>
>BEGIN RECORD [first record data] RECORD [second
>record data] RECORD
>[third record data]
>END
>
>Line breaks can (and do ;-() occur anywhere.  White space behaves very
>much like TeX, eg it is not important whether there are one or more
>spaces or linebreaks as long as there is one of them.  It is a text
>file, not binary.
>
>I need to extract the record data I marked with []'s, eg a vector such
>as c("[first record data]", "[second]", ...) would be
nice as a
>result.
>
>What functions should I use for this?
>
>Thanks,
>
>Tamas
>
>
>  
>

Eric Lecoutre

2004-Apr-27 10:15 UTC

head link

[R] parsing a data file

Oups,

I realize i do remove the last value with my code...
Better remove then from text directly:


readRecords = function(file){

	tmptxt = readLines(file)
	tmptxt = paste(tmptxt,collapse=" ") # All as a single string
	tmptxt=substr(tmptxt, 6,nchar(tmptxt)-3)# remove BEGIN and END
	tmptxt = strsplit(tmptxt,"RECORD")[[1]]
	
	num = as.numeric(tmptxt)
	return(num)
}

Eric
---
Eric Lecoutre
UCL /  Institut de Statistique
Voie du Roman Pays, 20
1348 Louvain-la-Neuve
Belgium

tel: (+32)(0)10473050
lecoutre at stat.ucl.ac.be
http://www.stat.ucl.ac.be/ISpersonnel/lecoutre

If the statistics are boring, then you've got the wrong numbers. -Edward 
Tufte

Barry Rowlingson

2004-Apr-27 10:35 UTC

head link

[R] parsing a data file

Tamas Papp wrote:
> I need to parse a data file (output of a measuring device) of the
> following format:
> 
> BEGIN RECORD [first record data] RECORD [second
> record data] RECORD
> [third record data]
> END
  Is it just the one 'BEGIN/END' pair per file? Or are there several? 
What's the format of the [first record data] entries? Numbers, strings? 
Are there literally square brackets in there?
> I need to extract the record data I marked with []'s, eg a vector such
> as c("[first record data]", "[second]", ...) would be
nice as a
> result.
> 
> What functions should I use for this?
  I'd consider writing a Perl script that converted this into an XML 
file, then you could probably use the RXML package to read it, and it 
would be in a format readable by any XML-reading thing, or at least in a 
more easily-convertable form. But that might be a bit heavyweight, and 
the Ted Harding approach of sed, tr, and awk is always appealing, 
assuming you have a Unix box or a Unix box-of-tricks on Windows (cygwin).


Baz

Seemingly Similar Threads

Search for more reasonably related threads

R help - Apr 2004 - parsing a data file

[R] parsing a data file

[R] parsing a data file

[R] parsing a data file

[R] parsing a data file

[R] parsing a data file

[R] parsing a data file

Seemingly Similar Threads