Emmanuel Charpentier
2007-Oct-11 20:42 UTC
[R] A read.table mystery (data for Framemaker Mac)
Dear list,
I have to read some clinical data a file coming from Filemaker on
Macintosh (Ugh ! But it could be worse and come from Excel...).
Exporting via Excel is out of question since the file has 467 columns
and 121 lines (+ headers), which is out of reach of Excel. So I received
an "mer"" files, which is what Filemaker exports as a text file.
It seems to be a semicolon delimited file, with all data item enclosed
in double quotes, and header items not delimited.
So, first attempt is :
In bash :
# Convert from apparent Latin-1 to UTF-8
charpent at yod:/media/EMTEC/MARS-Point2007$ cat export.mer | iconv -f
latin1 -t utf8 > export-utf8.mer
# Check size (line count)
charpent at yod:/media/EMTEC/MARS-Point2007$ wc export-utf8.mer 122
8632 311252 export-utf8.mer
In R :
foo<-read.table("export-utf8.mer", header=TRUE, sep=";",
dec=",",
check.names=TRUE, strip.white=TRUE, encoding="UTF-8")
/* Looong wait ! Almost 1 minute on a 3,2 GHz dual core... */
Warning message:
In read.table("export-utf8.mer", header = TRUE, sep = ";",
dec = ",", :
readTableHeader a trouv? une ligne finale incompl?te dans
'export-utf8.mer'> dim(foo)
[1] 0 468
Ouch ! One column too much, and no lines...
Trying to print foo seems to start R in an endless loop. According to
emacs (ESS is my favorite interface to R), R tries to print more than 1
million lines (empty) before I give up and kill emacs.
Second attempt : separate headers from data :
In bash :
# Same conversion, discarding the first line
charpent at yod:/media/EMTEC/MARS-Point2007$ cat export.mer | iconv -f
latin1 -t utf8 | tail -n +2 > data.utf8
# check size
charpent at yod:/media/EMTEC/MARS-Point2007$ wc data.utf8 121 7581
301942 data.utf8
# Convert header line, and stack headers by replacing the separator (;)
# by a newline
charpent at yod:/media/EMTEC/MARS-Point2007$ cat export.mer | iconv -f
latin1 -t utf8 | head -n 1 | sed -e "s/;/\n/g" > headers.utf8
# Check size
charpent at yod:/media/EMTEC/MARS-Point2007$ wc headers.utf8
467 1517 9310 headers.utf8
In R :> bar<-read.table("data.utf8", header=FALSE, sep=";",
dec=",",
strip.white=TRUE, encoding="UTF-8")> dim(bar)
[1] 121 467
/* Fine ... */> hdrs<-read.table("headers.utf8", header=FALSE,
sep=";",
strip.white=TRUE, encoding="UTF-8")> dim(hdrs)
[1] 467 1
/* Also fine. Let's use these names : */> names(bar)<-make.names(hdrs[,1])
This is quite fast, and the resulting data frame *seems* to be fine and
dandy.
I have found a pragmatic solution, which (more or less) solves my
problem. But I'm frustrated at understanding the root of the problem.
Of course, I also tried to read the header line alone as a semicolon
separated file of one line : it fails with a warning about an
unterminated final line and returns a data frame with 467 columns and 0
line.
And, by the way (let's try to be Ripley'd for something else than
forgetting to read the FAQ) :> R.Version()
$platform
[1] "i486-pc-linux-gnu"
$arch
[1] "i486"
$os
[1] "linux-gnu"
$system
[1] "i486, linux-gnu"
$status
[1] ""
$major
[1] "2"
$minor
[1] "6.0"
$year
[1] "2007"
$month
[1] "10"
$day
[1] "03"
$`svn rev`
[1] "43063"
(This is the version packaged for Ubuntu grabbed from the relevant R
repository).
The R FAQ was unhelpful in this special case. The import/export guide
suggests to fiddle with delimiters and quoting, which I did, to no avail.
Searching for read.table in the searchable archives of R help gives
about 900 r?f?rences for 2007. An honest effort to read the headers and
seemingly relevant messages did not shed any light.
My best clue so far is that the data items are enclosed in double
quotes, while items of the header line are not (just separated by
semicolons) ; I do not know how to coax read.table to read *that*.
I would appreciate any hint as to the source of the problem and a
possible solution (I try to find an as automated as possible way,
because I think I will have to read successive versions of this data
file many times...).
Sincerely yours,
Emmanuel Charpentier
Gabor Grothendieck
2007-Oct-11 21:01 UTC
[R] A read.table mystery (data for Framemaker Mac)
On 10/11/07, Emmanuel Charpentier <charpent at bacbuc.dyndns.org> wrote:> I have to read some clinical data a file coming from Filemaker on > Macintosh (Ugh ! But it could be worse and come from Excel...). > Exporting via Excel is out of question since the file has 467 columns > and 121 lines (+ headers), which is out of reach of Excel. So I received > an "mer"" files, which is what Filemaker exports as a text file.The limits for Excel are 1 million rows and 16,000 columns (based on Excel 2007 which is what I have). Read the last line on every message to r-help,
Maybe Matching Threads
- Bug#123084: r-base: R documentatio search engine unusable with Mozilla 6 (OK with netscape 4.7x) (PR#1202)
- Bug#123084: r-base: R documentatio search engine unusable (PR#1203)
- Pb with Installshield (Dragon Naturaly Speaking) : not enough room on system drive
- RubyODBC bus error under Rails
- Hint for Wine + Type1 + FrameMaker / Update for the Font HOWTO