Emmanuel Charpentier
2007-Oct-11 20:42 UTC
[R] A read.table mystery (data for Framemaker Mac)
Dear list, I have to read some clinical data a file coming from Filemaker on Macintosh (Ugh ! But it could be worse and come from Excel...). Exporting via Excel is out of question since the file has 467 columns and 121 lines (+ headers), which is out of reach of Excel. So I received an "mer"" files, which is what Filemaker exports as a text file. It seems to be a semicolon delimited file, with all data item enclosed in double quotes, and header items not delimited. So, first attempt is : In bash : # Convert from apparent Latin-1 to UTF-8 charpent at yod:/media/EMTEC/MARS-Point2007$ cat export.mer | iconv -f latin1 -t utf8 > export-utf8.mer # Check size (line count) charpent at yod:/media/EMTEC/MARS-Point2007$ wc export-utf8.mer 122 8632 311252 export-utf8.mer In R : foo<-read.table("export-utf8.mer", header=TRUE, sep=";", dec=",", check.names=TRUE, strip.white=TRUE, encoding="UTF-8") /* Looong wait ! Almost 1 minute on a 3,2 GHz dual core... */ Warning message: In read.table("export-utf8.mer", header = TRUE, sep = ";", dec = ",", : readTableHeader a trouv? une ligne finale incompl?te dans 'export-utf8.mer'> dim(foo)[1] 0 468 Ouch ! One column too much, and no lines... Trying to print foo seems to start R in an endless loop. According to emacs (ESS is my favorite interface to R), R tries to print more than 1 million lines (empty) before I give up and kill emacs. Second attempt : separate headers from data : In bash : # Same conversion, discarding the first line charpent at yod:/media/EMTEC/MARS-Point2007$ cat export.mer | iconv -f latin1 -t utf8 | tail -n +2 > data.utf8 # check size charpent at yod:/media/EMTEC/MARS-Point2007$ wc data.utf8 121 7581 301942 data.utf8 # Convert header line, and stack headers by replacing the separator (;) # by a newline charpent at yod:/media/EMTEC/MARS-Point2007$ cat export.mer | iconv -f latin1 -t utf8 | head -n 1 | sed -e "s/;/\n/g" > headers.utf8 # Check size charpent at yod:/media/EMTEC/MARS-Point2007$ wc headers.utf8 467 1517 9310 headers.utf8 In R :> bar<-read.table("data.utf8", header=FALSE, sep=";", dec=",",strip.white=TRUE, encoding="UTF-8")> dim(bar)[1] 121 467 /* Fine ... */> hdrs<-read.table("headers.utf8", header=FALSE, sep=";",strip.white=TRUE, encoding="UTF-8")> dim(hdrs)[1] 467 1 /* Also fine. Let's use these names : */> names(bar)<-make.names(hdrs[,1])This is quite fast, and the resulting data frame *seems* to be fine and dandy. I have found a pragmatic solution, which (more or less) solves my problem. But I'm frustrated at understanding the root of the problem. Of course, I also tried to read the header line alone as a semicolon separated file of one line : it fails with a warning about an unterminated final line and returns a data frame with 467 columns and 0 line. And, by the way (let's try to be Ripley'd for something else than forgetting to read the FAQ) :> R.Version()$platform [1] "i486-pc-linux-gnu" $arch [1] "i486" $os [1] "linux-gnu" $system [1] "i486, linux-gnu" $status [1] "" $major [1] "2" $minor [1] "6.0" $year [1] "2007" $month [1] "10" $day [1] "03" $`svn rev` [1] "43063" (This is the version packaged for Ubuntu grabbed from the relevant R repository). The R FAQ was unhelpful in this special case. The import/export guide suggests to fiddle with delimiters and quoting, which I did, to no avail. Searching for read.table in the searchable archives of R help gives about 900 r?f?rences for 2007. An honest effort to read the headers and seemingly relevant messages did not shed any light. My best clue so far is that the data items are enclosed in double quotes, while items of the header line are not (just separated by semicolons) ; I do not know how to coax read.table to read *that*. I would appreciate any hint as to the source of the problem and a possible solution (I try to find an as automated as possible way, because I think I will have to read successive versions of this data file many times...). Sincerely yours, Emmanuel Charpentier
Gabor Grothendieck
2007-Oct-11 21:01 UTC
[R] A read.table mystery (data for Framemaker Mac)
On 10/11/07, Emmanuel Charpentier <charpent at bacbuc.dyndns.org> wrote:> I have to read some clinical data a file coming from Filemaker on > Macintosh (Ugh ! But it could be worse and come from Excel...). > Exporting via Excel is out of question since the file has 467 columns > and 121 lines (+ headers), which is out of reach of Excel. So I received > an "mer"" files, which is what Filemaker exports as a text file.The limits for Excel are 1 million rows and 16,000 columns (based on Excel 2007 which is what I have). Read the last line on every message to r-help,
Maybe Matching Threads
- Bug#123084: r-base: R documentatio search engine unusable with Mozilla 6 (OK with netscape 4.7x) (PR#1202)
- Bug#123084: r-base: R documentatio search engine unusable (PR#1203)
- Pb with Installshield (Dragon Naturaly Speaking) : not enough room on system drive
- RubyODBC bus error under Rails
- Hint for Wine + Type1 + FrameMaker / Update for the Font HOWTO