thr3ads.net - R help - [R] troubles reading a text file [Dec 2012]

If this information is useful, please help other people find it:
Share via:

Igor.Drobyshev2 at uqat.ca

2012-Dec-15 22:23 UTC

[R] troubles reading a text file

Dear R experts,

For quite some time I have been trying to solve a mistery of reading a seemingly
trouble-free text file. The data is temperature reconstruction arranged as a
huge grid, preceded by seven "header lines" (which you see better if
file is opened in Firefox or Chrome).

This is the data (gridded temperature reconstruction)
ftp://ftp.ncdc.noaa.gov/pub/data/paleo/historical/europe/casty2007/temp-mon.txt

And this is original data description:
ftp://ftp.ncdc.noaa.gov/pub/data/paleo/historical/europe/casty2007/readme-casty2007.txt
Basically, it is says "space-delimited ASCII format" there ...

I tried this:
Temperature<-read.table(FileName,skip = 7, header = TRUE,
na.strings="NA",sep="")

But ..

> Temperature <- read.table(FileName, skip = 7, header = FALSE,
sep="")Error in read.table(FileName, skip = 7, header = FALSE, sep = "") :
  empty beginning of file





Trying read.csv gives this:



Error: cannot allocate vector of size 370.5 Mb



I attempted to handle this by opening and resaving the file in another software,
but even if I can still see the first lines of the file in the import dialog,
the full reading of the file always ends up with an error, possibly because of
the huge humber of columns ..



I believe the problem is with some special encoding but I cannot figure out how
to go around it.



Could some of you give me any hint on that?



many thanks in advance

Igor

Igor Drobyshev
Dendrochronological laboratory at Station de Recheche FERLD, director
Chaire industrielle CRSNG-UQAT-UQAM en aménagement forestier durable
Université du Québec en Abitibi-Témiscamingue
445 boul . de l'Université
Rouyn-Noranda, QC
Canada J9X5E4
http://www.dendro.uqat.ca/

	[[alternative HTML version deleted]]

Jeffrey Dick

2012-Dec-16 04:30 UTC

head link

[R] troubles reading a text file

Hi Igor,

It appears that the encoding is UTF-16.
> readLines("temp-mon.txt") [1] "??" ""      ""      ""     
""      ""      ""      ""     
""
   ""      ""      ""      ""
[14] ""      ""      ""      ""     
""      ""      ""

A search for "??" leads to the Wikipedia page
http://en.wikipedia.org/wiki/Byte_order_mark, specifically UTF-16
section.
> options(encoding="UTF-16")
> system.time(Temperature<-read.table("temp-mon.txt",skip = 7,
header = TRUE, na.strings="NA",sep=""))   user  system elapsed
 28.556   0.112  28.712> ncol(Temperature)
[1] 18001> Temperature[, 1:10]  YYYYMM X79.75N.49.75W X79.75N.49.25W X79.75N.48.75W X79.75N.48.25W
X79.75N.47.75W X79.75N.47.25W
1 176512         -32.61         -32.92         -33.34         -33.65
      -34.09         -34.21
2 176601         -31.89         -31.96         -32.26         -32.48
      -32.71         -33.03
  X79.75N.46.75W X79.75N.46.25W X79.75N.45.75W
1         -34.65         -34.98         -35.43
2         -33.29         -33.41         -33.76

Here you can see that I have downloaded just the first 1 MB of the
file, so it only has two lines after the header, but 28 seconds to
read it... I'm not sure how long it would take to read.table on the
whole ~600 MB file.

scan() might be faster:
(and this does not require setting options(encoding="UTF-16"))
> system.time(Temperature <- scan("temp-mon.txt",
fileEncoding="UTF-16", skip=8))Read 36002 items
   user  system elapsed
  0.104   0.000   0.104> Temperature <- matrix(Temperature, ncol=18001, byrow=TRUE)
> Temperature.colnames <- scan("temp-mon.txt", character(),
fileEncoding="UTF-16", skip=7, nmax=18001)
Read 18001 items> colnames(Temperature) <- Temperature.colnames
> Temperature[, 1:10]     YYYYMM 79.75N/49.75W 79.75N/49.25W 79.75N/48.75W 79.75N/48.25W
79.75N/47.75W 79.75N/47.25W
[1,] 176512        -32.61        -32.92        -33.34        -33.65
    -34.09        -34.21
[2,] 176601        -31.89        -31.96        -32.26        -32.48
    -32.71        -33.03
     79.75N/46.75W 79.75N/46.25W 79.75N/45.75W
[1,]        -34.65        -34.98        -35.43
[2,]        -33.29        -33.41        -33.76

(note the different colnames, similar to using check.names=FALSE in
read.table, and the result is a matrix, not a data frame as returned
by read.table)

HTH,
Jeff

On Sun, Dec 16, 2012 at 6:23 AM,  <Igor.Drobyshev2 at uqat.ca>
wrote:> Dear R experts,
>
> For quite some time I have been trying to solve a mistery of reading a
seemingly trouble-free text file. The data is temperature reconstruction
arranged as a huge grid, preceded by seven "header lines" (which you
see better if file is opened in Firefox or Chrome).
>
> This is the data (gridded temperature reconstruction)
>
ftp://ftp.ncdc.noaa.gov/pub/data/paleo/historical/europe/casty2007/temp-mon.txt
>
> And this is original data description:
>
ftp://ftp.ncdc.noaa.gov/pub/data/paleo/historical/europe/casty2007/readme-casty2007.txt
> Basically, it is says "space-delimited ASCII format" there ...
>
> I tried this:
> Temperature<-read.table(FileName,skip = 7, header = TRUE,
na.strings="NA",sep="")
>
> But ..
>
>
>> Temperature <- read.table(FileName, skip = 7, header = FALSE,
sep="")
> Error in read.table(FileName, skip = 7, header = FALSE, sep = "")
:
>   empty beginning of file
>
>
>
>
>
> Trying read.csv gives this:
>
>
>
> Error: cannot allocate vector of size 370.5 Mb
>
>
>
> I attempted to handle this by opening and resaving the file in another
software, but even if I can still see the first lines of the file in the import
dialog, the full reading of the file always ends up with an error, possibly
because of the huge humber of columns ..
>
>
>
> I believe the problem is with some special encoding but I cannot figure out
how to go around it.
>
>
>
> Could some of you give me any hint on that?
>
>
>
> many thanks in advance
>
> Igor
>
> Igor Drobyshev
> Dendrochronological laboratory at Station de Recheche FERLD, director
> Chaire industrielle CRSNG-UQAT-UQAM en am?nagement forestier durable
> Universit? du Qu?bec en Abitibi-T?miscamingue
> 445 boul . de l'Universit?
> Rouyn-Noranda, QC
> Canada J9X5E4
> http://www.dendro.uqat.ca/
>
>         [[alternative HTML version deleted]]
>
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

David Winsemius

2012-Dec-16 04:45 UTC

head link

[R] troubles reading a text file

On Dec 15, 2012, at 2:23 PM, <Igor.Drobyshev2 at uqat.ca> wrote:
> Dear R experts,
> 
> For quite some time I have been trying to solve a mistery of reading a
seemingly trouble-free text file. The data is temperature reconstruction
arranged as a huge grid, preceded by seven "header lines" (which you
see better if file is opened in Firefox or Chrome).
> 
> This is the data (gridded temperature reconstruction)
>
ftp://ftp.ncdc.noaa.gov/pub/data/paleo/historical/europe/casty2007/temp-mon.txt
> 
> And this is original data description:
>
ftp://ftp.ncdc.noaa.gov/pub/data/paleo/historical/europe/casty2007/readme-casty2007.txt
> Basically, it is says "space-delimited ASCII format" there ...
> 
> I tried this:
> Temperature<-read.table(FileName,skip = 7, header = TRUE,
na.strings="NA",sep="")
> 
> But ..
> 
> 
>> Temperature <- read.table(FileName, skip = 7, header = FALSE,
sep="")
> Error in read.table(FileName, skip = 7, header = FALSE, sep = "")
:
>  empty beginning of file
> 
 After inspecting a small (8 MB fragment downloaded with an ftp client) with
both Firefox and TextEdit.app and seeing that they reported this to be UTF-16
encoded, I saved it from TextEdit as UTF-8 and then could view it with R
readLines. These are the first 7 lines and the beginning of the eighth:
> readLines("~/Downloads/temp-mon2.txt", n=10) [1] "NAME \"Monthly European Temperatures 1766-2000 [T=2m,
Celsius]\""
 [2] "LONGITUDES\t180\t50.00W\t40.00E\t"
 [3] "LATITUDES\t100\t80.00N\t30.00N\t"
 [4] "NODATA_STRING\tNA"
 [5] "NUMBER_OF_ROWS\t2820"
 [6] "NUMBER_OF_COLUMNS\t18001\t"
 [7] ""
 [8]
"YYYYMM\t79.75N/49.75W\t79.75N/49.25W\t79.75N/48.75W\t79.75N/48.25W\t79.75N/47.75W\t79.75N/47.25W\t79.75N/46.75W\t79.75N/46.25W\t79.75N/45.75W\t79.75N/45.25W\t79.75N/44.75W\t79.75N/44.25W\t79.7

As you can readily see it isa tab-separated file. I was able to get partial
success ( reading the first three lines anyway) with:
> inp <- read.table("~/Downloads/temp-mon.txt",  nrow=3, skip
=7, header=TRUE, fill=TRUE, fileEncoding ="UTF-16")
> inp[1 , 1:10]  YYYYMM X79.75N.49.75W X79.75N.49.25W X79.75N.48.75W X79.75N.48.25W
X79.75N.47.75W X79.75N.47.25W X79.75N.46.75W X79.75N.46.25W X79.75N.45.75W
1 176512         -32.61         -32.92         -33.34         -33.65        
-34.09         -34.21         -34.65         -34.98        
-35.43> inp[ , 1:10]  YYYYMM X79.75N.49.75W X79.75N.49.25W X79.75N.48.75W X79.75N.48.25W
X79.75N.47.75W X79.75N.47.25W X79.75N.46.75W X79.75N.46.25W X79.75N.45.75W
1 176512         -32.61         -32.92         -33.34         -33.65        
-34.09         -34.21         -34.65         -34.98         -35.43
2 176601         -31.89         -31.96         -32.26         -32.48        
-32.71         -33.03         -33.29         -33.41         -33.76
3 176602         -34.31         -34.40         -34.60         -34.79        
-35.01         -35.13         -35.46         -35.57         -35.91
> 
> Trying read.csv gives this:
> 
> 
> Error: cannot allocate vector of size 370.5 Mb
That on the other hand suggests you have inadequate machine resources for this
job. Perhaps you should be thinking of using other tools than R for this project
... or buying more ram. You should probably have 32 GB for a job this
size.> 
> I attempted to handle this by opening and resaving the file in another
software, but even if I can still see the first lines of the file in the import
dialog, the full reading of the file always ends up with an error, possibly
because of the huge humber of columns ..
> 
> I believe the problem is with some special encoding but I cannot figure out
how to go around it.

Partially correct but perhaps your problems are multifactorial. 

I was able to get this to "work" from that webste:
> inp <-
read.table(file=url("ftp://ftp.ncdc.noaa.gov/pub/data/paleo/historical/europe/casty2007/temp-mon.txt",
encoding="UTF-16"), nrow=3 , skip =7, header=TRUE, fill=TRUE,
fileEncoding ="UTF-16")
> str(inp[ , 1:10])'data.frame':	3 obs. of  10 variables:
 $ YYYYMM        : int  176512 176601 176602
 $ X79.75N.49.75W: num  -32.6 -31.9 -34.3
 $ X79.75N.49.25W: num  -32.9 -32 -34.4
 $ X79.75N.48.75W: num  -33.3 -32.3 -34.6
 $ X79.75N.48.25W: num  -33.6 -32.5 -34.8
 $ X79.75N.47.75W: num  -34.1 -32.7 -35
 $ X79.75N.47.25W: num  -34.2 -33 -35.1
 $ X79.75N.46.75W: num  -34.6 -33.3 -35.5
 $ X79.75N.46.25W: num  -35 -33.4 -35.6
 $ X79.75N.45.75W: num  -35.4 -33.8 -35.9

-- 

David Winsemius
Alameda, CA, USA

R help - Dec 2012 - troubles reading a text file

[R] troubles reading a text file

[R] troubles reading a text file

[R] troubles reading a text file

Reasonably Related Threads