Dear R experts, For quite some time I have been trying to solve a mistery of reading a seemingly trouble-free text file. The data is temperature reconstruction arranged as a huge grid, preceded by seven "header lines" (which you see better if file is opened in Firefox or Chrome). This is the data (gridded temperature reconstruction) ftp://ftp.ncdc.noaa.gov/pub/data/paleo/historical/europe/casty2007/temp-mon.txt And this is original data description: ftp://ftp.ncdc.noaa.gov/pub/data/paleo/historical/europe/casty2007/readme-casty2007.txt Basically, it is says "space-delimited ASCII format" there ... I tried this: Temperature<-read.table(FileName,skip = 7, header = TRUE, na.strings="NA",sep="") But ..> Temperature <- read.table(FileName, skip = 7, header = FALSE, sep="")Error in read.table(FileName, skip = 7, header = FALSE, sep = "") : empty beginning of file Trying read.csv gives this: Error: cannot allocate vector of size 370.5 Mb I attempted to handle this by opening and resaving the file in another software, but even if I can still see the first lines of the file in the import dialog, the full reading of the file always ends up with an error, possibly because of the huge humber of columns .. I believe the problem is with some special encoding but I cannot figure out how to go around it. Could some of you give me any hint on that? many thanks in advance Igor Igor Drobyshev Dendrochronological laboratory at Station de Recheche FERLD, director Chaire industrielle CRSNG-UQAT-UQAM en aménagement forestier durable Université du Québec en Abitibi-Témiscamingue 445 boul . de l'Université Rouyn-Noranda, QC Canada J9X5E4 http://www.dendro.uqat.ca/ [[alternative HTML version deleted]]
Hi Igor, It appears that the encoding is UTF-16.> readLines("temp-mon.txt")[1] "??" "" "" "" "" "" "" "" "" "" "" "" "" [14] "" "" "" "" "" "" "" A search for "??" leads to the Wikipedia page http://en.wikipedia.org/wiki/Byte_order_mark, specifically UTF-16 section.> options(encoding="UTF-16") > system.time(Temperature<-read.table("temp-mon.txt",skip = 7, header = TRUE, na.strings="NA",sep=""))user system elapsed 28.556 0.112 28.712> ncol(Temperature)[1] 18001> Temperature[, 1:10]YYYYMM X79.75N.49.75W X79.75N.49.25W X79.75N.48.75W X79.75N.48.25W X79.75N.47.75W X79.75N.47.25W 1 176512 -32.61 -32.92 -33.34 -33.65 -34.09 -34.21 2 176601 -31.89 -31.96 -32.26 -32.48 -32.71 -33.03 X79.75N.46.75W X79.75N.46.25W X79.75N.45.75W 1 -34.65 -34.98 -35.43 2 -33.29 -33.41 -33.76 Here you can see that I have downloaded just the first 1 MB of the file, so it only has two lines after the header, but 28 seconds to read it... I'm not sure how long it would take to read.table on the whole ~600 MB file. scan() might be faster: (and this does not require setting options(encoding="UTF-16"))> system.time(Temperature <- scan("temp-mon.txt", fileEncoding="UTF-16", skip=8))Read 36002 items user system elapsed 0.104 0.000 0.104> Temperature <- matrix(Temperature, ncol=18001, byrow=TRUE) > Temperature.colnames <- scan("temp-mon.txt", character(), fileEncoding="UTF-16", skip=7, nmax=18001)Read 18001 items> colnames(Temperature) <- Temperature.colnames > Temperature[, 1:10]YYYYMM 79.75N/49.75W 79.75N/49.25W 79.75N/48.75W 79.75N/48.25W 79.75N/47.75W 79.75N/47.25W [1,] 176512 -32.61 -32.92 -33.34 -33.65 -34.09 -34.21 [2,] 176601 -31.89 -31.96 -32.26 -32.48 -32.71 -33.03 79.75N/46.75W 79.75N/46.25W 79.75N/45.75W [1,] -34.65 -34.98 -35.43 [2,] -33.29 -33.41 -33.76 (note the different colnames, similar to using check.names=FALSE in read.table, and the result is a matrix, not a data frame as returned by read.table) HTH, Jeff On Sun, Dec 16, 2012 at 6:23 AM, <Igor.Drobyshev2 at uqat.ca> wrote:> Dear R experts, > > For quite some time I have been trying to solve a mistery of reading a seemingly trouble-free text file. The data is temperature reconstruction arranged as a huge grid, preceded by seven "header lines" (which you see better if file is opened in Firefox or Chrome). > > This is the data (gridded temperature reconstruction) > ftp://ftp.ncdc.noaa.gov/pub/data/paleo/historical/europe/casty2007/temp-mon.txt > > And this is original data description: > ftp://ftp.ncdc.noaa.gov/pub/data/paleo/historical/europe/casty2007/readme-casty2007.txt > Basically, it is says "space-delimited ASCII format" there ... > > I tried this: > Temperature<-read.table(FileName,skip = 7, header = TRUE, na.strings="NA",sep="") > > But .. > > >> Temperature <- read.table(FileName, skip = 7, header = FALSE, sep="") > Error in read.table(FileName, skip = 7, header = FALSE, sep = "") : > empty beginning of file > > > > > > Trying read.csv gives this: > > > > Error: cannot allocate vector of size 370.5 Mb > > > > I attempted to handle this by opening and resaving the file in another software, but even if I can still see the first lines of the file in the import dialog, the full reading of the file always ends up with an error, possibly because of the huge humber of columns .. > > > > I believe the problem is with some special encoding but I cannot figure out how to go around it. > > > > Could some of you give me any hint on that? > > > > many thanks in advance > > Igor > > Igor Drobyshev > Dendrochronological laboratory at Station de Recheche FERLD, director > Chaire industrielle CRSNG-UQAT-UQAM en am?nagement forestier durable > Universit? du Qu?bec en Abitibi-T?miscamingue > 445 boul . de l'Universit? > Rouyn-Noranda, QC > Canada J9X5E4 > http://www.dendro.uqat.ca/ > > [[alternative HTML version deleted]] > > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >
On Dec 15, 2012, at 2:23 PM, <Igor.Drobyshev2 at uqat.ca> wrote:> Dear R experts, > > For quite some time I have been trying to solve a mistery of reading a seemingly trouble-free text file. The data is temperature reconstruction arranged as a huge grid, preceded by seven "header lines" (which you see better if file is opened in Firefox or Chrome). > > This is the data (gridded temperature reconstruction) > ftp://ftp.ncdc.noaa.gov/pub/data/paleo/historical/europe/casty2007/temp-mon.txt > > And this is original data description: > ftp://ftp.ncdc.noaa.gov/pub/data/paleo/historical/europe/casty2007/readme-casty2007.txt > Basically, it is says "space-delimited ASCII format" there ... > > I tried this: > Temperature<-read.table(FileName,skip = 7, header = TRUE, na.strings="NA",sep="") > > But .. > > >> Temperature <- read.table(FileName, skip = 7, header = FALSE, sep="") > Error in read.table(FileName, skip = 7, header = FALSE, sep = "") : > empty beginning of file >After inspecting a small (8 MB fragment downloaded with an ftp client) with both Firefox and TextEdit.app and seeing that they reported this to be UTF-16 encoded, I saved it from TextEdit as UTF-8 and then could view it with R readLines. These are the first 7 lines and the beginning of the eighth:> readLines("~/Downloads/temp-mon2.txt", n=10)[1] "NAME \"Monthly European Temperatures 1766-2000 [T=2m, Celsius]\"" [2] "LONGITUDES\t180\t50.00W\t40.00E\t" [3] "LATITUDES\t100\t80.00N\t30.00N\t" [4] "NODATA_STRING\tNA" [5] "NUMBER_OF_ROWS\t2820" [6] "NUMBER_OF_COLUMNS\t18001\t" [7] "" [8] "YYYYMM\t79.75N/49.75W\t79.75N/49.25W\t79.75N/48.75W\t79.75N/48.25W\t79.75N/47.75W\t79.75N/47.25W\t79.75N/46.75W\t79.75N/46.25W\t79.75N/45.75W\t79.75N/45.25W\t79.75N/44.75W\t79.75N/44.25W\t79.7 As you can readily see it isa tab-separated file. I was able to get partial success ( reading the first three lines anyway) with:> inp <- read.table("~/Downloads/temp-mon.txt", nrow=3, skip =7, header=TRUE, fill=TRUE, fileEncoding ="UTF-16") > inp[1 , 1:10]YYYYMM X79.75N.49.75W X79.75N.49.25W X79.75N.48.75W X79.75N.48.25W X79.75N.47.75W X79.75N.47.25W X79.75N.46.75W X79.75N.46.25W X79.75N.45.75W 1 176512 -32.61 -32.92 -33.34 -33.65 -34.09 -34.21 -34.65 -34.98 -35.43> inp[ , 1:10]YYYYMM X79.75N.49.75W X79.75N.49.25W X79.75N.48.75W X79.75N.48.25W X79.75N.47.75W X79.75N.47.25W X79.75N.46.75W X79.75N.46.25W X79.75N.45.75W 1 176512 -32.61 -32.92 -33.34 -33.65 -34.09 -34.21 -34.65 -34.98 -35.43 2 176601 -31.89 -31.96 -32.26 -32.48 -32.71 -33.03 -33.29 -33.41 -33.76 3 176602 -34.31 -34.40 -34.60 -34.79 -35.01 -35.13 -35.46 -35.57 -35.91> > Trying read.csv gives this: > > > Error: cannot allocate vector of size 370.5 MbThat on the other hand suggests you have inadequate machine resources for this job. Perhaps you should be thinking of using other tools than R for this project ... or buying more ram. You should probably have 32 GB for a job this size.> > I attempted to handle this by opening and resaving the file in another software, but even if I can still see the first lines of the file in the import dialog, the full reading of the file always ends up with an error, possibly because of the huge humber of columns .. > > I believe the problem is with some special encoding but I cannot figure out how to go around it.Partially correct but perhaps your problems are multifactorial. I was able to get this to "work" from that webste:> inp <- read.table(file=url("ftp://ftp.ncdc.noaa.gov/pub/data/paleo/historical/europe/casty2007/temp-mon.txt", encoding="UTF-16"), nrow=3 , skip =7, header=TRUE, fill=TRUE, fileEncoding ="UTF-16")> str(inp[ , 1:10])'data.frame': 3 obs. of 10 variables: $ YYYYMM : int 176512 176601 176602 $ X79.75N.49.75W: num -32.6 -31.9 -34.3 $ X79.75N.49.25W: num -32.9 -32 -34.4 $ X79.75N.48.75W: num -33.3 -32.3 -34.6 $ X79.75N.48.25W: num -33.6 -32.5 -34.8 $ X79.75N.47.75W: num -34.1 -32.7 -35 $ X79.75N.47.25W: num -34.2 -33 -35.1 $ X79.75N.46.75W: num -34.6 -33.3 -35.5 $ X79.75N.46.25W: num -35 -33.4 -35.6 $ X79.75N.45.75W: num -35.4 -33.8 -35.9 -- David Winsemius Alameda, CA, USA