I'm trying to read some mainframe data encoded as EBCDIC into R, and am at a loss. I'd like to avoid using an external program to convert the files, since I'm operating in a corporate environment. You can find the example files at at the link below, with both ASCII and EBCDIC versions. Note that there are no linebreaks in the EBCDIC versions of the file -- instead, I'd be specifying the width of each line manually. R has the IBM500 encoding available in my environment, which should be the correct one for these files. However, when I run the following commands, R seems to fail entirely. It loads a single record with garbage characters, regardless of the encoding I specified. layout <- read.fwf("EBCDIC_LAYOUT", widths = c(80), fileEncoding='ibm500') data <- read.fwf("EBCDIC_ZIPCODE", widths = c(32), fileEncoding='ibm500') Where might I go from here? Related -- some of the files I expect to use will be fairly large (1 GB or so). Preferably, I'd like a solution that scales reasonably well. (I tried packages like LaF, but they don't have the option to select encoding.) Thank you very much! Example files -- https://drive.google.com/open?id=0ByvX1v-WqaaASTdwV2ZYS0pBV00&authuser=0 [[alternative HTML version deleted]]
On Thu, Feb 5, 2015 at 2:08 PM, Brian Trautman <btrautman84 at gmail.com> wrote:> I'm trying to read some mainframe data encoded as EBCDIC into R, and am at > a loss. I'd like to avoid using an external program to convert the files, > since I'm operating in a corporate environment. > > You can find the example files at at the link below, with both ASCII and > EBCDIC versions. Note that there are no linebreaks in the EBCDIC versions > of the file -- instead, I'd be specifying the width of each line manually. > R has the IBM500 encoding available in my environment, which should be the > correct one for these files. > > However, when I run the following commands, R seems to fail entirely. It > loads a single record with garbage characters, regardless of the encoding I > specified. > > > layout <- read.fwf("EBCDIC_LAYOUT", widths = c(80), fileEncoding='ibm500') > > data <- read.fwf("EBCDIC_ZIPCODE", widths = c(32), fileEncoding='ibm500') > > > Where might I go from here? > > Related -- some of the files I expect to use will be fairly large (1 GB or > so). Preferably, I'd like a solution that scales reasonably well. (I tried > packages like LaF, but they don't have the option to select encoding.) > > Thank you very much! > > > Example files -- > https://drive.google.com/open?id=0ByvX1v-WqaaASTdwV2ZYS0pBV00&authuser=0 > >? I gave this a short try. What killed me (see below) is that your file EBCDIC_ZIPCODE has embedded NULL characters, \0. My transcript:> file<-file("EBCDIC_ZIPCODE",encoding="IBM500", raw=TRUE); > data=read.fwf(file,widths=c(32));Warning messages: 1: In readLines(file, n = thisblock) : line 1 appears to contain an embedded nul 2: In readLines(file, n = thisblock) : incomplete final line found on 'EBCDIC_ZIPCODE'> View(data)I don't know how to get past the embedded NULL. I'm a UNIX user, so my thought (not applicable with your restriction of "pure R"), would be to use "tr" to convert the \0 to spaces, then use the above.? -- He's about as useful as a wax frying pan. 10 to the 12th power microphones = 1 Megaphone Maranatha! <>< John McKown [[alternative HTML version deleted]]
First off, thank you very much for taking a look at this. I didn't know "raw=TRUE" would be necessary here. Unfortunately, I'm stuck with the embedded nulls in the source data at this point. If worst comes to worst, does R have a way to do something like -- 1. Read the entire file in as raw binary. 2. Replace all embedded nulls with spaces. 3. Output the revised file (as binary) somewhere else. ? I imagine it'd take a big performance penalty, but at least then I proceed with importing the revised file. Thanks again! On Thu, Feb 5, 2015 at 2:06 PM, John McKown <john.archie.mckown at gmail.com> wrote:> On Thu, Feb 5, 2015 at 2:08 PM, Brian Trautman <btrautman84 at gmail.com> > wrote: > >> I'm trying to read some mainframe data encoded as EBCDIC into R, and am at >> a loss. I'd like to avoid using an external program to convert the files, >> since I'm operating in a corporate environment. >> >> You can find the example files at at the link below, with both ASCII and >> EBCDIC versions. Note that there are no linebreaks in the EBCDIC versions >> of the file -- instead, I'd be specifying the width of each line manually. >> R has the IBM500 encoding available in my environment, which should be the >> correct one for these files. >> >> However, when I run the following commands, R seems to fail entirely. It >> loads a single record with garbage characters, regardless of the encoding >> I >> specified. >> >> >> layout <- read.fwf("EBCDIC_LAYOUT", widths = c(80), fileEncoding='ibm500') >> >> data <- read.fwf("EBCDIC_ZIPCODE", widths = c(32), >> fileEncoding='ibm500') >> >> >> Where might I go from here? >> >> Related -- some of the files I expect to use will be fairly large (1 GB or >> so). Preferably, I'd like a solution that scales reasonably well. (I tried >> packages like LaF, but they don't have the option to select encoding.) >> >> Thank you very much! >> >> >> Example files -- >> https://drive.google.com/open?id=0ByvX1v-WqaaASTdwV2ZYS0pBV00&authuser=0 >> >> > ? > I gave this a short try. What killed me (see below) is that your file > EBCDIC_ZIPCODE has embedded NULL characters, \0. My transcript: > > > file<-file("EBCDIC_ZIPCODE",encoding="IBM500", raw=TRUE); > > data=read.fwf(file,widths=c(32)); > Warning messages: > 1: In readLines(file, n = thisblock) : > line 1 appears to contain an embedded nul > 2: In readLines(file, n = thisblock) : > incomplete final line found on 'EBCDIC_ZIPCODE' > > View(data) > > I don't know how to get past the embedded NULL. I'm a UNIX user, so my > thought (not applicable with your restriction of "pure R"), would be to use > "tr" to convert the \0 to spaces, then use the above.? > > > -- > He's about as useful as a wax frying pan. > > 10 to the 12th power microphones = 1 Megaphone > > Maranatha! <>< > John McKown >[[alternative HTML version deleted]]