thr3ads.net - R help - [R] Fixed Width EBCDIC Files in R [Feb 2015]

If this information is useful, please help other people find it:
Share via:

Brian Trautman

2015-Feb-05 20:08 UTC

[R] Fixed Width EBCDIC Files in R

I'm trying to read some mainframe data encoded as EBCDIC into R, and am at
a loss. I'd like to avoid using an external program to convert the files,
since I'm operating in a corporate environment.

You can find the example files at at the link below, with both ASCII and
EBCDIC versions. Note that there are no linebreaks in the EBCDIC versions
of the file -- instead, I'd be specifying the width of each line manually.
R has the IBM500 encoding available in my environment, which should be the
correct one for these files.

However, when I run the following commands, R seems to fail entirely.  It
loads a single record with garbage characters, regardless of the encoding I
specified.


layout <- read.fwf("EBCDIC_LAYOUT", widths = c(80),
fileEncoding='ibm500')

data   <- read.fwf("EBCDIC_ZIPCODE", widths = c(32),
fileEncoding='ibm500')


Where might I go from here?

Related -- some of the files I expect to use will be fairly large (1 GB or
so). Preferably, I'd like a solution that scales reasonably well. (I tried
packages like LaF, but they don't have the option to select encoding.)

Thank you very much!


Example files --
https://drive.google.com/open?id=0ByvX1v-WqaaASTdwV2ZYS0pBV00&authuser=0

	[[alternative HTML version deleted]]

John McKown

2015-Feb-05 22:06 UTC

head link

[R] Fixed Width EBCDIC Files in R

On Thu, Feb 5, 2015 at 2:08 PM, Brian Trautman <btrautman84 at gmail.com>
wrote:
> I'm trying to read some mainframe data encoded as EBCDIC into R, and am
at
> a loss. I'd like to avoid using an external program to convert the
files,
> since I'm operating in a corporate environment.
>
> You can find the example files at at the link below, with both ASCII and
> EBCDIC versions. Note that there are no linebreaks in the EBCDIC versions
> of the file -- instead, I'd be specifying the width of each line
manually.
> R has the IBM500 encoding available in my environment, which should be the
> correct one for these files.
>
> However, when I run the following commands, R seems to fail entirely.  It
> loads a single record with garbage characters, regardless of the encoding I
> specified.
>
>
> layout <- read.fwf("EBCDIC_LAYOUT", widths = c(80),
fileEncoding='ibm500')
>
> data   <- read.fwf("EBCDIC_ZIPCODE", widths = c(32),
fileEncoding='ibm500')
>
>
> Where might I go from here?
>
> Related -- some of the files I expect to use will be fairly large (1 GB or
> so). Preferably, I'd like a solution that scales reasonably well. (I
tried
> packages like LaF, but they don't have the option to select encoding.)
>
> Thank you very much!
>
>
> Example files --
>
https://drive.google.com/open?id=0ByvX1v-WqaaASTdwV2ZYS0pBV00&authuser=0
>
>?
I gave this a short try. What killed me (see below) is that your file
EBCDIC_ZIPCODE has embedded NULL characters, \0. My transcript:
> file<-file("EBCDIC_ZIPCODE",encoding="IBM500",
raw=TRUE);
> data=read.fwf(file,widths=c(32));Warning messages:
1: In readLines(file, n = thisblock) :
  line 1 appears to contain an embedded nul
2: In readLines(file, n = thisblock) :
  incomplete final line found on
'EBCDIC_ZIPCODE'> View(data)
I don't know how to get past the embedded NULL. I'm a UNIX user, so my
thought (not applicable with your restriction of "pure R"), would be
to use
"tr" to convert the \0 to spaces, then use the above.?


-- 
He's about as useful as a wax frying pan.

10 to the 12th power microphones = 1 Megaphone

Maranatha! <><
John McKown

	[[alternative HTML version deleted]]

Brian Trautman

2015-Feb-05 22:45 UTC

head link

[R] Fixed Width EBCDIC Files in R

First off, thank you very much for taking a look at this.  I didn't know
"raw=TRUE" would be necessary here.

Unfortunately, I'm stuck with the embedded nulls in the source data at this
point.  If worst comes to worst, does R have a way to do something like --

1.  Read the entire file in as raw binary.
2.  Replace all embedded nulls with spaces.
3.  Output the revised file (as binary) somewhere else.

?

I imagine it'd take a big performance penalty, but at least then I proceed
with importing the revised file.

Thanks again!

On Thu, Feb 5, 2015 at 2:06 PM, John McKown <john.archie.mckown at
gmail.com>
wrote:
> On Thu, Feb 5, 2015 at 2:08 PM, Brian Trautman <btrautman84 at
gmail.com>
> wrote:
>
>> I'm trying to read some mainframe data encoded as EBCDIC into R,
and am at
>> a loss. I'd like to avoid using an external program to convert the
files,
>> since I'm operating in a corporate environment.
>>
>> You can find the example files at at the link below, with both ASCII
and
>> EBCDIC versions. Note that there are no linebreaks in the EBCDIC
versions
>> of the file -- instead, I'd be specifying the width of each line
manually.
>> R has the IBM500 encoding available in my environment, which should be
the
>> correct one for these files.
>>
>> However, when I run the following commands, R seems to fail entirely. 
It
>> loads a single record with garbage characters, regardless of the
encoding
>> I
>> specified.
>>
>>
>> layout <- read.fwf("EBCDIC_LAYOUT", widths = c(80),
fileEncoding='ibm500')
>>
>> data   <- read.fwf("EBCDIC_ZIPCODE", widths = c(32),
>> fileEncoding='ibm500')
>>
>>
>> Where might I go from here?
>>
>> Related -- some of the files I expect to use will be fairly large (1 GB
or
>> so). Preferably, I'd like a solution that scales reasonably well.
(I tried
>> packages like LaF, but they don't have the option to select
encoding.)
>>
>> Thank you very much!
>>
>>
>> Example files --
>>
https://drive.google.com/open?id=0ByvX1v-WqaaASTdwV2ZYS0pBV00&authuser=0
>>
>>
> ?
> I gave this a short try. What killed me (see below) is that your file
> EBCDIC_ZIPCODE has embedded NULL characters, \0. My transcript:
>
> > file<-file("EBCDIC_ZIPCODE",encoding="IBM500",
raw=TRUE);
> > data=read.fwf(file,widths=c(32));
> Warning messages:
> 1: In readLines(file, n = thisblock) :
>   line 1 appears to contain an embedded nul
> 2: In readLines(file, n = thisblock) :
>   incomplete final line found on 'EBCDIC_ZIPCODE'
> > View(data)
>
> I don't know how to get past the embedded NULL. I'm a UNIX user, so
my
> thought (not applicable with your restriction of "pure R"), would
be to use
> "tr" to convert the \0 to spaces, then use the above.?
>
>
> --
> He's about as useful as a wax frying pan.
>
> 10 to the 12th power microphones = 1 Megaphone
>
> Maranatha! <><
> John McKown
>
	[[alternative HTML version deleted]]

R help - Feb 2015 - Fixed Width EBCDIC Files in R

[R] Fixed Width EBCDIC Files in R

[R] Fixed Width EBCDIC Files in R

[R] Fixed Width EBCDIC Files in R