thr3ads.net - CentOS - [CentOS] Gulliver [Oct 2017]

If this information is useful, please help other people find it:
Share via:

Chris Olson

2017-Oct-30 17:07 UTC

[CentOS] Gulliver

We have been fortunate to hang onto one of our summer interns
for part time work on weekends during the current school year.
One of the intern's jobs is to load documents and data which
are then processed.? The documents are .txt, .docx, and .pdf
files. The data files are raw sensor outputs usually captured
using ADCs mostly with eight bit precision.? All files are
loaded or moved from one machine to another with sftp.

The intern noticed right a way that the documents will transfer
perfectly from our PPC and SPARC machines to our Intel/CentOS
platforms.? The raw data files, not so much.? There is always
an Endian (Thanks Gulliver) issue, which we assume is due to
the bytes of data being formatted into 32 bit words somewhere
in the Big Endian systems.? It is not totally clear why the
document files do not have this issue.? If there is a known
principle behind these observations, we would appreciate very
much any information that can shared.

Peter Kjellström

2017-Oct-30 17:14 UTC

head link

[CentOS] Gulliver

On Mon, 30 Oct 2017 17:07:31 +0000 (UTC)
Chris Olson <chris_e_olson at yahoo.com> wrote:
> We have been fortunate to hang onto one of our summer interns
> for part time work on weekends during the current school year.
> One of the intern's jobs is to load documents and data which
> are then processed.? The documents are .txt, .docx, and .pdf
> files. The data files are raw sensor outputs usually captured
> using ADCs mostly with eight bit precision.? All files are
> loaded or moved from one machine to another with sftp.
> 
> The intern noticed right a way that the documents will transfer
> perfectly from our PPC and SPARC machines to our Intel/CentOS
> platforms.? The raw data files, not so much.? There is always
> an Endian (Thanks Gulliver) issue, which we assume is due to
> the bytes of data being formatted into 32 bit words somewhere
> in the Big Endian systems.? It is not totally clear why the
> document files do not have this issue.? If there is a known
> principle behind these observations, we would appreciate very
> much any information that can shared.
Transferring a file will not change anything. It will be bit-wise
identical.

However the data in the file may be in bit-wise little or big endian
order. A file format may or may not have metadata indicating this.
That is, some files will read differently on different arch'es and
some will be immune (due to more sophisticated abstractions).

So it's not surprising that your raw files will have problems.

If you want to prove this to yourself simply md5sum/sha1sum/etc the
files on both sides.

/Peter K

Stephen John Smoogen

2017-Oct-30 17:46 UTC

head link

[CentOS] Gulliver

On 30 October 2017 at 13:07, Chris Olson <chris_e_olson at yahoo.com>
wrote:> We have been fortunate to hang onto one of our summer interns
> for part time work on weekends during the current school year.
> One of the intern's jobs is to load documents and data which
> are then processed.  The documents are .txt, .docx, and .pdf
> files. The data files are raw sensor outputs usually captured
> using ADCs mostly with eight bit precision.  All files are
> loaded or moved from one machine to another with sftp.
>
> The intern noticed right a way that the documents will transfer
> perfectly from our PPC and SPARC machines to our Intel/CentOS
> platforms.  The raw data files, not so much.  There is always
> an Endian (Thanks Gulliver) issue, which we assume is due to
> the bytes of data being formatted into 32 bit words somewhere
> in the Big Endian systems.  It is not totally clear why the
> document files do not have this issue.  If there is a known
> principle behind these observations, we would appreciate very
> much any information that can shared.
>
>
Text files which are ascii are generally 7->8 bit so don't tend to
have bit endian problems in 8+ bit architectures. [I expect a 4 bit
architecture would have problems].  Now 8+ bit UTF can have some
problems with endianess but it is usually not following some standard
and assuming that writing data works the same as it did with ascii
(mainly because few people dealt with 4 bit computers).

docx and pdf is written for a fixed endian format so even if
built/written on a big endian system the data itself is formatted to
be little endian. Raw data files are usually endian if they are 'raw'
memory dumps or similar. Some 'data' formats which are mostly raw are
actually written to a standard which will work because both the little
endian and big endian expects the data to be written in 'big' or
'little' endian and read in as such.


> _______________________________________________
> CentOS mailing list
> CentOS at centos.org
> https://lists.centos.org/mailman/listinfo/centos


-- 
Stephen J Smoogen.

Gordon Messmer

2017-Oct-31 18:47 UTC

head link

[CentOS] Gulliver

On 10/30/2017 10:07 AM, Chris Olson wrote:> All files are
> loaded or moved from one machine to another with sftp.
>
> The intern noticed right a way that the documents will transfer
> perfectly from our PPC and SPARC machines to our Intel/CentOS
> platforms.? The raw data files, not so much.? There is always
> an Endian (Thanks Gulliver) issue, which we assume is due to
> the bytes of data being formatted into 32 bit words somewhere
> in the Big Endian systems.

It's unlikely that copying the files is causing the problem you 
observe.? As Peter suggested, you can use "md5sum" on the source and 
destination hosts to demonstrate that the files are not being modified 
in transmission.

However, endianness can be a problem if the applications you use naively 
save data to a file in their native byte order, and also read in native 
byte order.? In situations like that, a big-endian system will save data 
that the same application will fail to read, when it is run on a 
little-endian system.

If this is an application that you've developed in-house, you should be 
using htonl() to convert your 32-bit values to network byte order and 
writing that value to the data file, and using ntohl() to convert 32-bit 
values that you read from data files to the native host byte order.

Warren Young

2017-Oct-31 19:06 UTC

head link

[CentOS] Gulliver

On Oct 31, 2017, at 12:47 PM, Gordon Messmer <gordon.messmer at gmail.com>
wrote:> 
> If this is an application that you've developed in-house, you should be
using htonl() to convert your 32-bit values to network byte order
?or its superset, XDR  [1]
?or use a text format (XML, JSON, YAML, SQL, CSV?)
?or use a binary serialization of same (BSON, CBOR, Binary XML?)
?or use FlatBuffers  [2]
?or use ASN.1  [3]

or, or or.  This problem is *solved*.  The only difficult part is choosing which
of the many available solutions to use.



[1]: https://en.wikipedia.org/wiki/External_Data_Representation
[2]: https://en.wikipedia.org/wiki/FlatBuffers
[3]: https://en.wikipedia.org/wiki/Abstract_Syntax_Notation_One

Apparently Analagous Threads

Search for more maybe matching threads

CentOS - Oct 2017 - Gulliver

[CentOS] Gulliver

[CentOS] Gulliver

[CentOS] Gulliver

[CentOS] Gulliver

[CentOS] Gulliver

Apparently Analagous Threads