thr3ads.net - freebsd stable - arm64 fork/swap data corruptions: A ~110 line C program demonstrating an example (Pine64+ 2GB context) [Corrected subject: arm64!] [Mar 2017]

If this information is useful, please help other people find it:
Share via:

Mark Millard

2017-Mar-15 04:33 UTC

arm64 fork/swap data corruptions: A ~110 line C program demonstrating an example (Pine64+ 2GB context) [Corrected subject: arm64!]

A single Byte access to a 4K Byte aligned region between
the fork and wait/sleep/swap-out prevents that specific
4K Byte region from having the (bad) zeros.

Sounds like a page sized unit of behavior to me.

Details follow.

On 2017-Mar-14, at 3:28 PM, Mark Millard <markmi at dsl-only.net> wrote:
> [test_check() between the fork and the wait/sleep prevents the
> failure from occurring. Even a small access to the memory at
> that stage prevents the failure. Details follow.]
> 
> On 2017-Mar-14, at 11:07 AM, Mark Millard <markmi at dsl-only.net>
wrote:
> 
>> [This is just a correction to the subject-line text to say arm64
>> instead of amd64.]
>> 
>> On 2017-Mar-14, at 12:58 AM, Mark Millard <markmi at
dsl-only.net> wrote:
>> 
>> [Another correction I'm afraid --about alternative program
variations
>> this time.]
>> 
>> On 2017-Mar-13, at 11:52 PM, Mark Millard <markmi at
dsl-only.net> wrote:
>> 
>>> I'm still at a loss about how to figure out what stages are
messed
>>> up. (Memory coherency? Some memory not swapped out? Bad data
swapped
>>> out? Wrong data swapped in?)
>>> 
>>> But at least I've found a much smaller/simpler example to
demonstrate
>>> some problem with in my Pine64+_ 2GB context.
>>> 
>>> The Pine64+ 2GB is the only amd64 context that I have access to.
>> 
>> Someday I'll learn to type arm64 the first time instead of amd64.
>> 
>>> The following program fails its check for data
>>> having its expected byte pattern in dynamically
>>> allocated memory after a fork/swap-out/swap-in
>>> sequence.
>>> 
>>> I'll note that the program sleeps for 60s after
>>> forking to give time to do something else to
>>> cause the parent and child processes to swap
>>> out (RES=0 as seen in top).
>> 
>> The following about the extra test_check() was
>> wrong.
>> 
>>> Note the source code line:
>>> 
>>> // test_check(); // Adding this line prevents failure.
>>> 
>>> It seem that accessing the region contents before forking
>>> and swapping avoids the problem. But there is a problem
>>> if the region was only written-to before the fork/swap.
> 
> There is a place that if a test_check call is put then the
> problem does not happen at any stage: I tried putting a
> call between the fork and the later wait/sleep code:
I changed the byte sequence patterns to avoid
zero values since the bad values are zeros:

static value_type value(size_t v) { return (value_type)((v&0xFEu)|0x1u); }
                  // value now avoids the zero value since the failures
                  // are zeros.

With that I can then test accurately what bytes have
bad values vs. do not. I also changed to:

void partial_test_check(void) {
    if (value(0u)!=gbl_region.array[0])    raise(SIGABRT);
    if (value(0u)!=(*dyn_region).array[0]) raise(SIGABRT);
}

since previously [0] had a zero value and so I'd used [1].

On this basis I'm now using the below. See the comments tied
to partial_test_check() calls:

extern void test_setup(void);         // Sets up the memory byte patterns.
extern void test_check(void);         // Tests the memory byte patterns.
extern void partial_test_check(void); // Tests just [0] of each region
                                      // (gbl_region and dyn_region).

int main(void) {
    test_setup();
    test_check(); // Before fork() [passes]

    pid_t pid = fork();
    int wait_status = 0;;

    // After fork; before waitsleep/swap-out.

    if (0==pid) partial_test_check();
                     // Even the above is sufficient by
                     // itself to prevent failure for
                     // region_size 1u through
                     // 4u*1024u!
                     // But 4u*1024u+1u and above fail
                     // with this access to memory.
                     // The failing test is of
                     // (*dyn_region).array[4096u].
                     // This test never fails here.

    if (0<pid) partial_test_check(); // This never prevents
                                     // later failures (and
                                     // never fails here).

    if (0<pid) { wait(&wait_status); }

    if (-1!=wait_status && 0<=pid) {
        if (0==pid) {
            sleep(60);

            // During this manually force this process to
            // swap out. I use something like:

            // stress -m 1 --vm-bytes 1800M

            // in another shell and ^C'ing it after top
            // shows the swapped status desired. 1800M
            // just happened to work on the Pine64+ 2GB
            // that I was using. I watch with top -PCwaopid .
        }

        test_check(); // After wait/sleep [fails for small-enough region_sizes]
    }
}
> This suggests to me that the small access is forcing one or more things to
> be initialized for memory access that fork is not establishing of itself.
> It appears that if established correctly then the swap-out/swap-in
> sequence would work okay without needing the manual access to the memory.
> 
> 
> So far via this test I've not seen any evidence of problems with the
global
> region but only the dynamically allocated region.
> 
> However, the symptoms that started this investigation in a much more
> complicated context had an area of global memory from a .so that ended
> up being zero.
> 
> I think that things should be fixed for this simpler context first and
> that further investigation of the sh/su related should wait to see what
> things are like after this test case works.
==Mark Millard
markmi at dsl-only.net

Mark Millard

2017-Mar-19 00:53 UTC

head link

arm64 fork/swap data corruptions: A ~110 line C program demonstrating an example (Pine64+ 2GB context) [Corrected subject: arm64!]

A new, significant discovery follows. . .

While checking out use of procstat -v I ran
into the following common property for the 3
programs that I looked at:

A) My small test program that fails for
   a dynamically allocated space.

B) sh reporting Failed assertion: "tsd_booted".

C) su reporting Failed assertion: "tsd_booted".

Here are example addresses from the area of
incorrectly zeroed memory (A then B then C):

(lldb) print dyn_region
(region *volatile) $0 = 0x0000000040616000

(lldb) print &__je_tsd_booted
(bool *) $0 = 0x0000000040618520

(lldb) print &__je_tsd_booted
(bool *) $0 = 0x0000000040618520

The first is from dynamic allocation ending up
in the area. The other two are from libc.so.7
globals/statics ending up in the general area.

It looks like something is trashing a specific
memory area for some reason, rather independently
of what the program specifics are.


Other notes:

At least for my small program showing failure:

Being explicit about the combined conditions for failure
for my test program. . .

Both tcache enabled and allocations fitting in SMALL_MAXCLASS
are required in order to make the program fail.

Note:

lldb) print __je_tcache_maxclass
(size_t) $0 = 32768

which is larger than SMALL_MAXCLASS. I've not observed
failures for sizes above SMALL_MAXCLASS but not exceeding
__je_tcache_maxclass.

Thus tcache use by itself does not seen sufficient for
my program to get corruption of its dynamically allocated
memory: the small allocation size also matters.


Be warned that I can not eliminate the possibility that
the trashing changed what region of memory it trashed
for larger allocations or when tcache is disabled.


==Mark Millard
markmi at dsl-only.net

freebsd stable - Mar 2017 - arm64 fork/swap data corruptions: A ~110 line C program demonstrating an example (Pine64+ 2GB context) [Corrected subject: arm64!]

arm64 fork/swap data corruptions: A ~110 line C program demonstrating an example (Pine64+ 2GB context) [Corrected subject: arm64!]

arm64 fork/swap data corruptions: A ~110 line C program demonstrating an example (Pine64+ 2GB context) [Corrected subject: arm64!]