thr3ads.net - Lustre discuss - [Lustre-discuss] Lustre 1.8.0.50 + Xen + kernel 2.6.22.17 [Apr 2009]

If this information is useful, please help other people find it:
Share via:

Lukas Hejtmanek

2009-Apr-20 18:23 UTC

[Lustre-discuss] Lustre 1.8.0.50 + Xen + kernel 2.6.22.17

Hello,

I''m using 1.8.0.50 lustre server + clients. The server is running
native
2.6.22.19 kernel, it has 1 mds and 3 OSDs (all at one server).

The client is running as DomU under Xen with Suse 2.6.22.17 kernel. The client
is patch less. 

I have an application that reads TIFF images stored on the lustre fs. The
libtiff uses mmap on tiffs to read. Unfortunately, I got Bus Errors (SIGBUS)
from libtiff.

The core looks like this:
#1  0x00002b7d5ff825a2 in DumpModeDecode (tif=0x58cdd0, buf=0xf7f7f7f5f5f5f6f6
<Address 0xf7f7f7f5f5f5f6f6 out of bounds>, cc=76800, s=2016)
    at tif_dumpmode.c:85
(gdb) up
#2  0x00002b7d5ff9f745 in TIFFReadEncodedStrip (tif=0x58cdd0, strip=53,
buf=0x5923a0, size=76800) at tif_read.c:160

the tif and buf are passed to the DumpModeDecode directly from
TIFFRedEncodeStrip. The cc and size are the same. While the tif and cc/size
values
are preserved, the buf is obviously corrupted. I tried to use
-fstack-protector-all, but no check fires. 

It does not happen, if I copy the TIFFs to local disk, or if I disable mmap
usage.

There is no chance that the TIFFs are modified in parallel in background. My
application is the only application and the only client accessing the lustre
fs.

Is this a known problem? Is this related only to Xen or some bugs in 2.6.22
kernel?

-- 
Luk?? Hejtm?nek

Andreas Dilger

2009-Apr-20 20:42 UTC

head link

[Lustre-discuss] Lustre 1.8.0.50 + Xen + kernel 2.6.22.17

On Apr 20, 2009  20:23 +0200, Lukas Hejtmanek wrote:> I''m using 1.8.0.50 lustre server + clients. The server is running
native
> 2.6.22.19 kernel, it has 1 mds and 3 OSDs (all at one server).
FYI, 1.8.0.50 appears to be the b1_8 development branch.  While we are
happy to have bug reports on this, you shouldn''t expect it to be a
heavily tested release.
> I have an application that reads TIFF images stored on the lustre fs. The
> libtiff uses mmap on tiffs to read. Unfortunately, I got Bus Errors
(SIGBUS)
> from libtiff.
> 
> The core looks like this:
> #1  0x00002b7d5ff825a2 in DumpModeDecode (tif=0x58cdd0,
buf=0xf7f7f7f5f5f5f6f6
> <Address 0xf7f7f7f5f5f5f6f6 out of bounds>, cc=76800, s=2016)
>     at tif_dumpmode.c:85
This looks like the buffer being passed to the kernel is not aligned
properly.   The message is fairly clear:

	buf=0xf7f7f7f5f5f5f6f6
	<Address 0xf7f7f7f5f5f5f6f6 out of bounds>

For mmap IO this should be aligned to a PAGE_SIZE boundary, normally
a multiple of 4096 bytes.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

Lukas Hejtmanek

2009-Apr-20 22:04 UTC

head link

[Lustre-discuss] Lustre 1.8.0.50 + Xen + kernel 2.6.22.17

On Mon, Apr 20, 2009 at 02:42:40PM -0600, Andreas Dilger
wrote:> > The core looks like this:
> > #1  0x00002b7d5ff825a2 in DumpModeDecode (tif=0x58cdd0,
buf=0xf7f7f7f5f5f5f6f6
> > <Address 0xf7f7f7f5f5f5f6f6 out of bounds>, cc=76800, s=2016)
> >     at tif_dumpmode.c:85
> 
> This looks like the buffer being passed to the kernel is not aligned
> properly.   The message is fairly clear:
> 
> 	buf=0xf7f7f7f5f5f5f6f6
> 	<Address 0xf7f7f7f5f5f5f6f6 out of bounds>
> 
> For mmap IO this should be aligned to a PAGE_SIZE boundary, normally
> a multiple of 4096 bytes.
that is misunderstanding.

the buffer had the valid value buf=0x5923a0 but it was asynchronously modified
by someone to the value buf=0xf7f7f7f5f5f5f6f6.

so, what I see, there happens some memory corruption if using mmap with lustre.

-- 
Luk?? Hejtm?nek

Oleg Drokin

2009-May-11 14:59 UTC

head link

[Lustre-discuss] Lustre 1.8.0.50 + Xen + kernel 2.6.22.17

Hello!

On Apr 20, 2009, at 6:04 PM, Lukas Hejtmanek wrote:
> On Mon, Apr 20, 2009 at 02:42:40PM -0600, Andreas Dilger wrote:
>>> The core looks like this:
>>> #1  0x00002b7d5ff825a2 in DumpModeDecode (tif=0x58cdd0,  
>>> buf=0xf7f7f7f5f5f5f6f6
>>> <Address 0xf7f7f7f5f5f5f6f6 out of bounds>, cc=76800, s=2016)
>>>    at tif_dumpmode.c:85
>> This looks like the buffer being passed to the kernel is not aligned
>> properly.   The message is fairly clear:
>> 	buf=0xf7f7f7f5f5f5f6f6
>> 	<Address 0xf7f7f7f5f5f5f6f6 out of bounds>
>> For mmap IO this should be aligned to a PAGE_SIZE boundary, normally
>> a multiple of 4096 bytes.
> the buffer had the valid value buf=0x5923a0 but it was  
> asynchronously modified
> by someone to the value buf=0xf7f7f7f5f5f5f6f6.
Actually sometimes gdb prints nonsense values like this in output for  
some reason
that is beyond me, even though the buffer is valid.
> so, what I see, there happens some memory corruption if using mmap  
> with lustre.
Do you happen to have a simple reproducer that we can run and  
reproduce the issue?
Please create a bug in our bugzilla with the reproducer attached.

Thanks for the report.

Bye,
     Oleg

Lukas Hejtmanek

2009-May-14 09:31 UTC

head link

[Lustre-discuss] Lustre 1.8.0.50 + Xen + kernel 2.6.22.17

On Mon, May 11, 2009 at 10:59:00AM -0400, Oleg Drokin
wrote:> Do you happen to have a simple reproducer that we can run and reproduce 
> the issue?
> Please create a bug in our bugzilla with the reproducer attached.
unfortunately, I do not have a simple test case, it happens randomly in
distributed environment. And because it is production environment, I had to
revert to stable 1.6.x version. I thought that 1.8.x is also stable.

-- 
Luk?? Hejtm?nek

Lustre discuss - Apr 2009 - Lustre 1.8.0.50 + Xen + kernel 2.6.22.17

[Lustre-discuss] Lustre 1.8.0.50 + Xen + kernel 2.6.22.17

[Lustre-discuss] Lustre 1.8.0.50 + Xen + kernel 2.6.22.17

[Lustre-discuss] Lustre 1.8.0.50 + Xen + kernel 2.6.22.17

[Lustre-discuss] Lustre 1.8.0.50 + Xen + kernel 2.6.22.17

[Lustre-discuss] Lustre 1.8.0.50 + Xen + kernel 2.6.22.17