Lukas Hejtmanek
2009-Apr-20 18:23 UTC
[Lustre-discuss] Lustre 1.8.0.50 + Xen + kernel 2.6.22.17
Hello, I''m using 1.8.0.50 lustre server + clients. The server is running native 2.6.22.19 kernel, it has 1 mds and 3 OSDs (all at one server). The client is running as DomU under Xen with Suse 2.6.22.17 kernel. The client is patch less. I have an application that reads TIFF images stored on the lustre fs. The libtiff uses mmap on tiffs to read. Unfortunately, I got Bus Errors (SIGBUS) from libtiff. The core looks like this: #1 0x00002b7d5ff825a2 in DumpModeDecode (tif=0x58cdd0, buf=0xf7f7f7f5f5f5f6f6 <Address 0xf7f7f7f5f5f5f6f6 out of bounds>, cc=76800, s=2016) at tif_dumpmode.c:85 (gdb) up #2 0x00002b7d5ff9f745 in TIFFReadEncodedStrip (tif=0x58cdd0, strip=53, buf=0x5923a0, size=76800) at tif_read.c:160 the tif and buf are passed to the DumpModeDecode directly from TIFFRedEncodeStrip. The cc and size are the same. While the tif and cc/size values are preserved, the buf is obviously corrupted. I tried to use -fstack-protector-all, but no check fires. It does not happen, if I copy the TIFFs to local disk, or if I disable mmap usage. There is no chance that the TIFFs are modified in parallel in background. My application is the only application and the only client accessing the lustre fs. Is this a known problem? Is this related only to Xen or some bugs in 2.6.22 kernel? -- Luk?? Hejtm?nek
Andreas Dilger
2009-Apr-20 20:42 UTC
[Lustre-discuss] Lustre 1.8.0.50 + Xen + kernel 2.6.22.17
On Apr 20, 2009 20:23 +0200, Lukas Hejtmanek wrote:> I''m using 1.8.0.50 lustre server + clients. The server is running native > 2.6.22.19 kernel, it has 1 mds and 3 OSDs (all at one server).FYI, 1.8.0.50 appears to be the b1_8 development branch. While we are happy to have bug reports on this, you shouldn''t expect it to be a heavily tested release.> I have an application that reads TIFF images stored on the lustre fs. The > libtiff uses mmap on tiffs to read. Unfortunately, I got Bus Errors (SIGBUS) > from libtiff. > > The core looks like this: > #1 0x00002b7d5ff825a2 in DumpModeDecode (tif=0x58cdd0, buf=0xf7f7f7f5f5f5f6f6 > <Address 0xf7f7f7f5f5f5f6f6 out of bounds>, cc=76800, s=2016) > at tif_dumpmode.c:85This looks like the buffer being passed to the kernel is not aligned properly. The message is fairly clear: buf=0xf7f7f7f5f5f5f6f6 <Address 0xf7f7f7f5f5f5f6f6 out of bounds> For mmap IO this should be aligned to a PAGE_SIZE boundary, normally a multiple of 4096 bytes. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc.
Lukas Hejtmanek
2009-Apr-20 22:04 UTC
[Lustre-discuss] Lustre 1.8.0.50 + Xen + kernel 2.6.22.17
On Mon, Apr 20, 2009 at 02:42:40PM -0600, Andreas Dilger wrote:> > The core looks like this: > > #1 0x00002b7d5ff825a2 in DumpModeDecode (tif=0x58cdd0, buf=0xf7f7f7f5f5f5f6f6 > > <Address 0xf7f7f7f5f5f5f6f6 out of bounds>, cc=76800, s=2016) > > at tif_dumpmode.c:85 > > This looks like the buffer being passed to the kernel is not aligned > properly. The message is fairly clear: > > buf=0xf7f7f7f5f5f5f6f6 > <Address 0xf7f7f7f5f5f5f6f6 out of bounds> > > For mmap IO this should be aligned to a PAGE_SIZE boundary, normally > a multiple of 4096 bytes.that is misunderstanding. the buffer had the valid value buf=0x5923a0 but it was asynchronously modified by someone to the value buf=0xf7f7f7f5f5f5f6f6. so, what I see, there happens some memory corruption if using mmap with lustre. -- Luk?? Hejtm?nek
Oleg Drokin
2009-May-11 14:59 UTC
[Lustre-discuss] Lustre 1.8.0.50 + Xen + kernel 2.6.22.17
Hello! On Apr 20, 2009, at 6:04 PM, Lukas Hejtmanek wrote:> On Mon, Apr 20, 2009 at 02:42:40PM -0600, Andreas Dilger wrote: >>> The core looks like this: >>> #1 0x00002b7d5ff825a2 in DumpModeDecode (tif=0x58cdd0, >>> buf=0xf7f7f7f5f5f5f6f6 >>> <Address 0xf7f7f7f5f5f5f6f6 out of bounds>, cc=76800, s=2016) >>> at tif_dumpmode.c:85 >> This looks like the buffer being passed to the kernel is not aligned >> properly. The message is fairly clear: >> buf=0xf7f7f7f5f5f5f6f6 >> <Address 0xf7f7f7f5f5f5f6f6 out of bounds> >> For mmap IO this should be aligned to a PAGE_SIZE boundary, normally >> a multiple of 4096 bytes. > the buffer had the valid value buf=0x5923a0 but it was > asynchronously modified > by someone to the value buf=0xf7f7f7f5f5f5f6f6.Actually sometimes gdb prints nonsense values like this in output for some reason that is beyond me, even though the buffer is valid.> so, what I see, there happens some memory corruption if using mmap > with lustre.Do you happen to have a simple reproducer that we can run and reproduce the issue? Please create a bug in our bugzilla with the reproducer attached. Thanks for the report. Bye, Oleg
Lukas Hejtmanek
2009-May-14 09:31 UTC
[Lustre-discuss] Lustre 1.8.0.50 + Xen + kernel 2.6.22.17
On Mon, May 11, 2009 at 10:59:00AM -0400, Oleg Drokin wrote:> Do you happen to have a simple reproducer that we can run and reproduce > the issue? > Please create a bug in our bugzilla with the reproducer attached.unfortunately, I do not have a simple test case, it happens randomly in distributed environment. And because it is production environment, I had to revert to stable 1.6.x version. I thought that 1.8.x is also stable. -- Luk?? Hejtm?nek