Okay this is my first time messing with lustre and I''ve had my first real problem that hasn''t been discussed (at least I haven''t found searching bugzilla and discuss/devel mls). So I''d like to post my results. I''m using the latest xen kernel on ia64 and attempting to patch lustre into that (2.6.16.33). I''ve been using the lustre 1.4.7.3 version for a while and use the sles10 series of patches. The sles10 series of patches are close but the vfs_intent patch doesn''t patch quite right missing a define in fs.h then everything works fine. ===diff --git a/include/linux/fs.h b/include/linux/fs.h index 872042f..95487ec 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -265,6 +265,7 @@ typedef void (dio_iodone_t)(struct kiocb #define ATTR_KILL_SUID 2048 #define ATTR_KILL_SGID 4096 #define ATTR_FILE 8192 +#define ATTR_NO_BLOCK 32768 /* Return EAGAIN and don''t block on long truncates */ /* * This is the Inode Attributes structure, used for notify_change(). It === Also there were tons of #if <insert define here> that were configure script defines that should be done as #ifdef not #if. I know there was a lot of these fixed but there''s more... Also using __WORDSIZE vs. BITS_PER_LONG it''d be really nice if everything used one or the other, not both... I got lots of __WORDSIZE not defined warnings in the lustre_user.h file which happens to be included by tons of other files. So every compiled source comes up with this warning and that makes me nervous. And finally the d_instantiate_unique check in configure came up wrong for me. 2.6.16.33''s d_instantiate_unique is the same as 2.6.15.7 and 2.6.17.14 is the one that has the fix so in lustre/llite/namei.c it really should check to make sure the version is strictly less than 2.6.17 not 16 and also the configure check should be fixed to make sure it works, I''d give you a patch to do that but I haven''t looked into how the check is done. But after I fixed all the warnings and all the patches applied successfully I was able to start lustre and it worked fine. A distrubuted filesystem seems rather ridiculas on one box but it worked okay, until I tried to put data into it. I downloaded a 250M iso off the web to toss into the filesystem then I pulled down the md5sum to check to make sure I had it all and put the file into the lustre filesystem after verifying that it was all there. Then I md5sum''ed the iso in the lustre filesystem and it came up wrong, I did it a second time and it came up different from the first and still wrong. I didn''t notice anything from dmesg about lustre when this was happening. Then when I tried to bring down the filesystem on the client the mds crashed and this error came out in dmesg. [57940.076201] LustreError: 3891:0:(socklnd.c:1287:ksocknal_close_conn_locked()) ASSERTION(peer->ksnp_error == 0) failed [57940.076235] LustreError: 3891:0:(tracefile.c:419:libcfs_assertion_failed()) LBUG [57940.076246] Lustre: 3891:0:(linux-debug.c:156:libcfs_debug_dumpstack()) showing stack for process 3891 [57940.076255] socknal_sd00 R running task 0 3891 1 3892 3767 (L-TLB) [57940.076284] [57940.076286] Call Trace: [57940.076350] scheduling while atomic: socknal_sd00/0x00000100/3891 [57940.076384] [57940.076386] Call Trace: [57940.076410] [<a00000010001b490>] show_stack+0x50/0xa0 [57940.076413] sp=e00000000afb7b60 bsp=e00000000afb1290 [57940.076433] [<a00000010001b510>] dump_stack+0x30/0x60 [57940.076436] sp=e00000000afb7d30 bsp=e00000000afb1278 [57940.076464] [<a0000001004e0160>] schedule+0xa0/0x1400 [57940.076466] sp=e00000000afb7d30 bsp=e00000000afb11a8 [57940.076542] [<a0000002002868b0>] libcfs_debug_dumplog+0x2b0/0x320 [libcfs] [57940.076545] sp=e00000000afb7d30 bsp=e00000000afb1190 [57940.076583] [<a00000020027b670>] lbug_with_loc+0x70/0xe0 [libcfs] [57940.076586] sp=e00000000afb7d60 bsp=e00000000afb1160 [57940.076622] [<a00000020028c160>] libcfs_assertion_failed+0xa0/0xc0 [libcfs] [57940.076625] sp=e00000000afb7d60 bsp=e00000000afb1128 [57940.076672] [<a0000002003b51a0>] ksocknal_close_conn_locked+0x80/0x5a0 [ksocklnd] [57940.076674] sp=e00000000afb7d60 bsp=e00000000afb10c0 [57940.076704] [<a0000002003bc6d0>] ksocknal_close_peer_conns_locked+0x90/0xe0 [ksocklnd] [57940.076707] sp=e00000000afb7d60 bsp=e00000000afb1080 [57940.076736] [<a0000002003bc790>] ksocknal_close_conn_and_siblings+0x70/0xc0 [ksocklnd] [57940.076739] sp=e00000000afb7d60 bsp=e00000000afb1040 [57940.076771] [<a0000002003c9e60>] ksocknal_process_receive+0x640/0xae0 [ksocklnd] [57940.076774] sp=e00000000afb7d60 bsp=e00000000afb1000 [57940.076802] [<a0000002003cacc0>] ksocknal_scheduler+0x4a0/0xe80 [ksocklnd] [57940.076805] sp=e00000000afb7da0 bsp=e00000000afb0f88 [57940.076824] [<a00000010001d790>] kernel_thread_helper+0x30/0x60 [57940.076826] sp=e00000000afb7e30 bsp=e00000000afb0f60 [57940.076846] [<a0000001000110c0>] start_kernel_thread+0x20/0x40 [57940.076849] sp=e00000000afb7e30 bsp=e00000000afb0f60 [57949.451304] BUG: soft lockup detected on CPU#0! [57949.451340] Modules linked in: fsfilt_ldiskfs ldiskfs mds lov osc mdc ptlrpc obdclass lvfs ksocklnd lnet libcfs ipv6 autofs4 sunrpc af_packet dm_snapshot dm_zero dm_mirror ext3 mbcache jbd dm_mod ide_disk cmd64x ide_core mptsas mptspi mptfc scsi_transport_fc mptscsih sym53c8xx mptbase sd_mod [57949.451466] [57949.451468] Pid: 3896, CPU 0, comm: socknal_reaper [57949.451480] psr : 00000210081a6010 ifs : 8000000000000004 ip : [<a0000001004e57c1>] Not tainted [57949.451497] ip is at _write_lock_bh+0x41/0xa0 [57949.451505] unat: 0000000000000000 pfs : 800000000000050e rsc : 000000000000000b [57949.451513] rnat: 0000000000000000 bsps: 0000000000000000 pr : 0000000000006681 [57949.451521] ldrs: 0000000000000000 ccv : 0000000000000000 fpsr: 0009804c8a70433f [57949.451529] csd : 0000000000000000 ssd : 0000000000000000 [57949.451539] b0 : a0000002003bb980 b6 : a000000100348ea0 b7 : a000000100068e30 [57949.451549] f6 : 1003e0000000000000028 f7 : 1003e28f5c28f5c28f5c3 [57949.451557] f8 : 1003e00000000000000fa f9 : 1003e0000000032000000 [57949.451567] f10 : 1003e000000003b9aca00 f11 : 1003ed6bf94d5e57a42bd [57949.451577] r1 : a00000010092b5a0 r2 : 0000000080000000 r3 : a00000010073cbf0 [57949.451587] r8 : 0000000000000000 r9 : fffffffffff00001 r10 : fffffffffff04c18 [57949.451598] r11 : fffffffffff00000 r12 : e0000000093efdf0 r13 : e0000000093e8000 [57949.451608] r14 : e0000000093e8f10 r15 : 0000000000000100 r16 : e0000000093e8f10 [57949.451617] r17 : 0000000000000000 r18 : 0000000000000001 r19 : 000000003fffff00 [57949.451626] r20 : 0000000000000100 r21 : 0000000000000000 r22 : ffffffffffff0048 [57949.451634] r23 : e0000000093e8f10 r24 : 0000000000000000 r25 : 0000000000000001 [57949.451644] r26 : a0000002003e0ec8 r27 : 0000000000000000 r28 : a0000002003e0e90 [57949.451654] r29 : 0000000080000000 r30 : 0000000000000000 r31 : e00000000d433700 [57949.451668] [57949.451670] Call Trace: [57949.451688] [<a00000010001b490>] show_stack+0x50/0xa0 [57949.451691] sp=e0000000093efa40 bsp=e0000000093e9288 [57949.451722] [<a00000010001bd60>] show_regs+0x820/0x840 [57949.451724] sp=e0000000093efc10 bsp=e0000000093e9240 [57949.451751] [<a0000001000d5e90>] softlockup_tick+0x150/0x180 [57949.451753] sp=e0000000093efc10 bsp=e0000000093e9210 [57949.451776] [<a0000001000a2f30>] do_timer+0x990/0x9c0 [57949.451778] sp=e0000000093efc20 bsp=e0000000093e91c0 [57949.451802] [<a0000001000411e0>] timer_interrupt+0x200/0x3c0 [57949.451804] sp=e0000000093efc20 bsp=e0000000093e9180 [57949.451824] [<a0000001000d65f0>] handle_IRQ_event+0x170/0x240 [57949.451827] sp=e0000000093efc20 bsp=e0000000093e9140 [57949.451846] [<a0000001000d6980>] __do_IRQ+0x2c0/0x3e0 [57949.451849] sp=e0000000093efc20 bsp=e0000000093e90f0 [57949.451875] [<a000000100341de0>] evtchn_do_upcall+0x160/0x240 [57949.451877] sp=e0000000093efc20 bsp=e0000000093e9068 [57949.451899] [<a000000100068ce0>] xen_event_callback+0x3a0/0x3e0 [57949.451902] sp=e0000000093efc20 bsp=e0000000093e9068 Any help on this would be appreciated. I''m trying to build 1.4.8 now to see if any of the old 1.4.7.3 patches I had still need to be applied. Thanks, David Brown