Greetings, We have 4 OSS nodes and 2 MDS nodes configured in HA pairs, running 2.6.18-53.1.14.el5_lustre.1.6.5smp, and using the o2ib network transport. We had multiple failovers recently (possibly due to hardware problems, but no root cause yet) and managed to get things back again to what I _thought_ was a normal state. However, in the system log we are seeing many "server_bulk_callback" error messages at the rate of ~6 per second. Interestingly, they only come from one HA pair of OSS nodes: Sep 24 23:03:14 lfs-oss-0-3 kernel: LustreError: 20694:0:(events.c:361:server_bulk_callback()) event type 4, status -103, desc ffff81019fce6000 Sep 24 23:03:14 lfs-oss-0-3 kernel: LustreError: 20694:0:(events.c:361:server_bulk_callback()) event type 2, status -103, desc ffff81019fce6000 Sep 24 23:03:16 lfs-oss-0-2 kernel: LustreError: 27257:0:(events.c:361:server_bulk_callback()) event type 4, status -103, desc ffff8101b52b8000 Sep 24 23:03:16 lfs-oss-0-2 kernel: LustreError: 27257:0:(events.c:361:server_bulk_callback()) event type 2, status -103, desc ffff8101b52b8000 Can anyone direct me to documentation to decipher these messages? What does "server_bulk_callback" do, and does "status -103" indicate a severe problem for event types 2 and 4? Thanks very much for your guidance, Nathan
On Sep 24, 2008 17:22 -0600, Nathan Dauchy wrote:> We have 4 OSS nodes and 2 MDS nodes configured in HA pairs, running > 2.6.18-53.1.14.el5_lustre.1.6.5smp, and using the o2ib network > transport. We had multiple failovers recently (possibly due to hardware > problems, but no root cause yet) and managed to get things back again to > what I _thought_ was a normal state. > > However, in the system log we are seeing many "server_bulk_callback" > error messages at the rate of ~6 per second. Interestingly, they only > come from one HA pair of OSS nodes: > > Sep 24 23:03:14 lfs-oss-0-3 kernel: LustreError: > 20694:0:(events.c:361:server_bulk_callback()) event type 4, status -103, > desc ffff81019fce6000 > Sep 24 23:03:14 lfs-oss-0-3 kernel: LustreError: > 20694:0:(events.c:361:server_bulk_callback()) event type 2, status -103, > desc ffff81019fce6000 > Sep 24 23:03:16 lfs-oss-0-2 kernel: LustreError: > 27257:0:(events.c:361:server_bulk_callback()) event type 4, status -103, > desc ffff8101b52b8000 > Sep 24 23:03:16 lfs-oss-0-2 kernel: LustreError: > 27257:0:(events.c:361:server_bulk_callback()) event type 2, status -103, > desc ffff8101b52b8000 > > Can anyone direct me to documentation to decipher these messages? > What does "server_bulk_callback" do, and does "status -103" indicate a > severe problem for event types 2 and 4?All Lustre error numbers are from /usr/include/asm/errno.h. In this case, -103 = -ECONNABORTED. My guess would be some kind of networking issue being hit by LNET, because that isn''t an error used by the Lustre filesystem itself. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc.
On Wed, Sep 24, 2008 at 05:22:55PM -0600, Nathan Dauchy wrote:> Can anyone direct me to documentation to decipher these messages? > What does "server_bulk_callback" do, and does "status -103" indicate a > severe problem for event types 2 and 4?server_bulk_callback signals the completion of bulk data movement initiated by servers; type 2 and 4 are for bulk read and write operations (from the perspective of the server) respectively. The -103 error most likely came from Lustre networking stacks. You''d be able to get more details by enabling low-level network error console logging: echo +neterror > /proc/sys/lnet/printk Thanks, Isaac