eeb@clusterfs.com
2006-Dec-18 15:10 UTC
[Lustre-devel] [Bug 11234] dump portals traces on kptllnd timeout
Please don''t reply to lustre-devel. Instead, comment in Bugzilla by using the following link: https://bugzilla.lustre.org/show_bug.cgi?id=11234 What |Removed |Added ---------------------------------------------------------------------------- Attachment #8934 is|0 |1 obsolete| | Created an attachment (id=9164) Please don''t reply to lustre-devel. Instead, comment in Bugzilla by using the following link: --> (https://bugzilla.lustre.org/attachment.cgi?id=9164&action=view) updated patch I updated the patch. Major diffs are as follows... 1. Used existing libcfs module parameter string support 2. Added #ifdef CRAY_XT3 for ptl trace dump support 3. Reworked ptllnd_ptltrace.c to simplify code 4. Fixed bug in file offsets passed to ptl_proc_read 5. Changed CERRORs which could only occur if there were bugs in ptl_proc_read() to LASSERTs
nic@cray.com
2006-Dec-18 15:38 UTC
[Lustre-devel] [Bug 11234] dump portals traces on kptllnd timeout
Please don''t reply to lustre-devel. Instead, comment in Bugzilla by using the following link: https://bugzilla.lustre.org/show_bug.cgi?id=11234 A few questions: - Why add ifdef CRAY_XT3 in ptllnd code, given that the ptllnd is only for Cray ? + LCONSOLE_ERROR("Timing out %s: please check Portals\n", + libcfs_id2str(peer->peer_id)); + Why LCONSOLE_ERROR? Also -- why not put the elapsed time spent waiting here ? I think having this value right up front & readable goes a long way to helping poor sysadmins figure out what is going on. - Removed following comment: - /* The CFS_MMSPACE needs to be used when doing system calls from - * kernelspace. We are treating cfs_filp_write as one - */ Why ? There is no documentation anywhere when to use CFS_MMSPACE. - - if ((start < tmpbuf || start > (tmpbuf + count))) { - CERROR("ptl_proc_read set bogus start: %p\n", - start); - break; - } else if (len > count) { - CERROR("Apparent overflow: %d > %d\n", - len, count); - break; - } + LASSERT (start >= tmpbuf && start + len <= tmpbuf + PAGE_SIZE); + Why does this need to be an LASSERT ? Is it overkill to require this to LBUG & hang a thread -- locking up the whole ptllnd ? If I read code below this right, this LASSERT will lock up this thread, keeping the ptltrace_mutex held and the calling ptllnd thread will never continue. - Why go to the following code? + libcfs_daemonize("ptltracedump"); + + /* serialise with other instances of me */ + mutex_down(&ptltrace_mutex); The original implementation followed the libcfs_debug_dumplog() code exactly. - Why no more reparent_to_init() ? - Why change to blocking the parent thread while dumping this file ?