eeb@clusterfs.com
2006-Dec-18  15:10 UTC
[Lustre-devel] [Bug 11234] dump portals traces on kptllnd timeout
Please don''t reply to lustre-devel. Instead, comment in Bugzilla by
using the following link:
https://bugzilla.lustre.org/show_bug.cgi?id=11234
           What    |Removed                     |Added
----------------------------------------------------------------------------
Attachment #8934 is|0                           |1
           obsolete|                            |
Created an attachment (id=9164)
Please don''t reply to lustre-devel. Instead, comment in Bugzilla by
using the following link:
 --> (https://bugzilla.lustre.org/attachment.cgi?id=9164&action=view)
updated patch
I updated the patch.  Major diffs are as follows...
1. Used existing libcfs module parameter string support
2. Added #ifdef CRAY_XT3 for ptl trace dump support
3. Reworked ptllnd_ptltrace.c to simplify code
4. Fixed bug in file offsets passed to ptl_proc_read
5. Changed CERRORs which could only occur if there were bugs in ptl_proc_read()
to LASSERTs
nic@cray.com
2006-Dec-18  15:38 UTC
[Lustre-devel] [Bug 11234] dump portals traces on kptllnd timeout
Please don''t reply to lustre-devel. Instead, comment in Bugzilla by
using the following link:
https://bugzilla.lustre.org/show_bug.cgi?id=11234
A few questions:
- Why add ifdef CRAY_XT3 in ptllnd code, given that the ptllnd is only for Cray
?
+                LCONSOLE_ERROR("Timing out %s: please check
Portals\n",
+                               libcfs_id2str(peer->peer_id));
+
Why LCONSOLE_ERROR?
Also -- why not put the elapsed time spent waiting here ? I think having this
value right up front & readable goes a long way to helping poor sysadmins
figure
out what is going on.
- Removed following comment:
-        /* The CFS_MMSPACE needs to be used when doing system calls from
-         * kernelspace. We are treating cfs_filp_write as one
-         */
Why ? There is no documentation anywhere when to use CFS_MMSPACE.
-                 
-                if ((start < tmpbuf || start > (tmpbuf + count))) {
-                        CERROR("ptl_proc_read set bogus start: %p\n",
-                               start);
-                        break;  
-                } else if (len > count) {
-                        CERROR("Apparent overflow: %d > %d\n",
-                               len, count);
-                        break;  
 
-                }
+                LASSERT (start >= tmpbuf && start + len <= tmpbuf
+ PAGE_SIZE);
+
Why does this need to be an LASSERT ? Is it overkill to require this to LBUG
&
hang a thread -- locking up the whole ptllnd ? If I read code below this right,
this LASSERT will lock up this thread, keeping the ptltrace_mutex held and the
calling ptllnd thread will never continue.
- Why go to the following code?
+        libcfs_daemonize("ptltracedump");
+
+        /* serialise with other instances of me */
+        mutex_down(&ptltrace_mutex);
The original implementation followed the libcfs_debug_dumplog() code exactly.
- Why no more reparent_to_init() ? 
- Why change to blocking the parent thread while dumping this file ?