I had a rather strange problem last week with one of our 8 core servers. The users complained the performance was "slow" so I checked the basic things, processes on top, vmstat for memory and context switching, i/o stats for internal disk I/O, netstat for any network issues and other things like network through put by copying a large file (1gb file across the network). It turned out I had an NMI related issue on the processor. I figured this out by checking the /var/log/messages but it was a real mystery for be at first. My question, is there a way to detect or benchmark a system and all of its processors to make sure I don't bypass this type of error again? I am not necessary looking for monitoring tools but more of techniques like, run a while loop on all processors/cores to make sure they all give a constant time? TIA
Mag Gam <magawake at ...> writes:> ... > It turned out I had an NMI related issue on the processor. I figured > this out by checking the /var/log/messages but it was a real mystery > for be at first. My question, is there a way to detect or benchmark a > system and all of its processors to make sure I don't bypass this type > of error again? I am not necessary looking for monitoring tools but > more of techniques like, run a while loop on all processors/cores to > make sure they all give a constant time? > ...Regarding NMI (if you want to help debugging it): http://www.kernel.org/doc/Documentation/nmi_watchdog.txt Regarding CPU monitoring: - top type 1 to show all CPUs - htop http://htop.sourceforge.net/index.php?page=main - xosview http://xosview.sourceforge.net/ Some of the above tools can be used to capture output at set intervals. Mostly available as standard tools (or from supplementary repo). Regarding CPU task controlled-monitoring: man taskset part of util-linux or util-linux-ng package JB