Hi All, I''ve got a DomU that sometimes goes mad. I can''t ssh or usually even console to it. The time I did manage to console I got a load of dumps about being out of memory and swap, but couldn''t run any commands to find out which process had gone mad :( From Dom0 I can see the DomU at 100% CPU and can only stop it with a destroy. What can I do/check to find out why this happens? Sometimes it''ll be fine for weeks on end, others it''ll go wrong almost every day. The servers average load is very low, around 0.1. I assume there is a process that goes wild for whatever reason, but no idea where to start to track it down :( I''m running the latest CentOS, any help much appreciated. Lyle _______________________________________________ Xen-users mailing list Xen-users@lists.xensource.com http://lists.xensource.com/xen-users
Lyle wrote:> Hi All, > I''ve got a DomU that sometimes goes mad. I can''t ssh or usually even > console to it. The time I did manage to console I got a load of dumps > about being out of memory and swap, but couldn''t run any commands to > find out which process had gone mad :( > From Dom0 I can see the DomU at 100% CPU and can only stop it with a > destroy. What can I do/check to find out why this happens? Sometimes > it''ll be fine for weeks on end, others it''ll go wrong almost every day. > The servers average load is very low, around 0.1. I assume there is a > process that goes wild for whatever reason, but no idea where to start > to track it down :( > I''m running the latest CentOS, any help much appreciated. > > > Lyle > > > _______________________________________________ > Xen-users mailing list > Xen-users@lists.xensource.com > http://lists.xensource.com/xen-users > >Lyle, What services does this DomU run? In other words is it a mail server, web server, radius, etc? What can you tell us about the DomU that would be of help to us helping you? -- -- Steven G. Spencer, Network Administrator KSC Corporate - The Kelly Supply Family of Companies Office 308-382-8764 Ext. 231 Mobile 308-380-7957 _______________________________________________ Xen-users mailing list Xen-users@lists.xensource.com http://lists.xensource.com/xen-users
Hi!> I''ve got a DomU that sometimes goes mad. I can''t ssh or usually even > console to it. The time I did manage to console I got a load of dumps > about being out of memory and swap, but couldn''t run any commands to > find out which process had gone mad :(You could monitor your services for memory consumption? Something like ps -e -orss=,args= | sort -b -k1,1n or ps -auxf | sort -nr -k 4 maybe with ps -auxf | sort -nr -k 4 | head -10 shows sorted memory consumption by process or you might rather want to use a monitoring tool like sar, nagios, or whatever to find out which process causes this... -- Adi _______________________________________________ Xen-users mailing list Xen-users@lists.xensource.com http://lists.xensource.com/xen-users
On 26/07/2010 14:36, Steve Spencer wrote:> Lyle wrote: > >> Hi All, >> I''ve got a DomU that sometimes goes mad. I can''t ssh or usually even >> console to it. The time I did manage to console I got a load of dumps >> about being out of memory and swap, but couldn''t run any commands to >> find out which process had gone mad :( >> From Dom0 I can see the DomU at 100% CPU and can only stop it with a >> destroy. What can I do/check to find out why this happens? Sometimes >> it''ll be fine for weeks on end, others it''ll go wrong almost every day. >> The servers average load is very low, around 0.1. I assume there is a >> process that goes wild for whatever reason, but no idea where to start >> to track it down :( >> I''m running the latest CentOS, any help much appreciated. >> >> >> Lyle >> >> >> _______________________________________________ >> Xen-users mailing list >> Xen-users@lists.xensource.com >> http://lists.xensource.com/xen-users >> >> >> > Lyle, > > What services does this DomU run? In other words is it a mail server, > web server, radius, etc? What can you tell us about the DomU that would > be of help to us helping you? >Here is an abridged ps aux, I cut out what look like duplicates. Is there a way of setting some kind of process logging to trigger once the CPU % goes over 90%? USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND root 1 0.0 0.1 10348 600 ? Ss 06:14 0:00 init [3] root 2 0.0 0.0 0 0 ? S< 06:14 0:00 [migration/0] root 3 0.0 0.0 0 0 ? SN 06:14 0:00 [ksoftirqd/0] root 4 0.0 0.0 0 0 ? S< 06:14 0:00 [watchdog/0] root 5 0.0 0.0 0 0 ? S< 06:14 0:00 [events/0] root 6 0.0 0.0 0 0 ? S< 06:14 0:00 [khelper] root 7 0.0 0.0 0 0 ? S< 06:14 0:00 [kthread] root 9 0.0 0.0 0 0 ? S< 06:14 0:00 [xenwatch] root 10 0.0 0.0 0 0 ? S< 06:14 0:00 [xenbus] root 14 0.0 0.0 0 0 ? S< 06:14 0:00 [migration/1] root 15 0.0 0.0 0 0 ? SN 06:14 0:00 [ksoftirqd/1] root 16 0.0 0.0 0 0 ? S< 06:14 0:00 [watchdog/1] root 17 0.0 0.0 0 0 ? S< 06:14 0:00 [events/1] root 20 0.0 0.0 0 0 ? S< 06:14 0:00 [kblockd/0] root 21 0.0 0.0 0 0 ? S< 06:14 0:00 [kblockd/1] root 22 0.0 0.0 0 0 ? S< 06:14 0:00 [cqueue/0] root 23 0.0 0.0 0 0 ? S< 06:14 0:00 [cqueue/1] root 27 0.0 0.0 0 0 ? S< 06:14 0:00 [khubd] root 29 0.0 0.0 0 0 ? S< 06:14 0:00 [kseriod] root 94 0.0 0.0 0 0 ? S 06:14 0:00 [khungtaskd] root 95 0.0 0.0 0 0 ? S 06:14 0:00 [pdflush] root 96 0.0 0.0 0 0 ? S 06:14 0:00 [pdflush] root 97 0.0 0.0 0 0 ? S< 06:14 0:01 [kswapd0] root 98 0.0 0.0 0 0 ? S< 06:14 0:00 [aio/0] root 99 0.0 0.0 0 0 ? S< 06:14 0:00 [aio/1] root 229 0.0 0.0 0 0 ? S< 06:14 0:00 [kpsmoused] root 254 0.0 0.0 0 0 ? S< 06:14 0:00 [kstriped] root 267 0.0 0.0 0 0 ? S< 06:14 0:00 [ksnapd] root 282 0.0 0.0 0 0 ? S< 06:14 0:00 [kjournald] root 304 0.0 0.0 0 0 ? S< 06:14 0:00 [kauditd] root 332 0.0 0.0 12604 348 ? S<s 06:14 0:00 /sbin/udevd -d root 664 0.0 0.0 0 0 ? S< 06:14 0:00 [kmpathd/0] root 665 0.0 0.0 0 0 ? S< 06:14 0:00 [kmpathd/1] root 666 0.0 0.0 0 0 ? S< 06:14 0:00 [kmpath_handle] root 688 0.0 0.0 0 0 ? S< 06:14 0:00 [kjournald] root 1067 0.0 0.1 27348 696 ? S<sl 06:15 0:00 auditd root 1069 0.0 0.1 81800 760 ? S<sl 06:15 0:00 /sbin/audispd root 1089 0.0 0.1 5908 532 ? Ss 06:15 0:00 syslogd -m 0 root 1092 0.0 0.0 3804 324 ? Ss 06:15 0:00 klogd -x root 1101 0.0 0.0 10760 316 ? Ss 06:15 0:00 irqbalance named 1138 0.0 1.2 166536 6728 ? Ssl 06:15 0:01 /usr/sbin/named rpc 1171 0.0 0.0 8052 408 ? Ss 06:15 0:00 portmap root 1215 0.0 0.0 0 0 ? S< 06:15 0:00 [rpciod/0] root 1216 0.0 0.0 0 0 ? S< 06:15 0:00 [rpciod/1] rpcuser 1223 0.0 0.1 10160 564 ? Ss 06:15 0:00 rpc.statd root 1245 0.0 0.0 55180 236 ? Ss 06:15 0:00 rpc.idmapd dbus 1258 0.0 0.1 21356 852 ? Ss 06:15 0:00 dbus-daemon --s root 1266 0.0 0.0 10432 376 ? Ss 06:15 0:00 /usr/sbin/hcid root 1272 0.0 0.0 5936 392 ? Ss 06:15 0:00 /usr/sbin/sdpd root 1294 0.0 0.0 0 0 ? S< 06:15 0:00 [krfcommd] root 1329 0.0 0.0 21040 524 ? Ssl 06:15 0:00 pcscd root 1347 0.0 0.0 8516 364 ? Ss 06:15 0:00 /usr/bin/hidd - root 1380 0.0 0.1 54396 836 ? Ssl 06:15 0:00 automount root 1399 0.0 0.1 63516 532 ? Ss 06:15 0:00 /usr/sbin/sshd root 1407 0.0 0.1 134096 952 ? Ss 06:15 0:00 cupsd root 1419 0.0 0.1 21644 540 ? Ss 06:15 0:00 xinetd -stayali root 1430 0.0 0.0 44268 188 ? Ss 06:15 0:00 /usr/sbin/vsftp root 1462 0.0 0.1 65980 996 ? S 06:15 0:00 /bin/sh /usr/bi mysql 1509 0.0 0.8 191260 4308 ? Sl 06:15 0:00 /usr/libexec/my postgres 1589 0.0 0.2 120740 1344 ? S 06:15 0:00 /usr/bin/postma root 1600 0.0 0.0 6060 500 ? Ss 06:15 0:00 /usr/sbin/dovec root 1608 0.0 0.2 62500 1300 ? S 06:15 0:00 dovecot-auth dovecot 1612 0.0 0.2 33892 1300 ? S 06:15 0:00 imap-login postgres 1615 0.0 0.0 109920 176 ? S 06:15 0:00 postgres: logge nobody 1622 0.0 31.0 212288 163000 ? Ssl 06:15 0:06 clamd.virtualmi postgrey 1632 0.0 1.0 111480 5380 ? Ss 06:15 0:00 /usr/sbin/postg root 1684 0.0 0.3 54144 1828 ? Ss 06:15 0:00 /usr/libexec/po postfix 1691 0.0 0.3 55160 1932 ? S 06:15 0:00 qmgr -l -t fifo root 1701 0.0 0.0 6452 256 ? Ss 06:15 0:00 gpm -m /dev/inp postfix 1733 0.0 0.3 54204 1868 ? S 06:15 0:00 tlsmgr -l -t un root 1743 0.0 0.6 319152 3244 ? Ss 06:15 0:00 /usr/sbin/httpd apache 1746 0.0 0.0 249564 444 ? S 06:15 0:00 /usr/sbin/httpd root 1752 0.0 0.1 74860 724 ? Ss 06:15 0:00 crond root 1763 0.0 0.0 49764 420 ? Ss 06:15 0:00 squid -D squid 1765 0.0 0.5 52236 3128 ? S 06:15 0:00 (squid) -D squid 1767 0.0 0.0 3644 184 ? Ss 06:15 0:00 (unlinkd) apache 1779 0.0 0.0 319064 424 ? S 06:15 0:00 /usr/sbin/fcgi- sympa 1780 0.0 4.3 258180 22672 ? S 06:15 0:01 /usr/bin/perl - xfs 1796 0.0 0.1 20260 568 ? Ss 06:15 0:00 xfs -droppriv - root 1811 0.0 0.0 18732 352 ? Ss 06:15 0:00 /usr/sbin/atd root 1819 0.0 0.0 46740 304 ? Ss 06:15 0:00 /usr/sbin/sasla sympa 1833 0.0 4.1 230640 21652 ? S 06:15 0:01 /usr/bin/perl - avahi 1840 0.0 0.1 24172 1032 ? Ss 06:15 0:00 avahi-daemon: r 68 1849 0.0 0.1 30428 976 ? Ss 06:15 0:00 hald root 1850 0.0 0.1 21692 532 ? S 06:15 0:00 hald-runner mailman 1866 0.0 0.1 149556 692 ? Ss 06:15 0:00 /usr/bin/python root 1898 0.0 0.7 257084 3736 ? SN 06:15 0:00 /usr/bin/python root 1900 0.0 0.1 12916 852 ? SN 06:15 0:00 /usr/libexec/ga root 1927 0.0 0.0 0 0 ? Z 06:16 0:00 [sen] <defunct> root 2046 0.0 0.3 125808 1740 ? Ss 06:16 0:00 /usr/libexec/we root 2056 0.0 0.0 18416 240 ? S 06:16 0:00 /usr/sbin/smart root 2069 0.0 0.1 52108 892 ? Ss 06:16 0:00 login -- root postfix 13030 0.0 0.5 54348 2636 ? S 08:12 0:00 trivial-rewrite postfix 17743 0.0 0.4 54208 2240 ? S 09:14 0:00 pickup -l -t fi postfix 18474 0.0 0.4 54204 2288 ? S 09:24 0:00 anvil -l -t uni postfix 19504 0.0 0.5 54428 2660 ? S 09:35 0:00 local -t unix dovecot 19540 0.0 0.3 33884 1632 ? S 09:35 0:00 pop3-login postfix 19543 0.0 0.8 72672 4492 ? S 09:35 0:00 smtpd -n smtp - postfix 19660 0.0 0.5 54468 2684 ? S 09:40 0:00 cleanup -z -t u postfix 19947 0.0 0.8 72672 4484 ? S 09:41 0:00 smtpd -n smtp - _______________________________________________ Xen-users mailing list Xen-users@lists.xensource.com http://lists.xensource.com/xen-users
On 26/07/2010 14:44, Adi Kriegisch wrote:> Hi! > > >> I''ve got a DomU that sometimes goes mad. I can''t ssh or usually even >> console to it. The time I did manage to console I got a load of dumps >> about being out of memory and swap, but couldn''t run any commands to >> find out which process had gone mad :( >> > You could monitor your services for memory consumption? > Something like > ps -e -orss=,args= | sort -b -k1,1n > or > ps -auxf | sort -nr -k 4 > maybe with > ps -auxf | sort -nr -k 4 | head -10 > shows sorted memory consumption by process or you might rather want to use > a monitoring tool like sar, nagios, or whatever to find out which process > causes this... >These are useful thanks, although ps doesn''t use - (just to be awkward, everything else does). I looked at nagios a few years ago, it looked great, but like I''d have to take a week out to set it up. If there anything lightweight I could make? I guess I could write a Perl daemon that runs that ps command every 10 seconds or something and logs the output to a file... Seems like the sort of thing that should all ready be available though... Anyone else had an issue like this? Lyle _______________________________________________ Xen-users mailing list Xen-users@lists.xensource.com http://lists.xensource.com/xen-users
Lyle wrote:> On 26/07/2010 14:44, Adi Kriegisch wrote: >> Hi! >> >> >>> I''ve got a DomU that sometimes goes mad. I can''t ssh or usually even >>> console to it. The time I did manage to console I got a load of dumps >>> about being out of memory and swap, but couldn''t run any commands to >>> find out which process had gone mad :( >>> >> You could monitor your services for memory consumption? >> Something like >> ps -e -orss=,args= | sort -b -k1,1n >> or >> ps -auxf | sort -nr -k 4 >> maybe with >> ps -auxf | sort -nr -k 4 | head -10 >> shows sorted memory consumption by process or you might rather want to >> use >> a monitoring tool like sar, nagios, or whatever to find out which process >> causes this... >> > > These are useful thanks, although ps doesn''t use - (just to be awkward, > everything else does). > > I looked at nagios a few years ago, it looked great, but like I''d have > to take a week out to set it up. If there anything lightweight I could > make? I guess I could write a Perl daemon that runs that ps command > every 10 seconds or something and logs the output to a file... Seems > like the sort of thing that should all ready be available though... > Anyone else had an issue like this? > > > Lyle > > > _______________________________________________ > Xen-users mailing list > Xen-users@lists.xensource.com > http://lists.xensource.com/xen-users > >I used psmon a few years ago for something similar. Perhaps it would work for you as well. It looks as though this is a mail server (postfix), so it could be something there that is causing your problem. Here''s the link to psmon if you want to give that a try: http://www.psmon.com/ -- -- Steven G. Spencer, Network Administrator KSC Corporate - The Kelly Supply Family of Companies Office 308-382-8764 Ext. 231 Mobile 308-380-7957 _______________________________________________ Xen-users mailing list Xen-users@lists.xensource.com http://lists.xensource.com/xen-users
Hi!> >You could monitor your services for memory consumption?[SNIP]> These are useful thanks, although ps doesn''t use - (just to be awkward, > everything else does).man ps: [SNIP] 1 UNIX options, which may be grouped and must be preceded by a dash. 2 BSD options, which may be grouped and must not be used with a dash. 3 GNU long options, which are preceded by two dashes. [SNAP]> I looked at nagios a few years ago, it looked great, but like I''d have > to take a week out to set it up. If there anything lightweight I could > make? I guess I could write a Perl daemon that runs that ps command > every 10 seconds or something and logs the output to a file... Seems > like the sort of thing that should all ready be available though...My suggestion was not about setting up nagios if you''re not already using it. You could start using sar[1] or just write a plain shell script doing the monitoring for you: while /bin/true; do WHATEVER_PS_COMMAND_YOU_LIKE_BEST > \ /var/log/mymemlog/$(date +%Y-%m-%d_-_%H.%M.%S) sleep 10 done ...and you''ll get memstats every 10 seconds saved in a log file for further analysis. Another option would be to check your already existing log files for "oomkiller" messages. They could give hints on the processes eating up all your memory. Further this is a general issue with Linux servers running out of memory and is not related to Xen or a Xen issue. You might as well want to have a look at sites serverfault[2] or you might want to do it the other way around and limit memory for the available applications and users. Just have a look at /etc/security/limits.conf for example. Then sit down and wait for the first service dying... ;-) Another option could be to add more swap space (as this is usually cheaper than ram). That way your problem might "disappear". On the other hand you should plan your (virtual) machines with expected memory consumption in mind so that using swap space will not happen at all (or just in case of emergency preventing the oomkiller to snap in). -- Adi [1] http://pagesperso-orange.fr/sebastien.godard/ [2] http://www.serverfault.com _______________________________________________ Xen-users mailing list Xen-users@lists.xensource.com http://lists.xensource.com/xen-users
On 26/07/2010 16:17, Steve Spencer wrote:> I used psmon a few years ago for something similar. Perhaps it would > work for you as well. It looks as though this is a mail server > (postfix), so it could be something there that is causing your problem. > > Here''s the link to psmon if you want to give that a try: > > http://www.psmon.com/ >Thanks for the link I''ll take a look :) Lyle _______________________________________________ Xen-users mailing list Xen-users@lists.xensource.com http://lists.xensource.com/xen-users
On 26/07/2010 16:24, Adi Kriegisch wrote:> Hi! > > >>> You could monitor your services for memory consumption? >>> > [SNIP] > >> These are useful thanks, although ps doesn''t use - (just to be awkward, >> everything else does). >> > man ps: > [SNIP] > 1 UNIX options, which may be grouped and must be preceded by a dash. > 2 BSD options, which may be grouped and must not be used with a dash. > 3 GNU long options, which are preceded by two dashes. > [SNAP] >Just that I was getting the error message "Warning: bad syntax, perhaps a bogus ''-''? See /usr/share/doc/procps-3.2.5/FAQ" Taking the - off seemed to cure it.>> I looked at nagios a few years ago, it looked great, but like I''d have >> to take a week out to set it up. If there anything lightweight I could >> make? I guess I could write a Perl daemon that runs that ps command >> every 10 seconds or something and logs the output to a file... Seems >> like the sort of thing that should all ready be available though... >> > My suggestion was not about setting up nagios if you''re not already using > it. You could start using sar[1] or just write a plain shell script doing > the monitoring for you: > while /bin/true; do > WHATEVER_PS_COMMAND_YOU_LIKE_BEST> \ > /var/log/mymemlog/$(date +%Y-%m-%d_-_%H.%M.%S) > sleep 10 > done >Beautiful thank you :)> ...and you''ll get memstats every 10 seconds saved in a log file for further > analysis. > > Another option would be to check your already existing log files for > "oomkiller" messages. They could give hints on the processes eating up all > your memory. >Will do.> Further this is a general issue with Linux servers running out of memory > and is not related to Xen or a Xen issue. You might as well want to have a > look at sites serverfault[2] or you might want to do it the other way > around and limit memory for the available applications and users. Just have > a look at /etc/security/limits.conf for example. Then sit down and wait for > the first service dying... ;-) >I wasn''t sure if there was something common in xen that I needed to setup to stop this.> Another option could be to add more swap space (as this is usually cheaper > than ram). That way your problem might "disappear". On the other hand you > should plan your (virtual) machines with expected memory consumption in > mind so that using swap space will not happen at all (or just in case of > emergency preventing the oomkiller to snap in). >I don''t want to throw more memory at it, I''d rather figure out what''s going wrong and why. Thanks for the detailed responce :) Lyle _______________________________________________ Xen-users mailing list Xen-users@lists.xensource.com http://lists.xensource.com/xen-users