George Shuklin
2011-Dec-09 19:49 UTC
bug in xenstored? No notification to subscription on @introduceDomain
Good day. I think I met some strange bug in xenstored. I using XCP for long time and all that time we have some funny bug we was not able to debug enough due product environment and very low chance to appear, now we was able to catch it in testing environment and done some research. We have python application running in dom0 and waiting domain appearance. This implemented this via subscription to @introduceDomain xenstore key. Under some conditions we stops to receive notification on subscription. If we ran application as second instance it will receive that notification, if we restart application it will receive too. I unable to pinpoint exact condition for this, but this a) Happens occasionally but consistently (about once a month in farm of 50 hosts at least at one host) b) Not related to xenstored uptime c) Not related to load on xen or dom0 d) Not related to amount of domains e) Occur at least at XCP 0.5, 1.0 and 1.1 (I don''t know how to get version from xenstored) Last time I got that on two hosts in lab at same time (with single guest domain without any high load) and done some experiments - so I can say exactly I wrote above. The pieces from python code we ran: from xen.lowlevel.xs import xs conn = xs.xs() conn.watch("@introduceDomain", "+") conn.watch("@releaseDomain", "-") conn.read_watch()
Ian Campbell
2011-Dec-12 11:31 UTC
Re: [Xen-devel] bug in xenstored? No notification to subscription on @introduceDomain
On Fri, 2011-12-09 at 19:49 +0000, George Shuklin wrote:> Good day. > > I think I met some strange bug in xenstored.If you are using XCP then this will be using oxenstored. I''ve CC''d xen-api@ since that is the correct place for XCP discussions. It''s also plausibly a bug in the C client library or the python bindings to that library (or indeed your application).> I using XCP for long time and all that time we have some funny bug we > was not able to debug enough due product environment and very low chance > to appear, now we was able to catch it in testing environment and done > some research. > > We have python application running in dom0 and waiting domain > appearance. This implemented this via subscription to @introduceDomain > xenstore key. Under some conditions we stops to receive notification on > subscription. If we ran application as second instance it will receive > that notification, if we restart application it will receive too.You lose both @introduce and @release notifications or just @introduce? Does the app do any other XS stuff, e.g. other watches or read/write? Do these stop working also? oxenstored (at least in XCP) logs to /var/log/xenstore-access.log -- do you see any activity in there? There is also /var/log/xenstored.log Does strace show the daemon writing (or trying to write) to the socket associated with this client? What about on the client side? (nb: libxenstore uses a thread to handle watches so be sure to use the appropriate options to strace.) Identifying the fd associated with the connection on either end might be tricky, /proc/<pid>/fd and/or netstat might help narrow it down. The app being python presumably makes it hard to attach gdb to and get anything sensible, likewise the daemon being ocaml. If anyone has any hints on attaching a debugging to an existing process of these types then that might be useful. Other than that I''m afraid I really don''t have any idea what might be going wrong, or indeed what other next steps can be taken to diagnose the issue :-( Ian.> I unable to pinpoint exact condition for this, but this > a) Happens occasionally but consistently (about once a month in farm of > 50 hosts at least at one host) > b) Not related to xenstored uptime > c) Not related to load on xen or dom0 > d) Not related to amount of domains > e) Occur at least at XCP 0.5, 1.0 and 1.1 (I don''t know how to get > version from xenstored) > > Last time I got that on two hosts in lab at same time (with single guest > domain without any high load) and done some experiments - so I can say > exactly I wrote above. > > The pieces from python code we ran: > > from xen.lowlevel.xs import xs > conn = xs.xs() > conn.watch("@introduceDomain", "+") > conn.watch("@releaseDomain", "-") > conn.read_watch() > > _______________________________________________ > Xen-devel mailing list > Xen-devel-GuqFBffKawuULHF6PoxzQEEOCMrvLtNR@public.gmane.org > http://lists.xensource.com/xen-devel
Ian Campbell
2011-Dec-12 13:46 UTC
Re: [Xen-devel] bug in xenstored? No notification to subscription on @introduceDomain
Please don''t top post and don''t drop people/lists from the CC. I have reinstated xen-devel and refrained from trimming the quotes as heavily as I normally would. Counter to my own advice I have also dropped xen-hosts-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org because last time I got a bounce in Russian to the effect that the group does not exist (according to google translate). On Mon, 2011-12-12 at 12:10 +0000, George Shuklin wrote:> Thanks for reply. > > The problem is we tried at least two different libraries - xs (+python > xen.lowlevel.xs) and our own library (pyxs), created from scratches on > pure python - both shows exactly same behavior. We loosing same time > @introduce and @release, but only for new domains. Older domains (which > starts before error appear) during shutdown/migration sends @release > normally. > > I done strace, nothig is sending by xenstored to application socket when > ''new'' domains appears and disappears (I''m not sure 100% due not very > good strace skills). > > Application performs write/read operations to/from xenstore (and do many > subscriptions, but only after @introduce) and older subscription works fine. > > PS We got other strange bug with memory leak in xenstored (happens only > with big amount of transactions, and ONLY with socket) - but this case > is still under research, so I decide not to post this (but may be it > related somehow?).Are the two event correlated? i.e. is the oxenstored process huge when these failures occur? Inability to allocate memory could explain some of your symptoms although I''d expect it to be more fatal more quickly and obviously than what you describe or to have wider impact.> Sorry for question - how I can gather debug information for oxenstored?What sort of debug information are you after? There are various logging options which you could turn up to 11 in /etc/xensource/xenstored.conf but I do not have a complete list of what they are, similarly for command line options -- perhaps someone on xen-api@ could chime in? Otherwise looking in the source might be the best way to find out what they are, try xenstore.ml, parse_args.ml logging.ml would be good places to start. (if having done so you feel motivated to write a patch to add docs/man/oxenstored.1.pod we would be much obliged...) Ian.
Ian Campbell
2011-Dec-12 13:55 UTC
Re: [Xen-devel] bug in xenstored? No notification to subscription on @introduceDomain
On Mon, 2011-12-12 at 11:31 +0000, Ian Campbell wrote:> Does the app do any other XS stuff, e.g. other watches or read/write? Do > these stop working also?One other question -- does your app use threading anywhere apart from the one it gets from libxenstore? Ian.
George Shuklin
2011-Dec-12 15:34 UTC
Re: [Xen-devel] bug in xenstored? No notification to subscription on @introduceDomain
On 12.12.2011 17:46, Ian Campbell wrote:> Please don''t top post and don''t drop people/lists from the CC. I have > reinstated xen-devel and refrained from trimming the quotes as heavily > as I normally would. > > Counter to my own advice I have also dropped xen-hosts-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org > because last time I got a bounce in Russian to the effect that the group > does not exist (according to google translate). > > On Mon, 2011-12-12 at 12:10 +0000, George Shuklin wrote: >> Thanks for reply. >> >> The problem is we tried at least two different libraries - xs (+python >> xen.lowlevel.xs) and our own library (pyxs), created from scratches on >> pure python - both shows exactly same behavior. We loosing same time >> @introduce and @release, but only for new domains. Older domains (which >> starts before error appear) during shutdown/migration sends @release >> normally. >> >> I done strace, nothig is sending by xenstored to application socket when >> ''new'' domains appears and disappears (I''m not sure 100% due not very >> good strace skills). >> >> Application performs write/read operations to/from xenstore (and do many >> subscriptions, but only after @introduce) and older subscription works fine. >> >> PS We got other strange bug with memory leak in xenstored (happens only >> with big amount of transactions, and ONLY with socket) - but this case >> is still under research, so I decide not to post this (but may be it >> related somehow?). > Are the two event correlated? i.e. is the oxenstored process huge when > these failures occur? Inability to allocate memory could explain some of > your symptoms although I''d expect it to be more fatal more quickly and > obviously than what you describe or to have wider impact.Nope, memory leak occur only if transaction happens with subscription, but ''no notification'' problem continues after we stops to use transaction (this cure memory leak completely, so I think this is separate issue, but I don''t sure). I still can''t catch condition for lack of notifications for @introduce, sorry (I got one more this morning in test pool).>> Sorry for question - how I can gather debug information for oxenstored? > What sort of debug information are you after? > > There are various logging options which you could turn up to 11 > in /etc/xensource/xenstored.conf but I do not have a complete list of > what they are, similarly for command line options -- perhaps someone on > xen-api@ could chime in? Otherwise looking in the source might be the > best way to find out what they are, try xenstore.ml, parse_args.ml > logging.ml would be good places to start. (if having done so you feel > motivated to write a patch to add docs/man/oxenstored.1.pod we would be > much obliged...) >Ok, thanks, I''ll dig to sources to set up them all. We heavily using xenstore for dynamic memory regulation (about five operations for every domain per second).
George Shuklin
2011-Dec-12 15:36 UTC
Re: [Xen-devel] bug in xenstored? No notification to subscription on @introduceDomain
On 12.12.2011 17:55, Ian Campbell wrote:> On Mon, 2011-12-12 at 11:31 +0000, Ian Campbell wrote: >> Does the app do any other XS stuff, e.g. other watches or read/write? Do >> these stop working also? > One other question -- does your app use threading anywhere apart from > the one it gets from libxenstore? >Yes, it is! We using multithread model (that why we wrote an alternative library to access xenstore - to get normal multithread subscription). But this problem happens before we start multithread, with single-thread application.