Can anyone who''s having this problem please send details? I''m trying to reproduce it -- I''ve got 5 clients concurrently retrieving 200 10k files made of random binary, and I can''t get any corruption or memory growth at all. Is everyone experiencing the problem using Mongrel? Webrick? What versions of ruby? Are only big files affected? Small files? I''m going to spend some more time fixing my client-side hack that just fails if md5s don''t match, but this is a serious-enough problem that I want to fix the server-side too. -- Love is the triumph of imagination over intelligence. -- H. L. Mencken --------------------------------------------------------------------- Luke Kanies | http://reductivelabs.com | http://madstop.com
Luke Kanies <luke@madstop.com> writes:> Can anyone who''s having this problem please send details? I''m trying to > reproduce it -- I''ve got 5 clients concurrently retrieving 200 10k files > made of random binary, and I can''t get any corruption or memory growth > at all. > > Is everyone experiencing the problem using Mongrel? Webrick? What > versions of ruby? Are only big files affected? Small files? > > I''m going to spend some more time fixing my client-side hack that just > fails if md5s don''t match, but this is a serious-enough problem that I > want to fix the server-side too.Here are as many details as I can come with off-hand: - Server is 0.23.2-3 (Debian). - Clients are 0.24.1-1 (Debian and Red Hat) -- the problem does not occur with 0.23.2-3. It does occur with both Debian and Red Hat clients. - Server is using Mongrel. - Shortly before this problem happens, the load goes crazy on the puppetmaster and clients start failing to be able to download resources or get the wide range of nil classes and comparisons with nil that we always get when the puppetmaster doesn''t respond. - Small files are affected, namely configuration files of all kind. - We don''t serve large files through Puppet, so I''m not sure if they''re affected or not. - The most common symptom is that the file is replaced with the checksum as described in the bug. - A less common but still frequent problem is that a configuration file is replaced with a directory (so you get, for example, an /etc/crontab directory instead of an /etc/crontab file). - The directory problem also affects 0.23.2-3 clients, but the client rejects what the server says with an error message about not being able to use a directory as a resource. 0.24.1-1 clients happily replace the file with a directory. - The version of Ruby on both the server and the clients is 1.8.6.36-3 on Debian. I''m not sure what it is on Red Hat clients. Something older. - The puppetmaster runs for a while without any trouble, and then this suddenly happens. We *think* it''s related to puppetmaster growing until it cuts into swap, but we''re not at all sure. We''re currently downgrading all of our clients to 0.23.2 to avoid this problem since it''s caused several production outages. -- Russ Allbery (rra@stanford.edu) <http://www.eyrie.org/~eagle/>
Russ Allbery <rra@stanford.edu> writes:> - Clients are 0.24.1-1 (Debian and Red Hat) -- the problem does not occur > with 0.23.2-3. It does occur with both Debian and Red Hat clients.Oh, one other note on this. We find that with both 0.23.2 and 0.24.1 clients the client puppetd generally dies when this happens. We think that the reason why we''re not seeing this with 0.23.2 may be that 0.23.2 clients die more quickly and therefore die before they can act on bad data, whereas 0.24.1 clients keep reconnecting and persist and then act on the bad data (and then finally die anyway). -- Russ Allbery (rra@stanford.edu) <http://www.eyrie.org/~eagle/>
On Feb 21, 2008, at 11:15 PM, Russ Allbery wrote:> > Here are as many details as I can come with off-hand: > > - Server is 0.23.2-3 (Debian). > - Clients are 0.24.1-1 (Debian and Red Hat) -- the problem does not > occur > with 0.23.2-3. It does occur with both Debian and Red Hat clients. > - Server is using Mongrel. > - Shortly before this problem happens, the load goes crazy on the > puppetmaster and clients start failing to be able to download > resources > or get the wide range of nil classes and comparisons with nil that we > always get when the puppetmaster doesn''t respond. > - Small files are affected, namely configuration files of all kind. > - We don''t serve large files through Puppet, so I''m not sure if > they''re > affected or not. > - The most common symptom is that the file is replaced with the > checksum > as described in the bug. > - A less common but still frequent problem is that a configuration > file is > replaced with a directory (so you get, for example, an /etc/crontab > directory instead of an /etc/crontab file). > - The directory problem also affects 0.23.2-3 clients, but the client > rejects what the server says with an error message about not being > able > to use a directory as a resource. 0.24.1-1 clients happily replace > the > file with a directory. > - The version of Ruby on both the server and the clients is > 1.8.6.36-3 on > Debian. I''m not sure what it is on Red Hat clients. Something > older. > - The puppetmaster runs for a while without any trouble, and then this > suddenly happens. We *think* it''s related to puppetmaster growing > until > it cuts into swap, but we''re not at all sure.Thank you for the detail. Is anyone having this problem with webrick? I''ve just committed a client-side checksum validation fix, but that''s only a band-aid, really, although it should hopefully get back to "fail rather than do evil". Given the seriousness of this problem, I''m looking at pushing some of the fileserving work I was planning on saving for the REST transition; if it makes things cleaner and thus less prone to failure, it makes sense to do the work now. Is anyone who''s having the problem willing to run a host out of the current 0.24.x HEAD in git, to see if the problem is caught? Russ, do you have any hope we might be able to find the source of these problems on the server? They seem to be the real problem, but I can''t reproduce them so I can''t diagnose them. -- Dawkins''s Law of Adversarial Debate: When two incompatible beliefs are advocated with equal intensity, the truth does not lie half way between them. --------------------------------------------------------------------- Luke Kanies | http://reductivelabs.com | http://madstop.com
Luke Kanies <luke@madstop.com> writes:> Russ, do you have any hope we might be able to find the source of > these problems on the server? They seem to be the real problem, but I > can''t reproduce them so I can''t diagnose them.I don''t know -- I''ve never seen them outside of running a full production load on the servers. It has taken around three days for the problem to recur. We''ve consistently seen the problem all along, but it''s only with the 0.24.1 clients that it caused file corruption rather than just an extremely slow puppetmaster that was mostly unusable until restarted. We''re currently restarting it nightly to work around this problem, so we expect not to see it in production right now. This is with about 240 nodes and thousands of files provided through the file server, with probably a good hundred pulled down by every node, all checking every half-hour. -- Russ Allbery (rra@stanford.edu) <http://www.eyrie.org/~eagle/>
Russ Allbery <rra@stanford.edu> writes:> This is with about 240 nodes and thousands of files provided through the > file server, with probably a good hundred pulled down by every node, all > checking every half-hour.Oh, and we''re running ten instances of puppetmaster on the master server. We tried increasing it to 20 to see if that was what was causing the problem, but that didn''t apparently make any difference. When this problem happens, the whole service struggles; I''m not sure that every puppetmaster daemon is necessarily having problems, but clients are definitely not being successful in pulling manifests. We see the following types of errors from puppetmaster all the time, but when this problems happens, we seem to see more of them: Feb 20 04:33:06 henson puppetmasterd[8748]: Denying authenticated client r7-app1-uat.stanford.edu(171.67.41.90) access to puppetmaster.getconfig Feb 20 04:33:15 henson puppetmasterd[8728]: undefined method `<'' for nil:NilClass Feb 20 04:33:16 henson puppetmasterd[9167]: comparison of Fixnum with nil failed Feb 20 04:34:13 henson puppetmasterd[9220]: Denying unauthenticated client shadow.stanford.edu(171.64.7.21) access to fileserver.list (Both of those clients do have signed certificates and normally have no trouble retrieving configurations.) -- Russ Allbery (rra@stanford.edu) <http://www.eyrie.org/~eagle/>
> Is anyone having this problem with webrick?Yes. I''m having the problem with webrick. All machines are Debian etch. All are running ruby 1.8.5 (2006-08-25) [i486-linux]. All are running puppet 0.24.1-2 from the Debian testing repository. The puppetmaster is also 0.24.1-2 from the testing repository. I have the exact same symptoms as Russ. The load on the puppetmaster host goes crazy then clients start to die with corruption. Most of the time, for me, the corrupted files become directories. -- Evan Borgstrom <evan@fatbox.ca> FatBox Inc. 720 King St West, Suite 126 Toronto, Ontario, M5V 3S5 t:416.833.3763 | f:888.829.5963 msn: evan@fatbox.ca | aim: evan@fatbox.ca
On Feb 22, 2008, at 12:02 AM, Russ Allbery wrote:> Russ Allbery <rra@stanford.edu> writes: > >> This is with about 240 nodes and thousands of files provided >> through the >> file server, with probably a good hundred pulled down by every >> node, all >> checking every half-hour. > > Oh, and we''re running ten instances of puppetmaster on the master > server. > We tried increasing it to 20 to see if that was what was causing the > problem, but that didn''t apparently make any difference. When this > problem happens, the whole service struggles; I''m not sure that every > puppetmaster daemon is necessarily having problems, but clients are > definitely not being successful in pulling manifests. > > We see the following types of errors from puppetmaster all the time, > but > when this problems happens, we seem to see more of them: > > Feb 20 04:33:06 henson puppetmasterd[8748]: Denying authenticated > client r7-app1-uat.stanford.edu(171.67.41.90) access to > puppetmaster.getconfig > Feb 20 04:33:15 henson puppetmasterd[8728]: undefined method `<'' for > nil:NilClass > Feb 20 04:33:16 henson puppetmasterd[9167]: comparison of Fixnum > with nil failed > Feb 20 04:34:13 henson puppetmasterd[9220]: Denying unauthenticated > client shadow.stanford.edu(171.64.7.21) access to fileserver.list > > (Both of those clients do have signed certificates and normally have > no > trouble retrieving configurations.)Would you be able to run the master in --trace mode with the output going to a file, to hopefully get a stack trace of the problem? -- It''s not that I''m afraid to die. I just don''t want to be there when it happens. -- Woody Allen --------------------------------------------------------------------- Luke Kanies | http://reductivelabs.com | http://madstop.com
I just had this happen with (I think) webrick. I am using what ever the default that puppetmasterd runs. The file that was corrupted was /etc/resolv.conf. Damn. So I''m running 0.24.1 on both the server and the clients. amd64 debian etch 2.6.18-5 ruby 1.8.5 A whole lot happens on the box thats running puppetmasterd, so the idea that this happens under load seems pretty plausable. -Joel On Thu, 21 Feb 2008, Luke Kanies wrote:> On Feb 21, 2008, at 11:15 PM, Russ Allbery wrote: >> >> Here are as many details as I can come with off-hand: >> >> - Server is 0.23.2-3 (Debian). >> - Clients are 0.24.1-1 (Debian and Red Hat) -- the problem does not >> occur >> with 0.23.2-3. It does occur with both Debian and Red Hat clients. >> - Server is using Mongrel. >> - Shortly before this problem happens, the load goes crazy on the >> puppetmaster and clients start failing to be able to download >> resources >> or get the wide range of nil classes and comparisons with nil that we >> always get when the puppetmaster doesn''t respond. >> - Small files are affected, namely configuration files of all kind. >> - We don''t serve large files through Puppet, so I''m not sure if >> they''re >> affected or not. >> - The most common symptom is that the file is replaced with the >> checksum >> as described in the bug. >> - A less common but still frequent problem is that a configuration >> file is >> replaced with a directory (so you get, for example, an /etc/crontab >> directory instead of an /etc/crontab file). >> - The directory problem also affects 0.23.2-3 clients, but the client >> rejects what the server says with an error message about not being >> able >> to use a directory as a resource. 0.24.1-1 clients happily replace >> the >> file with a directory. >> - The version of Ruby on both the server and the clients is >> 1.8.6.36-3 on >> Debian. I''m not sure what it is on Red Hat clients. Something >> older. >> - The puppetmaster runs for a while without any trouble, and then this >> suddenly happens. We *think* it''s related to puppetmaster growing >> until >> it cuts into swap, but we''re not at all sure. > > Thank you for the detail. > > Is anyone having this problem with webrick? > > I''ve just committed a client-side checksum validation fix, but that''s > only a band-aid, really, although it should hopefully get back to > "fail rather than do evil". > > Given the seriousness of this problem, I''m looking at pushing some of > the fileserving work I was planning on saving for the REST transition; > if it makes things cleaner and thus less prone to failure, it makes > sense to do the work now. > > Is anyone who''s having the problem willing to run a host out of the > current 0.24.x HEAD in git, to see if the problem is caught? > > Russ, do you have any hope we might be able to find the source of > these problems on the server? They seem to be the real problem, but I > can''t reproduce them so I can''t diagnose them. > > -- > Dawkins''s Law of Adversarial Debate: > When two incompatible beliefs are advocated with equal intensity, > the truth does not lie half way between them. > --------------------------------------------------------------------- > Luke Kanies | http://reductivelabs.com | http://madstop.com > > _______________________________________________ > Puppet-users mailing list > Puppet-users@madstop.com > https://mail.madstop.com/mailman/listinfo/puppet-users >
Luke Kanies <luke@madstop.com> writes:> Would you be able to run the master in --trace mode with the output > going to a file, to hopefully get a stack trace of the problem?I''ll see if someone in my group can get that set up. -- Russ Allbery (rra@stanford.edu) <http://www.eyrie.org/~eagle/>
On Thu, Feb 21, 2008 at 8:15 PM, Russ Allbery <rra@stanford.edu> wrote:> - A less common but still frequent problem is that a configuration file is > replaced with a directory (so you get, for example, an /etc/crontab > directory instead of an /etc/crontab file).We''ve seen this since moving the server to 0.24.1, but it seems to have been fixed by adding ensure => file to those resource definitions. I''ll see if I can reproduce it today with --trace on one of our test servers. -- Nigel Kersten Systems Administrator MacOps
On Feb 21, 2008, at 11:20 PM, Russ Allbery wrote:> Russ Allbery <rra@stanford.edu> writes: > >> - Clients are 0.24.1-1 (Debian and Red Hat) -- the problem does not >> occur >> with 0.23.2-3. It does occur with both Debian and Red Hat clients. > > Oh, one other note on this. We find that with both 0.23.2 and 0.24.1 > clients the client puppetd generally dies when this happens. We think > that the reason why we''re not seeing this with 0.23.2 may be that > 0.23.2 > clients die more quickly and therefore die before they can act on bad > data, whereas 0.24.1 clients keep reconnecting and persist and then > act on > the bad data (and then finally die anyway).Okay. I''ll set up a test at home that hammers a server for a few days and see what I can get. Thanks. -- I used to get high on life but lately I''ve built up a resistance. --------------------------------------------------------------------- Luke Kanies | http://reductivelabs.com | http://madstop.com
On Feb 22, 2008, at 11:15 AM, Nigel Kersten wrote:> We''ve seen this since moving the server to 0.24.1, but it seems to > have been fixed by adding ensure => file to those resource > definitions.This is... interesting. So specifying ''ensure'' fixes the problem?> > I''ll see if I can reproduce it today with --trace on one of our test > servers.That''d be great. -- Risk! Risk anything! Care no more for the opinions of others, for those voices. Do the hardest thing on earth for you. Act for yourself. Face the truth. -- Katherine Mansfield --------------------------------------------------------------------- Luke Kanies | http://reductivelabs.com | http://madstop.com
Hi, I''m also facing this problem, sometimes. Here are details of my puppet architecture: - 2 instances of mongrel puppetmasterd (0.24.1-1 debian package) - 1 nginx - clients (0.24.1-1 debian package), about 10 clients with a runinterval of 15 minutes Since I have this problem of corruption, I have limited the execution of puppet to working hours, in order to quickly fix incidents cause by this issue. And as I have reduce the working time of puppet I have less incidents. If I remember correctly I think that after a restart of all puppetmasterd I don''t have any problem for a while. By the way I don''t know if it''s a usefull information but I''m using puppet over very bad connections (I mean that sometimes I have very slow connection ~ 5 - 30KB/s ; with very high latency ~ 200 - 1500 ms) Kevin STEVENARD System & network Administrator LinkInTime ... Get Mobile ))) www.linkintime.com Mobile: (00967) 712 000 838 Office: (00967) 1 427 377 Fax : (00967) 1 428 851 LinkInTime Ltd. Iran Street Haddah - Sana''a - P.O.Box. 16871, YEMEN ----- Original Message ----- From: "Luke Kanies" <luke@madstop.com> To: "Puppet User Discussion" <puppet-users@madstop.com> Sent: Friday, February 22, 2008 6:12:48 AM GMT +03:00 Kuwait / Riyadh Subject: [Puppet-users] File corruption while serving Can anyone who''s having this problem please send details? I''m trying to reproduce it -- I''ve got 5 clients concurrently retrieving 200 10k files made of random binary, and I can''t get any corruption or memory growth at all. Is everyone experiencing the problem using Mongrel? Webrick? What versions of ruby? Are only big files affected? Small files? I''m going to spend some more time fixing my client-side hack that just fails if md5s don''t match, but this is a serious-enough problem that I want to fix the server-side too. -- Love is the triumph of imagination over intelligence. -- H. L. Mencken --------------------------------------------------------------------- Luke Kanies | http://reductivelabs.com | http://madstop.com _______________________________________________ Puppet-users mailing list Puppet-users@madstop.com https://mail.madstop.com/mailman/listinfo/puppet-users _______________________________________________ Puppet-users mailing list Puppet-users@madstop.com https://mail.madstop.com/mailman/listinfo/puppet-users