thr3ads.net - Puppet users - File corruption while serving [Feb 2008]

If this information is useful, please help other people find it:
Share via:

Luke Kanies

2008-Feb-22 03:12 UTC

File corruption while serving

Can anyone who''s having this problem please send details?  I''m
trying
to reproduce it -- I''ve got 5 clients concurrently retrieving 200 10k  
files made of random binary, and I can''t get any corruption or memory  
growth at all.

Is everyone experiencing the problem using Mongrel?  Webrick?  What  
versions of ruby?  Are only big files affected?  Small files?

I''m going to spend some more time fixing my client-side hack that just
fails if md5s don''t match, but this is a serious-enough problem that I
want to fix the server-side too.

-- 
Love is the triumph of imagination over intelligence.
     -- H. L. Mencken
---------------------------------------------------------------------
Luke Kanies | http://reductivelabs.com | http://madstop.com

Russ Allbery

2008-Feb-22 04:15 UTC

head link

Re: File corruption while serving

Luke Kanies <luke@madstop.com> writes:
> Can anyone who''s having this problem please send details? 
I''m trying to
> reproduce it -- I''ve got 5 clients concurrently retrieving 200 10k
files
> made of random binary, and I can''t get any corruption or memory
growth
> at all.
>
> Is everyone experiencing the problem using Mongrel?  Webrick?  What
> versions of ruby?  Are only big files affected?  Small files?
>
> I''m going to spend some more time fixing my client-side hack that
just
> fails if md5s don''t match, but this is a serious-enough problem
that I
> want to fix the server-side too.
Here are as many details as I can come with off-hand:

- Server is 0.23.2-3 (Debian).
- Clients are 0.24.1-1 (Debian and Red Hat) -- the problem does not occur
  with 0.23.2-3.  It does occur with both Debian and Red Hat clients.
- Server is using Mongrel.
- Shortly before this problem happens, the load goes crazy on the
  puppetmaster and clients start failing to be able to download resources
  or get the wide range of nil classes and comparisons with nil that we
  always get when the puppetmaster doesn''t respond.
- Small files are affected, namely configuration files of all kind.
- We don''t serve large files through Puppet, so I''m not sure
if they''re
  affected or not.
- The most common symptom is that the file is replaced with the checksum
  as described in the bug.
- A less common but still frequent problem is that a configuration file is
  replaced with a directory (so you get, for example, an /etc/crontab
  directory instead of an /etc/crontab file).
- The directory problem also affects 0.23.2-3 clients, but the client
  rejects what the server says with an error message about not being able
  to use a directory as a resource.  0.24.1-1 clients happily replace the
  file with a directory.
- The version of Ruby on both the server and the clients is 1.8.6.36-3 on
  Debian.  I''m not sure what it is on Red Hat clients.  Something
older.
- The puppetmaster runs for a while without any trouble, and then this
  suddenly happens.  We *think* it''s related to puppetmaster growing
until
  it cuts into swap, but we''re not at all sure.

We''re currently downgrading all of our clients to 0.23.2 to avoid this
problem since it''s caused several production outages.

-- 
Russ Allbery (rra@stanford.edu)             <http://www.eyrie.org/~eagle/>

Russ Allbery

2008-Feb-22 04:20 UTC

head link

Re: File corruption while serving

Russ Allbery <rra@stanford.edu> writes:
> - Clients are 0.24.1-1 (Debian and Red Hat) -- the problem does not occur
>   with 0.23.2-3.  It does occur with both Debian and Red Hat clients.
Oh, one other note on this.  We find that with both 0.23.2 and 0.24.1
clients the client puppetd generally dies when this happens.  We think
that the reason why we''re not seeing this with 0.23.2 may be that
0.23.2
clients die more quickly and therefore die before they can act on bad
data, whereas 0.24.1 clients keep reconnecting and persist and then act on
the bad data (and then finally die anyway).

-- 
Russ Allbery (rra@stanford.edu)             <http://www.eyrie.org/~eagle/>

Luke Kanies

2008-Feb-22 04:48 UTC

head link

Re: File corruption while serving

On Feb 21, 2008, at 11:15 PM, Russ Allbery wrote:>
> Here are as many details as I can come with off-hand:
>
> - Server is 0.23.2-3 (Debian).
> - Clients are 0.24.1-1 (Debian and Red Hat) -- the problem does not  
> occur
>  with 0.23.2-3.  It does occur with both Debian and Red Hat clients.
> - Server is using Mongrel.
> - Shortly before this problem happens, the load goes crazy on the
>  puppetmaster and clients start failing to be able to download  
> resources
>  or get the wide range of nil classes and comparisons with nil that we
>  always get when the puppetmaster doesn''t respond.
> - Small files are affected, namely configuration files of all kind.
> - We don''t serve large files through Puppet, so I''m not
sure if
> they''re
>  affected or not.
> - The most common symptom is that the file is replaced with the  
> checksum
>  as described in the bug.
> - A less common but still frequent problem is that a configuration  
> file is
>  replaced with a directory (so you get, for example, an /etc/crontab
>  directory instead of an /etc/crontab file).
> - The directory problem also affects 0.23.2-3 clients, but the client
>  rejects what the server says with an error message about not being  
> able
>  to use a directory as a resource.  0.24.1-1 clients happily replace  
> the
>  file with a directory.
> - The version of Ruby on both the server and the clients is  
> 1.8.6.36-3 on
>  Debian.  I''m not sure what it is on Red Hat clients.  Something  
> older.
> - The puppetmaster runs for a while without any trouble, and then this
>  suddenly happens.  We *think* it''s related to puppetmaster
growing
> until
>  it cuts into swap, but we''re not at all sure.
Thank you for the detail.

Is anyone having this problem with webrick?

I''ve just committed a client-side checksum validation fix, but
that''s
only a band-aid, really, although it should hopefully get back to  
"fail rather than do evil".

Given the seriousness of this problem, I''m looking at pushing some of  
the fileserving work I was planning on saving for the REST transition;  
if it makes things cleaner and thus less prone to failure, it makes  
sense to do the work now.

Is anyone who''s having the problem willing to run a host out of the  
current 0.24.x HEAD in git, to see if the problem is caught?

Russ, do you have any hope we might be able to find the source of  
these problems on the server?  They seem to be the real problem, but I  
can''t reproduce them so I can''t diagnose them.

-- 
Dawkins''s Law of Adversarial Debate:
     When two incompatible beliefs are advocated with equal intensity,
     the truth does not lie half way between them.
---------------------------------------------------------------------
Luke Kanies | http://reductivelabs.com | http://madstop.com

Russ Allbery

2008-Feb-22 04:55 UTC

head link

Re: File corruption while serving

Luke Kanies <luke@madstop.com> writes:
> Russ, do you have any hope we might be able to find the source of  
> these problems on the server?  They seem to be the real problem, but I  
> can''t reproduce them so I can''t diagnose them.
I don''t know -- I''ve never seen them outside of running a full
production
load on the servers.  It has taken around three days for the problem to
recur.  We''ve consistently seen the problem all along, but
it''s only with
the 0.24.1 clients that it caused file corruption rather than just an
extremely slow puppetmaster that was mostly unusable until restarted.

We''re currently restarting it nightly to work around this problem, so
we
expect not to see it in production right now.

This is with about 240 nodes and thousands of files provided through the
file server, with probably a good hundred pulled down by every node, all
checking every half-hour.

-- 
Russ Allbery (rra@stanford.edu)             <http://www.eyrie.org/~eagle/>

Russ Allbery

2008-Feb-22 05:02 UTC

head link

Re: File corruption while serving

Russ Allbery <rra@stanford.edu> writes:
> This is with about 240 nodes and thousands of files provided through the
> file server, with probably a good hundred pulled down by every node, all
> checking every half-hour.
Oh, and we''re running ten instances of puppetmaster on the master
server.
We tried increasing it to 20 to see if that was what was causing the
problem, but that didn''t apparently make any difference.  When this
problem happens, the whole service struggles; I''m not sure that every
puppetmaster daemon is necessarily having problems, but clients are
definitely not being successful in pulling manifests.

We see the following types of errors from puppetmaster all the time, but
when this problems happens, we seem to see more of them:

Feb 20 04:33:06 henson puppetmasterd[8748]: Denying authenticated client
r7-app1-uat.stanford.edu(171.67.41.90) access to puppetmaster.getconfig
Feb 20 04:33:15 henson puppetmasterd[8728]: undefined method `<'' for
nil:NilClass
Feb 20 04:33:16 henson puppetmasterd[9167]: comparison of Fixnum with nil failed
Feb 20 04:34:13 henson puppetmasterd[9220]: Denying unauthenticated client
shadow.stanford.edu(171.64.7.21) access to fileserver.list

(Both of those clients do have signed certificates and normally have no
trouble retrieving configurations.)

-- 
Russ Allbery (rra@stanford.edu)             <http://www.eyrie.org/~eagle/>

Evan Borgstrom

2008-Feb-22 05:06 UTC

head link

Re: File corruption while serving

> Is anyone having this problem with webrick?
Yes. I''m having the problem with webrick.

All machines are Debian etch. All are running ruby 1.8.5 (2006-08-25) 
[i486-linux]. All are running puppet 0.24.1-2 from the Debian testing 
repository. The puppetmaster is also 0.24.1-2 from the testing repository.

I have the exact same symptoms as Russ. The load on the puppetmaster 
host goes crazy then clients start to die with corruption. Most of the 
time, for me, the corrupted files become directories.

-- 
Evan Borgstrom <evan@fatbox.ca>
FatBox Inc.
720 King St West, Suite 126
Toronto, Ontario, M5V 3S5
t:416.833.3763 | f:888.829.5963
msn: evan@fatbox.ca | aim: evan@fatbox.ca

Luke Kanies

2008-Feb-22 05:10 UTC

head link

Re: File corruption while serving

On Feb 22, 2008, at 12:02 AM, Russ Allbery wrote:
> Russ Allbery <rra@stanford.edu> writes:
>
>> This is with about 240 nodes and thousands of files provided  
>> through the
>> file server, with probably a good hundred pulled down by every  
>> node, all
>> checking every half-hour.
>
> Oh, and we''re running ten instances of puppetmaster on the master
> server.
> We tried increasing it to 20 to see if that was what was causing the
> problem, but that didn''t apparently make any difference.  When
this
> problem happens, the whole service struggles; I''m not sure that
every
> puppetmaster daemon is necessarily having problems, but clients are
> definitely not being successful in pulling manifests.
>
> We see the following types of errors from puppetmaster all the time,  
> but
> when this problems happens, we seem to see more of them:
>
> Feb 20 04:33:06 henson puppetmasterd[8748]: Denying authenticated  
> client r7-app1-uat.stanford.edu(171.67.41.90) access to  
> puppetmaster.getconfig
> Feb 20 04:33:15 henson puppetmasterd[8728]: undefined method
`<'' for
> nil:NilClass
> Feb 20 04:33:16 henson puppetmasterd[9167]: comparison of Fixnum  
> with nil failed
> Feb 20 04:34:13 henson puppetmasterd[9220]: Denying unauthenticated  
> client shadow.stanford.edu(171.64.7.21) access to fileserver.list
>
> (Both of those clients do have signed certificates and normally have  
> no
> trouble retrieving configurations.)
Would you be able to run the master in --trace mode with the output  
going to a file, to hopefully get a stack trace of the problem?

-- 
It''s not that I''m afraid to die. I just don''t want to
be there when it
happens. -- Woody Allen
---------------------------------------------------------------------
Luke Kanies | http://reductivelabs.com | http://madstop.com

Joel Wood

2008-Feb-22 05:16 UTC

head link

Re: File corruption while serving

I just had this happen with (I think) webrick.  I am using what ever the 
default that puppetmasterd runs.

The file that was corrupted was /etc/resolv.conf.  Damn.

So I''m running 0.24.1 on both the server and the clients.

amd64 debian etch 2.6.18-5
ruby 1.8.5

A whole lot happens on the box thats running puppetmasterd, so the idea 
that this happens under load seems pretty plausable.

-Joel

On Thu, 21 Feb 2008, Luke Kanies wrote:
> On Feb 21, 2008, at 11:15 PM, Russ Allbery wrote:
>>
>> Here are as many details as I can come with off-hand:
>>
>> - Server is 0.23.2-3 (Debian).
>> - Clients are 0.24.1-1 (Debian and Red Hat) -- the problem does not
>> occur
>>  with 0.23.2-3.  It does occur with both Debian and Red Hat clients.
>> - Server is using Mongrel.
>> - Shortly before this problem happens, the load goes crazy on the
>>  puppetmaster and clients start failing to be able to download
>> resources
>>  or get the wide range of nil classes and comparisons with nil that we
>>  always get when the puppetmaster doesn''t respond.
>> - Small files are affected, namely configuration files of all kind.
>> - We don''t serve large files through Puppet, so I''m
not sure if
>> they''re
>>  affected or not.
>> - The most common symptom is that the file is replaced with the
>> checksum
>>  as described in the bug.
>> - A less common but still frequent problem is that a configuration
>> file is
>>  replaced with a directory (so you get, for example, an /etc/crontab
>>  directory instead of an /etc/crontab file).
>> - The directory problem also affects 0.23.2-3 clients, but the client
>>  rejects what the server says with an error message about not being
>> able
>>  to use a directory as a resource.  0.24.1-1 clients happily replace
>> the
>>  file with a directory.
>> - The version of Ruby on both the server and the clients is
>> 1.8.6.36-3 on
>>  Debian.  I''m not sure what it is on Red Hat clients. 
Something
>> older.
>> - The puppetmaster runs for a while without any trouble, and then this
>>  suddenly happens.  We *think* it''s related to puppetmaster
growing
>> until
>>  it cuts into swap, but we''re not at all sure.
>
> Thank you for the detail.
>
> Is anyone having this problem with webrick?
>
> I''ve just committed a client-side checksum validation fix, but
that''s
> only a band-aid, really, although it should hopefully get back to
> "fail rather than do evil".
>
> Given the seriousness of this problem, I''m looking at pushing some
of
> the fileserving work I was planning on saving for the REST transition;
> if it makes things cleaner and thus less prone to failure, it makes
> sense to do the work now.
>
> Is anyone who''s having the problem willing to run a host out of
the
> current 0.24.x HEAD in git, to see if the problem is caught?
>
> Russ, do you have any hope we might be able to find the source of
> these problems on the server?  They seem to be the real problem, but I
> can''t reproduce them so I can''t diagnose them.
>
> -- 
> Dawkins''s Law of Adversarial Debate:
>     When two incompatible beliefs are advocated with equal intensity,
>     the truth does not lie half way between them.
> ---------------------------------------------------------------------
> Luke Kanies | http://reductivelabs.com | http://madstop.com
>
> _______________________________________________
> Puppet-users mailing list
> Puppet-users@madstop.com
> https://mail.madstop.com/mailman/listinfo/puppet-users
>

Russ Allbery

2008-Feb-22 06:01 UTC

head link

Re: File corruption while serving

Luke Kanies <luke@madstop.com> writes:
> Would you be able to run the master in --trace mode with the output
> going to a file, to hopefully get a stack trace of the problem?
I''ll see if someone in my group can get that set up.

-- 
Russ Allbery (rra@stanford.edu)             <http://www.eyrie.org/~eagle/>

Nigel Kersten

2008-Feb-22 16:15 UTC

head link

Re: File corruption while serving

On Thu, Feb 21, 2008 at 8:15 PM, Russ Allbery <rra@stanford.edu> wrote:
>  - A less common but still frequent problem is that a configuration file is
>   replaced with a directory (so you get, for example, an /etc/crontab
>   directory instead of an /etc/crontab file).
We''ve seen this since moving the server to 0.24.1, but it seems to
have been fixed by adding ensure => file to those resource
definitions.

I''ll see if I can reproduce it today with --trace on one of our test
servers.


-- 
Nigel Kersten
Systems Administrator
MacOps

Luke Kanies

2008-Feb-22 17:49 UTC

head link

Re: File corruption while serving

On Feb 21, 2008, at 11:20 PM, Russ Allbery wrote:
> Russ Allbery <rra@stanford.edu> writes:
>
>> - Clients are 0.24.1-1 (Debian and Red Hat) -- the problem does not  
>> occur
>>  with 0.23.2-3.  It does occur with both Debian and Red Hat clients.
>
> Oh, one other note on this.  We find that with both 0.23.2 and 0.24.1
> clients the client puppetd generally dies when this happens.  We think
> that the reason why we''re not seeing this with 0.23.2 may be that
> 0.23.2
> clients die more quickly and therefore die before they can act on bad
> data, whereas 0.24.1 clients keep reconnecting and persist and then  
> act on
> the bad data (and then finally die anyway).
Okay.  I''ll set up a test at home that hammers a server for a few days
and see what I can get.

Thanks.

-- 
I used to get high on life but lately I''ve built up a resistance.
---------------------------------------------------------------------
Luke Kanies | http://reductivelabs.com | http://madstop.com

Luke Kanies

2008-Feb-22 17:52 UTC

head link

Re: File corruption while serving

On Feb 22, 2008, at 11:15 AM, Nigel Kersten wrote:
> We''ve seen this since moving the server to 0.24.1, but it seems to
> have been fixed by adding ensure => file to those resource
> definitions.
This is... interesting.  So specifying ''ensure'' fixes the
problem?
>
> I''ll see if I can reproduce it today with --trace on one of our
test
> servers.

That''d be great.

-- 
Risk! Risk anything! Care no more for the opinions of others, for those
voices. Do the hardest thing on earth for you. Act for yourself. Face
the truth. -- Katherine Mansfield
---------------------------------------------------------------------
Luke Kanies | http://reductivelabs.com | http://madstop.com

Kevin Stevenard [IT]

2008-Feb-23 07:38 UTC

head link

Re: File corruption while serving

Hi, 

I''m also facing this problem, sometimes. 

Here are details of my puppet architecture: 
- 2 instances of mongrel puppetmasterd (0.24.1-1 debian package) 
- 1 nginx 
- clients (0.24.1-1 debian package), about 10 clients with a runinterval of 15
minutes

Since I have this problem of corruption, I have limited the execution of puppet
to working hours, in order to quickly fix incidents cause by this issue. And as
I have reduce the working time of puppet I have less incidents. If I remember
correctly I think that after a restart of all puppetmasterd I don''t
have any problem for a while.

By the way I don''t know if it''s a usefull information but
I''m using puppet over very bad connections (I mean that sometimes I
have very slow connection ~ 5 - 30KB/s ; with very high latency ~ 200 - 1500 ms)

Kevin STEVENARD 
System & network Administrator 
LinkInTime 
... Get Mobile ))) 

www.linkintime.com 

Mobile: (00967) 712 000 838 
Office: (00967) 1 427 377 
Fax : (00967) 1 428 851 

LinkInTime Ltd. 
Iran Street 
Haddah - Sana''a - P.O.Box. 16871, YEMEN 

----- Original Message ----- 
From: "Luke Kanies" <luke@madstop.com> 
To: "Puppet User Discussion" <puppet-users@madstop.com> 
Sent: Friday, February 22, 2008 6:12:48 AM GMT +03:00 Kuwait / Riyadh 
Subject: [Puppet-users] File corruption while serving 

Can anyone who''s having this problem please send details? I''m
trying
to reproduce it -- I''ve got 5 clients concurrently retrieving 200 10k 
files made of random binary, and I can''t get any corruption or memory 
growth at all. 

Is everyone experiencing the problem using Mongrel? Webrick? What 
versions of ruby? Are only big files affected? Small files? 

I''m going to spend some more time fixing my client-side hack that just 
fails if md5s don''t match, but this is a serious-enough problem that I 
want to fix the server-side too. 

-- 
Love is the triumph of imagination over intelligence. 
-- H. L. Mencken 
--------------------------------------------------------------------- 
Luke Kanies | http://reductivelabs.com | http://madstop.com 

_______________________________________________ 
Puppet-users mailing list 
Puppet-users@madstop.com 
https://mail.madstop.com/mailman/listinfo/puppet-users 



_______________________________________________
Puppet-users mailing list
Puppet-users@madstop.com
https://mail.madstop.com/mailman/listinfo/puppet-users

Maybe Matching Threads

Search for more reasonably related threads

Puppet users - Feb 2008 - File corruption while serving

File corruption while serving

Re: File corruption while serving

Re: File corruption while serving

Re: File corruption while serving

Re: File corruption while serving

Re: File corruption while serving

Re: File corruption while serving

Re: File corruption while serving

Re: File corruption while serving

Re: File corruption while serving

Re: File corruption while serving

Re: File corruption while serving

Re: File corruption while serving

Re: File corruption while serving

Maybe Matching Threads