thr3ads.net - Puppet users - [Puppet Users] Mirror folder with large files [Jan 2011]

If this information is useful, please help other people find it:
Share via:

Daniel Piddock

2011-Jan-24 17:14 UTC

[Puppet Users] Mirror folder with large files

Dear list,

I''m attempting to mirror a folder containing a few large files from an
NFS location to the local drive. Subsequent runs take a lot longer than
I''d have expected, after the first run.

Using the following block a puppet apply run is currently taking 30 seconds:
file { ''/usr/share/target'':
        source   => ''file:///home/archive/source/'',
        recurse  => true,
        backup   => false,
        checksum => mtime,
}

There are 42 files taking up 870MB. I''d have thought stating the files
in the source and target, comparing to each other (or a cache internal
to puppet as it doesn''t set the mtime on files) would be a lot faster
than it is.

I was curious about what puppet was up to, so ran it in strace. It''s
reading every file every run, multiple times! Reads the target twice,
then the source twice before reading the target again. Considering I
wasn''t expecting it to open any of the files at all this is total over
kill.

Is this horribly bugged or have I got a magic incantation that''s
causing
this behaviour? strace is rather verbose and I haven''t exactly read all
80MB of the dump line by line.

Is there a neater way of just mirroring a folder based on modification
time? I suppose the easiest route would be an exec of rsync, at least I
have control over that.

I''m using Puppet 2.6.4.

Dan
I especially like the way Ruby searches for and loads the md5 library
every time it''s used. What a performant language.

-- 
You received this message because you are subscribed to the Google Groups
"Puppet Users" group.
To post to this group, send email to puppet-users@googlegroups.com.
To unsubscribe from this group, send email to
puppet-users+unsubscribe@googlegroups.com.
For more options, visit this group at
http://groups.google.com/group/puppet-users?hl=en.

Patrick

2011-Jan-24 20:57 UTC

head link

Re: [Puppet Users] Mirror folder with large files

On Jan 24, 2011, at 9:14 AM, Daniel Piddock wrote:
> Dear list,
> 
> I''m attempting to mirror a folder containing a few large files
from an
> NFS location to the local drive. Subsequent runs take a lot longer than
> I''d have expected, after the first run.
> 
> Using the following block a puppet apply run is currently taking 30
seconds:
> file { ''/usr/share/target'':
>        source   => ''file:///home/archive/source/'',
>        recurse  => true,
>        backup   => false,
>        checksum => mtime,
> }
> 
> There are 42 files taking up 870MB. I''d have thought stating the
files
> in the source and target, comparing to each other (or a cache internal
> to puppet as it doesn''t set the mtime on files) would be a lot
faster
> than it is.
> 
> I was curious about what puppet was up to, so ran it in strace.
It''s
> reading every file every run, multiple times! Reads the target twice,
> then the source twice before reading the target again. Considering I
> wasn''t expecting it to open any of the files at all this is total
over kill.
> 
> Is this horribly bugged or have I got a magic incantation that''s
causing
> this behaviour? strace is rather verbose and I haven''t exactly
read all
> 80MB of the dump line by line.
> 
> Is there a neater way of just mirroring a folder based on modification
> time? I suppose the easiest route would be an exec of rsync, at least I
> have control over that.
> 
> I''m using Puppet 2.6.4.
> 
> Dan
> I especially like the way Ruby searches for and loads the md5 library
> every time it''s used. What a performant language.
This sounds like a bug to me.  I do know that I never use recurse=true except
when neccisary myself because it''s too slow.  In your position, I would
replace it with an rsync and file a bug.

Also, does it behave this badly when no changes are made or just when making
changes?

-- 
You received this message because you are subscribed to the Google Groups
"Puppet Users" group.
To post to this group, send email to puppet-users@googlegroups.com.
To unsubscribe from this group, send email to
puppet-users+unsubscribe@googlegroups.com.
For more options, visit this group at
http://groups.google.com/group/puppet-users?hl=en.

Daniel Piddock

2011-Jan-25 12:16 UTC

head link

Re: [Puppet Users] Mirror folder with large files

On 24/01/11 20:57, Patrick wrote:> On Jan 24, 2011, at 9:14 AM, Daniel Piddock wrote:
>
>> Dear list,
>>
>> I''m attempting to mirror a folder containing a few large files
from an
>> NFS location to the local drive. Subsequent runs take a lot longer than
>> I''d have expected, after the first run.
>>
>> Using the following block a puppet apply run is currently taking 30
seconds:
>> file { ''/usr/share/target'':
>>        source   => ''file:///home/archive/source/'',
>>        recurse  => true,
>>        backup   => false,
>>        checksum => mtime,
>> }
>>
>> There are 42 files taking up 870MB. I''d have thought stating
the files
>> in the source and target, comparing to each other (or a cache internal
>> to puppet as it doesn''t set the mtime on files) would be a lot
faster
>> than it is.
>>
>> I was curious about what puppet was up to, so ran it in strace.
It''s
>> reading every file every run, multiple times! Reads the target twice,
>> then the source twice before reading the target again. Considering I
>> wasn''t expecting it to open any of the files at all this is
total over kill.
>>
>> Is this horribly bugged or have I got a magic incantation
that''s causing
>> this behaviour? strace is rather verbose and I haven''t exactly
read all
>> 80MB of the dump line by line.
>>
>> Is there a neater way of just mirroring a folder based on modification
>> time? I suppose the easiest route would be an exec of rsync, at least I
>> have control over that.
>>
>> I''m using Puppet 2.6.4.
>>
>> Dan
>> I especially like the way Ruby searches for and loads the md5 library
>> every time it''s used. What a performant language.
> This sounds like a bug to me.  I do know that I never use recurse=true
except when neccisary myself because it''s too slow.  In your position,
I would replace it with an rsync and file a bug.
>
> Also, does it behave this badly when no changes are made or just when
making changes
This happens every single run, source and target have not changed state.

I tried stracing when just a single file is copied. Puppet is still
reading both source and target when checksum => mtime is used, although
just the once.

I think there might be two bugs here - checksum does not work with
timestamps and recurse is horribly broken.

Puppet issues 6003 and 6004 raised.

Dan

-- 
You received this message because you are subscribed to the Google Groups
"Puppet Users" group.
To post to this group, send email to puppet-users@googlegroups.com.
To unsubscribe from this group, send email to
puppet-users+unsubscribe@googlegroups.com.
For more options, visit this group at
http://groups.google.com/group/puppet-users?hl=en.

Brice Figureau

2011-Jan-25 12:45 UTC

head link

Re: [Puppet Users] Mirror folder with large files

On Mon, 2011-01-24 at 17:14 +0000, Daniel Piddock wrote:> Dear list,
> 
> I''m attempting to mirror a folder containing a few large files
from an
> NFS location to the local drive. Subsequent runs take a lot longer than
> I''d have expected, after the first run.
> 
> Using the following block a puppet apply run is currently taking 30
seconds:
> file { ''/usr/share/target'':
>         source   => ''file:///home/archive/source/'',
>         recurse  => true,
>         backup   => false,
>         checksum => mtime,
> }
> 
> There are 42 files taking up 870MB. I''d have thought stating the
files
> in the source and target, comparing to each other (or a cache internal
> to puppet as it doesn''t set the mtime on files) would be a lot
faster
> than it is.
This is a naive view of the problem :)
The puppet file type is certainly the most complex resource abstraction
puppet embeds (just think about the fact that it handles dir, files,
link, remote recursion, local recursion, etc...).
> I was curious about what puppet was up to, so ran it in strace.
It''s
> reading every file every run, multiple times! Reads the target twice,
> then the source twice before reading the target again. Considering I
> wasn''t expecting it to open any of the files at all this is total
over kill.
> 
> Is this horribly bugged or have I got a magic incantation that''s
causing
> this behaviour? strace is rather verbose and I haven''t exactly
read all
> 80MB of the dump line by line.
> 
> Is there a neater way of just mirroring a folder based on modification
> time? I suppose the easiest route would be an exec of rsync, at least I
> have control over that.
Yes, I think rsync is the sanest way to do this.

Recursive file resources (and especially sourced ones) are really tough
for puppet to handle in the current way the code is working.

Puppet manages individual file resources, and for every resource it
manages it as an instance of this resource in memory.

For deep/large file hierarchies, Puppet has to create/manage an
individual resource per file/directory present in this hierarchy, which
consumes both cpu and ram (due to the way the ruby GC is poorly
implemented and the time it takes to create a ruby object). 
And I don''t even talk about the scalability issues of the generation
and
handling of billions of "change" event coming up each time a file is
changed (which happens for instance the first time puppet runs).

I think I remember mtime is a checksum valid only for directory, and
puppet automatically switches to md5 for files (I don''t really know the
reason, but I''m sure redmine knows it).

(One of) The problem is that puppet reads the file once to compute the
md5 sum, then it also reads it again to perform the copy when it detects
a change. I don''t exactly know why it would write multiple times, but
I''m sure you can debug this by adding debug statements in
puppet/type/file/content.rb where all the write happens.

> I''m using Puppet 2.6.4.
> 
> Dan
> I especially like the way Ruby searches for and loads the md5 library
> every time it''s used. What a performant language.
This certainly comes from this code in Puppet::Util::Checksums:
  # Calculate a checksum of a file''s content using Digest::MD5.
  def md5_file(filename, lite = false)
    require ''digest/md5''

    digest = Digest::MD5.new
    checksum_file(digest, filename,  lite)
  end

Notice how the "require" is in the function instead of being outside.
I''d think that ruby would be smart enough to understand the file has
already been "required" and not bother, but apparently it
doesn''t do
that for you. Can you give us what ruby version and what platform
you''re
using?
-- 
Brice Figureau
Follow the latest Puppet Community evolutions on www.planetpuppet.org!

-- 
You received this message because you are subscribed to the Google Groups
"Puppet Users" group.
To post to this group, send email to puppet-users@googlegroups.com.
To unsubscribe from this group, send email to
puppet-users+unsubscribe@googlegroups.com.
For more options, visit this group at
http://groups.google.com/group/puppet-users?hl=en.

Daniel Piddock

2011-Jan-25 14:40 UTC

head link

Re: [Puppet Users] Mirror folder with large files

On 25/01/11 12:45, Brice Figureau wrote:> On Mon, 2011-01-24 at 17:14 +0000, Daniel Piddock wrote:
>> Dear list,
>>
>> I''m attempting to mirror a folder containing a few large files
from an
>> NFS location to the local drive. Subsequent runs take a lot longer than
>> I''d have expected, after the first run.
>>
>> Using the following block a puppet apply run is currently taking 30
seconds:
>> file { ''/usr/share/target'':
>>         source   =>
''file:///home/archive/source/'',
>>         recurse  => true,
>>         backup   => false,
>>         checksum => mtime,
>> }
>>
>> There are 42 files taking up 870MB. I''d have thought stating
the files
>> in the source and target, comparing to each other (or a cache internal
>> to puppet as it doesn''t set the mtime on files) would be a lot
faster
>> than it is.
> This is a naive view of the problem :)
> The puppet file type is certainly the most complex resource abstraction
> puppet embeds (just think about the fact that it handles dir, files,
> link, remote recursion, local recursion, etc...).
Yes, it''s a shame that the implication of "checksum =>
mtime" doesn''t do
what it says on the tin, or the documentation doesn''t really mention
anything about how the checksums differ or function. However md5summing
every file twice when recursing seems a bit broken.
>> I was curious about what puppet was up to, so ran it in strace.
It''s
>> reading every file every run, multiple times! Reads the target twice,
>> then the source twice before reading the target again. Considering I
>> wasn''t expecting it to open any of the files at all this is
total over kill.
>>
>> Is this horribly bugged or have I got a magic incantation
that''s causing
>> this behaviour? strace is rather verbose and I haven''t exactly
read all
>> 80MB of the dump line by line.
>>
>> Is there a neater way of just mirroring a folder based on modification
>> time? I suppose the easiest route would be an exec of rsync, at least I
>> have control over that.
> Yes, I think rsync is the sanest way to do this.
>
> Recursive file resources (and especially sourced ones) are really tough
> for puppet to handle in the current way the code is working.
>
> Puppet manages individual file resources, and for every resource it
> manages it as an instance of this resource in memory.
>
> For deep/large file hierarchies, Puppet has to create/manage an
> individual resource per file/directory present in this hierarchy, which
> consumes both cpu and ram (due to the way the ruby GC is poorly
> implemented and the time it takes to create a ruby object). 
> And I don''t even talk about the scalability issues of the
generation and
> handling of billions of "change" event coming up each time a file
is
> changed (which happens for instance the first time puppet runs).
>
> I think I remember mtime is a checksum valid only for directory, and
> puppet automatically switches to md5 for files (I don''t really
know the
> reason, but I''m sure redmine knows it).
>
> (One of) The problem is that puppet reads the file once to compute the
> md5 sum, then it also reads it again to perform the copy when it detects
> a change. I don''t exactly know why it would write multiple times,
but
> I''m sure you can debug this by adding debug statements in
> puppet/type/file/content.rb where all the write happens.
In recursion, the source file is read twice, target is tested and if it
doesn''t exist the source is read again for the copy. If the target did
exist, it''s read twice as well. It does not matter if the checksum was
specified as md5 or mtime. I put more detail on issue 6003
http://projects.puppetlabs.com/issues/6003 .

Writing only happens once per changed file.
>> I''m using Puppet 2.6.4.
>>
>> Dan
>> I especially like the way Ruby searches for and loads the md5 library
>> every time it''s used. What a performant language.
> This certainly comes from this code in Puppet::Util::Checksums:
>   # Calculate a checksum of a file''s content using Digest::MD5.
>   def md5_file(filename, lite = false)
>     require ''digest/md5''
>
>     digest = Digest::MD5.new
>     checksum_file(digest, filename,  lite)
>   end
>
> Notice how the "require" is in the function instead of being
outside.
> I''d think that ruby would be smart enough to understand the file
has
> already been "required" and not bother, but apparently it
doesn''t do
> that for you. Can you give us what ruby version and what platform
you''re
> using?
The client I''m using for testing is Fedora 14,
ruby-1.8.7.330-1.fc14.x86_64

Dan

-- 
You received this message because you are subscribed to the Google Groups
"Puppet Users" group.
To post to this group, send email to puppet-users@googlegroups.com.
To unsubscribe from this group, send email to
puppet-users+unsubscribe@googlegroups.com.
For more options, visit this group at
http://groups.google.com/group/puppet-users?hl=en.

Puppet users - Jan 2011 - Mirror folder with large files

[Puppet Users] Mirror folder with large files

Re: [Puppet Users] Mirror folder with large files

Re: [Puppet Users] Mirror folder with large files

Re: [Puppet Users] Mirror folder with large files

Re: [Puppet Users] Mirror folder with large files