Dear list, I''m attempting to mirror a folder containing a few large files from an NFS location to the local drive. Subsequent runs take a lot longer than I''d have expected, after the first run. Using the following block a puppet apply run is currently taking 30 seconds: file { ''/usr/share/target'': source => ''file:///home/archive/source/'', recurse => true, backup => false, checksum => mtime, } There are 42 files taking up 870MB. I''d have thought stating the files in the source and target, comparing to each other (or a cache internal to puppet as it doesn''t set the mtime on files) would be a lot faster than it is. I was curious about what puppet was up to, so ran it in strace. It''s reading every file every run, multiple times! Reads the target twice, then the source twice before reading the target again. Considering I wasn''t expecting it to open any of the files at all this is total over kill. Is this horribly bugged or have I got a magic incantation that''s causing this behaviour? strace is rather verbose and I haven''t exactly read all 80MB of the dump line by line. Is there a neater way of just mirroring a folder based on modification time? I suppose the easiest route would be an exec of rsync, at least I have control over that. I''m using Puppet 2.6.4. Dan I especially like the way Ruby searches for and loads the md5 library every time it''s used. What a performant language. -- You received this message because you are subscribed to the Google Groups "Puppet Users" group. To post to this group, send email to puppet-users@googlegroups.com. To unsubscribe from this group, send email to puppet-users+unsubscribe@googlegroups.com. For more options, visit this group at http://groups.google.com/group/puppet-users?hl=en.
On Jan 24, 2011, at 9:14 AM, Daniel Piddock wrote:> Dear list, > > I''m attempting to mirror a folder containing a few large files from an > NFS location to the local drive. Subsequent runs take a lot longer than > I''d have expected, after the first run. > > Using the following block a puppet apply run is currently taking 30 seconds: > file { ''/usr/share/target'': > source => ''file:///home/archive/source/'', > recurse => true, > backup => false, > checksum => mtime, > } > > There are 42 files taking up 870MB. I''d have thought stating the files > in the source and target, comparing to each other (or a cache internal > to puppet as it doesn''t set the mtime on files) would be a lot faster > than it is. > > I was curious about what puppet was up to, so ran it in strace. It''s > reading every file every run, multiple times! Reads the target twice, > then the source twice before reading the target again. Considering I > wasn''t expecting it to open any of the files at all this is total over kill. > > Is this horribly bugged or have I got a magic incantation that''s causing > this behaviour? strace is rather verbose and I haven''t exactly read all > 80MB of the dump line by line. > > Is there a neater way of just mirroring a folder based on modification > time? I suppose the easiest route would be an exec of rsync, at least I > have control over that. > > I''m using Puppet 2.6.4. > > Dan > I especially like the way Ruby searches for and loads the md5 library > every time it''s used. What a performant language.This sounds like a bug to me. I do know that I never use recurse=true except when neccisary myself because it''s too slow. In your position, I would replace it with an rsync and file a bug. Also, does it behave this badly when no changes are made or just when making changes? -- You received this message because you are subscribed to the Google Groups "Puppet Users" group. To post to this group, send email to puppet-users@googlegroups.com. To unsubscribe from this group, send email to puppet-users+unsubscribe@googlegroups.com. For more options, visit this group at http://groups.google.com/group/puppet-users?hl=en.
On 24/01/11 20:57, Patrick wrote:> On Jan 24, 2011, at 9:14 AM, Daniel Piddock wrote: > >> Dear list, >> >> I''m attempting to mirror a folder containing a few large files from an >> NFS location to the local drive. Subsequent runs take a lot longer than >> I''d have expected, after the first run. >> >> Using the following block a puppet apply run is currently taking 30 seconds: >> file { ''/usr/share/target'': >> source => ''file:///home/archive/source/'', >> recurse => true, >> backup => false, >> checksum => mtime, >> } >> >> There are 42 files taking up 870MB. I''d have thought stating the files >> in the source and target, comparing to each other (or a cache internal >> to puppet as it doesn''t set the mtime on files) would be a lot faster >> than it is. >> >> I was curious about what puppet was up to, so ran it in strace. It''s >> reading every file every run, multiple times! Reads the target twice, >> then the source twice before reading the target again. Considering I >> wasn''t expecting it to open any of the files at all this is total over kill. >> >> Is this horribly bugged or have I got a magic incantation that''s causing >> this behaviour? strace is rather verbose and I haven''t exactly read all >> 80MB of the dump line by line. >> >> Is there a neater way of just mirroring a folder based on modification >> time? I suppose the easiest route would be an exec of rsync, at least I >> have control over that. >> >> I''m using Puppet 2.6.4. >> >> Dan >> I especially like the way Ruby searches for and loads the md5 library >> every time it''s used. What a performant language. > This sounds like a bug to me. I do know that I never use recurse=true except when neccisary myself because it''s too slow. In your position, I would replace it with an rsync and file a bug. > > Also, does it behave this badly when no changes are made or just when making changesThis happens every single run, source and target have not changed state. I tried stracing when just a single file is copied. Puppet is still reading both source and target when checksum => mtime is used, although just the once. I think there might be two bugs here - checksum does not work with timestamps and recurse is horribly broken. Puppet issues 6003 and 6004 raised. Dan -- You received this message because you are subscribed to the Google Groups "Puppet Users" group. To post to this group, send email to puppet-users@googlegroups.com. To unsubscribe from this group, send email to puppet-users+unsubscribe@googlegroups.com. For more options, visit this group at http://groups.google.com/group/puppet-users?hl=en.
On Mon, 2011-01-24 at 17:14 +0000, Daniel Piddock wrote:> Dear list, > > I''m attempting to mirror a folder containing a few large files from an > NFS location to the local drive. Subsequent runs take a lot longer than > I''d have expected, after the first run. > > Using the following block a puppet apply run is currently taking 30 seconds: > file { ''/usr/share/target'': > source => ''file:///home/archive/source/'', > recurse => true, > backup => false, > checksum => mtime, > } > > There are 42 files taking up 870MB. I''d have thought stating the files > in the source and target, comparing to each other (or a cache internal > to puppet as it doesn''t set the mtime on files) would be a lot faster > than it is.This is a naive view of the problem :) The puppet file type is certainly the most complex resource abstraction puppet embeds (just think about the fact that it handles dir, files, link, remote recursion, local recursion, etc...).> I was curious about what puppet was up to, so ran it in strace. It''s > reading every file every run, multiple times! Reads the target twice, > then the source twice before reading the target again. Considering I > wasn''t expecting it to open any of the files at all this is total over kill. > > Is this horribly bugged or have I got a magic incantation that''s causing > this behaviour? strace is rather verbose and I haven''t exactly read all > 80MB of the dump line by line. > > Is there a neater way of just mirroring a folder based on modification > time? I suppose the easiest route would be an exec of rsync, at least I > have control over that.Yes, I think rsync is the sanest way to do this. Recursive file resources (and especially sourced ones) are really tough for puppet to handle in the current way the code is working. Puppet manages individual file resources, and for every resource it manages it as an instance of this resource in memory. For deep/large file hierarchies, Puppet has to create/manage an individual resource per file/directory present in this hierarchy, which consumes both cpu and ram (due to the way the ruby GC is poorly implemented and the time it takes to create a ruby object). And I don''t even talk about the scalability issues of the generation and handling of billions of "change" event coming up each time a file is changed (which happens for instance the first time puppet runs). I think I remember mtime is a checksum valid only for directory, and puppet automatically switches to md5 for files (I don''t really know the reason, but I''m sure redmine knows it). (One of) The problem is that puppet reads the file once to compute the md5 sum, then it also reads it again to perform the copy when it detects a change. I don''t exactly know why it would write multiple times, but I''m sure you can debug this by adding debug statements in puppet/type/file/content.rb where all the write happens.> I''m using Puppet 2.6.4. > > Dan > I especially like the way Ruby searches for and loads the md5 library > every time it''s used. What a performant language.This certainly comes from this code in Puppet::Util::Checksums: # Calculate a checksum of a file''s content using Digest::MD5. def md5_file(filename, lite = false) require ''digest/md5'' digest = Digest::MD5.new checksum_file(digest, filename, lite) end Notice how the "require" is in the function instead of being outside. I''d think that ruby would be smart enough to understand the file has already been "required" and not bother, but apparently it doesn''t do that for you. Can you give us what ruby version and what platform you''re using? -- Brice Figureau Follow the latest Puppet Community evolutions on www.planetpuppet.org! -- You received this message because you are subscribed to the Google Groups "Puppet Users" group. To post to this group, send email to puppet-users@googlegroups.com. To unsubscribe from this group, send email to puppet-users+unsubscribe@googlegroups.com. For more options, visit this group at http://groups.google.com/group/puppet-users?hl=en.
On 25/01/11 12:45, Brice Figureau wrote:> On Mon, 2011-01-24 at 17:14 +0000, Daniel Piddock wrote: >> Dear list, >> >> I''m attempting to mirror a folder containing a few large files from an >> NFS location to the local drive. Subsequent runs take a lot longer than >> I''d have expected, after the first run. >> >> Using the following block a puppet apply run is currently taking 30 seconds: >> file { ''/usr/share/target'': >> source => ''file:///home/archive/source/'', >> recurse => true, >> backup => false, >> checksum => mtime, >> } >> >> There are 42 files taking up 870MB. I''d have thought stating the files >> in the source and target, comparing to each other (or a cache internal >> to puppet as it doesn''t set the mtime on files) would be a lot faster >> than it is. > This is a naive view of the problem :) > The puppet file type is certainly the most complex resource abstraction > puppet embeds (just think about the fact that it handles dir, files, > link, remote recursion, local recursion, etc...).Yes, it''s a shame that the implication of "checksum => mtime" doesn''t do what it says on the tin, or the documentation doesn''t really mention anything about how the checksums differ or function. However md5summing every file twice when recursing seems a bit broken.>> I was curious about what puppet was up to, so ran it in strace. It''s >> reading every file every run, multiple times! Reads the target twice, >> then the source twice before reading the target again. Considering I >> wasn''t expecting it to open any of the files at all this is total over kill. >> >> Is this horribly bugged or have I got a magic incantation that''s causing >> this behaviour? strace is rather verbose and I haven''t exactly read all >> 80MB of the dump line by line. >> >> Is there a neater way of just mirroring a folder based on modification >> time? I suppose the easiest route would be an exec of rsync, at least I >> have control over that. > Yes, I think rsync is the sanest way to do this. > > Recursive file resources (and especially sourced ones) are really tough > for puppet to handle in the current way the code is working. > > Puppet manages individual file resources, and for every resource it > manages it as an instance of this resource in memory. > > For deep/large file hierarchies, Puppet has to create/manage an > individual resource per file/directory present in this hierarchy, which > consumes both cpu and ram (due to the way the ruby GC is poorly > implemented and the time it takes to create a ruby object). > And I don''t even talk about the scalability issues of the generation and > handling of billions of "change" event coming up each time a file is > changed (which happens for instance the first time puppet runs). > > I think I remember mtime is a checksum valid only for directory, and > puppet automatically switches to md5 for files (I don''t really know the > reason, but I''m sure redmine knows it). > > (One of) The problem is that puppet reads the file once to compute the > md5 sum, then it also reads it again to perform the copy when it detects > a change. I don''t exactly know why it would write multiple times, but > I''m sure you can debug this by adding debug statements in > puppet/type/file/content.rb where all the write happens.In recursion, the source file is read twice, target is tested and if it doesn''t exist the source is read again for the copy. If the target did exist, it''s read twice as well. It does not matter if the checksum was specified as md5 or mtime. I put more detail on issue 6003 http://projects.puppetlabs.com/issues/6003 . Writing only happens once per changed file.>> I''m using Puppet 2.6.4. >> >> Dan >> I especially like the way Ruby searches for and loads the md5 library >> every time it''s used. What a performant language. > This certainly comes from this code in Puppet::Util::Checksums: > # Calculate a checksum of a file''s content using Digest::MD5. > def md5_file(filename, lite = false) > require ''digest/md5'' > > digest = Digest::MD5.new > checksum_file(digest, filename, lite) > end > > Notice how the "require" is in the function instead of being outside. > I''d think that ruby would be smart enough to understand the file has > already been "required" and not bother, but apparently it doesn''t do > that for you. Can you give us what ruby version and what platform you''re > using?The client I''m using for testing is Fedora 14, ruby-1.8.7.330-1.fc14.x86_64 Dan -- You received this message because you are subscribed to the Google Groups "Puppet Users" group. To post to this group, send email to puppet-users@googlegroups.com. To unsubscribe from this group, send email to puppet-users+unsubscribe@googlegroups.com. For more options, visit this group at http://groups.google.com/group/puppet-users?hl=en.