Hi All, I have an interesting question that may or may not be answerable from some internal ZFS semantics. I have a Sun Messaging Server which has 5 ZFS based email stores. The Sun Messaging server uses hard links to link identical messages together. Messages are stored in standard SMTP MIME format so the binary attachments are included in the message ASCII. Each individual message is stored in a separate file. So as an example if a user sends a email with a 2MB attachment to the staff mailing list and there is 3 staff stores with 500 users on each, it will generate a space usage like : /store1 = 1 x 2MB + 499 x 1KB /store2 = 1 x 2MB + 499 x 1KB /store3 = 1 x 2MB + 499 x 1KB So total storage used is around ~7.5MB due to the hard linking taking place on each store. If hard linking capability had been turned off, this same message would have used 1500 x 2MB =3GB worth of storage. My question is there any simple ways of determining the space savings on each of the stores from the usage of hard links? The reason I ask is that our educational institute wishes to migrate these stores to M$ Exchange 2010 which doesn''t do message single instancing. I need to try and project what the storage requirement will be on the new target environment. If anyone has any ideas be it ZFS based or any useful scripts that could help here, I am all ears. I may post this to Sun Managers as well to see if anyone there might have any ideas on this as well. Regards, Scott.
On Sun, Jun 12, 2011 at 4:14 PM, Scott Lawson <Scott.Lawson at manukau.ac.nz> wrote:> I have an interesting question that may or may not be answerable from some > internal > ZFS semantics.This is really standard Unix filesystem semantics.> [...] > > So total storage used is around ~7.5MB due to the hard linking taking place > on each store. > > If hard linking capability had been turned off, this same message would have > used 1500 x 2MB =3GB > worth of storage. > > My question is there any simple ways of determining the space savings on > each of the stores from the usage of hard links? [...]But... you just did! :) It''s: number of hard links * (file size + sum(size of link names and/or directory slot size)). For sufficiently large files (say, larger than one disk block) you could approximate that as: number of hard links * file size. The key is the number of hard links, which will typically vary, but for e-mails that go to all users, well, you know the number of links then is the number of users. You could write a script to do this -- just look at the size and hard-link count of every file in the store, apply the above formula, add up the inflated sizes, and you''re done. Nico PS: Is it really the case that Exchange still doesn''t deduplicate e-mails? Really? It''s much simpler to implement dedup in a mail store than in a filesystem...
Some time ago I wrote a script to find any "duplicate" files and replace them with hardlinks to one inode. Apparently this is only good for same files which don''t change separately in future, such as distro archives. I can send it to you offlist, but it would be slow in your case because it is not quite the tool for the job (it will start by calculating checksums of all of your files ;) ) What you might want to do and script up yourself is a recursive listing "find /var/opt/SUNWmsqsr/store/partition... -ls". This would print you the inode numbers and file sizes and link counts. Pipe it through something like this: find ... -ls | awk ''{print $1" "$4" "$7}'' | sort | uniq And you''d get 3 columns - inode, count, size My AWK math is a bit rusty today, so I present a monster-script like this to multiply and sum up the values: ( find ... -ls | awk ''{print $1" "$4" "$7}'' | sort | uniq | awk ''{ print $2"*"$3"+\\" }''; echo 0 ) | bc Can be done cleaner, i.e. in a PERL one-liner, and if you have many values - that would probably complete faster too. But as a prototype this would do. HTH, //Jim PS: Why are you replacing the cool Sun Mail? Is it about Oracle licensing and the now-required purchase and support cost? 2011-06-13 1:14, Scott Lawson ?????:> Hi All, > > I have an interesting question that may or may not be answerable from > some internal > ZFS semantics. > > I have a Sun Messaging Server which has 5 ZFS based email stores. The > Sun Messaging server > uses hard links to link identical messages together. Messages are > stored in standard SMTP > MIME format so the binary attachments are included in the message > ASCII. Each individual > message is stored in a separate file. > > So as an example if a user sends a email with a 2MB attachment to the > staff mailing list and there > is 3 staff stores with 500 users on each, it will generate a space > usage like : > > /store1 = 1 x 2MB + 499 x 1KB > /store2 = 1 x 2MB + 499 x 1KB > /store3 = 1 x 2MB + 499 x 1KB > > So total storage used is around ~7.5MB due to the hard linking taking > place on each store. > > If hard linking capability had been turned off, this same message > would have used 1500 x 2MB =3GB > worth of storage. > > My question is there any simple ways of determining the space savings > on each of the stores from > the usage of hard links? The reason I ask is that our educational > institute wishes to migrate these stores > to M$ Exchange 2010 which doesn''t do message single instancing. I need > to try and project what the storage > requirement will be on the new target environment. > > If anyone has any ideas be it ZFS based or any useful scripts that > could help here, I am all ears. > > I may post this to Sun Managers as well to see if anyone there might > have any ideas on this as well. > > Regards, > > Scott. > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss-- +============================================================+ | | | ?????? ???????, Jim Klimov | | ??????????? ???????? CTO | | ??? "??? ? ??" JSC COS&HT | | | | +7-903-7705859 (cellular) mailto:jimklimov at cos.ru | | CC:admin at cos.ru,jimklimov at mail.ru | +============================================================+ | () ascii ribbon campaign - against html mail | | /\ - against microsoft attachments | +============================================================+
2011-06-13 2:28, Nico Williams ?????:> PS: Is it really the case that Exchange still doesn''t deduplicate > e-mails? Really? It''s much simpler to implement dedup in a mail > store than in a filesystem...That''s especially strange, because NTFS has hardlinks and softlinks... Not that Microsoft provided any tools for using that, but there are third-party programs like CygWin "ls" and the FAR File Manger. Well, enought off-topicing ;) //Jim
On 13/06/11 10:28 AM, Nico Williams wrote:> On Sun, Jun 12, 2011 at 4:14 PM, Scott Lawson > <Scott.Lawson at manukau.ac.nz> wrote: > >> I have an interesting question that may or may not be answerable from some >> internal >> ZFS semantics. >> > This is really standard Unix filesystem semantics. >I Understand this, just wanting to see if here is any easy way before I trawl through 10 million little files.. ;)> >> [...] >> >> So total storage used is around ~7.5MB due to the hard linking taking place >> on each store. >> >> If hard linking capability had been turned off, this same message would have >> used 1500 x 2MB =3GB >> worth of storage. >> >> My question is there any simple ways of determining the space savings on >> each of the stores from the usage of hard links? [...] >> > But... you just did! :) It''s: number of hard links * (file size + > sum(size of link names and/or directory slot size)). For sufficiently > large files (say, larger than one disk block) you could approximate > that as: number of hard links * file size. The key is the number of > hard links, which will typically vary, but for e-mails that go to all > users, well, you know the number of links then is the number of users. >Yes this number varies based on number of recipients, so could be as many a> You could write a script to do this -- just look at the size and > hard-link count of every file in the store, apply the above formula, > add up the inflated sizes, and you''re done. >Looks like I will have to, just looking for a tried and tested method before I have to create my own one if possible. Just was looking for an easy option before I have to sit down and develop and test a script. I have resigned from my current job of 9 years and finish in 15 days and have a heck of a lot of documentation and knowledge transfer I need to do around other UNIX systems and am running very short on time...> Nico > > PS: Is it really the case that Exchange still doesn''t deduplicate > e-mails? Really? It''s much simpler to implement dedup in a mail > store than in a filesystem... >As a side not Exchange 2002 + Exchange 2007 do do this. But apparently M$ decided in Exchange 2010 that they no longer wished to do this and dropped the capability. Bizarre to say the least, but it may come down to what they have done in the underlying store technology changes..
On 13/06/11 11:36 AM, Jim Klimov wrote:> Some time ago I wrote a script to find any "duplicate" files and replace > them with hardlinks to one inode. Apparently this is only good for same > files which don''t change separately in future, such as distro archives. > > I can send it to you offlist, but it would be slow in your case > because it > is not quite the tool for the job (it will start by calculating checksums > of all of your files ;) ) > > What you might want to do and script up yourself is a recursive listing > "find /var/opt/SUNWmsqsr/store/partition... -ls". This would print you > the inode numbers and file sizes and link counts. Pipe it through > something like this: > > find ... -ls | awk ''{print $1" "$4" "$7}'' | sort | uniq > > And you''d get 3 columns - inode, count, size > > My AWK math is a bit rusty today, so I present a monster-script like > this to multiply and sum up the values: > > ( find ... -ls | awk ''{print $1" "$4" "$7}'' | sort | uniq | awk ''{ > print $2"*"$3"+\\" }''; echo 0 ) | bcThis looks something like what I thought would have to be done, I was just looking to see if there was something tried and tested before I had to invent something. I was really hoping in zdb there might have been some magic information I could have tapped into.. ;)> > Can be done cleaner, i.e. in a PERL one-liner, and if you have > many values - that would probably complete faster too. But as > a prototype this would do. > > HTH, > //Jim > > PS: Why are you replacing the cool Sun Mail? Is it about Oracle > licensing and the now-required purchase and support cost?Yes it is about cost mostly. We had Sun Mail for our Staff and students. We had 20,000 + students on it up until Christmas time as well. We have now migrated them to M$ Live at EDU. This leaves us with 1500 Staff left who all like to use LookOut. The Sun connector for LookOut is a bit flaky at best. But the Oracle licensing cost for Messaging and Calendar starts at 10,000 users plus and so is now rather expensive for what mailboxes we have left. M$ also heavily discounts Exchange CALS to Edu and Oracle is not very friendly the way Sun was with their JES licensing. So it is bye bye Sun Messaging Server for us.> > > 2011-06-13 1:14, Scott Lawson ?????: >> Hi All, >> >> I have an interesting question that may or may not be answerable from >> some internal >> ZFS semantics. >> >> I have a Sun Messaging Server which has 5 ZFS based email stores. The >> Sun Messaging server >> uses hard links to link identical messages together. Messages are >> stored in standard SMTP >> MIME format so the binary attachments are included in the message >> ASCII. Each individual >> message is stored in a separate file. >> >> So as an example if a user sends a email with a 2MB attachment to the >> staff mailing list and there >> is 3 staff stores with 500 users on each, it will generate a space >> usage like : >> >> /store1 = 1 x 2MB + 499 x 1KB >> /store2 = 1 x 2MB + 499 x 1KB >> /store3 = 1 x 2MB + 499 x 1KB >> >> So total storage used is around ~7.5MB due to the hard linking taking >> place on each store. >> >> If hard linking capability had been turned off, this same message >> would have used 1500 x 2MB =3GB >> worth of storage. >> >> My question is there any simple ways of determining the space savings >> on each of the stores from >> the usage of hard links? The reason I ask is that our educational >> institute wishes to migrate these stores >> to M$ Exchange 2010 which doesn''t do message single instancing. I >> need to try and project what the storage >> requirement will be on the new target environment. >> >> If anyone has any ideas be it ZFS based or any useful scripts that >> could help here, I am all ears. >> >> I may post this to Sun Managers as well to see if anyone there might >> have any ideas on this as well. >> >> Regards, >> >> Scott. >> _______________________________________________ >> zfs-discuss mailing list >> zfs-discuss at opensolaris.org >> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > >
On Sun, Jun 12, 2011 at 5:28 PM, Nico Williams <nico at cryptonector.com>wrote:> On Sun, Jun 12, 2011 at 4:14 PM, Scott Lawson > <Scott.Lawson at manukau.ac.nz> wrote: > > I have an interesting question that may or may not be answerable from > some > > internal > > ZFS semantics. > > This is really standard Unix filesystem semantics. > > > [...] > > > > So total storage used is around ~7.5MB due to the hard linking taking > place > > on each store. > > > > If hard linking capability had been turned off, this same message would > have > > used 1500 x 2MB =3GB > > worth of storage. > > > > My question is there any simple ways of determining the space savings on > > each of the stores from the usage of hard links? [...] > > But... you just did! :) It''s: number of hard links * (file size + > sum(size of link names and/or directory slot size)). For sufficiently > large files (say, larger than one disk block) you could approximate > that as: number of hard links * file size. The key is the number of > hard links, which will typically vary, but for e-mails that go to all > users, well, you know the number of links then is the number of users. > > You could write a script to do this -- just look at the size and > hard-link count of every file in the store, apply the above formula, > add up the inflated sizes, and you''re done. > > Nico > > PS: Is it really the case that Exchange still doesn''t deduplicate > e-mails? Really? It''s much simpler to implement dedup in a mail > store than in a filesystem... >MS has had SIS since Exchange 4.0. They dumped it in 2010 because it was a huge source of their small random I/O''s. In an effort to allow Exchange to be more "storage friendly" (IE: more of a large sequential I/O profile), they''ve done away with SIS. The defense for it is that you can buy more "cheap" storage for less money than you''d save with SIS and 15k rpm disks. Whether that''s factual I suppose is for the reader to decide. --Tim -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20110612/f3d529af/attachment.html>
> If anyone has any ideas be it ZFS based or any useful scripts that > could help here, I am all ears.Something like this one-liner will show what would be allocated by everything if hardlinks weren''t used: # size=0; for i in `find . -type f -exec du {} \; | awk ''{ print $1 }''`; do size=$(( $size + $i )); done; echo $size Vennlige hilsener / Best regards roy -- Roy Sigurd Karlsbakk (+47) 97542685 roy at karlsbakk.net http://blogg.karlsbakk.net/ -- I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er et element?rt imperativ for alle pedagoger ? unng? eksessiv anvendelse av idiomer med fremmed opprinnelse. I de fleste tilfeller eksisterer adekvate og relevante synonymer p? norsk.
On Mon, Jun 13, 2011 at 5:50 AM, Roy Sigurd Karlsbakk <roy at karlsbakk.net> wrote:>> If anyone has any ideas be it ZFS based or any useful scripts that >> could help here, I am all ears. > > Something like this one-liner will show what would be allocated by everything if hardlinks weren''t used: > > # size=0; for i in `find . -type f -exec du {} \; | awk ''{ print $1 }''`; do size=$(( $size + $i )); done; echo $sizeOh, you don''t want to do that: you''ll run into max argument list size issues. Try this instead: (echo 0; find . -type f \! -links 1 | xargs stat -c " %b %B *+" $p; echo p) | dc ;) xargs is your friend (and so is dc... RPN FTW!). Note that I''m not printing the number of links because find will print a name for every link (well, if you do the find from the root of the relevant filesystem), so we''d be counting too much space. You''ll need the GNU stat(1). Or you could do something like this using the ksh stat builtin: ( echo 0 find . -type f \! -links 1 | while read p; do xargs stat -c " %b %B *+" $p done echo p ) | dc Nico --
On Mon, Jun 13, 2011 at 12:59 PM, Nico Williams <nico at cryptonector.com> wrote:> Try this instead: > > (echo 0; find . -type f \! -links 1 | xargs stat -c " %b %B *+" $p; echo p) | dcs/\$p//
And, without a sub-shell: find . -type f \! -links 1 | xargs stat -c " %b %B *+p" /dev/null | dc 2>/dev/null | tail -1 (The stderr redirection is because otherwise dc whines once that the stack is empty, and the tail is because we print interim totals as we go.) Also, this doesn''t quit work, since it counts every link, when we want to count all but one links. This, then, is what will tell you how much space you saved due to hardlinks: find . -type f \! -links 1 | xargs stat -c " 8k %b %B * %h 1 - * %h /+p" /dev/null 2>/dev/null | dc Excuse my earlier brainfarts :) Nico --