Hi all As I''ve said here on the list a few times earlier, the last on the thread ''ZFS not usable (was ZFS Dedup question)'', I''ve been doing some rather thorough testing on zfs dedup, and as you can see from the posts, it wasn''t very satisfactory. The docs claim 1-2GB memory usage per terabyte stored, ARC or L2ARC, but as you can read from the post, I don''t find this very likely. So, is there anyone in here using dedup for large storage (2TB? 10TB? more?) and can document sustained high performance? The reason I ask, is if this is the case, something is badly wrong with my test setup. The test box is a supermicro thing with a Core2duo CPU, 8 gigs of RAM, 4 gigs of mirrored SLOG and some 150 gigs of L2ARC on 80GB x25-M drives. The data drives are 7 2TB drives in RAIDz2. We''re getting down to 10-20MB/s on Bacula backup to this system, meaning streaming, which should be good for RAIDz2. Since the writes are local (bacula-sd running), async writes will be the main thing. Initial results show pretty good I/O perfrmance, but after about 2TB used, the I/O speed is down to the numbers I mentioned PS: I know those drives aren''t optimal for this, but the box is a year old or so. Still, they should help out a bit. Vennlige hilsener / Best regards roy -- Roy Sigurd Karlsbakk (+47) 97542685 roy at karlsbakk.net http://blogg.karlsbakk.net/ -- I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er et element?rt imperativ for alle pedagoger ? unng? eksessiv anvendelse av idiomer med fremmed opprinnelse. I de fleste tilfeller eksisterer adekvate og relevante synonymer p? norsk.
I''m not sure about *docs*, but my rough estimations: Assume 1TB of actual used storage. Assume 64K block/slab size. (Not sure how realistic that is -- it depends totally on your data set.) Assume 300 bytes per DDT entry. So we have (1024^4 / 65536) * 300 = 5033164800 or about 5GB RAM for one TB of used disk space. Dedup is *hungry* for RAM. 8GB is not enough for your configuration, most likely! First guess: double the RAM and then you might have better luck. The other takeaway here: dedup is the wrong technology for typical small home server (e.g. systems that max out at 4 or even 8 GB). Look into compression and snapshot clones as better alternatives to reduce your disk space needs without incurring the huge RAM penalties associated with dedup. Dedup is *great* for a certain type of data set with configurations that are extremely RAM heavy. For everyone else, its almost universally the wrong solution. Ultimately, disk is usually cheaper than RAM -- think hard before you enable dedup -- are you making the right trade off? - Garrett On Sun, 2011-01-30 at 22:53 +0100, Roy Sigurd Karlsbakk wrote:> Hi all > > As I''ve said here on the list a few times earlier, the last on the thread ''ZFS not usable (was ZFS Dedup question)'', I''ve been doing some rather thorough testing on zfs dedup, and as you can see from the posts, it wasn''t very satisfactory. The docs claim 1-2GB memory usage per terabyte stored, ARC or L2ARC, but as you can read from the post, I don''t find this very likely. > > So, is there anyone in here using dedup for large storage (2TB? 10TB? more?) and can document sustained high performance? > > The reason I ask, is if this is the case, something is badly wrong with my test setup. > > The test box is a supermicro thing with a Core2duo CPU, 8 gigs of RAM, 4 gigs of mirrored SLOG and some 150 gigs of L2ARC on 80GB x25-M drives. The data drives are 7 2TB drives in RAIDz2. We''re getting down to 10-20MB/s on Bacula backup to this system, meaning streaming, which should be good for RAIDz2. Since the writes are local (bacula-sd running), async writes will be the main thing. Initial results show pretty good I/O perfrmance, but after about 2TB used, the I/O speed is down to the numbers I mentioned > > PS: I know those drives aren''t optimal for this, but the box is a year old or so. Still, they should help out a bit. > > Vennlige hilsener / Best regards > > roy > -- > Roy Sigurd Karlsbakk > (+47) 97542685 > roy at karlsbakk.net > http://blogg.karlsbakk.net/ > -- > I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er et element?rt imperativ for alle pedagoger ? unng? eksessiv anvendelse av idiomer med fremmed opprinnelse. I de fleste tilfeller eksisterer adekvate og relevante synonymer p? norsk. > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- > bounces at opensolaris.org] On Behalf Of Roy Sigurd Karlsbakk > > The test box is a supermicro thing with a Core2duo CPU, 8 gigs of RAM, 4 gigs > of mirrored SLOG and some 150 gigs of L2ARC on 80GB x25-M drives. The > data drives are 7 2TB drives in RAIDz2. We''re getting down to 10-20MB/s on > Bacula backup to this system, meaning streaming, which should be good for > RAIDz2. Since the writes are local (bacula-sd running), async writes will be the > main thing. Initial results show pretty good I/O perfrmance, but after about > 2TB used, the I/O speed is down to the numbers I mentionedYou probably know this already, but while you''re doing async writes, neither the slog nor the l2arc offer any benefit to you. Also, your problem might be completely unrelated to your pool. You might try writing to /dev/null instead, and just see what the performance is. If it''s still slow, you know you''re waiting for something else that''s not zpool related. Also, you might try making an old backup file available on something like external disk or whatever, and simply copy it to the zpool in question. If it goes fast, once again you''re eliminating the possibility that the problem is zpool related. I don''t know what bacula uses in the background, but I know I''ve had terrible performance using dd to write to tape while dd would perform just fine writing to anything else ... and anything else would work fine writing to tape. My point is only to say that you should question precisely *what* is causing the performance bottleneck.
> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- > bounces at opensolaris.org] On Behalf Of Roy Sigurd Karlsbakk > > We''re getting down to 10-20MB/s onOh, one more thing. How are you measuring the speed? Because if you have data which is highly compressible, or highly duplicated, you could be virtually writing tons of data really really fast, but the disks would barely be active at all. For example, if you dd if=/dev/zero bs=1024k count=1024 | pv | gzip | pv > zerofile.gz Then the data rate going through the first pv is 1000 times higher than the datarate going through the second pv. If I were only looking at the data rate after compression, then I might falsely reach the conclusion that I was having bad performance.
> I''m not sure about *docs*, but my rough estimations: > > Assume 1TB of actual used storage. Assume 64K block/slab size. (Not > sure how realistic that is -- it depends totally on your data set.) > Assume 300 bytes per DDT entry. > > So we have (1024^4 / 65536) * 300 = 5033164800 or about 5GB RAM for > one > TB of used disk space. > > Dedup is *hungry* for RAM. 8GB is not enough for your configuration, > most likely! First guess: double the RAM and then you might have > better > luck.I know... that''s why I use L2ARC> The other takeaway here: dedup is the wrong technology for typical > small home server (e.g. systems that max out at 4 or even 8 GB).This isn''t a home server test> Look into compression and snapshot clones as better alternatives to > reduce your disk space needs without incurring the huge RAM penalties > associated with dedup. > > Dedup is *great* for a certain type of data set with configurations > that > are extremely RAM heavy. For everyone else, its almost universally the > wrong solution. Ultimately, disk is usually cheaper than RAM -- think > hard before you enable dedup -- are you making the right trade off?Just what sort of configurations would you think of? I''ve been testing dedup in rather large ones, and the sun is that ZFS doesn''t scale well as of now Vennlige hilsener / Best regards roy -- Roy Sigurd Karlsbakk (+47) 97542685 roy at karlsbakk.net http://blogg.karlsbakk.net/ -- I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er et element?rt imperativ for alle pedagoger ? unng? eksessiv anvendelse av idiomer med fremmed opprinnelse. I de fleste tilfeller eksisterer adekvate og relevante synonymer p? norsk.
Roy Sigurd Karlsbakk
2011-Feb-01 00:48 UTC
[zfs-discuss] ZFS dedup success stories (take two)
> As I''ve said here on the list a few times earlier, the last on the > thread ''ZFS not usable (was ZFS Dedup question)'', I''ve been doing some > rather thorough testing on zfs dedup, and as you can see from the > posts, it wasn''t very satisfactory. The docs claim 1-2GB memory usage > per terabyte stored, ARC or L2ARC, but as you can read from the post, > I don''t find this very likely.Sorry about the initial post - it was wrong. The hardware configuration was right, but for initial tests, I use NFS, meaning sync writes. This obviously stresses the ARC/L2ARC more than async writes, but the result remains the same. With 140GB with of L2ARC on two X25-Ms and some 4GB partitions on the same devices, 4GB each, in a mirror, the write speed was reduced to something like 20% of the origian speed. This was with about 2TB used on the zpool with a single data stream, no parallelism whatsoever. Still with 8GB ARC and 140GB of L2ARC on two SSDs, this speed is fairly low. I could not see substantially high CPU or I/O load during this test. Vennlige hilsener / Best regards roy -- Roy Sigurd Karlsbakk (+47) 97542685 roy at karlsbakk.net http://blogg.karlsbakk.net/ -- I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er et element?rt imperativ for alle pedagoger ? unng? eksessiv anvendelse av idiomer med fremmed opprinnelse. I de fleste tilfeller eksisterer adekvate og relevante synonymer p? norsk.
On 01/31/11 04:48 PM, Roy Sigurd Karlsbakk wrote:>> As I''ve said here on the list a few times earlier, the last on the >> thread ''ZFS not usable (was ZFS Dedup question)'', I''ve been doing some >> rather thorough testing on zfs dedup, and as you can see from the >> posts, it wasn''t very satisfactory. The docs claim 1-2GB memory usage >> per terabyte stored, ARC or L2ARC, but as you can read from the post, >> I don''t find this very likely. >> > Sorry about the initial post - it was wrong. The hardware configuration was right, but for initial tests, I use NFS, meaning sync writes. This obviously stresses the ARC/L2ARC more than async writes, but the result remains the same. > > With 140GB with of L2ARC on two X25-Ms and some 4GB partitions on the same devices, 4GB each, in a mirror, the write speed was reduced to something like 20% of the origian speed. This was with about 2TB used on the zpool with a single data stream, no parallelism whatsoever. Still with 8GB ARC and 140GB of L2ARC on two SSDs, this speed is fairly low. I could not see substantially high CPU or I/O load during this test. >I would not expect good performance on dedup with write... dedup isn''t going to make write''s fast - its something you want on a system with a lot of duplicated data that sustain a lot of reads. (That said, highly duplicate date with a DDT that fits entirely in RAM might see a benefit from not having to write meta data frequently. But I suspect an SLOG here is going to be critical to get good performance since you''ll still have a lot of synchronous meta data writes.) - Garrett> Vennlige hilsener / Best regards > > roy > -- > Roy Sigurd Karlsbakk > (+47) 97542685 > roy at karlsbakk.net > http://blogg.karlsbakk.net/ > -- > I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er et element?rt imperativ for alle pedagoger ? unng? eksessiv anvendelse av idiomer med fremmed opprinnelse. I de fleste tilfeller eksisterer adekvate og relevante synonymer p? norsk. > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >
> > Dedup is *hungry* for RAM. 8GB is not enough for > your configuration, > > most likely! First guess: double the RAM and then > you might have > > better > > luck. > > I know... that''s why I use L2ARC >What is zdb -D showing? Does this give you any clue; http://blogs.sun.com/roch/entry/dedup_performance_considerations1 br, syljua -- This message posted from opensolaris.org
> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- > bounces at opensolaris.org] On Behalf Of Roy Sigurd Karlsbakk > > > Dedup is *hungry* for RAM. 8GB is not enough for your configuration, > > most likely! First guess: double the RAM and then you might have > > better > > luck. > > I know... that''s why I use L2ARCl2arc is not a substitute for ram. In some cases it can improve disk performance in the absence of ram, but it cannot be used for in-memory applications and kernel. At best, what you''re describing would be swap space on a SSD. Swap space is a substitute for ram. Be aware that SSD performance is 1/100th the performance of ram (or worse.) Garrett is right. Add more ram, if it is physically possible. And if it is not physically possible, think long and hard about upgrading your server so you can add more ram.
Edward Ned Harvey
2011-Feb-01 12:41 UTC
[zfs-discuss] ZFS dedup success stories (take two)
> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- > bounces at opensolaris.org] On Behalf Of Roy Sigurd Karlsbakk > > Sorry about the initial post - it was wrong. The hardware configuration was > right, but for initial tests, I use NFS, meaning sync writes. This obviously > stresses the ARC/L2ARC more than async writes, but the result remains the > same.I''m sorry, that''s not correct. L2ARC is a read cache. ZIL is used for sync writes. ZIL always exists. If there is no dedicated ZIL log device, then blocks are used for ZIL in the main storage pool.
Two caveats inline ? On 1 Feb 2011, at 01:05, Garrett D''Amore wrote:> On 01/31/11 04:48 PM, Roy Sigurd Karlsbakk wrote: >>> As I''ve said here on the list a few times earlier, the last on the >>> thread ''ZFS not usable (was ZFS Dedup question)'', I''ve been doing some >>> rather thorough testing on zfs dedup, and as you can see from the >>> posts, it wasn''t very satisfactory. The docs claim 1-2GB memory usage >>> per terabyte stored, ARC or L2ARC, but as you can read from the post, >>> I don''t find this very likely. >>> >> Sorry about the initial post - it was wrong. The hardware configuration was right, but for initial tests, I use NFS, meaning sync writes. This obviously stresses the ARC/L2ARC more than async writes, but the result remains the same. >> >> With 140GB with of L2ARC on two X25-Ms and some 4GB partitions on the same devices, 4GB each, in a mirror, the write speed was reduced to something like 20% of the origian speed. This was with about 2TB used on the zpool with a single data stream, no parallelism whatsoever. Still with 8GB ARC and 140GB of L2ARC on two SSDs, this speed is fairly low. I could not see substantially high CPU or I/O load during this test. >> > > I would not expect good performance on dedup with write... dedup isn''t going to make write''s fast - its something you want on a system with a lot of duplicated data that sustain a lot of reads. (That said, highly duplicate date with a DDT that fits entirely in RAM might see a benefit from not having to write meta data frequently. But I suspect an SLOG here is going to be critical to get good performance since you''ll still have a lot of synchronous meta data writes.) > > - GarrettThere is one circumstance where the write operation could be an improvement, in a system with data which is highly de-dupable *and* undergoing heavy write load, it may be useful to forego the large write and instead convert into a smaller (and more frequent) small metadata write, SLOGs would then show more benefit and we''d release pressure on the back-end for thruput. On a system with a high read ratio, de-duped data currently would be quite efficient, but there is one pathology in current ZFS which impacts this somewhat, last time I looked each ARC ref to a de-duped block leads to a inflated ARC copy of the data, hence a highly ref''ed block (20x for instance), could exist 20x in an inflated state in ARC after read refs to each occurrence. De-dup of inflated data in ARC was a pending ZFS optimisation ? Craig>> Vennlige hilsener / Best regards >> >> roy >> -- >> Roy Sigurd Karlsbakk >> (+47) 97542685 >> roy at karlsbakk.net >> http://blogg.karlsbakk.net/ >> -- >> I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er et element?rt imperativ for alle pedagoger ? unng? eksessiv anvendelse av idiomer med fremmed opprinnelse. I de fleste tilfeller eksisterer adekvate og relevante synonymer p? norsk. >> _______________________________________________ >> zfs-discuss mailing list >> zfs-discuss at opensolaris.org >> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >> > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss-- Craig Morgan Cinnabar Solutions Ltd t: +44 (0)791 338 3190 f: +44 (0)870 705 1726 e: craig at cinnabar-solutions.com w: www.cinnabar-solutions.com