thr3ads.net - zfs discuss - [zfs-discuss] Best practices for zpools on zfs [Nov 2009]

If this information is useful, please help other people find it:
Share via:

Mike Gerdts

2009-Nov-24 14:37 UTC

[zfs-discuss] Best practices for zpools on zfs

Suppose I have a storage server that runs ZFS, presumably providing
file (NFS) and/or block (iSCSI, FC) services to other machines that
are running Solaris.  Some of the use will be for LDoms and zones[1],
which would create zpools on top of zfs (fs or zvol).  I have concerns
about variable block sizes and the implications for performance.

1. http://hub.opensolaris.org/bin/view/Community+Group+zones/zoss

Suppose that on the storage server, an NFS shared dataset is created
without tuning the block size.  This implies that when the client
(ldom or zone v12n server) runs mkfile or similar to create the
backing store for a vdisk or a zpool, the file on the storage server
will be created with 128K blocks.  Then when Solaris or OpenSolaris is
installed into the vdisk or zpool, files of a wide variety of sizes
will be created.  At this layer they will be created with variable
block sizes (512B to 128K).

The implications for a 512 byte write in the upper level zpool (inside
a zone or ldom) seems to be:

- The 512 byte write turns into a 128 KB write at the storage server
  (256x multiplication in write size).
- To write that 128 KB block, the rest of the block needs to be read
  to recalculate the checksum.  That is, a read/modify/write process
  is forced.  (Less impact if block already in ARC.)
- Deduplicaiton is likely to be less effective because it is unlikely
  that the same combination of small blocks in different zones/ldoms
  will be packed into the same 128 KB block.

Alternatively, the block size could be forced to something smaller at
the storage server.  Setting it to 512 bytes could eliminate the
read/modify/write cycle, but would presumably be less efficient (less
performant) with moderate to large files.  Setting it somewhere in
between may be desirable as well, but it is not clear where.  The key
competition in this area seems to have a fixed 4 KB block size.

Questions:

Are my basic assumptions about a given file consisting only of a
single sized block, except for perhaps the final block?

Has any work been done to identify the performance characteristics in
this area?

Is there less to be concerned about from a performance standpoint if
the workload is primarily read?

To maximize the efficacy of dedup, would it be best to pick a fixed
block size and match it between the layers of zfs?

-- 
Mike Gerdts
http://mgerdts.blogspot.com/

Richard Elling

2009-Nov-24 15:46 UTC

head link

[zfs-discuss] Best practices for zpools on zfs

Good question!  Additional thoughts below...

On Nov 24, 2009, at 6:37 AM, Mike Gerdts wrote:
> Suppose I have a storage server that runs ZFS, presumably providing
> file (NFS) and/or block (iSCSI, FC) services to other machines that
> are running Solaris.  Some of the use will be for LDoms and zones[1],
> which would create zpools on top of zfs (fs or zvol).  I have concerns
> about variable block sizes and the implications for performance.
>
> 1. http://hub.opensolaris.org/bin/view/Community+Group+zones/zoss
>
> Suppose that on the storage server, an NFS shared dataset is created
> without tuning the block size.  This implies that when the client
> (ldom or zone v12n server) runs mkfile or similar to create the
> backing store for a vdisk or a zpool, the file on the storage server
> will be created with 128K blocks.  Then when Solaris or OpenSolaris is
> installed into the vdisk or zpool, files of a wide variety of sizes
> will be created.  At this layer they will be created with variable
> block sizes (512B to 128K).
>
> The implications for a 512 byte write in the upper level zpool (inside
> a zone or ldom) seems to be:
>
> - The 512 byte write turns into a 128 KB write at the storage server
>  (256x multiplication in write size).
> - To write that 128 KB block, the rest of the block needs to be read
>  to recalculate the checksum.  That is, a read/modify/write process
>  is forced.  (Less impact if block already in ARC.)
> - Deduplicaiton is likely to be less effective because it is unlikely
>  that the same combination of small blocks in different zones/ldoms
>  will be packed into the same 128 KB block.
>
> Alternatively, the block size could be forced to something smaller at
> the storage server.  Setting it to 512 bytes could eliminate the
> read/modify/write cycle, but would presumably be less efficient (less
> performant) with moderate to large files.  Setting it somewhere in
> between may be desirable as well, but it is not clear where.  The key
> competition in this area seems to have a fixed 4 KB block size.
>
> Questions:
>
> Are my basic assumptions about a given file consisting only of a
> single sized block, except for perhaps the final block?
Yes, for a file system dataset. Volumes are fixed block size with
the default being 8 KB.  So in the iSCSI over volume case, OOB
it can be more efficient.  4KB matches well with NTFS or some of
the Linux file systems.
> Has any work been done to identify the performance characteristics in
> this area?
None to my knowledge.  The performance teams know to set the block
size to match the application, so they don''t waste time re-learning  
this.
> Is there less to be concerned about from a performance standpoint if
> the workload is primarily read?
Sequential read: yes
Random read: no
> To maximize the efficacy of dedup, would it be best to pick a fixed
> block size and match it between the layers of zfs?
I don''t think we know yet.  Until b128 arrives in binary, and folks get
some time to experiment, we just don''t have much data... and there
are way too many variables at play to predict.  I can make one
prediction, though, dedupe for mkfile or dd if=/dev/zero will scream :-)
  -- richard

Mike Gerdts

2009-Nov-24 19:31 UTC

head link

[zfs-discuss] Best practices for zpools on zfs

On Tue, Nov 24, 2009 at 9:46 AM, Richard Elling
<richard.elling at gmail.com> wrote:> Good question! ?Additional thoughts below...
>
> On Nov 24, 2009, at 6:37 AM, Mike Gerdts wrote:
>
>> Suppose I have a storage server that runs ZFS, presumably providing
>> file (NFS) and/or block (iSCSI, FC) services to other machines that
>> are running Solaris. ?Some of the use will be for LDoms and zones[1],
>> which would create zpools on top of zfs (fs or zvol). ?I have concerns
>> about variable block sizes and the implications for performance.
>>
>> 1. http://hub.opensolaris.org/bin/view/Community+Group+zones/zoss
>>
>> Suppose that on the storage server, an NFS shared dataset is created
>> without tuning the block size. ?This implies that when the client
>> (ldom or zone v12n server) runs mkfile or similar to create the
>> backing store for a vdisk or a zpool, the file on the storage server
>> will be created with 128K blocks. ?Then when Solaris or OpenSolaris is
>> installed into the vdisk or zpool, files of a wide variety of sizes
>> will be created. ?At this layer they will be created with variable
>> block sizes (512B to 128K).
>>
>> The implications for a 512 byte write in the upper level zpool (inside
>> a zone or ldom) seems to be:
>>
>> - The 512 byte write turns into a 128 KB write at the storage server
>> ?(256x multiplication in write size).
>> - To write that 128 KB block, the rest of the block needs to be read
>> ?to recalculate the checksum. ?That is, a read/modify/write process
>> ?is forced. ?(Less impact if block already in ARC.)
>> - Deduplicaiton is likely to be less effective because it is unlikely
>> ?that the same combination of small blocks in different zones/ldoms
>> ?will be packed into the same 128 KB block.
>>
>> Alternatively, the block size could be forced to something smaller at
>> the storage server. ?Setting it to 512 bytes could eliminate the
>> read/modify/write cycle, but would presumably be less efficient (less
>> performant) with moderate to large files. ?Setting it somewhere in
>> between may be desirable as well, but it is not clear where. ?The key
>> competition in this area seems to have a fixed 4 KB block size.
>>
>> Questions:
>>
>> Are my basic assumptions about a given file consisting only of a
>> single sized block, except for perhaps the final block?
>
> Yes, for a file system dataset. Volumes are fixed block size with
> the default being 8 KB. ?So in the iSCSI over volume case, OOB
> it can be more efficient. ?4KB matches well with NTFS or some of
> the Linux file systems
OOB is missing from my TLA translator.  Help, please.
>
>> Has any work been done to identify the performance characteristics in
>> this area?
>
> None to my knowledge. ?The performance teams know to set the block
> size to match the application, so they don''t waste time
re-learning this.
That works great for certain workloads, particularly those with a
fixed record size or large sequential I/O.  If the workload is
"installing then running an operating system" the answer is harder to
define.
>
>> Is there less to be concerned about from a performance standpoint if
>> the workload is primarily read?
>
> Sequential read: yes
> Random read: no
I was thinking that random wouldn''t be too much of a concern either
assuming that the things that are commonly read are in cache.  I guess
this does open the door for a small chunk of useful code in the middle
of a largely useless shared library to force lot of that shared
library into the ARC, among other things.
>
>> To maximize the efficacy of dedup, would it be best to pick a fixed
>> block size and match it between the layers of zfs?
>
> I don''t think we know yet. ?Until b128 arrives in binary, and
folks get
> some time to experiment, we just don''t have much data... and there
> are way too many variables at play to predict. ?I can make one
> prediction, though, dedupe for mkfile or dd if=/dev/zero will scream :-)
We already have that optimization with compression.  Dedupe just
messes up my method of repeatedly writing the same smallish (<1MB)
chunk of random or already compressed data to avoid the block-of-zeros
compression optimization.

Pretty soon filebench is going to need to add statistical methods to
mimic the level of duplicate data it is simulating.  Trying to write
simple benchmarks to test increasingly smart systems looks to be
problematic.

-- 
Mike Gerdts
http://mgerdts.blogspot.com/

Richard Elling

2009-Nov-24 19:39 UTC

head link

[zfs-discuss] Best practices for zpools on zfs

On Nov 24, 2009, at 11:31 AM, Mike Gerdts wrote:
> On Tue, Nov 24, 2009 at 9:46 AM, Richard Elling
> <richard.elling at gmail.com> wrote:
>> Good question!  Additional thoughts below...
>>
>> On Nov 24, 2009, at 6:37 AM, Mike Gerdts wrote:
>>
>>> Suppose I have a storage server that runs ZFS, presumably providing
>>> file (NFS) and/or block (iSCSI, FC) services to other machines that
>>> are running Solaris.  Some of the use will be for LDoms and  
>>> zones[1],
>>> which would create zpools on top of zfs (fs or zvol).  I have  
>>> concerns
>>> about variable block sizes and the implications for performance.
>>>
>>> 1. http://hub.opensolaris.org/bin/view/Community+Group+zones/zoss
>>>
>>> Suppose that on the storage server, an NFS shared dataset is
created
>>> without tuning the block size.  This implies that when the client
>>> (ldom or zone v12n server) runs mkfile or similar to create the
>>> backing store for a vdisk or a zpool, the file on the storage
server
>>> will be created with 128K blocks.  Then when Solaris or  
>>> OpenSolaris is
>>> installed into the vdisk or zpool, files of a wide variety of sizes
>>> will be created.  At this layer they will be created with variable
>>> block sizes (512B to 128K).
>>>
>>> The implications for a 512 byte write in the upper level zpool  
>>> (inside
>>> a zone or ldom) seems to be:
>>>
>>> - The 512 byte write turns into a 128 KB write at the storage
server
>>>  (256x multiplication in write size).
>>> - To write that 128 KB block, the rest of the block needs to be
read
>>>  to recalculate the checksum.  That is, a read/modify/write process
>>>  is forced.  (Less impact if block already in ARC.)
>>> - Deduplicaiton is likely to be less effective because it is  
>>> unlikely
>>>  that the same combination of small blocks in different zones/ldoms
>>>  will be packed into the same 128 KB block.
>>>
>>> Alternatively, the block size could be forced to something smaller
>>> at
>>> the storage server.  Setting it to 512 bytes could eliminate the
>>> read/modify/write cycle, but would presumably be less efficient  
>>> (less
>>> performant) with moderate to large files.  Setting it somewhere in
>>> between may be desirable as well, but it is not clear where.  The  
>>> key
>>> competition in this area seems to have a fixed 4 KB block size.
>>>
>>> Questions:
>>>
>>> Are my basic assumptions about a given file consisting only of a
>>> single sized block, except for perhaps the final block?
>>
>> Yes, for a file system dataset. Volumes are fixed block size with
>> the default being 8 KB.  So in the iSCSI over volume case, OOB
>> it can be more efficient.  4KB matches well with NTFS or some of
>> the Linux file systems
>
> OOB is missing from my TLA translator.  Help, please.
Out of box.
>>
>>> Has any work been done to identify the performance characteristics
>>> in
>>> this area?
>>
>> None to my knowledge.  The performance teams know to set the block
>> size to match the application, so they don''t waste time
re-learning
>> this.
>
> That works great for certain workloads, particularly those with a
> fixed record size or large sequential I/O.  If the workload is
> "installing then running an operating system" the answer is
harder to
> define.
running OSes don''t create much work, post boot
>>> Is there less to be concerned about from a performance standpoint
if
>>> the workload is primarily read?
>>
>> Sequential read: yes
>> Random read: no
>
> I was thinking that random wouldn''t be too much of a concern
either
> assuming that the things that are commonly read are in cache.  I guess
> this does open the door for a small chunk of useful code in the middle
> of a largely useless shared library to force lot of that shared
> library into the ARC, among other things.
This was much more of a problem years ago when memory was small.
Don''t see it much on modern machines.
>>
>>> To maximize the efficacy of dedup, would it be best to pick a fixed
>>> block size and match it between the layers of zfs?
>>
>> I don''t think we know yet.  Until b128 arrives in binary, and
folks
>> get
>> some time to experiment, we just don''t have much data... and
there
>> are way too many variables at play to predict.  I can make one
>> prediction, though, dedupe for mkfile or dd if=/dev/zero will  
>> scream :-)
>
> We already have that optimization with compression.  Dedupe just
> messes up my method of repeatedly writing the same smallish (<1MB)
> chunk of random or already compressed data to avoid the block-of-zeros
> compression optimization.
>
> Pretty soon filebench is going to need to add statistical methods to
> mimic the level of duplicate data it is simulating.  Trying to write
> simple benchmarks to test increasingly smart systems looks to be
> problematic.
:-)
Also, the performance of /dev/*random is not very good.  So prestaging
lots of random data will be particularly challenging.
  -- richard

Mike Gerdts

2009-Nov-24 20:07 UTC

head link

[zfs-discuss] Best practices for zpools on zfs

On Tue, Nov 24, 2009 at 1:39 PM, Richard Elling
<richard.elling at gmail.com> wrote:> On Nov 24, 2009, at 11:31 AM, Mike Gerdts wrote:
>
>> On Tue, Nov 24, 2009 at 9:46 AM, Richard Elling
>> <richard.elling at gmail.com> wrote:
>>>
>>> Good question! ?Additional thoughts below...
>>>
>>> On Nov 24, 2009, at 6:37 AM, Mike Gerdts wrote:
>>>
>>>> Suppose I have a storage server that runs ZFS, presumably
providing
>>>> file (NFS) and/or block (iSCSI, FC) services to other machines
that
>>>> are running Solaris. ?Some of the use will be for LDoms and
zones[1],
>>>> which would create zpools on top of zfs (fs or zvol). ?I have
concerns
>>>> about variable block sizes and the implications for
performance.
>>>>
>>>> 1.
http://hub.opensolaris.org/bin/view/Community+Group+zones/zoss
>>>>
>>>> Suppose that on the storage server, an NFS shared dataset is
created
>>>> without tuning the block size. ?This implies that when the
client
>>>> (ldom or zone v12n server) runs mkfile or similar to create the
>>>> backing store for a vdisk or a zpool, the file on the storage
server
>>>> will be created with 128K blocks. ?Then when Solaris or
OpenSolaris is
>>>> installed into the vdisk or zpool, files of a wide variety of
sizes
>>>> will be created. ?At this layer they will be created with
variable
>>>> block sizes (512B to 128K).
>>>>
>>>> The implications for a 512 byte write in the upper level zpool
(inside
>>>> a zone or ldom) seems to be:
>>>>
>>>> - The 512 byte write turns into a 128 KB write at the storage
server
>>>> ?(256x multiplication in write size).
>>>> - To write that 128 KB block, the rest of the block needs to be
read
>>>> ?to recalculate the checksum. ?That is, a read/modify/write
process
>>>> ?is forced. ?(Less impact if block already in ARC.)
>>>> - Deduplicaiton is likely to be less effective because it is
unlikely
>>>> ?that the same combination of small blocks in different
zones/ldoms
>>>> ?will be packed into the same 128 KB block.
>>>>
>>>> Alternatively, the block size could be forced to something
smaller at
>>>> the storage server. ?Setting it to 512 bytes could eliminate
the
>>>> read/modify/write cycle, but would presumably be less efficient
(less
>>>> performant) with moderate to large files. ?Setting it somewhere
in
>>>> between may be desirable as well, but it is not clear where.
?The key
>>>> competition in this area seems to have a fixed 4 KB block size.
>>>>
>>>> Questions:
>>>>
>>>> Are my basic assumptions about a given file consisting only of
a
>>>> single sized block, except for perhaps the final block?
>>>
>>> Yes, for a file system dataset. Volumes are fixed block size with
>>> the default being 8 KB. ?So in the iSCSI over volume case, OOB
>>> it can be more efficient. ?4KB matches well with NTFS or some of
>>> the Linux file systems
>>
>> OOB is missing from my TLA translator. ?Help, please.
>
> Out of box.
Looky there, it was in my TLA translator after all.  Not sure how I
missed it the first time.
>
>>>
>>>> Has any work been done to identify the performance
characteristics in
>>>> this area?
>>>
>>> None to my knowledge. ?The performance teams know to set the block
>>> size to match the application, so they don''t waste time
re-learning this.
>>
>> That works great for certain workloads, particularly those with a
>> fixed record size or large sequential I/O. ?If the workload is
>> "installing then running an operating system" the answer is
harder to
>> define.
>
> running OSes don''t create much work, post boot
Agreed, particularly if backups are pushed to the storage server.  I
suspect that most apps that shuffle bits between protocols but do
little disk I/O can piggy back on this idea.  That is, a J2EE server
that just talks to the web and database tier, with some log entries
and occasional app deployments should be pretty safe too.
>
>>>> Is there less to be concerned about from a performance
standpoint if
>>>> the workload is primarily read?
>>>
>>> Sequential read: yes
>>> Random read: no
>>
>> I was thinking that random wouldn''t be too much of a concern
either
>> assuming that the things that are commonly read are in cache. ?I guess
>> this does open the door for a small chunk of useful code in the middle
>> of a largely useless shared library to force lot of that shared
>> library into the ARC, among other things.
>
> This was much more of a problem years ago when memory was small.
> Don''t see it much on modern machines.
Probably so, particularly if deduped blocks are deduped in the ARC.
If 1000 virtual machines are each forcing the same 5 MB of extra stuff
into the ARC, that can be inconsequential or take up a somewhat
significant amount of the ARC.
>
>>>
>>>> To maximize the efficacy of dedup, would it be best to pick a
fixed
>>>> block size and match it between the layers of zfs?
>>>
>>> I don''t think we know yet. ?Until b128 arrives in binary,
and folks get
>>> some time to experiment, we just don''t have much data...
and there
>>> are way too many variables at play to predict. ?I can make one
>>> prediction, though, dedupe for mkfile or dd if=/dev/zero will
scream :-)
>>
>> We already have that optimization with compression. ?Dedupe just
>> messes up my method of repeatedly writing the same smallish (<1MB)
>> chunk of random or already compressed data to avoid the block-of-zeros
>> compression optimization.
>>
>> Pretty soon filebench is going to need to add statistical methods to
>> mimic the level of duplicate data it is simulating. ?Trying to write
>> simple benchmarks to test increasingly smart systems looks to be
>> problematic.
>
> :-)
> Also, the performance of /dev/*random is not very good. ?So prestaging
> lots of random data will be particularly challenging.
I was thinking that a bignum library such as libgmp could be handy to
allow easy bit shifting of large amounts of data.  That is, fill a 128
KB buffer with random data then do bitwise rotations for each
successive use of the buffer.  Unless my math is wrong, it should
allow 128 KB of random data to be write 128 GB of data with very
little deduplication or compression.  A much larger data set could be
generated with the use of a 128 KB linear feedback shift register...

http://en.wikipedia.org/wiki/Linear_feedback_shift_register

-- 
Mike Gerdts
http://mgerdts.blogspot.com/

Peter Jeremy

2009-Nov-25 21:31 UTC

head link

[zfs-discuss] Best practices for zpools on zfs

On 2009-Nov-24 14:07:06 -0600, Mike Gerdts <mgerdts at gmail.com>
wrote:>On Tue, Nov 24, 2009 at 1:39 PM, Richard Elling
><richard.elling at gmail.com> wrote:
>> Also, the performance of /dev/*random is not very good. ?So prestaging
>> lots of random data will be particularly challenging.
This depends on the random number generation algorithm used in the
kernel.  I get >50MB/sec out of FreeBSD on 3.2GHz P4 (using Yarrow).
In any case, you don''t need crypto-grade random numbers, just data
that is different and uncompressible - there are lots of relatively
simple RNGs that can deliver this with far greater speed.
>I was thinking that a bignum library such as libgmp could be handy to
>allow easy bit shifting of large amounts of data.  That is, fill a 128
>KB buffer with random data then do bitwise rotations for each
>successive use of the buffer.  Unless my math is wrong, it should
>allow 128 KB of random data to be write 128 GB of data with very
>little deduplication or compression.  A much larger data set could be
>generated with the use of a 128 KB linear feedback shift register...
This strikes me as much harder to use than just filling the buffer
with 8/32/64-bit random numbers from a linear congruential generator,
lagged fibonacci generator, mersenne twister or even random(3)

http://en.wikipedia.org/wiki/List_of_random_number_generators

-- 
Peter Jeremy
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 196 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20091126/d6d5a1a7/attachment.bin>

Toby Thain

2009-Nov-26 21:20 UTC

head link

[zfs-discuss] Best practices for zpools on zfs

On 25-Nov-09, at 4:31 PM, Peter Jeremy wrote:
> On 2009-Nov-24 14:07:06 -0600, Mike Gerdts <mgerdts at gmail.com>
wrote:
>> ... fill a 128
>> KB buffer with random data then do bitwise rotations for each
>> successive use of the buffer.  Unless my math is wrong, it should
>> allow 128 KB of random data to be write 128 GB of data with very
>> little deduplication or compression.  A much larger data set could be
>> generated with the use of a 128 KB linear feedback shift register...
>
> This strikes me as much harder to use than just filling the buffer
> with 8/32/64-bit random numbers
I think Mike''s reasoning is that a single bit shift (and propagation)  
is cheaper than generating a new random word. After the whole buffer  
is shifted, you have a new very-likely-unique block. (This seems like  
overkill if you know the dedup unit size in advance.)

--Toby
> from a linear congruential generator,
> lagged fibonacci generator, mersenne twister or even random(3)

Richard Elling

2009-Nov-27 01:57 UTC

head link

[zfs-discuss] Best practices for zpools on zfs

On Nov 26, 2009, at 1:20 PM, Toby Thain wrote:> On 25-Nov-09, at 4:31 PM, Peter Jeremy wrote:
>
>> On 2009-Nov-24 14:07:06 -0600, Mike Gerdts <mgerdts at gmail.com>
wrote:
>>> ... fill a 128
>>> KB buffer with random data then do bitwise rotations for each
>>> successive use of the buffer.  Unless my math is wrong, it should
>>> allow 128 KB of random data to be write 128 GB of data with very
>>> little deduplication or compression.  A much larger data set could
>>> be
>>> generated with the use of a 128 KB linear feedback shift
register...
>>
>> This strikes me as much harder to use than just filling the buffer
>> with 8/32/64-bit random numbers
>
> I think Mike''s reasoning is that a single bit shift (and  
> propagation) is cheaper than generating a new random word. After the  
> whole buffer is shifted, you have a new very-likely-unique block.  
> (This seems like overkill if you know the dedup unit size in advance.)
You should be able to get a unique block by shifting one word, as long
as the shift doesn''t duplicate the word.  I don''t think many
of the
existing
benchmarks do this, though.
  -- richard

Toby Thain

2009-Nov-27 02:53 UTC

head link

[zfs-discuss] Best practices for zpools on zfs

On 26-Nov-09, at 8:57 PM, Richard Elling wrote:
> On Nov 26, 2009, at 1:20 PM, Toby Thain wrote:
>> On 25-Nov-09, at 4:31 PM, Peter Jeremy wrote:
>>
>>> On 2009-Nov-24 14:07:06 -0600, Mike Gerdts <mgerdts at
gmail.com>
>>> wrote:
>>>> ... fill a 128
>>>> KB buffer with random data then do bitwise rotations for each
>>>> successive use of the buffer.  Unless my math is wrong, it
should
>>>> allow 128 KB of random data to be write 128 GB of data with
very
>>>> little deduplication or compression.  A much larger data set  
>>>> could be
>>>> generated with the use of a 128 KB linear feedback shift  
>>>> register...
>>>
>>> This strikes me as much harder to use than just filling the buffer
>>> with 8/32/64-bit random numbers
>>
>> I think Mike''s reasoning is that a single bit shift (and  
>> propagation) is cheaper than generating a new random word. After  
>> the whole buffer is shifted, you have a new very-likely-unique  
>> block. (This seems like overkill if you know the dedup unit size  
>> in advance.)
>
> You should be able to get a unique block by shifting one word, as long
> as the shift doesn''t duplicate the word.
That is true, but you will run out of permutations sooner.

--Toby
>   I don''t think many of the existing
> benchmarks do this, though.
>  -- richard
>

Mike Gerdts

2009-Nov-27 05:43 UTC

head link

[zfs-discuss] Best practices for zpools on zfs

On Thu, Nov 26, 2009 at 8:53 PM, Toby Thain <toby at telegraphics.com.au>
wrote:>
> On 26-Nov-09, at 8:57 PM, Richard Elling wrote:
>
>> On Nov 26, 2009, at 1:20 PM, Toby Thain wrote:
>>>
>>> On 25-Nov-09, at 4:31 PM, Peter Jeremy wrote:
>>>
>>>> On 2009-Nov-24 14:07:06 -0600, Mike Gerdts <mgerdts at
gmail.com> wrote:
>>>>>
>>>>> ... fill a 128
>>>>> KB buffer with random data then do bitwise rotations for
each
>>>>> successive use of the buffer. ?Unless my math is wrong, it
should
>>>>> allow 128 KB of random data to be write 128 GB of data with
very
>>>>> little deduplication or compression. ?A much larger data
set could be
>>>>> generated with the use of a 128 KB linear feedback shift
register...
>>>>
>>>> This strikes me as much harder to use than just filling the
buffer
>>>> with 8/32/64-bit random numbers
>>>
>>> I think Mike''s reasoning is that a single bit shift (and
propagation) is
>>> cheaper than generating a new random word. After the whole buffer
is
>>> shifted, you have a new very-likely-unique block. (This seems like
overkill
>>> if you know the dedup unit size in advance.)
>>
>> You should be able to get a unique block by shifting one word, as long
>> as the shift doesn''t duplicate the word.
>
> That is true, but you will run out of permutations sooner.
Rather than shifting a word, you could just increment it.  In a
multi-threaded test, each thread picks the word corresponding to the
thread that is executing.  Assuming 32-bit words (b4-bit is overkill),
this allows up to 128 threads with 512 byte blocks.  It also allows up
to 2 TB per thread per 512 bytes in a block.  That is, if 50 threads
are used and the block size is 8 KB, there should be no duplicates in
2 * 50 * 8192 / 512 = 1600 TB.

But... this leads us to the point the workload generators are too good
at generating unique data.

-- 
Mike Gerdts
http://mgerdts.blogspot.com/

Apparently Analagous Threads

Search for more maybe matching threads

zfs discuss - Nov 2009 - Best practices for zpools on zfs

[zfs-discuss] Best practices for zpools on zfs

[zfs-discuss] Best practices for zpools on zfs

[zfs-discuss] Best practices for zpools on zfs

[zfs-discuss] Best practices for zpools on zfs

[zfs-discuss] Best practices for zpools on zfs

[zfs-discuss] Best practices for zpools on zfs

[zfs-discuss] Best practices for zpools on zfs

[zfs-discuss] Best practices for zpools on zfs

[zfs-discuss] Best practices for zpools on zfs

[zfs-discuss] Best practices for zpools on zfs

Apparently Analagous Threads