A) Resilver = Defrag. True/false? B) If I buy larger drives and resilver, does defrag happen? C) Does zfs send zfs receive mean it will defrag? -- This message posted from opensolaris.org
On Thu, Sep 9, 2010 at 1:04 PM, Orvar Korvar <knatte_fnatte_tjatte at yahoo.com> wrote:> A) Resilver = Defrag. True/false?False. Resilver just rebuilds a drive in a vdev based on the redundant data stored on the other drives in the vdev. Similar to how replacing a dead drive works in a hardware RAID array.> B) If I buy larger drives and resilver, does defrag happen?No.> C) Does zfs send zfs receive mean it will defrag?No. ZFS doesn''t currently have a defragmenter. That will come when the legendary block pointer rewrite feature is committed. -- Freddie Cash fjwcash at gmail.com
I am speaking from my own observations and nothing scientific such as reading the code or designing the process.> A) Resilver = Defrag. True/false?False> B) If I buy larger drives and resilver, does defrag > happen?No. The first X sectors of the bigger drive are identical to the smaller drive, fragments and all.> C) Does zfs send zfs receive mean it will defrag?Yes. The data is laid out on the receiving side in a sane manner, until it later becomes fragmented. -- This message posted from opensolaris.org
On Thu, Sep 9, 2010 at 1:26 PM, Freddie Cash <fjwcash at gmail.com> wrote:> On Thu, Sep 9, 2010 at 1:04 PM, Orvar Korvar > <knatte_fnatte_tjatte at yahoo.com> wrote: >> A) Resilver = Defrag. True/false? > > False. ?Resilver just rebuilds a drive in a vdev based on the > redundant data stored on the other drives in the vdev. ?Similar to how > replacing a dead drive works in a hardware RAID array. > >> B) If I buy larger drives and resilver, does defrag happen? > > No.Actually, thinking about it ... since the resilver is writing new data to an empty drive, in essence, the drive is defragmented.>> C) Does zfs send zfs receive mean it will defrag? > > No.Same here, but only if the receiving pool has never had any snapshots deleted or files deleted, so that there are no holes in the pool. Then the newly written data will be contiguous (not fragmented).> ZFS doesn''t currently have a defragmenter. ?That will come when the > legendary block pointer rewrite feature is committed. > > > -- > Freddie Cash > fjwcash at gmail.com >-- Freddie Cash fjwcash at gmail.com
> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- > bounces at opensolaris.org] On Behalf Of Orvar Korvar > > A) Resilver = Defrag. True/false?I think everyone will agree "false" on this question. However, more detail may be appropriate. See below.> B) If I buy larger drives and resilver, does defrag happen?Scores so far: 2 No 1 Yes> C) Does zfs send zfs receive mean it will defrag?Scores so far: 1 No 2 Yes> ...Does anybody here know what they''re talking about? I''d feel good if perhaps Erik ... or Neil ... perhaps ... answered the question with actual knowledge. Thanks...
On 09/09/10 20:08, Edward Ned Harvey wrote:> Scores so far: > 2 No > 1 YesNo. resilver does not re-layout your data or change whats in the block pointers on disk. if it was fragmented before, it will be fragmented after.>> C) Does zfs send zfs receive mean it will defrag? > > Scores so far: > 1 No > 2 Yes"maybe". If there is sufficient contiguous freespace in the destination pool, files may be less fragmented. But if you do incremental sends of multiple snapshots, you may well replicate some or all the fragmentation on the origin (because snapshots only copy the blocks that change, and receiving an incremental send does the same). And if the destination pool is short on space you may end up more fragmented than the source. - Bill
On 10/09/2010 04:24, Bill Sommerfeld wrote:>>> C) Does zfs send zfs receive mean it will defrag? >> >> Scores so far: >> 1 No >> 2 Yes > > "maybe". If there is sufficient contiguous freespace in the destination > pool, files may be less fragmented. > > But if you do incremental sends of multiple snapshots, you may well > replicate some or all the fragmentation on the origin (because snapshots > only copy the blocks that change, and receiving an incremental send does > the same). > > And if the destination pool is short on space you may end up more > fragmented than the source.There is yet more "it depends". It depends on what you mean by fragmentation. ZFS has "gang blocks", which are used when we need to store a block of size N but can''t find a block that size but can make up that amount of storage from M smaller blocks that are available. Because ZFS send|recv work at the DMU layer they know nothing about gang blocks, which are a ZIO layer concept. As such if your filesystem is heavily "fragmented" on the source because it uses gang blocks, that doesn''t necessarily mean it will be using gang blocks at all or of the same size on the destination. I very strongly recommend the original poster take a step back and ask "why are you even worried about fragmentation ?" "do you know you have a pool that is fragmented?" "is it actually causing you a performance problem?" -- Darren J Moffat
I am not really worried about fragmentation. I was just wondering if I attach new drives and zfs send recieve to a new zpool, would count as defrag. But apparently, not. Anyway thank you for your input! -- This message posted from opensolaris.org
It really depends on your definition of "fragmentation." This term is used differently for various file systems. The UFS notion of fragmentation is closer to the ZFS notion of gangs. -- richard On Sep 11, 2010, at 6:16 AM, Orvar Korvar <knatte_fnatte_tjatte at yahoo.com> wrote:> I am not really worried about fragmentation. I was just wondering if I attach new drives and zfs send recieve to a new zpool, would count as defrag. But apparently, not. >
> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- > bounces at opensolaris.org] On Behalf Of Orvar Korvar > > I am not really worried about fragmentation. I was just wondering if I > attach new drives and zfs send recieve to a new zpool, would count as > defrag. But apparently, not."Apparently not in all situations" would be more appropriate. The understanding I had was: If you send a single zfs send | receive, then it does effectively get defragmented, because the receiving filesystem is going to re-layout the received filesystem, and there is nothing pre-existing to make the receiving filesystem dance around... But if you''re sending some initial, plus incrementals, then you''re actually repeating the same operations that probably caused the original filesystem to become fragmented in the first place. And in fact, it seems unavoidable... Suppose you have a large file, which is all sequential on disk. You make a snapshot of it. Which means all the individual blocks must not be overwritten. And then you overwrite a few bytes scattered randomly in the middle of the file. The nature of copy on write is such that of course, the latest version of the filesystem is impossible to remain contiguous. Your only choices are: To read & write copies of the whole file, including multiple copies of what didn''t change, or you leave the existing data in place where it is on disk, and you instead write your new random bytes to other non-contiguous locations on disk. Hence fragmentation.
On Sep 12, 2010, at 8:27 PM, Edward Ned Harvey wrote:>> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- >> bounces at opensolaris.org] On Behalf Of Orvar Korvar >> >> I am not really worried about fragmentation. I was just wondering if I >> attach new drives and zfs send recieve to a new zpool, would count as >> defrag. But apparently, not. > > "Apparently not in all situations" would be more appropriate. > > The understanding I had was: If you send a single zfs send | receive, then > it does effectively get defragmented, because the receiving filesystem is > going to re-layout the received filesystem, and there is nothing > pre-existing to make the receiving filesystem dance around... But if you''re > sending some initial, plus incrementals, then you''re actually repeating the > same operations that probably caused the original filesystem to become > fragmented in the first place. And in fact, it seems unavoidable... > > Suppose you have a large file, which is all sequential on disk. You make a > snapshot of it. Which means all the individual blocks must not be > overwritten. And then you overwrite a few bytes scattered randomly in the > middle of the file. The nature of copy on write is such that of course, the > latest version of the filesystem is impossible to remain contiguous. Your > only choices are: To read & write copies of the whole file, including > multiple copies of what didn''t change, or you leave the existing data in > place where it is on disk, and you instead write your new random bytes to > other non-contiguous locations on disk. Hence fragmentation.This operational definition of "fragmentation" comes from the single-user, single-tasking world (PeeCees). In that world, only one thread writes files from one application at one time. In those cases, there is a reasonable expectation that a single file''s blocks might be contiguous on a single disk. That isn''t the world we live in, where have RAID, multi-user, or multi-threaded environments. -- richard -- OpenStorage Summit, October 25-27, Palo Alto, CA http://nexenta-summit2010.eventbrite.com Richard Elling richard at nexenta.com +1-760-896-4422 Enterprise class storage for everyone www.nexenta.com
> From: Richard Elling [mailto:Richard at Nexenta.com] > > This operational definition of "fragmentation" comes from the single- > user, > single-tasking world (PeeCees). In that world, only one thread writes > files > from one application at one time. In those cases, there is a reasonable > expectation that a single file''s blocks might be contiguous on a single > disk. > That isn''t the world we live in, where have RAID, multi-user, or multi- > threaded > environments.I don''t know what you''re saying, but I''m quite sure I disagree with it. Regardless of multithreading, multiprocessing, it''s absolutely possible to have contiguous files, and/or file fragmentation. That''s not a characteristic which depends on the threading model. Also regardless of raid, it''s possible to have contiguous or fragmented files. The same concept applies to multiple disks.
I was thinking to delete all zfs snapshots before zfs send receive to another new zpool. Then everything would be defragmented, I thought. (I assume snapshots works this way: I snapshot once and do some changes, say delete file "A" and edit file "B". When I delete the snapshot, the file "A" is still deleted and file "B" is still edited. In other words, deletion of snapshot does not revert back the changes. Therefore I just delete all snapshots and make my filesystem up to date before zfs send receive) -- This message posted from opensolaris.org
> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- > bounces at opensolaris.org] On Behalf Of Orvar Korvar > > I was thinking to delete all zfs snapshots before zfs send receive to > another new zpool. Then everything would be defragmented, I thought.You don''t need to delete snaps before zfs send, if your goal is to "defragment" your filesystem. Just perform a single zfs send, and don''t do any incrementals afterward. The receiving filesystem will layout the filesystem as it wishes.> (I assume snapshots works this way: I snapshot once and do some > changes, say delete file "A" and edit file "B". When I delete the > snapshot, the file "A" is still deleted and file "B" is still edited. > In other words, deletion of snapshot does not revert back the changes.You are correct. A snapshot is a read-only image of the filesystem, as it was, at some time in the past. If you destroy the snapshot, you''ve only destroyed the snapshot. You haven''t destroyed the most recent "live" version of the filesystem. If you wanted to, you could rollback, which destroys the "live" version of the filesystem, and restores you back to some snapshot. But that is a very different operation. Rollback is not at all similar to destroying a snapshot. These two operations are basically opposites of each other. All of this is discussed in the man pages. I suggest "man zpool" and "man zfs" Everything you need to know is written there.
On Sep 13, 2010, at 5:14 AM, Edward Ned Harvey wrote:>> From: Richard Elling [mailto:Richard at Nexenta.com] >> >> This operational definition of "fragmentation" comes from the single- >> user, >> single-tasking world (PeeCees). In that world, only one thread writes >> files >> from one application at one time. In those cases, there is a reasonable >> expectation that a single file''s blocks might be contiguous on a single >> disk. >> That isn''t the world we live in, where have RAID, multi-user, or multi- >> threaded >> environments. > > I don''t know what you''re saying, but I''m quite sure I disagree with it. > > Regardless of multithreading, multiprocessing, it''s absolutely possible to > have contiguous files, and/or file fragmentation. That''s not a > characteristic which depends on the threading model.Possible, yes. Probable, no. Consider that a file system is allocating space for multiple, concurrent file writers.> Also regardless of raid, it''s possible to have contiguous or fragmented > files. The same concept applies to multiple disks.RAID works against the efforts to gain performance by contiguous access because the access becomes non-contiguous. -- richard
> From: Richard Elling [mailto:Richard at Nexenta.com] > > > > Regardless of multithreading, multiprocessing, it''s absolutely > possible to > > have contiguous files, and/or file fragmentation. That''s not a > > characteristic which depends on the threading model. > > Possible, yes. Probable, no. Consider that a file system is > allocating > space for multiple, concurrent file writers.Process A is writing. Suppose it starts writing at block 10,000 out of my 1,000,000 block device. Process B is also writing. Suppose it starts writing at block 50,000. These two processes write simultaneously, and no fragmentation occurs, unless Process A writes more than 40,000 blocks. In that case, A''s file gets fragmented, and the 2nd fragment might begin at block 300,000. The concept which causes fragmentation (not counting COW) in the size of the span of unallocated blocks. Most filesystems will allocate blocks from the largest unallocated contiguous area of the physical device, so as to minimize fragmentation. I can''t say how ZFS behaves authoritatively, but I''d be extremely surprised if two processes writing different files as fast as possible result in all their blocks interleaved with each other on physical disk. I think this is possible if you have multiple processes lazily writing at less-than full speed, because then ZFS might remap a bunch of small writes into a single contiguous write.> > Also regardless of raid, it''s possible to have contiguous or > fragmented > > files. The same concept applies to multiple disks. > > RAID works against the efforts to gain performance by contiguous access > because the access becomes non-contiguous.These might as well have been words randomly selected from the dictionary to me - I recognize that it''s a complete sentence, but you might have said "processors aren''t needed in computers anymore," or something equally illogical. Suppose you have a 3-disk raid stripe set, using traditional simple striping, because it''s very easy to explain. Suppose a process is writing as fast as it can, and suppose it''s going to write block 0 through block 99 of a virtual device. virtual block 0 = block 0 of disk 0 virtual block 1 = block 0 of disk 1 virtual block 2 = block 0 of disk 2 virtual block 3 = block 1 of disk 0 virtual block 4 = block 1 of disk 1 virtual block 5 = block 1 of disk 2 virtual block 6 = block 2 of disk 0 virtual block 7 = block 2 of disk 1 virtual block 8 = block 2 of disk 2 virtual block 9 = block 3 of disk 0 ... virtual block 96 = block 32 of disk 0 virtual block 97 = block 32 of disk 1 virtual block 98 = block 32 of disk 2 virtual block 99 = block 33 of disk 0 Thanks to buffering and command queueing, the OS tells the RAID controller to write blocks 0-8, and the raid controller tells disk 0 to write blocks 0-2, tells disk 1 to write blocks 0-2, and tells disk 2 to write 0-2, simultaneously. So the total throughput is the sum of all 3 disks writing continuously and contiguously to sequential blocks. This accelerates performance for continuous sequential writes. It does not "work against efforts to gain performance by contiguous access." The same concept is true for raid-5 or raidz, but it''s more complicated. The filesystem or raid controller does in fact know how to write sequential filesystem blocks to sequential physical blocks on the physical devices for the sake of performance enhancement on contiguous read/write. If you don''t believe me, there''s a very easy test to prove it: Create a zpool with 1 disk in it. time writing 100G (or some amount of data>> larger than RAM.)Create a zpool with several disks in a raidz set, and time writing 100G. The speed scales up linearly with the number of disks, until you reach some other hardware bottleneck, such as bus speed or something like that.
On Mon, September 13, 2010 07:14, Edward Ned Harvey wrote:>> From: Richard Elling [mailto:Richard at Nexenta.com] >> >> This operational definition of "fragmentation" comes from the single- >> user, >> single-tasking world (PeeCees). In that world, only one thread writes >> files >> from one application at one time. In those cases, there is a reasonable >> expectation that a single file''s blocks might be contiguous on a single >> disk. >> That isn''t the world we live in, where have RAID, multi-user, or multi- >> threaded >> environments. > > I don''t know what you''re saying, but I''m quite sure I disagree with it. > > Regardless of multithreading, multiprocessing, it''s absolutely possible to > have contiguous files, and/or file fragmentation. That''s not a > characteristic which depends on the threading model. > > Also regardless of raid, it''s possible to have contiguous or fragmented > files. The same concept applies to multiple disks.The attitude that it *matters* seems to me to have developed, and be relevant only to, single-user computers. Regardless of whether a file is contiguous or not, by the time you read the next chunk of it, in the multi-user world some other user is going to have moved the access arm of that drive. Hence, it doesn''t matter if the file is contiguous or not. -- David Dyer-Bennet, dd-b at dd-b.net; http://dd-b.net/ Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/ Photos: http://dd-b.net/photography/gallery/ Dragaera: http://dragaera.info
To summarize, A) resilver does not defrag. B) zfs send receive to a new zpool means it will be defragged Correctly understood? -- This message posted from opensolaris.org
On Sep 13, 2010, at 10:54 AM, Orvar Korvar wrote:> To summarize, > > A) resilver does not defrag. > > B) zfs send receive to a new zpool means it will be defraggedDefine "fragmentation"? If you follow the wikipedia definition of "defragmentation" then the answer is no, zfs send/receive does not change the location of files. Why? Because zfs sends objects, not files. The objects can be allocated in a (more) contiguous form on the receiving side, or maybe not, depending on the configuration and use of the receiving side. A file may be wholly contained in an object, or not, depending on how it was created. For example, if a file is less than 128KB (by default) and is created at one time, then it will be wholly contained in one object. By contrast, UFS has an 8KB max block size will use up to 16 different blocks to store the same file. These blocks may or may not be contiguous in UFS. http://en.wikipedia.org/wiki/Defragmentation> Correctly understood?Clear as mud. I suggest deprecating the use of the term "defragmentation." -- richard -- OpenStorage Summit, October 25-27, Palo Alto, CA http://nexenta-summit2010.eventbrite.com Richard Elling richard at nexenta.com +1-760-896-4422 Enterprise class storage for everyone www.nexenta.com
Richard Elling wrote:> On Sep 13, 2010, at 5:14 AM, Edward Ned Harvey wrote: > > >>> From: Richard Elling [mailto:Richard at Nexenta.com] >>> >>> This operational definition of "fragmentation" comes from the single- >>> user, >>> single-tasking world (PeeCees). In that world, only one thread writes >>> files >>> from one application at one time. In those cases, there is a reasonable >>> expectation that a single file''s blocks might be contiguous on a single >>> disk. >>> That isn''t the world we live in, where have RAID, multi-user, or multi- >>> threaded >>> environments. >>> >> I don''t know what you''re saying, but I''m quite sure I disagree with it. >> >> Regardless of multithreading, multiprocessing, it''s absolutely possible to >> have contiguous files, and/or file fragmentation. That''s not a >> characteristic which depends on the threading model. >> > > Possible, yes. Probable, no. Consider that a file system is allocating > space for multiple, concurrent file writers. >With appropriate write caching and grouping or re-ordering of writes algorithms, it should be possible to minimize the amount of file interleaving and fragmentation on write that takes place. (Or at least optimize the amount of file interleaving. Years ago MFM hard drives had configurable sector interleave factors to better optimize performance when no interleaving meant the drive had spun the platter far enough to be ready to give the next sector to the CPU before the CPU was ready with the result that the platter had to be spun a second time around to wait for the CPU to catch up.)>> Also regardless of raid, it''s possible to have contiguous or fragmented >> files. The same concept applies to multiple disks. >> > > RAID works against the efforts to gain performance by contiguous access > because the access becomes non-contiguous.From what I''ve seen, defragmentation offers its greatest benefit when the tiniest reads are eliminated by grouping them into larger contiguous reads. Once the contiguous areas reach a certain size (somewhere in the few Mbytes to a few hundred Mbytes range), further defragmentation offers little additional benefit. Full defragmentation is a useful goal when the option of using file carving based data recovery is desirable. Also remember that defragmentation is not limited to space used by files. It can also apply to free, unused space, which should also be defragmented to prevent future writes from being fragmented on write. With regard to multiuser systems and how that negates the need to defragment, I think that is only partially true. As long as the files are defragmented enough so that each particular read request only requires one seek before it is time to service the next read request, further defragmentation may offer only marginal benefit. On the other hand, if files from been fragmented down to each sector being stored separately on the drive, then each read request is going to take that much longer to be completed (or will be interrupted by another read request because it has taken too long).. -hk -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100913/683ac519/attachment-0001.html>
On Sep 13, 2010, at 9:41 PM, Haudy Kazemi wrote:> Richard Elling wrote: >> On Sep 13, 2010, at 5:14 AM, Edward Ned Harvey wrote: >>>> From: Richard Elling [mailto:Richard at Nexenta.com >>>> ] >>>> >>>> This operational definition of "fragmentation" comes from the single- >>>> user, >>>> single-tasking world (PeeCees). In that world, only one thread writes >>>> files >>>> from one application at one time. In those cases, there is a reasonable >>>> expectation that a single file''s blocks might be contiguous on a single >>>> disk. >>>> That isn''t the world we live in, where have RAID, multi-user, or multi- >>>> threaded >>>> environments. >>>> >>>> >>> I don''t know what you''re saying, but I''m quite sure I disagree with it. >>> >>> Regardless of multithreading, multiprocessing, it''s absolutely possible to >>> have contiguous files, and/or file fragmentation. That''s not a >>> characteristic which depends on the threading model. >>> >>> >> >> Possible, yes. Probable, no. Consider that a file system is allocating >> space for multiple, concurrent file writers. > > With appropriate write caching and grouping or re-ordering of writes algorithms, it should be possible to minimize the amount of file interleaving and fragmentation on write that takes place.To some degree, ZFS already does this. The dynamic block sizing tries to ensure that a file is written into the largest block[1]> (Or at least optimize the amount of file interleaving. Years ago MFM hard drives had configurable sector interleave factors to better optimize performance when no interleaving meant the drive had spun the platter far enough to be ready to give the next sector to the CPU before the CPU was ready with the result that the platter had to be spun a second time around to wait for the CPU to catch up.)Reason #526 why SSDs kill HDDs on performance.>>> Also regardless of raid, it''s possible to have contiguous or fragmented >>> files. The same concept applies to multiple disks. >>> >>> >> >> RAID works against the efforts to gain performance by contiguous access >> because the access becomes non-contiguous. >> > > From what I''ve seen, defragmentation offers its greatest benefit when the tiniest reads are eliminated by grouping them into larger contiguous reads. Once the contiguous areas reach a certain size (somewhere in the few Mbytes to a few hundred Mbytes range), further defragmentation offers little additional benefit.For the wikipedia definition of defragmentation, this can only occur when the files themselves are hundreds of megabytes in size. This is not the general case for which I see defragmentation used. Also, ZFS has an intelligent prefetch algorithm that can hide some performance aspects of "defragmentation" on HDDs.> Full defragmentation is a useful goal when the option of using file carving based data recovery is desirable. Also remember that defragmentation is not limited to space used by files. It can also apply to free, unused space, which should also be defragmented to prevent future writes from being fragmented on write.This is why ZFS uses a first fit algorithm until space becomes too low, when it changes to a best fit algorithm. As long as available space is big enough for the block, then it will be used.> With regard to multiuser systems and how that negates the need to defragment, I think that is only partially true. As long as the files are defragmented enough so that each particular read request only requires one seek before it is time to service the next read request, further defragmentation may offer only marginal benefit. On the other hand, if files from been fragmented down to each sector being stored separately on the drive, then each read request is going to take that much longer to be completed (or will be interrupted by another read request because it has taken too long)..Yes, so try to avoid running your ZFS pool at more than 96% full. -- richard -- OpenStorage Summit, October 25-27, Palo Alto, CA http://nexenta-summit2010.eventbrite.com Richard Elling richard at nexenta.com +1-760-896-4422 Enterprise class storage for everyone www.nexenta.com
> From: Haudy Kazemi [mailto:kaze0010 at umn.edu] > > With regard to multiuser systems and how that negates the need to > defragment, I think that is only partially true.? As long as the files > are defragmented enough so that each particular read request only > requires one seek before it is time to service the next read request, > further defragmentation may offer only marginal benefit.? On the otherHere''s a great way to quantify how much "fragmentation" is acceptable: Suppose you want to ensure at least 99% efficiency of the drive. At most 1% time wasted by seeking. Suppose you''re talking about 7200rpm sata drives, which sustain 500Mbit/s transfer, and have average seek time 8ms. 8ms is 1% of 800ms. In 800ms, the drive could read 400 Mbit of sequential data. That''s 40 MB So as long as the "fragment" size of your files are approx 40 MB or larger, then fragmentation has a negligible effect on performance. One seek per every 40MB read/written will yield less than 1% performance impact. For the heck of it, let''s see how that would have computed with 15krpm SAS drives. Sustained transfer 1Gbit/s, and average seek 3.5ms 3.5ms is 1% of 350ms In 350ms, the drive could read 350 Mbit (call it 43MB) That''s certainly in the ballpark of 40 MB.
> From: Richard Elling [mailto:Richard at Nexenta.com] > > With appropriate write caching and grouping or re-ordering of writes > algorithms, it should be possible to minimize the amount of file > interleaving and fragmentation on write that takes place. > > To some degree, ZFS already does this. The dynamic block sizing tries > to ensure > that a file is written into the largest block[1]Yes, but the block sizes in question are typically up to 128K. As computed in my email 1 minute ago ... The "fragment" size needs to be on the order of 40 MB in order to effectively eliminate performance loss of fragmentation.> Also, ZFS has an intelligent prefetch algorithm that can hide some > performance > aspects of "defragmentation" on HDDs.Unfortunately, prefetch can only hide fragmentation on systems that have idle disk time. Prefetch isn''t going to help you if you actually need to transfer a whole file as fast as possible.
Richard Elling wote:> Define "fragmentation"?Maybe this is the wrong thread. I have noticed that an old pool can take 4 hours to scrub, with a large portion of the time reading from the pool disks at the rate of 150+ MB/s but zpool iostat reports 2 MB/s read speed. My naive interpretation is that the data scrub is looking for has become fragmented. Should I refresh the pool by zfs sending it to another pool then zfs receiving the data back again, the same scrub can take less than an hour with zpool iostat reporting more sane throughput. On an old pool which had lots of snapshots come and go, the scrub throughput is awful. On that same data, refreshed via zfs send/receive, the throughput much better. It would appear to me that this is an artifact of fragmentation, although I have nothing scientific on which to base this. Additional unscientific observations leads me to believe these same "refreshed" pools also perform better for non-scrub activities. -- This message posted from opensolaris.org
The difference between multi-user thinking and single-user thinking is really quite dramatic in this area. I came up the time-sharing side (PDP-8, PDP-11, DECSYSTEM-20); TOPS-20 didn''t have any sort of disk defragmenter, and nobody thought one was particularly desirable, because the normal access pattern of a busy system was spread all across the disk packs anyway. On a desktop workstation, it makes some sense to think about loading big executable files fast -- that''s something the user is sitting there waiting for, and there''s often nothing else going on at that exact moment. (There *could* be significant things happening in the background, but quite often there aren''t.) Similarly, loading a big "document" (single-file book manuscript, bitmap image, or whatever) happens at a point where the user has requested it and is waiting for it right then, and there''s mostly nothing else going on. But on really shared disk space (either on a timesharing system, or a network file server serving a good-sized user base), the user is competing for disk activity (either bandwidth or IOPs, depending on the access pattern of the users). Generally you don''t get to load your big DLL in one read -- and to the extent that you don''t, it doesn''t matter much how it''s spread around the disk, because the head won''t be in the same spot when you get your turn again. -- David Dyer-Bennet, dd-b at dd-b.net; http://dd-b.net/ Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/ Photos: http://dd-b.net/photography/gallery/ Dragaera: http://dragaera.info
On Sep 14, 2010, at 4:58 AM, Edward Ned Harvey wrote:>> From: Haudy Kazemi [mailto:kaze0010 at umn.edu] >> >> With regard to multiuser systems and how that negates the need to >> defragment, I think that is only partially true. As long as the files >> are defragmented enough so that each particular read request only >> requires one seek before it is time to service the next read request, >> further defragmentation may offer only marginal benefit. On the other > > Here''s a great way to quantify how much "fragmentation" is acceptable: > > Suppose you want to ensure at least 99% efficiency of the drive. At most 1% > time wasted by seeking.This is practically impossible on a HDD. If you need this, use SSD. This phenomenon is why "short-stroking" became popular, until SSDs killed short-stroking.> Suppose you''re talking about 7200rpm sata drives, which sustain 500Mbit/s > transfer, and have average seek time 8ms. > > 8ms is 1% of 800ms. > In 800ms, the drive could read 400 Mbit of sequential data. > That''s 40 MBIn UFS we have cluster groups -- doesn''t survive the test of time. In ZFS we have metaslabs, perhaps with a better chance of longevity. The vdev is divided into a set of metaslabs and the allocator tries to use space in one metaslab before changing to another. Several features work against HDD optimization. Redundant copies of the metadata are intentionally spread across the media, so that there is some resilience to media errors. Entries into the ZIL can also be of varying size and are allocated in the pool -- solved by using a separate log device. COW can lead to wikipedia disk [de]fragmentation for files which are larger than the recordsize. Continuing to try to optimize for HDD performance is just a matter of changing the lipstick on the pig. -- richard -- OpenStorage Summit, October 25-27, Palo Alto, CA http://nexenta-summit2010.eventbrite.com Richard Elling richard at nexenta.com +1-760-896-4422 Enterprise class storage for everyone www.nexenta.com
> From: Richard Elling [mailto:Richard at Nexenta.com] > > > Suppose you want to ensure at least 99% efficiency of the drive. At > most 1% > > time wasted by seeking. > > This is practically impossible on a HDD. If you need this, use SSD.Lately, Richard, you''re saying some of the craziest illogical things I''ve ever heard, about fragmentation and/or raid. It is absolutely not difficult to avoid fragmentation on a spindle drive, at the level I described. Just keep plenty of empty space in your drive, and you won''t have a fragmentation problem. (Except as required by COW.) How on earth do you conclude this is "practically impossible?" For example, if you start with an empty drive, and you write a large amount of data to it, you will have no fragmentation. (At least, no significant fragmentation; you may get a little bit based on random factors.) As life goes on, as long as you keep plenty of empty space on the drive, there''s never any reason for anything to become significantly fragmented. Again, except for COW. It is known that COW will cause fragmentation if you write randomly in the middle of a file that is protected by snapshots.
On Sep 15, 2010, at 2:18 PM, Edward Ned Harvey wrote:>> From: Richard Elling [mailto:Richard at Nexenta.com] >> >>> Suppose you want to ensure at least 99% efficiency of the drive. At >> most 1% >>> time wasted by seeking. >> >> This is practically impossible on a HDD. If you need this, use SSD. > > Lately, Richard, you''re saying some of the craziest illogical things I''ve > ever heard, about fragmentation and/or raid. > > It is absolutely not difficult to avoid fragmentation on a spindle drive, at > the level I described. Just keep plenty of empty space in your drive, and > you won''t have a fragmentation problem. (Except as required by COW.) How > on earth do you conclude this is "practically impossible?"It is practically impossible to keep a drive from seeking. It is also practically impossible to keep from blowing a rev. Cute little piggy, eh? :-)> For example, if you start with an empty drive, and you write a large amount > of data to it, you will have no fragmentation. (At least, no significant > fragmentation; you may get a little bit based on random factors.) As life > goes on, as long as you keep plenty of empty space on the drive, there''s > never any reason for anything to become significantly fragmented. > > Again, except for COW. It is known that COW will cause fragmentation if you > write randomly in the middle of a file that is protected by snapshots.IFF the file is larger than recordsize. -- richard
On 09/16/10 09:18 AM, Edward Ned Harvey wrote:>> From: Richard Elling [mailto:Richard at Nexenta.com] >> >> >>> Suppose you want to ensure at least 99% efficiency of the drive. At >>> >> most 1% >> >>> time wasted by seeking. >>> >> This is practically impossible on a HDD. If you need this, use SSD. >> > Lately, Richard, you''re saying some of the craziest illogical things I''ve > ever heard, about fragmentation and/or raid. > > It is absolutely not difficult to avoid fragmentation on a spindle drive, at > the level I described. Just keep plenty of empty space in your drive, and > you won''t have a fragmentation problem. (Except as required by COW.) How > on earth do you conclude this is "practically impossible?" >Drives seek, there isn''t a lot you can do to stop that. -- Ian.
On Wed, Sep 15, 2010 at 05:18:08PM -0400, Edward Ned Harvey wrote:> It is absolutely not difficult to avoid fragmentation on a spindle drive, at > the level I described. Just keep plenty of empty space in your drive, and > you won''t have a fragmentation problem. (Except as required by COW.) How > on earth do you conclude this is "practically impossible?"That''s expensive. It''s also approaching short-stroking (which is expensive). Which is what Richard said (in so many words, that it''s expensive). Can you make HDDs perform awesome? Yes, but you''ll need lots of them, and you''ll need to use them very inefficiently. Nico --
> From: Richard Elling [mailto:Richard at Nexenta.com] > > It is practically impossible to keep a drive from seeking. It is alsoThe first time somebody (Richard) said "you can''t prevent a drive from seeking," I just decided to ignore it. But then it was said twice. (Ian.) I don''t get why anybody is saying "drives seek." Did anybody say drives don''t seek? I said you can quantify how much "fragmentation" is acceptable, given drive speed characteristics, and a percentage of time you consider acceptable for seeking. I suggested "acceptable" was 99% efficiency and 1% time waste seeking. Roughly calculated, I came up with 40 MB sequential data per random seek would yield 99% efficiency. For some situations, that''s entirely possible and likely to be the norm. For other cases, it may be unrealistic, and you may suffer badly from fragmentation. Is there some point we''re talking about here? I don''t get why the conversation seems to have taken such a tangent.
On Wed, September 15, 2010 16:18, Edward Ned Harvey wrote:> For example, if you start with an empty drive, and you write a large > amount > of data to it, you will have no fragmentation. (At least, no significant > fragmentation; you may get a little bit based on random factors.) As life > goes on, as long as you keep plenty of empty space on the drive, there''s > never any reason for anything to become significantly fragmented.Sure, if only a single thread is ever writing to the disk store at a time. This situation doesn''t exist with any kind of enterprise disk appliance, though; there are always multiple users doing stuff. -- David Dyer-Bennet, dd-b at dd-b.net; http://dd-b.net/ Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/ Photos: http://dd-b.net/photography/gallery/ Dragaera: http://dragaera.info
>>>>> "dd" == David Dyer-Bennet <dd-b at dd-b.net> writes:dd> Sure, if only a single thread is ever writing to the disk dd> store at a time. video warehousing is a reasonable use case that will have small numbers of sequential readers and writers to large files. virtual tape library is another obviously similar one. basically, things which used to be stored on tape. which are not uncommon. AIUI ZFS does not have a fragmentation problem for these cases unless you fill past 96%, though I''ve been trying to keep my pool below 80% because <general FUD>. dd> This situation doesn''t exist with any kind of enterprise disk dd> appliance, though; there are always multiple users doing dd> stuff. the point''s relevant, but I''m starting to tune out every time I hear the word ``enterprise.'''' seems it often decodes to: (1) ``fat sacks and no clue,'''' or (2) ``i can''t hear you i can''t hear you i have one big hammer in my toolchest and one quick answer to all questions, and everything''s perfect! perfect, I say. unless you''re offering an even bigger hammer I can swap for this one, I don''t want to hear it,'''' or (3) ``However of course I agree that hammers come in different colors, and a wise and experienced craftsman will always choose the color of his hammer based on the color of the nail he''s hitting, because the interface between hammers and nails doesn''t work well otherwise. We all know here how to match hammer and nail colors, but I don''t want to discuss that at all because it''s a private decision to make between you and your salesdroid. ``However, in this forum here we talk about GREEN NAILS ONLY. If you are hitting green nails with red hammers and finding they go into the wood anyway then you are being very unprofessional because that nail might have been a bank transaction. --posted from opensolaris.org'''' -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 304 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100916/deb8ff19/attachment.bin>
David Dyer-Bennet wote:> Sure, if only a single thread is ever writing to the > disk store at a time. > > This situation doesn''t exist with any kind of > enterprise disk appliance, > though; there are always multiple users doing stuff.Ok, I''ll bite. Your assertion seems to be that "any kind of enterprise disk appliance" will always have enough simultaneous I/O requests queued that any sequential read from any application will be sufficiently broken up by requests from other applications, effectively rendering all read requests as random. If I follow your logic, since all requests are essentially random anyway, then where they fall on the disk is irrelevant. I might challenge a couple of those assumptions. First, if the data is not fragmented, then ZFS would coalesce multiple contiguous read requests into a single large read request, increasing total throughput regardless of competing I/O requests (which also might benefit from the same effect). Second, I am unaware of an enterprise requirement that disk I/O run at 100% busy, any more than I am aware of the same requirement for full network link utilization, CPU utilization or PCI bus utilization. What appears to be missing from this discussion is any shred of scientific evidence that fragmentation is good or bad and by how much. We also lack any detail on how much fragmentation does take place. Let''s see if some people in the community can get some real numbers behind this stuff in real world situations. Cheers, Marty -- This message posted from opensolaris.org
> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- > bounces at opensolaris.org] On Behalf Of David Dyer-Bennet > > > For example, if you start with an empty drive, and you write a large > > amount > > of data to it, you will have no fragmentation. (At least, no > significant > > fragmentation; you may get a little bit based on random factors.) As > life > > goes on, as long as you keep plenty of empty space on the drive, > there''s > > never any reason for anything to become significantly fragmented. > > Sure, if only a single thread is ever writing to the disk store at a > time.This has already been discussed in this thread. The threading model doesn''t affect the outcome of files being fragmented or unfragmented on disk. The OS is smart enough to know "these blocks writen by process A are all sequential, and those blocks all written by process B are also sequential, but separate."
> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- > bounces at opensolaris.org] On Behalf Of Marty Scholes > > What appears to be missing from this discussion is any shred of > scientific evidence that fragmentation is good or bad and by how much. > We also lack any detail on how much fragmentation does take place.Agreed. I''ve been rather lazily asserting a few things here and there that I expected to be challenged, so I''ve been thinking up tests to verify/dispute my claims, but then nobody challenged. Specifically, "the blocks on disk are not interleaved just because multiple threads were writing at the same time." So there''s at least one thing which is testable, if anyone cares. But there''s also no way that I know of, to measure fragmentation in a real system that''s been in production for a year.
On Sep 16, 2010, at 12:33 PM, Marty Scholes wrote:> David Dyer-Bennet wote: >> Sure, if only a single thread is ever writing to the >> disk store at a time. >> >> This situation doesn''t exist with any kind of >> enterprise disk appliance, >> though; there are always multiple users doing stuff. > > Ok, I''ll bite. > > Your assertion seems to be that "any kind of enterprise disk appliance" will always have enough simultaneous I/O requests queued that any sequential read from any application will be sufficiently broken up by requests from other applications, effectively rendering all read requests as random. If I follow your logic, since all requests are essentially random anyway, then where they fall on the disk is irrelevant.Allan and Neel did a study of this for MySQL. http://www.youtube.com/watch?v=a31NhwzlAxs -- richard -- OpenStorage Summit, October 25-27, Palo Alto, CA http://nexenta-summit2010.eventbrite.com Richard Elling richard at nexenta.com +1-760-896-4422 Enterprise class storage for everyone www.nexenta.com
On Thu, September 16, 2010 14:04, Miles Nordin wrote:>>>>>> "dd" == David Dyer-Bennet <dd-b at dd-b.net> writes: > > dd> Sure, if only a single thread is ever writing to the disk > dd> store at a time. > > video warehousing is a reasonable use case that will have small > numbers of sequential readers and writers to large files. virtual > tape library is another obviously similar one. basically, things > which used to be stored on tape. which are not uncommon.Haven''t encountered those kinds of things first-hand, so I didn''t think of them. Yes, those sound like they''d have lower numbers of simultaneous users by a lot for one reason or another.> AIUI ZFS does not have a fragmentation problem for these cases unless > you fill past 96%, though I''ve been trying to keep my pool below 80% > because <general FUD>.As various people have said recently, we have no way to measure it that we know of. I don''t feel I have a problem in my own setup, but it''s so low-stress that if ZFS doesn''t work there, it wouldn''t work anywhere.> dd> This situation doesn''t exist with any kind of enterprise disk > dd> appliance, though; there are always multiple users doing > dd> stuff. > > the point''s relevant, but I''m starting to tune out every time I hear > the word ``enterprise.'''' seems it often decodes to:Picked the phrase out of an orifice; trying to distinguish between storage for key corporate data assets, and other uses.> (1) ``fat sacks and no clue,'''' or > > (2) ``i can''t hear you i can''t hear you i have one big hammer in my > toolchest and one quick answer to all questions, and everything''s > perfect! perfect, I say. unless you''re offering an even bigger > hammer I can swap for this one, I don''t want to hear it,'''' or > > (3) ``However of course I agree that hammers come in different > colors, and a wise and experienced craftsman will always choose > the color of his hammer based on the color of the nail he''s > hitting, because the interface between hammers and nails doesn''t > work well otherwise. We all know here how to match hammer and > nail colors, but I don''t want to discuss that at all because it''s > a private decision to make between you and your salesdroid. > > ``However, in this forum here we talk about GREEN NAILS ONLY. If > you are hitting green nails with red hammers and finding they go > into the wood anyway then you are being very unprofessional > because that nail might have been a bank transaction. --posted > from opensolaris.org''''#3 is particularly amusing! -- David Dyer-Bennet, dd-b at dd-b.net; http://dd-b.net/ Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/ Photos: http://dd-b.net/photography/gallery/ Dragaera: http://dragaera.info