Alexandre Oliva
2013-Mar-18 21:14 UTC
corruption of active mmapped files in btrfs snapshots
For quite a while, I''ve experienced oddities with snapshotted Firefox _CACHE_00?_ files, whose checksums (and contents) would change after the btrfs snapshot was taken, and would even change depending on how the file was brought to memory (e.g., rsyncing it to backup storage vs checking its md5sum before or after the rsync). This only affected these cache files, so I didn''t give it too much attention. A similar problem seems to affect the leveldb databases maintained by ceph within the periodic snapshots it takes of its object storage volumes. I''m told others using ceph on filesystems other than btrfs are not observing this problem, which makes me thing it''s not memory corruption within ceph itself. I''ve looked into this for a bit, and I''m now inclined to believe it has to do with some bad interaction of mmap and snapshots; I''m not sure the fact that the filesystem has compression enabled has any effect, but that''s certainly a possibility. leveldb does not modify file contents once they''re initialized, it only appends to files, ftruncate()ing them to about a MB early on, mmap()ping that in and memcpy()ing blocks of various sizes to the end of the output buffer, occasionally msync()ing the maps, or running fdatasync if it didn''t msync a map before munmap()ping it. If it runs out of space in a map, it munmap()s the previously mapped range, truncates the file to a larger size, then maps in the new tail of the file, starting at the page it should append to next. What I''m observing is that some btrfs snapshots taken by ceph osds, containing the leveldb database, are corrupted, causing crashes during the use of the database. I''ve scripted regular checks of osd snapshots, saving the last-known-good database along with the first one that displays the corruption. Studying about two dozen failures over the weekend, that took place on all of 13 btrfs-based osds on 3 servers running btrfs as in 3.8.3(-gnu), I noticed that all of the corrupted databases had a similar pattern: a stream of NULs of varying sizes at the end of a page, starting at a block boundary (leveldb doesn''t do page-sized blocking, so blocks can start anywhere in a page), and ending close to the beginning of the next page, although not exactly at the page boundary; 20 bytes past the page boundary seemed to be the most common size, but the occasional presence of NULs in the database contents makes it harder to tell for sure. The stream of NULs ended in the middle of a database block (meaning it was not the beginning of a subsequent database block written later; the beginning of the database block was partially replaced with NULs). Furthermore, the checksum fails to match on this one partially-NULed block. Since the checksum is computed just before the block and the checksum trailer are memcpy()ed to the mmap()ed area, it is a certainty that the block was copied entirely to the right place at some point, and if part of it became zeros, it''s either because the modification was partially lost, or because the mmapped buffer was partially overwritten. The fact that all instances of corruption I looked at were correct right to the end of one block boundary, and then all zeros instead of the beginning of the subsequent block to the end of that page, makes a failure to write that modified page seem more likely in my mind (more so given the Firefox _CACHE_ file oddities in snapshots); intense memory pressure at the time of the corruption also seems to favor this possibility. Now, it could be that btrfs requires those who modify SHARED mmap()ed files so as to make sure that data makes it to a subsequent snapshot, along the lines of msync MS_ASYNC, and leveldb does not take this sort of precaution. However, I noticed that the unexpected stream of zeros after a prior block and before the rest of the subsequent block *remains* in subsequent snapshots, which to me indicates the page update is effectively lost. This explains why even the running osd, that operates on the “current” subvolumes from which snapshots for recovery are taken, occasionally crashes because of database corruption, and will later fail to restart from an earlier snapshot due to that same corruption. Does this problem sound familiar to anyone else? Should mmaped-file writers in general do more than umount or msync to ensure changes make it to subsequent snapshots that are supposed to be consistent? Any tips on where to start looking so as to fix the problem, or even to confirm that the problem is indeed in btrfs? TIA, -- Alexandre Oliva, freedom fighter http://FSFLA.org/~lxoliva/ You must be the change you wish to see in the world. -- Gandhi Be Free! -- http://FSFLA.org/ FSF Latin America board member Free Software Evangelist Red Hat Brazil Compiler Engineer -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Alexandre Oliva
2013-Mar-18 22:43 UTC
Re: corruption of active mmapped files in btrfs snapshots
While I wrote the previous email, a smoking gun formed in one of my servers: a snapshot that had passed a database consistency check turned out to be corrupted when I tried to rollback to it! Since the snapshot was not modified in any way between the initial scripted check and the later manual check, the problem must be in btrfs. On Mar 18, 2013, Alexandre Oliva <oliva@gnu.org> wrote:> I''ve scripted regular checks of osd snapshots, saving the > last-known-good database along with the first one that displays the > corruption. Studying about two dozen failures over the weekend, that > took place on all of 13 btrfs-based osds on 3 servers running btrfs as > in 3.8.3(-gnu), I noticed that all of the corrupted databases had a > similar pattern: a stream of NULs of varying sizes at the end of a page, > starting at a block boundary (leveldb doesn''t do page-sized blocking, so > blocks can start anywhere in a page), and ending close to the beginning > of the next page, although not exactly at the page boundary; 20 bytes > past the page boundary seemed to be the most common size, but the > occasional presence of NULs in the database contents makes it harder to > tell for sure.Additional corrupted snapshots collected today have confirmed this pattern, except that today I got several corrupted files with non-NULs right at the beginning of the page following the one that marked the beginning of the corrupted database block. -- Alexandre Oliva, freedom fighter http://FSFLA.org/~lxoliva/ You must be the change you wish to see in the world. -- Gandhi Be Free! -- http://FSFLA.org/ FSF Latin America board member Free Software Evangelist Red Hat Brazil Compiler Engineer -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Chris Mason
2013-Mar-18 22:52 UTC
Re: corruption of active mmapped files in btrfs snapshots
A few questions. Does leveldb use O_DIRECT and mmap together? (the source of a write being pages that are mmap''d from somewhere else) That''s the most likely place for this kind of problem. Also, you mention crc errors. Are those reported by btrfs or are they application level crcs. Thanks for all the time you spent tracking it down this far. -chris Quoting Alexandre Oliva (2013-03-18 17:14:41)> For quite a while, I''ve experienced oddities with snapshotted Firefox > _CACHE_00?_ files, whose checksums (and contents) would change after the > btrfs snapshot was taken, and would even change depending on how the > file was brought to memory (e.g., rsyncing it to backup storage vs > checking its md5sum before or after the rsync). This only affected > these cache files, so I didn''t give it too much attention. > > A similar problem seems to affect the leveldb databases maintained by > ceph within the periodic snapshots it takes of its object storage > volumes. I''m told others using ceph on filesystems other than btrfs are > not observing this problem, which makes me thing it''s not memory > corruption within ceph itself. I''ve looked into this for a bit, and I''m > now inclined to believe it has to do with some bad interaction of mmap > and snapshots; I''m not sure the fact that the filesystem has compression > enabled has any effect, but that''s certainly a possibility. > > leveldb does not modify file contents once they''re initialized, it only > appends to files, ftruncate()ing them to about a MB early on, mmap()ping > that in and memcpy()ing blocks of various sizes to the end of the output > buffer, occasionally msync()ing the maps, or running fdatasync if it > didn''t msync a map before munmap()ping it. If it runs out of space in a > map, it munmap()s the previously mapped range, truncates the file to a > larger size, then maps in the new tail of the file, starting at the page > it should append to next. > > What I''m observing is that some btrfs snapshots taken by ceph osds, > containing the leveldb database, are corrupted, causing crashes during > the use of the database. > > I''ve scripted regular checks of osd snapshots, saving the > last-known-good database along with the first one that displays the > corruption. Studying about two dozen failures over the weekend, that > took place on all of 13 btrfs-based osds on 3 servers running btrfs as > in 3.8.3(-gnu), I noticed that all of the corrupted databases had a > similar pattern: a stream of NULs of varying sizes at the end of a page, > starting at a block boundary (leveldb doesn''t do page-sized blocking, so > blocks can start anywhere in a page), and ending close to the beginning > of the next page, although not exactly at the page boundary; 20 bytes > past the page boundary seemed to be the most common size, but the > occasional presence of NULs in the database contents makes it harder to > tell for sure. > > The stream of NULs ended in the middle of a database block (meaning it > was not the beginning of a subsequent database block written later; the > beginning of the database block was partially replaced with NULs). > Furthermore, the checksum fails to match on this one partially-NULed > block. Since the checksum is computed just before the block and the > checksum trailer are memcpy()ed to the mmap()ed area, it is a certainty > that the block was copied entirely to the right place at some point, and > if part of it became zeros, it''s either because the modification was > partially lost, or because the mmapped buffer was partially overwritten. > The fact that all instances of corruption I looked at were correct right > to the end of one block boundary, and then all zeros instead of the > beginning of the subsequent block to the end of that page, makes a > failure to write that modified page seem more likely in my mind (more so > given the Firefox _CACHE_ file oddities in snapshots); intense memory > pressure at the time of the corruption also seems to favor this > possibility. > > Now, it could be that btrfs requires those who modify SHARED mmap()ed > files so as to make sure that data makes it to a subsequent snapshot, > along the lines of msync MS_ASYNC, and leveldb does not take this sort > of precaution. However, I noticed that the unexpected stream of zeros > after a prior block and before the rest of the subsequent block > *remains* in subsequent snapshots, which to me indicates the page update > is effectively lost. This explains why even the running osd, that > operates on the “current” subvolumes from which snapshots for recovery > are taken, occasionally crashes because of database corruption, and will > later fail to restart from an earlier snapshot due to that same > corruption. > > > Does this problem sound familiar to anyone else? > > Should mmaped-file writers in general do more than umount or msync to > ensure changes make it to subsequent snapshots that are supposed to be > consistent? > > Any tips on where to start looking so as to fix the problem, or even to > confirm that the problem is indeed in btrfs? > > > TIA, > > -- > Alexandre Oliva, freedom fighter http://FSFLA.org/~lxoliva/ > You must be the change you wish to see in the world. -- Gandhi > Be Free! -- http://FSFLA.org/ FSF Latin America board member > Free Software Evangelist Red Hat Brazil Compiler Engineer > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html-- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Alexandre Oliva
2013-Mar-19 05:20 UTC
Re: corruption of active mmapped files in btrfs snapshots
On Mar 18, 2013, Chris Mason <chris.mason@fusionio.com> wrote:> A few questions. Does leveldb use O_DIRECT and mmap together?No, it doesn''t use O_DIRECT at all. Its I/O interface is very simplified: it just opens each new file (database chunks limited to 2MB) with O_CREAT|O_RDWR|O_TRUNC, and then uses ftruncate, mmap, msync, munmap and fdatasync. It doesn''t seem to modify data once it''s written; it only appends. Reading data back from it uses a completely different class interface, using separate descriptors and using pread only.> (the source of a write being pages that are mmap''d from somewhere > else)AFAICT the source of the memcpy()s that append to the file are malloc()ed memory.> That''s the most likely place for this kind of problem. Also, you > mention crc errors. Are those reported by btrfs or are they application > level crcs.These are CRCs leveldb computes and writes out after each db block. No btrfs CRC errors are reported in this process. -- Alexandre Oliva, freedom fighter http://FSFLA.org/~lxoliva/ You must be the change you wish to see in the world. -- Gandhi Be Free! -- http://FSFLA.org/ FSF Latin America board member Free Software Evangelist Red Hat Brazil Compiler Engineer -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Chris Mason
2013-Mar-19 12:09 UTC
Re: corruption of active mmapped files in btrfs snapshots
Quoting Alexandre Oliva (2013-03-19 01:20:10)> On Mar 18, 2013, Chris Mason <chris.mason@fusionio.com> wrote: > > > A few questions. Does leveldb use O_DIRECT and mmap together? > > No, it doesn''t use O_DIRECT at all. Its I/O interface is very > simplified: it just opens each new file (database chunks limited to 2MB) > with O_CREAT|O_RDWR|O_TRUNC, and then uses ftruncate, mmap, msync, > munmap and fdatasync. It doesn''t seem to modify data once it''s written; > it only appends. Reading data back from it uses a completely different > class interface, using separate descriptors and using pread only. > > > (the source of a write being pages that are mmap''d from somewhere > > else) > > AFAICT the source of the memcpy()s that append to the file are > malloc()ed memory. > > > That''s the most likely place for this kind of problem. Also, you > > mention crc errors. Are those reported by btrfs or are they application > > level crcs. > > These are CRCs leveldb computes and writes out after each db block. No > btrfs CRC errors are reported in this process.Ok, so we have three moving pieces here. 1) leveldb truncating the files 2) leveldb using mmap to write 3) btrfs snapshots My guess is the truncate is creating a orphan item that is being processed inside the snapshot. Is it possible to create a smaller leveldb unit test that we might use to exercise all of this? -chris -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, 19 Mar 2013, Chris Mason wrote:> Quoting Alexandre Oliva (2013-03-19 01:20:10) > > On Mar 18, 2013, Chris Mason <chris.mason@fusionio.com> wrote: > > > > > A few questions. Does leveldb use O_DIRECT and mmap together? > > > > No, it doesn''t use O_DIRECT at all. Its I/O interface is very > > simplified: it just opens each new file (database chunks limited to 2MB) > > with O_CREAT|O_RDWR|O_TRUNC, and then uses ftruncate, mmap, msync, > > munmap and fdatasync. It doesn''t seem to modify data once it''s written; > > it only appends. Reading data back from it uses a completely different > > class interface, using separate descriptors and using pread only. > > > > > (the source of a write being pages that are mmap''d from somewhere > > > else) > > > > AFAICT the source of the memcpy()s that append to the file are > > malloc()ed memory. > > > > > That''s the most likely place for this kind of problem. Also, you > > > mention crc errors. Are those reported by btrfs or are they application > > > level crcs. > > > > These are CRCs leveldb computes and writes out after each db block. No > > btrfs CRC errors are reported in this process. > > Ok, so we have three moving pieces here. > > 1) leveldb truncating the files > 2) leveldb using mmap to write > 3) btrfs snapshots > > My guess is the truncate is creating a orphan item that is being > processed inside the snapshot. > > Is it possible to create a smaller leveldb unit test that we might use > to exercise all of this?There is a set of unit tests in the leveldb source tree that ought to do the trick: git clone https://code.google.com/p/leveldb/ sage -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Alexandre Oliva
2013-Mar-19 19:26 UTC
Re: corruption of active mmapped files in btrfs snapshots
On Mar 19, 2013, Chris Mason <clmason@fusionio.com> wrote:> My guess is the truncate is creating a orphan itemWould it, even though the truncate is used to grow rather than to shrink the file?> that is being processed inside the snapshot.This doesn''t explain why the master database occasionally gets similarly corrupted, does it?> Is it possible to create a smaller leveldb unit test that we might use > to exercise all of this?I suppose we can even do away with leveldb altogether, using only a PosixMmapFile object, as created by PosixEnv::NewWritableFile (all of this is defined in leveldb''s util/env_posix.cc), to exercise the creation and growth of multiple files, one at a time, taking btrfs snapshots at random in between the writes. This ought to suffice. One thing I''m yet to check is whether ceph uses the sync leveldb WriteOption, to determine whether or not to call the file object''s Sync member function in the test; this would bring fdatasync and msync calls into the picture, that would otherwise be left entirely out of the test. -- Alexandre Oliva, freedom fighter http://FSFLA.org/~lxoliva/ You must be the change you wish to see in the world. -- Gandhi Be Free! -- http://FSFLA.org/ FSF Latin America board member Free Software Evangelist Red Hat Brazil Compiler Engineer -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Alexandre Oliva
2013-Mar-19 19:26 UTC
Re: corruption of active mmapped files in btrfs snapshots
On Mar 19, 2013, Sage Weil <sage@inktank.com> wrote:> There is a set of unit tests in the leveldb source tree that ought to do > the trick:> git clone https://code.google.com/p/leveldb/But these don''t create btrfs snapshots. -- Alexandre Oliva, freedom fighter http://FSFLA.org/~lxoliva/ You must be the change you wish to see in the world. -- Gandhi Be Free! -- http://FSFLA.org/ FSF Latin America board member Free Software Evangelist Red Hat Brazil Compiler Engineer -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Alexandre Oliva
2013-Mar-20 01:58 UTC
Re: corruption of active mmapped files in btrfs snapshots
On Mar 19, 2013, Alexandre Oliva <oliva@gnu.org> wrote:>> that is being processed inside the snapshot.> This doesn''t explain why the master database occasionally gets similarly > corrupted, does it?Actually, scratch this bit for now. I don''t really have proof that the master database actually gets corrupted while it''s in use, rather than having inherited corruption on a server restart, that rolls back to the most recent snapshot and replays the osd journal on it. It could be that the used snapshot is corrupted in a way that doesn''t manifest itself immediately, or that that it gets corrupted afterwards with your delayed-orphan theory. I wrote a test that exercises leveldb''s PosixMmapFile with highly compressible appends of varying sizes, as well as syncs and btrfs snapshots at random, but I haven''t been able to trigger the problem with it (yet?). I''m now instrumenting the failing code to try to collect more data. It looks like, even though ceph does use leveldb''s sync option in some situations, the syncs don''t seem to get all to the data files, only to the leveldb logs. -- Alexandre Oliva, freedom fighter http://FSFLA.org/~lxoliva/ You must be the change you wish to see in the world. -- Gandhi Be Free! -- http://FSFLA.org/ FSF Latin America board member Free Software Evangelist Red Hat Brazil Compiler Engineer -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Alexandre Oliva
2013-Mar-21 07:14 UTC
Re: corruption of active mmapped files in btrfs snapshots
On Mar 19, 2013, Alexandre Oliva <oliva@gnu.org> wrote:> On Mar 19, 2013, Alexandre Oliva <oliva@gnu.org> wrote: >>> that is being processed inside the snapshot.>> This doesn''t explain why the master database occasionally gets similarly >> corrupted, does it?> Actually, scratch this bit for now. I don''t really have proof that the > master database actually gets corrupted while it''s in useScratch the “scratch this”. The master database actually gets corrupted, and it''s with recently-created files, created after earlier known-good snapshots. So, it can''t really be orphan processing, can it? Some more info from the errors and instrumentation: - no data syncing on the affected files is taking place. it''s just memcpy()ing data in <4KiB-sized chunks onto mmap()ed areas, munmap()ing it, growing the file with ftruncate and mapping a subsequent chunk for further output - the NULs at the end of pages do NOT occur at munmap/mmap boundaries as I suspected at first, but they do coincide with the end of extents that are smaller than the maximum compressed extent size. So, something''s making btrfs flush pages to disk before the pages are completely written (which is fine in principle), but apparently failing to pick up subsequent changes to the pages (eek!) -- Alexandre Oliva, freedom fighter http://FSFLA.org/~lxoliva/ You must be the change you wish to see in the world. -- Gandhi Be Free! -- http://FSFLA.org/ FSF Latin America board member Free Software Evangelist Red Hat Brazil Compiler Engineer -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Chris Mason
2013-Mar-21 18:06 UTC
Re: corruption of active mmapped files in btrfs snapshots
Quoting Alexandre Oliva (2013-03-21 03:14:02)> On Mar 19, 2013, Alexandre Oliva <oliva@gnu.org> wrote: > > > On Mar 19, 2013, Alexandre Oliva <oliva@gnu.org> wrote: > >>> that is being processed inside the snapshot. > > >> This doesn''t explain why the master database occasionally gets similarly > >> corrupted, does it? > > > Actually, scratch this bit for now. I don''t really have proof that the > > master database actually gets corrupted while it''s in use > > Scratch the “scratch this”. The master database actually gets > corrupted, and it''s with recently-created files, created after earlier > known-good snapshots. So, it can''t really be orphan processing, can it?Right, it can''t be orphan processing.> > Some more info from the errors and instrumentation: > > - no data syncing on the affected files is taking place. it''s just > memcpy()ing data in <4KiB-sized chunks onto mmap()ed areas, > munmap()ing it, growing the file with ftruncate and mapping a > subsequent chunk for further output > > - the NULs at the end of pages do NOT occur at munmap/mmap boundaries as > I suspected at first, but they do coincide with the end of extents > that are smaller than the maximum compressed extent size. So, > something''s making btrfs flush pages to disk before the pages are > completely written (which is fine in principle), but apparently > failing to pick up subsequent changes to the pages (eek!)With mmap the kernel can pick any given time to start writing out dirty pages. The idea is that if the application makes more changes the page becomes dirty again and the kernel writes it again. So the question is, can you trigger this without snapshots being done at all? I''ll try to make an mmap tester here that hammers on the related code. We usually test this with fsx, which catches all kinds of horrors. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Chris Mason
2013-Mar-21 23:06 UTC
Re: corruption of active mmapped files in btrfs snapshots
Quoting Chris Mason (2013-03-21 14:06:14)> Quoting Alexandre Oliva (2013-03-21 03:14:02) > > On Mar 19, 2013, Alexandre Oliva <oliva@gnu.org> wrote: > > > > > On Mar 19, 2013, Alexandre Oliva <oliva@gnu.org> wrote: > > >>> that is being processed inside the snapshot. > > > > >> This doesn''t explain why the master database occasionally gets similarly > > >> corrupted, does it? > > > > > Actually, scratch this bit for now. I don''t really have proof that the > > > master database actually gets corrupted while it''s in use > > > > Scratch the “scratch this”. The master database actually gets > > corrupted, and it''s with recently-created files, created after earlier > > known-good snapshots. So, it can''t really be orphan processing, can it? > > Right, it can''t be orphan processing. > > > > > Some more info from the errors and instrumentation: > > > > - no data syncing on the affected files is taking place. it''s just > > memcpy()ing data in <4KiB-sized chunks onto mmap()ed areas, > > munmap()ing it, growing the file with ftruncate and mapping a > > subsequent chunk for further output > > > > - the NULs at the end of pages do NOT occur at munmap/mmap boundaries as > > I suspected at first, but they do coincide with the end of extents > > that are smaller than the maximum compressed extent size. So, > > something''s making btrfs flush pages to disk before the pages are > > completely written (which is fine in principle), but apparently > > failing to pick up subsequent changes to the pages (eek!) > > With mmap the kernel can pick any given time to start writing out dirty > pages. The idea is that if the application makes more changes the page > becomes dirty again and the kernel writes it again. > > So the question is, can you trigger this without snapshots being done > at all? I''ll try to make an mmap tester here that hammers on the > related code. We usually test this with fsx, which catches all kinds of > horrors.So my test program creates an 8GB file in chunks of 1MB each. Using truncate to extend the file and then mmap to write into the new hole. It is writing in 1MB chunks, ever so slightly not aligned. After creating the whole file, it reads it back to look for errors. I''m running this with heavy memory pressure, but no snapshots. No corruptions yet, but I''ll let it run a while long. -chris -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Alexandre Oliva
2013-Mar-22 05:27 UTC
Re: corruption of active mmapped files in btrfs snapshots
On Mar 21, 2013, Chris Mason <chris.mason@fusionio.com> wrote:> Quoting Chris Mason (2013-03-21 14:06:14) >> With mmap the kernel can pick any given time to start writing out dirty >> pages. The idea is that if the application makes more changes the page >> becomes dirty again and the kernel writes it again.That''s the theory. But what if there''s some race between the time the page is frozen for compressing and the time it''s marked as clean, or it''s marked as clean after it''s further modified, or a subsequent write to the same page ends up overridden by the background compression of the old contents of the page? These are all possibilities that come to mind without knowing much about btrfs inner workings.>> So the question is, can you trigger this without snapshots being done >> at all?I haven''t tried, but I now have a program that hit the error condition while taking snapshots in background with small time perturbations to increase the likelihood of hitting a race condition at the exact time. It uses leveldb''s infrastructure for the mmapping, but it shouldn''t be too hard to adapt it so that it doesn''t.> So my test program creates an 8GB file in chunks of 1MB each.That''s probably too large a chunk to write at a time. The bug is exercised with writes slightly smaller than a single page (although straddling across two consecutive pages). This half-baked test program (hereby provided under the terms of the GNU GPLv3+) creates a btrfs subvolume and two files in it: one in which I/O will be performed with write()s, another that will get the same data appended with leveldb''s mmap-based output interface. Random block sizes, as well as milli and microsecond timing perturbations, are read from /dev/urandom, and the rest of the output buffer is filled with (char)1. The test that actually failed (on the first try!, after some other variations that didn''t fail) didn''t have any of the #ifdef options enabled (i.e., no -D* flags during compilation), but it triggered the exact failure observed with ceph: zeros at the end of a page where there should have been nonzero data, followed by nonzero data on the following page! That was within snapshots, not in the main subvol, but hopefully it''s the same problem, just a bit harder to trigger. I can''t tell whether memory pressure is required to hit the problem. The system on which I hit the error was mostly otherwise idle while running the test, but starting so many shell commands in background surely creates intense activity on the system, possibly increasing the odds that some race condition will hit. Two subsequent runs of the program failed to trigger the problem. -- Alexandre Oliva, freedom fighter http://FSFLA.org/~lxoliva/ You must be the change you wish to see in the world. -- Gandhi Be Free! -- http://FSFLA.org/ FSF Latin America board member Free Software Evangelist Red Hat Brazil Compiler Engineer
Chris Mason
2013-Mar-22 12:07 UTC
Re: corruption of active mmapped files in btrfs snapshots
Quoting Alexandre Oliva (2013-03-22 01:27:42)> On Mar 21, 2013, Chris Mason <chris.mason@fusionio.com> wrote: > > > Quoting Chris Mason (2013-03-21 14:06:14) > >> With mmap the kernel can pick any given time to start writing out dirty > >> pages. The idea is that if the application makes more changes the page > >> becomes dirty again and the kernel writes it again. > > That''s the theory. But what if there''s some race between the time the > page is frozen for compressing and the time it''s marked as clean, or > it''s marked as clean after it''s further modified, or a subsequent write > to the same page ends up overridden by the background compression of the > old contents of the page? These are all possibilities that come to mind > without knowing much about btrfs inner workings.Definitely, there is a lot of room for racing. Are you using compression in btrfs or just in leveldb?> > >> So the question is, can you trigger this without snapshots being done > >> at all? > > I haven''t tried, but I now have a program that hit the error condition > while taking snapshots in background with small time perturbations to > increase the likelihood of hitting a race condition at the exact time. > It uses leveldb''s infrastructure for the mmapping, but it shouldn''t be > too hard to adapt it so that it doesn''t. > > > So my test program creates an 8GB file in chunks of 1MB each. > > That''s probably too large a chunk to write at a time. The bug is > exercised with writes slightly smaller than a single page (although > straddling across two consecutive pages). > > This half-baked test program (hereby provided under the terms of the GNU > GPLv3+) creates a btrfs subvolume and two files in it: one in which I/O > will be performed with write()s, another that will get the same data > appended with leveldb''s mmap-based output interface. Random block > sizes, as well as milli and microsecond timing perturbations, are read > from /dev/urandom, and the rest of the output buffer is filled with > (char)1. > > The test that actually failed (on the first try!, after some other > variations that didn''t fail) didn''t have any of the #ifdef options > enabled (i.e., no -D* flags during compilation), but it triggered the > exact failure observed with ceph: zeros at the end of a page where there > should have been nonzero data, followed by nonzero data on the following > page! That was within snapshots, not in the main subvol, but hopefully > it''s the same problem, just a bit harder to trigger.I''d like to take snapshots out of the picture for a minute. We need some way to synchronize the leveldb with snapshotting because the snapshot is basically the same thing as a crash from a db point of view. Corrupting the main database file is a much different (and bigger) problem. -chris -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Alexandre Oliva
2013-Mar-22 14:17 UTC
Re: corruption of active mmapped files in btrfs snapshots
On Mar 22, 2013, Chris Mason <clmason@fusionio.com> wrote:> Are you using compression in btrfs or just in leveldb?btrfs lzo compression.> I''d like to take snapshots out of the picture for a minute.That''s understandable, I guess, but I don''t know that anyone has ever got the problem without snapshots. I mean, even when the master copy of the database got corrupted, snapshots of the subvol containing it were being taken every now and again, because that''s the way ceph works. Even back when I noticed corruption of firefox _CACHE_* files, snapshots taken for archival were involved. So, unless the program happens to trigger the problem with the -DNOSNAPS option about as easily as it did without it, I guess we may not have a choice but to keep snapshots in the picture.> We need some way to synchronize the leveldb with snapshottingI purposefully refrained from doing that, because AFAICT ceph doesn''t do that. Once I failed to trigger the problem with Sync calls, and determined ceph only syncs the leveldb logs before taking its snapshots, I went without syncing and finally succeeded in triggering the bug in snapshots, by simulating very similar snapshotting and mmaping conditions to those generated by ceph. I haven''t managed to trigger the corruption of the master subvol yet with the test program, but I already knew its corruption didn''t occur as often as that of the snapshots, and since it smells like two slightly different symptoms of the same bug, I decided to leave the test program at that. -- Alexandre Oliva, freedom fighter http://FSFLA.org/~lxoliva/ You must be the change you wish to see in the world. -- Gandhi Be Free! -- http://FSFLA.org/ FSF Latin America board member Free Software Evangelist Red Hat Brazil Compiler Engineer -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Chris Mason
2013-Mar-22 14:26 UTC
Re: corruption of active mmapped files in btrfs snapshots
Quoting Alexandre Oliva (2013-03-22 10:17:30)> On Mar 22, 2013, Chris Mason <clmason@fusionio.com> wrote: > > > Are you using compression in btrfs or just in leveldb? > > btrfs lzo compression.Perfect, I''ll focus on that part of things.> > > I''d like to take snapshots out of the picture for a minute. > > That''s understandable, I guess, but I don''t know that anyone has ever > got the problem without snapshots. I mean, even when the master copy of > the database got corrupted, snapshots of the subvol containing it were > being taken every now and again, because that''s the way ceph works.Hopefully Sage can comment, but the basic idea is that if you snapshot a database file the db must participate. If it doesn''t, it really is the same effect as crashing the box. Something is definitely broken if we''re corrupting the source files (either with or without snapshots), but avoiding incomplete writes in the snapshot files requires synchronization with the db. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Samuel Just
2013-Mar-22 17:06 UTC
Re: corruption of active mmapped files in btrfs snapshots
Incomplete writes for leveldb should just result in lost updates, not corruption. Also, we do stop writes before the snapshot is initiated so there should be no in-progress writes to leveldb other than leveldb compaction (though that might be something to investigate). -Sam On Fri, Mar 22, 2013 at 7:26 AM, Chris Mason <clmason@fusionio.com> wrote:> Quoting Alexandre Oliva (2013-03-22 10:17:30) >> On Mar 22, 2013, Chris Mason <clmason@fusionio.com> wrote: >> >> > Are you using compression in btrfs or just in leveldb? >> >> btrfs lzo compression. > > Perfect, I''ll focus on that part of things. > >> >> > I''d like to take snapshots out of the picture for a minute. >> >> That''s understandable, I guess, but I don''t know that anyone has ever >> got the problem without snapshots. I mean, even when the master copy of >> the database got corrupted, snapshots of the subvol containing it were >> being taken every now and again, because that''s the way ceph works. > > Hopefully Sage can comment, but the basic idea is that if you snapshot a > database file the db must participate. If it doesn''t, it really is the > same effect as crashing the box. > > Something is definitely broken if we''re corrupting the source files > (either with or without snapshots), but avoiding incomplete writes in > the snapshot files requires synchronization with the db. > > -chris > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html-- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
David Sterba
2013-Mar-22 17:08 UTC
Re: corruption of active mmapped files in btrfs snapshots
On Fri, Mar 22, 2013 at 10:26:59AM -0400, Chris Mason wrote:> Quoting Alexandre Oliva (2013-03-22 10:17:30) > > On Mar 22, 2013, Chris Mason <clmason@fusionio.com> wrote: > > > > > Are you using compression in btrfs or just in leveldb? > > > > btrfs lzo compression. > > Perfect, I''ll focus on that part of things.> > > I''d like to take snapshots out of the picture for a minute.I''ve reproduced this without compression, with autodefrag on. The test was using snapshots (ie. the unmmodified versino) and ended with 1087 blocks, 4316779 total size snaptest.268/ca snaptest.268/db differ: char 4245170, line 16 after a few minutes. Before that, I was running the NOSNAPS mode for many-minutes (up to 50k rounds) without a reported problem. There was the same ''make clean && make -j 32'' kernel compilation running in parallel, the box has 8 cpus, 4GB ram. Watching ''free'' showed the memory going up to a few gigs and down to ~130MB. david -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Chris Mason
2013-Mar-22 17:12 UTC
Re: corruption of active mmapped files in btrfs snapshots
In this case, I think Alexandre is scanning for zeros in the file. The incomplete writes will definitely show that. -chris Quoting Samuel Just (2013-03-22 13:06:41)> Incomplete writes for leveldb should just result in lost updates, not > corruption. Also, we do stop writes before the snapshot is initiated > so there should be no in-progress writes to leveldb other than leveldb > compaction (though that might be something to investigate). > -Sam > > On Fri, Mar 22, 2013 at 7:26 AM, Chris Mason <clmason@fusionio.com> wrote: > > Quoting Alexandre Oliva (2013-03-22 10:17:30) > >> On Mar 22, 2013, Chris Mason <clmason@fusionio.com> wrote: > >> > >> > Are you using compression in btrfs or just in leveldb? > >> > >> btrfs lzo compression. > > > > Perfect, I''ll focus on that part of things. > > > >> > >> > I''d like to take snapshots out of the picture for a minute. > >> > >> That''s understandable, I guess, but I don''t know that anyone has ever > >> got the problem without snapshots. I mean, even when the master copy of > >> the database got corrupted, snapshots of the subvol containing it were > >> being taken every now and again, because that''s the way ceph works. > > > > Hopefully Sage can comment, but the basic idea is that if you snapshot a > > database file the db must participate. If it doesn''t, it really is the > > same effect as crashing the box. > > > > Something is definitely broken if we''re corrupting the source files > > (either with or without snapshots), but avoiding incomplete writes in > > the snapshot files requires synchronization with the db. > > > > -chris > > -- > > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > > the body of a message to majordomo@vger.kernel.org > > More majordomo info at http://vger.kernel.org/majordomo-info.html-- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, 22 Mar 2013, Chris Mason wrote:> Quoting Alexandre Oliva (2013-03-22 10:17:30) > > On Mar 22, 2013, Chris Mason <clmason@fusionio.com> wrote: > > > > > Are you using compression in btrfs or just in leveldb? > > > > btrfs lzo compression. > > Perfect, I''ll focus on that part of things. > > > > > > I''d like to take snapshots out of the picture for a minute. > > > > That''s understandable, I guess, but I don''t know that anyone has ever > > got the problem without snapshots. I mean, even when the master copy of > > the database got corrupted, snapshots of the subvol containing it were > > being taken every now and again, because that''s the way ceph works. > > Hopefully Sage can comment, but the basic idea is that if you snapshot a > database file the db must participate. If it doesn''t, it really is the > same effect as crashing the box. > > Something is definitely broken if we''re corrupting the source files > (either with or without snapshots), but avoiding incomplete writes in > the snapshot files requires synchronization with the db.In this case, we quiesce write activity, call leveldb''s sync(), take the snapshot, and then continue. (FWIW, this isn''t the first time we''ve heard about leveldb corruption, but in each case we''ve looked into the user had the btrfs compression enabled.... so I suspect that''s the right avenue of investigation!) sage -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Chris Mason
2013-Mar-22 18:07 UTC
Re: corruption of active mmapped files in btrfs snapshots
[ mmap corruptions with leveldb and btrfs compression ] I ran this a number of times with compression off and wasn''t able to trigger problems. With compress=lzo, I see errors on every run. Compile: gcc -Wall -o mmap-trunc mmap-trunc.c Run: ./mmap-trunc file_name The basic idea is to create a 256MB file in steps. Each step ftruncates the file larger, and then mmaps a region for writing. It dirties some unaligned bytes (a little more than 8K), and then munmaps. Then a verify stage goes back through the file to make sure the data we wrote is really there. I''m using a simple rotating pattern of chars that compress very well. I run it in batches of 100 with some memory pressure on the side: for x in `seq 1 100` ; do (mmap-trunc f$x &) ; done #define _FILE_OFFSET_BITS 64 #include <sys/types.h> #include <sys/stat.h> #include <sys/mman.h> #include <fcntl.h> #include <unistd.h> #include <stdio.h> #include <stdlib.h> #include <string.h> #include <sys/time.h> #define FILE_SIZE ((loff_t)256 * 1024 * 1024) /* make a painfully unaligned chunk size */ #define CHUNK_SIZE (8192 + 932) #define mmap_align(x) (((x) + 4095) & ~4095) char *file_name = NULL; void mmap_one_chunk(int fd, loff_t *cur_size, unsigned char *file_buf) { int ret; loff_t new_size = *cur_size + CHUNK_SIZE; loff_t pos = *cur_size; unsigned long map_size = mmap_align(CHUNK_SIZE) + 4096; char val = file_buf[0]; char *p; int extra; /* step one, truncate out a hole */ ret = ftruncate(fd, new_size); if (ret) { perror("truncate"); exit(1); } if (val == 0 || val == ''z'') val = ''a''; else val++; memset(file_buf, val, CHUNK_SIZE); extra = pos & 4095; p = mmap(0, map_size, PROT_READ | PROT_WRITE, MAP_SHARED, fd, pos - extra); if (p == MAP_FAILED) { perror("mmap"); exit(1); } memcpy(p + extra, file_buf, CHUNK_SIZE); ret = munmap(p, map_size); if (ret) { perror("munmap"); exit(1); } *cur_size = new_size; } void check_chunks(int fd) { char *p; loff_t checked = 0; char val = ''a''; int i; int errors = 0; int ret; int extra; unsigned long map_size = mmap_align(CHUNK_SIZE) + 4096; fprintf(stderr, "checking chunks\n"); while (checked < FILE_SIZE) { extra = checked & 4095; p = mmap(0, map_size, PROT_READ, MAP_SHARED, fd, checked - extra); if (p == MAP_FAILED) { perror("mmap"); exit(1); } for (i = 0; i < CHUNK_SIZE; i++) { if (p[i + extra] != val) { fprintf(stderr, "%s: bad val %x wanted %x offset 0x%llx\n", file_name, p[i + extra], val, (unsigned long long)checked + i); errors++; } } if (val == ''z'') val = ''a''; else val++; ret = munmap(p, map_size); if (ret) { perror("munmap"); exit(1); } checked += CHUNK_SIZE; } printf("%s found %d errors\n", file_name, errors); if (errors) exit(1); } int main(int ac, char **av) { unsigned char *file_buf; loff_t pos = 0; int ret; int fd; if (ac < 2) { fprintf(stderr, "usage: mmap-trunc filename\n"); exit(1); } ret = posix_memalign((void **)&file_buf, 4096, CHUNK_SIZE); if (ret) { perror("cannot allocate memory\n"); exit(1); } file_buf[0] = 0; file_name = av[1]; fprintf(stderr, "running test on %s\n", file_name); unlink(file_name); fd = open(file_name, O_RDWR | O_CREAT, 0600); if (fd < 0) { perror("open"); exit(1); } fprintf(stderr, "writing chunks\n"); while (pos < FILE_SIZE) { mmap_one_chunk(fd, &pos, file_buf); } check_chunks(fd); return 0; } -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Chris Mason
2013-Mar-22 20:31 UTC
Re: corruption of active mmapped files in btrfs snapshots
Quoting Chris Mason (2013-03-22 14:07:05)> [ mmap corruptions with leveldb and btrfs compression ] > > I ran this a number of times with compression off and wasn''t able to > trigger problems. With compress=lzo, I see errors on every run. > > Compile: gcc -Wall -o mmap-trunc mmap-trunc.c > Run: ./mmap-trunc file_name > > The basic idea is to create a 256MB file in steps. Each step ftruncates > the file larger, and then mmaps a region for writing. It dirties some > unaligned bytes (a little more than 8K), and then munmaps. > > Then a verify stage goes back through the file to make sure the data we > wrote is really there. I''m using a simple rotating pattern of chars > that compress very well.Going through the code here, when I change the test to truncate once in the very beginning, I still get errors. So, it isn''t an interaction between mmap and truncate. It must be a problem between lzo and mmap. -chris -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Alexandre Oliva
2013-Mar-23 09:47 UTC
Re: corruption of active mmapped files in btrfs snapshots
On Mar 22, 2013, Chris Mason <clmason@fusionio.com> wrote:> Quoting Samuel Just (2013-03-22 13:06:41) >> Incomplete writes for leveldb should just result in lost updates, not >> corruption.> In this case, I think Alexandre is scanning for zeros in the file.Yup, the symptom is zeros at the end of a page, with nonzeros on the subsequent page, which indicates that the writes to the previous page were dropped. What I actually do is to iterate over the entire database, which will error out when the block header is found to be corrupted. I use this program I wrote (also hereby provided under GNU GPLv3+) to check the database for corruption. -- Alexandre Oliva, freedom fighter http://FSFLA.org/~lxoliva/ You must be the change you wish to see in the world. -- Gandhi Be Free! -- http://FSFLA.org/ FSF Latin America board member Free Software Evangelist Red Hat Brazil Compiler Engineer
Alexandre Oliva
2013-Mar-23 09:48 UTC
Re: corruption of active mmapped files in btrfs snapshots
On Mar 22, 2013, David Sterba <dsterba@suse.cz> wrote:> I''ve reproduced this without compression, with autodefrag on.I don''t have autodefrag on, unless it''s enabled by default on 3.8.3 or on the for-linus tree. -- Alexandre Oliva, freedom fighter http://FSFLA.org/~lxoliva/ You must be the change you wish to see in the world. -- Gandhi Be Free! -- http://FSFLA.org/ FSF Latin America board member Free Software Evangelist Red Hat Brazil Compiler Engineer -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
David Sterba
2013-Mar-25 15:33 UTC
Re: corruption of active mmapped files in btrfs snapshots
On Sat, Mar 23, 2013 at 06:48:38AM -0300, Alexandre Oliva wrote:> On Mar 22, 2013, David Sterba <dsterba@suse.cz> wrote: > > > I''ve reproduced this without compression, with autodefrag on. > > I don''t have autodefrag on, unless it''s enabled by default on 3.8.3 or > on the for-linus tree.It''s not on by default, but it seemed that I reproduced the same issue without compression (and autodefrag was the difference). david -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Chris Mason
2013-Mar-26 00:08 UTC
Re: corruption of active mmapped files in btrfs snapshots
Quoting Chris Mason (2013-03-22 16:31:42)> Going through the code here, when I change the test to truncate once in > the very beginning, I still get errors. So, it isn''t an interaction > between mmap and truncate. It must be a problem between lzo and mmap.With compression off, we use clear_page_dirty_for_io to create a wall between applications using mmap and our crc code. Once we call clear_page_dirty_for_io, it means we''re in the process of writing the page and anyone using mmap must wait (by calling page_mkwrite) before they are allowed to change the page. We use it with compression on as well, but it only ends up protecting the crcs. It gets called after the compression is done, which allows applications to race in and modify the pages while we are compressing them. This patch changes our compression code to call clear_page_dirty_for_io before we compress, and then redirty the pages if the compression fails. Alexandre, many thanks for tracking this down into a well defined use case. -chris diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index f173c5a..cdee391 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -1257,6 +1257,39 @@ int unlock_extent(struct extent_io_tree *tree, u64 start, u64 end) GFP_NOFS); } +int extent_range_clear_dirty_for_io(struct inode *inode, u64 start, u64 end) +{ + unsigned long index = start >> PAGE_CACHE_SHIFT; + unsigned long end_index = end >> PAGE_CACHE_SHIFT; + struct page *page; + + while (index <= end_index) { + page = find_get_page(inode->i_mapping, index); + BUG_ON(!page); /* Pages should be in the extent_io_tree */ + clear_page_dirty_for_io(page); + page_cache_release(page); + index++; + } + return 0; +} + +int extent_range_redirty_for_io(struct inode *inode, u64 start, u64 end) +{ + unsigned long index = start >> PAGE_CACHE_SHIFT; + unsigned long end_index = end >> PAGE_CACHE_SHIFT; + struct page *page; + + while (index <= end_index) { + page = find_get_page(inode->i_mapping, index); + BUG_ON(!page); /* Pages should be in the extent_io_tree */ + account_page_redirty(page); + __set_page_dirty_nobuffers(page); + page_cache_release(page); + index++; + } + return 0; +} + /* * helper function to set both pages and extents in the tree writeback */ diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h index 6068a19..258c921 100644 --- a/fs/btrfs/extent_io.h +++ b/fs/btrfs/extent_io.h @@ -325,6 +325,8 @@ int map_private_extent_buffer(struct extent_buffer *eb, unsigned long offset, unsigned long *map_len); int extent_range_uptodate(struct extent_io_tree *tree, u64 start, u64 end); +int extent_range_clear_dirty_for_io(struct inode *inode, u64 start, u64 end); +int extent_range_redirty_for_io(struct inode *inode, u64 start, u64 end); int extent_clear_unlock_delalloc(struct inode *inode, struct extent_io_tree *tree, u64 start, u64 end, struct page *locked_page, diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index ca1b767..88d4a18 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -353,6 +353,7 @@ static noinline int compress_file_range(struct inode *inode, int i; int will_compress; int compress_type = root->fs_info->compress_type; + int redirty = 0; /* if this is a small write inside eof, kick off a defrag */ if ((end - start + 1) < 16 * 1024 && @@ -415,6 +416,8 @@ again: if (BTRFS_I(inode)->force_compress) compress_type = BTRFS_I(inode)->force_compress; + extent_range_clear_dirty_for_io(inode, start, end); + redirty = 1; ret = btrfs_compress_pages(compress_type, inode->i_mapping, start, total_compressed, pages, @@ -554,6 +557,8 @@ cleanup_and_bail_uncompressed: __set_page_dirty_nobuffers(locked_page); /* unlocked later on in the async handlers */ } + if (redirty) + extent_range_redirty_for_io(inode, start, end); add_async_extent(async_cow, start, end - start + 1, 0, NULL, 0, BTRFS_COMPRESS_NONE); *num_added += 1; -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Alexandre Oliva
2013-Mar-29 09:56 UTC
Re: corruption of active mmapped files in btrfs snapshots
On Mar 25, 2013, Chris Mason <chris.mason@fusionio.com> wrote:> This patch changes our compression code to call clear_page_dirty_for_io > before we compress, and then redirty the pages if the compression fails.> Alexandre, many thanks for tracking this down into a well defined use > case.Thanks for the patch, it''s run flawlessly since I started gradually rolling it out onto my ceph OSDs on Monday! Ship it! :-) -- Alexandre Oliva, freedom fighter http://FSFLA.org/~lxoliva/ You must be the change you wish to see in the world. -- Gandhi Be Free! -- http://FSFLA.org/ FSF Latin America board member Free Software Evangelist Red Hat Brazil Compiler Engineer -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Chris Mason
2013-Mar-29 11:35 UTC
Re: corruption of active mmapped files in btrfs snapshots
Quoting Alexandre Oliva (2013-03-29 05:56:06)> On Mar 25, 2013, Chris Mason <chris.mason@fusionio.com> wrote: > > > This patch changes our compression code to call clear_page_dirty_for_io > > before we compress, and then redirty the pages if the compression fails. > > > Alexandre, many thanks for tracking this down into a well defined use > > case. > > Thanks for the patch, it''s run flawlessly since I started gradually > rolling it out onto my ceph OSDs on Monday! Ship it! :-)Awesome, it''s in my for-linus and will go out to Linus today. -chris -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html